Home Insights NGS Big Data Issues for Biomanufacturing

NGS Big Data Issues for Biomanufacturing

January 13, 2017

January 15, 2017 (Vol. 37, No. 2)

Right Approach Can Bring Significant Economic Benefits and More Regulatory Freedom

The GEN Special Section on Big Data consists of four articles:
Precision Medicine Research in the Million-Genome Era
Utilizing Machine-Learning Capabilities
NGS Big Data Issues for Biomanufacturing
Visualization for Advanced Big Data Analysis

Chinese hamster ovary (CHO) cells are responsible for billions of dollars in biopharma products. Unfortunately, there are some fundamental gaps in understanding these crucial input materials in the biomanufacturing process. In this article, the focus will be on how next-generation sequencing (NGS) can be used to not only provide greater insights into these crucial materials but, potentially, to award biomanufacturers greater regulatory freedom.

It is widely accepted that cell lines evolve during culture; with every cell division, thousands of mutations accumulate, and this process selects for better-growing cells, given the conditions. As culture conditions are optimized for production, cells with different genomes will naturally be selected.

The sequencing of the Chinese hamster genome and several derived CHO cell-lines illustrates the accumulation of mutations. What researchers found was that during the selection of new recombinant protein-producing cell lines, hundreds of thousands of new SNPs occur and major re-arrangements take place, with as few as 40% of CHO cell-line chromosomes hybridizing to Chinese hamster chromosome probes.¹A more significant insight was that many genes encoding proteins that prevent apoptotic pathways in CHO cell-lines are mutated, making the cells less likely to die in the event of stress. Resistance to apoptosis is good for ensuring flexible culture conditions, but bad in the sense that the cells could accumulate DNA damage or organellar defects that will affect active recombinant protein expression.

Crucial Input Material

Thus, it is clear that an accurate assessment of the genetic status of the cell lines provides a good, ground-level understanding of a crucial input material. In addition to greater control over the input materials, NGS could be used to gain greater flexibility in biomanufacturing by enabling superior knowledge of the details of the CHO cell genome that can directly impact product quality.

Increasing regulatory flexibility due to superior cell-line understanding is desirable. The increased level of knowledge could lock out biosimilar competition by establishing growth regimes that support a much tighter target-product quality profile.

A more detailed analysis of the CHO-K1 cell line revealed insights into immunogen production, regulation of glycosylation, and the presence of sub-populations—all are crucial for demonstrating superior input material knowledge.² The CHO-K1 sequencing produced a very profound insight into the regulation of glycoprotein a (1, 3) galactosyltransferase (Ggta1), an enzyme that is not expressed in humans, but is active in mouse and modifies recombinant IgAs.

Ggta1 can lead to an immune response in humans. CHO cells lack the sufficient enzymatic machinery to produce glycan structures with the a-Gal epitopes, except in very small subpopulations. Monitoring for these subpopulations is crucially important to avoid a potentially lethal response.

Investigators also found a novel gain-of-function mutation that induces MGAT3 expression, coding for GnTIII/GlcNAcTIII to add the bisecting (b4) N-acetylglucosamine (GlcNAc) branch, which is found on about 10% of human IgG glycoforms. This enables appropriate glycosylation of antibodies in a cell line that lacked such capability originally.

Gain-of-function mutations are relatively rare, and this observation exemplifies the power of actually casting a broad net to understand the cell line’s genome as well as the impact of the selection process in creating the CHO-K1 cell line.

NGS has made important contributions to our basic understanding of pathways in CHO-K1 cells that directly influence the production of active, quality recombinant proteins. One of the principal advantages of using NGS is that it allows hypothesis-free questioning of the entire genome, verifying the biochemical and genetic makeup of the cells.

These results show how NGS could be a decisive tool in a CAPA investigation when a product falls outside of specifications. Instead of a myriad of PCR markers, each of which has a detailed question associated with it, a broad net can be cast, and PCR can be used to follow up in detail, if desired.

Regulatory flexibility in defining the design space for biomanufacturing is a highly prized and valuable achievement. The basis for regulatory flexibility is found in the ICHQ8 R2:

“In addition, the applicant can choose to conduct pharmaceutical development studies that can lead to an enhanced knowledge of product performance over a wider range of material attributes, processing options, and process parameters.

“Inclusion of this additional information in the pharmaceutical development section provides an opportunity to demonstrate a higher degree of understanding of material attributes, manufacturing processes, and their controls. This scientific understanding facilitates establishment of an expanded design space.”

There are a number of significant challenges to gaining regulatory flexibility with Big Data. Most importantly, the data management and analytical system used must support the scale of the NGS data generated and provide an end-to-end view of the process.

Fundamental Challenge

The fundamental business challenge of Big Data in biomanufacturing is providing an integrated view of critical quality attributes to ensure a quality product. With a consistent PQP, you can reduce inventory, waste, product recall risk, and ease of biosimilar production through superior process knowledge. The design space concept is crucial for successful pharmaceutical development.

The FDA reports a significant increase in first-time acceptance of applications as a result of the implementation of ICH Q 8, 9, and Q10: “As a result of better collaboration between the industry and the agency, `first cycle’ approvals are exceeding 70%, without any decrease in FDA’s approval standards.”

Clearly, QbD is paying off.³ It is crucially significant to remember that the genome is the design space within which a cell operates.

The data-management challenges of working with NGS are legend. From the perspective of biomanufacturing, there are a number of crucial pieces of information that can be provided by NGS, including status of the host genome, status of the recombinant gene construct, the presence of immunogenic subclones, and the absence of viruses.

One of the challenges in working with NGS data is the size of the data, which makes the basic manipulation and review of results challenging. Consequently, numerous pipelines have been established to immediately convert sequence data to variation data, and transfer the primary data to storage.⁴

One downside is that GATK workflows require repeated re-analysis of the primary data in a batch mode, because there is not convenient representation of the data when the data is stored as files. So, to confirm variants, the entire pipeline has to be re-run. Thus, accurately finding minority calls that can directly reveal the presence of subclonal populations, batch to batch variation, sequencing quality per instrument type, and operator, is a challenging and time-consuming process.

An array-native, computational database approach can avoid both moving data and repeated re-analysis, especially when working with matrix representations of NGS data on the scale of 3.2 billion columns and billions of rows. NCBI has been using such a database—called SciDB—to support workflows that enablethe simultaneous comparisons of multiple whole or exome genomes and minority point mutation calling.

Essential Feature

The essential feature of SciDB is that it has a multi-dimensional array or matrix data model—meaning that SciDB stores data with the natural ordering from collection. SciDB can easily work with 3.2 billion columns and millions of rows. For BAM fragments, they can be piled up along the genomic coordinate as one dimension.

Figure 1 illustrates that type of pileup. Paradigm4 has used these pileups to find minority variants, occurring at 10–30% of the normal calls in sets of 20 or more whole genomes from different sequencing platforms using joint-variant calling. SciDB can perform correlations and clustering operations in database, without moving the data and without requiring data to fit into memory.

We successfully clustered the minority variants to distinguish between PacBio, Illumina, and CG sequencing platforms (data not shown) of NA12878, the genome in a bottle cell line.

SciDB enables a number of important capabilities regarding quality of data, data trends, and the ability to easily interrogate data without moving the data. Calculations of coverage and bias relative to a chromosome can be quickly and traceably performed in database, to keep track of the accuracy of the sequencing process over time, accommodate method changes, and track the impact of other procedures and supply chain components on the NGS test.

Transition/transversion ratios can monitor the stability of the cell line’s genome at the global level, and reveal problems with DNA repair processes that could predict impending issues with production.

Importantly, the SciDB technology has many of the required attributes to support 21 CFR Part 11 and ICH Q10 requirements—ACID (atomicity, consistency, isolation, durability), user permissions and controls, versioning, and traceability logs. Because data can be analyzed in the database, data movement does not need to be constrained by process controls and a fundamental risk to supporting the design space can be eliminated.

An often overlooked property of databases is the ability to support unannounced inspections by regulatory authorities. It is easy to convince yourself of the ease of inspection by going to the NCBI’s 1000 genomes browser website, which is built on top of a ~2 TB SciDB.

In this article, we explained how NGS data can be used to establish quality metrics for biomanufacturing by characterizing and monitoring cell-line genome status. Because NGS results encompass the entire genome, the design space for the cell, NGS is uniquely positioned to monitor both the input materials and the cell’s genome status during culturing.

The specific details that can be monitored include the recombinant protein and the absence or presence of mutations that control quality product production such as glycosylation patterns. Using SciDB, biomanufacturers can confidently manage and exploit their NGS data to improve the economics of biomanufacturing.

Figure 1. A SciDB BAM PileUp browser view of cell-line sequences from NA12878, the Genome in a Bottle (GIAB) cell line, that are stored persistently as an array in the SciDB Molecular BioMarker Workbench.

Zachary Pitluk, Ph.D. ([email protected]), is vp of business development, life sciences, and healthcare at Paradigm4. The author would like to acknowledge Dino J. Farina for all of his pedagogy in the area of ICHQ8 R2.

References:
1. Genomic landscapes of Chinese Hamster Ovary cell lines as revealed by the Cricetulus griseus draft genome”, Lewis, N. E. et al. Nature Biotechnology, 31 (2013) 759-765.
2. Xu X, Nagarajan H, Lewis NE et al. The genomic sequence of the Chinese hamster ovary (CHO)-K1 cell line. Nat. Biotechnol. 29(8), 735–741 (2011).
3. White Paper: FDA and Accelerating the Development of the New Pharmaceutical Therapies.
4. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, 2010 Genome Research 20:1297-303.

NGS Big Data Issues for Biomanufacturing

Single-Cell Cloning Remains a Challenge

New Approach to Nanopore Sequencing That Is Sure to CATCH Your...