January 15, 2017 (Vol. 37, No. 2)
Data Handling, Analyzing Datasets, and the TCGA
Tackling the TCGA Mutation Calling Project
Established in 2006, The Cancer Genome Atlas (TCGA) serves as a vital compendium of genetic mutations responsible for cancer, derived using next-generation DNA sequencing. Today, TCGA includes more than 2.5 petabytes of data collected from nearly 11,000 patients, describing 34 different tumor types (including 10 rare cancers) based on paired tumor and normal tissue sets.
Over the past 10 years, the importance and knowledge of mutation calling has increased and has changed the way analyses are conducted. To ensure this dataset remains an up-to-date resource for the global research community, the TCGA team decided to go back and resequence the more than 10,000 exomes contained in the database and produce a multicaller somatic mutation dataset.
Resequencing the TCGA dataset was a massive undertaking. The necessary computer resources for a large-scale project of this nature was not in place at TCGA member institutes. The DNAnexus Platform provided important requirements for the mutation calling project, including patient security, a scalable environment that could handle tens of thousands of exomes, and reproducibility of results, according to Andrew Carroll, vp of science, DNAnexus. Over a four-week period, approximately 1.8 million core-hours of computational time were used to process 400 TB of data, yielding reproducible results.
“The value of this reanalysis under a single methodology across new standardized mutation callers allows for the samples to be compared across cancer types. This will facilitate further new findings, such as if one individual’s breast cancer may show greater genomic similarity to a subtype of ovarian cancer than to other types of breast cancer,” said Dr. Carroll. “In the future, we believe patients will be treated based on their genomic profile rather than the origin of their cancer.”
Acquiring, Managing, Integrating, and Analyzing Large Datasets
To achieve better health outcomes, it is imperative that researchers are able to retrieve and understand all the data necessary to aid in drug discovery and development. But even in the era of Big Data, researchers are met with obstacles and silos that make accessing and using research and clinical data cumbersome and time-consuming.
Fortunately, the industry has taken great strides in fostering data-sharing to advance translational research, and organizations are now able to accelerate the understanding of, and in many cases advance, drug discovery.
“For example, Albany Molecular Research (AMRI), a global contract research and manufacturing organization that works with the life sciences industry to improve patient outcomes and quality of life, is integrating cloud-based data management, aggregation and analysis that compiles and relates experimental data to scientifically meaningful concepts,” says Jens Hoefkens, Ph.D., director of research, strategic marketing, and informatics for PerkinElmer. “AMRI is leveraging the PerkinElmer Signals™ for Screening platform in its cell::explorer instrumentation to enhance its ability to quickly acquire, manage, integrate, and analyze large, complex datasets from the various translational platforms to improve its research and development processes. This process helps to enhance distributed research and accelerate scientists’ knowledge and information pass-through, which can ultimately lead to better and faster drug discovery.”
According to Dr. Hoefkens, Perkin-Elmer also offers the Signals™ for Translational platform, which is designed to help translational researchers easily integrate experimental and clinical data from existing proprietary databases, private and public databases (such as GEO and tranSMART), and connect with other enterprise systems. It is delivered in a Software as a Service (SaaS) model to provide organizations with flexibility and scalability and allows researchers to not only store structured and unstructured data, but also easily query, extract, identify, and store information and normalize the representation of stored data.
“The platform’s ability to enhance collaboration and eliminate silos between research and clinical data represents a significant step toward advancing precision medicine, ensuring optimal pairing between patients and drugs,” notes Dr. Hoefkens.
Analyzing Large Geno & Pheno Datasets
From its founding in 1997, the Swiss-Finnish bioinformatics company BC Platforms has specialized in the management, sharing, and analysis of massive and ever-growing genomic and phenotypic datasets.
While next-gen sequencing (NGS) produces a colossal amount of raw data, the aligned and called analysis of ready data is quite compact. Due to the high cost of NGS, projects using whole-genome sequencing remain limited in terms of the number of patients. Today the largest datasets for downstream analyses are produced by population scale biobanks using low-cost genotyping and imputation, thus “inferring” whole genome sequence data from hundreds of thousands of patients.
Big-data techniques are required to facilitate downstream analysis of such datasets, together with linked clinical data. Instead of using Hadoop and forcing the re-invention of statistical algorithms developed over the last 15 years, BC Platforms developed and published a specialized data-tiling method. Optimized for cloud use, tiling splits data according to chromosome and subject group, according to Doug Last, director, sales, North America.
In addition to high compression rates (whole genome data with quality scores from one million subjects consume less than 40 TB), tiling facilitates efficient and massively parallel, downstream, data analysis using popular statistical genomics tools that were originally developed for much smaller datasets. Doing all this in an environment using large numbers of low-cost CPUs has proven quite effective, notes Last.
“Today, BC Platforms’ technology runs and automates some of the largest systems ever built. An example of such a system is the OBREA (Open Biobank Research Enhancement Alliance) network, a federated biobank system which is on target to link whole genome and clinical data for more that 5 million subjects by 2020,” explains Last. “Initiated together with biobanks and the pharma industry, OBREA is open for any academic and commercial partners interested in utilizing their biobanks for advancing research and drug discovery.”
Moving Bioprocess Information into the Cloud and toward Big Data
Discussions regarding the storage, handling, and exchange of data in the life sciences are often focused on genomics and proteomics. But, data management is also an issue in other disciplines; for example bioprocessing, which has its own requirements for storing and handling large amounts of data.
The new eve® bioprocess software from Infors provides a simple, yet powerful way to plan, record, and analyze bioprocesses using the ElasticSearch® NoSQL (Not Only SQL) database, while easing the path toward cloud-based storage, notes Tony Allman, Ph.D., product manager, fermentation, Infors.
Structured Query Language (SQL) databases have been used for supervisory control and data acquisition (SCADA) bioprocess data for over four decades. They are a good solution for the type of data currently being stored, such as sensor values, set points, and other process parameters, points out Dr. Allman, adding, however, that not all bioprocess information fits neatly into a traditional SQL storage structure, since batch records must also include experimental details and specifics such as media, validation, cell culture characteristics, and processed data (e.g., calculated from soft sensors).
“In general, NoSQL databases are better suited for handling Big Data, which tends to be less structured,” he explains. “Such flexible, structured databases enable faster data processing and retrieval than SQL, also being superior for addressing record-keeping involving evolving data requirements with an emphasis on scalability.”
For data integration, eve not only allows multiple offline formats (Excel, csv, iris…) to be imported, but also all OPC standards (DA 1–3, XML-DA, UA) and a web service, REST API, which allows users to exchange data live. eve can therefore handle various data sources and store data in one centralized database in order to be used for data-, information-, and knowledge-management.
The new eve bioprocess platform software paves the way for a Big Data way of evaluating, storing, and sharing bioprocess information and knowledge between devices, platforms, and users, maintains Dr. Allman.