The GEN Special Section on Big Data consists of four articles:
Precision Medicine Research in the Million-Genome Era
Utilizing Machine-Learning Capabilities
NGS Big Data Issues for Biomanufacturing
Visualization for Advanced Big Data Analysis
Delivering the right therapy—specifically tailored for a patient at the right time—requires that we understand how individuals differ in their disease course and response to treatment.
Fortunately, we can benefit from the current flood of genomic, transcriptomic, proteomic, and epigenomic data, which offers the potential to truly understand the mechanisms of individual response at the molecular level, and turn these discoveries into therapies that can be precisely delivered to those patients who will benefit from them.
We are in the million-genome era. In the past few years, we have entered the age of big genomics projects. Initiatives like the U.K. government’s 100,000 Genomes Project, the U.S. government’s Million Veteran Program, and the AstraZeneca-led effort (which will enroll two million people over the next decade) are collecting an unprecedented amount of genomic data.
We already see the benefits of the previous generation of genomics projects. Efforts such as the 1,000 Genomes Project and The Cancer Genome Atlas (TCGA; containing cancer genomic and clinical data from more than 11,000 patients) have advanced our understanding of the genomics of different populations and the causes of cancer, informing target identification, biomarker discovery, and the development of personalized therapeutics.
Why is it that these projects are getting bigger? It turns out that research in precision medicine is very dependent on the number of samples. In many cases, the more individuals from whom we sequence and collect clinical data, the more opportunity there is to uncover the genetic variants and other molecular alterations linked to a host of diseases and traits.
An example is in the discovery of genes associated with cancer. Efforts such as TCGA have led to the discovery of many such genes, along with the recognition that cancer is in fact over 200 different diseases, each characterized by specific molecular alterations. The power to detect significantly mutated genes varies by cancer type, but in many cases thousands of samples are needed to detect all the genes involved.
Increasing sample sizes for precision medicine research is especially relevant to minority populations, who are currently underrepresented in most research cohorts. For example, there are more than 30,000 African Americans in the Million Veteran Program cohort right now, and by 2020 there will be more than 130,000. This unprecedented sample size will provide a real chance to do precision medicine research to develop treatment strategies tailored to this population.
Finally, increasing cohort size tackles the problem that not all genetic variants associated with disease are equally common, or equally easy to detect. As the number of sequenced genomes grows, we find that rare variants make important contributions to many diseases. Finding these variants requires large sample sizes, as well as new methods designed to analyze these large samples.
Significant Data Challenges
It’s an exciting time for genomics research, but the scale and complexity of these large population samples introduces new data challenges that require innovative and efficient approaches to analysis. By 2025, the annual acquisition of genomic data is anticipated to exceed 2 exabytes (2 million terabytes), and storing, accessing, and analyzing these data will be nontrivial.
Based on current infrastructure, most research organizations will struggle to store and manage these data, let alone optimally analyze them.
These big genomics projects also gather a spectrum of other data that provides valuable information on causal mechanisms and biomarkers of health and disease. TCGA, for example, comprises not just cancer genomes, but also other data types, including RNA sequencing, proteomic, imaging, and clinical data. The current generation of million-genome studies will not just contain more genomes, but also many more dimensions of data to be stored and analyzed.
Seven Bridges works with the world’s largest genomics projects, including the U.S. National Cancer Institute’s (NCI) Cancer Genomics Cloud pilots, Genomics England’s 100,000 Genomes Project, the Cancer Moonshot Blood Profiling Atlas, and the Million Veteran Program, where we see first-hand the challenges of working with the massive data that these efforts produce.
As the size of accumulated genomic data grows, straightforward data-management tasks become increasingly time- and resource-intensive for research professionals. Data sharing, for example, scales poorly. While it is easy to share a text file listing genes of interest by email, sharing the raw data from a whole sequenced genome requires mailing hard drives, and collaborating on these data in a dynamic manner is nigh on impossible.
The solution is to employ portable analysis workflows that travel to the data, an approach used in our work with the Million Veteran Program, which takes place across a network of Veterans Affairs (VA) sites, each with its own data repositories. A researcher within a VA research site can write a description of an analysis she wants to do (using an open specification called the Common Workflow Language) and submit it to another VA research site, sending only kilobytes of data. The analysis is done using local resources, without transferring any 200 GB files. In this way, VA researchers can rapidly analyze data across the network.
A cost-effective option for many research organizations is to centralize storage, with leading biopharmaceutical companies increasingly turning to the cloud to store and analyze data. Cloud providers offer storage and computation infrastructure, which biomedical software and service providers, like Seven Bridges, build on to create streamlined genomics analysis systems to help research organizations effectively use these data.
By using these resources, companies can store multidimensional omics data centrally, where they can be accessed and used by staff around the world. There is no unnecessary data duplication at local sites or infrastructural barrier to data sharing. A local version of TCGA, for example, would cost around $2 million just in storage costs, whereas a cloud-based version is freely available to researchers through the NCI Cancer Genomics Cloud pilots.
Centralizing storage also lets an organization “rescue” datasets—data collected for one project can be easily discovered and used by other projects if stored centrally in the cloud.
Making Data More Useful
Many research organizations fall into the trap of thinking that vast amounts of data automatically produce insight and returns. For the most efficient precision medicine research, these data are an absolute prerequisite. But there is more to it than that—researchers need to be able to actually use the data.
De-siloing data is identified as a key component to drive success in the U.S. government’s Cancer Moonshot and the White House Precision Medicine Initiative. De-siloing also benefits biopharmaceutical companies, who hold many data assets that are not fully exploited and which are often used just once before being archived. An opportunity exists for the organizations that integrate these data into their ongoing pharmaceutical research and development.
Data has great value when it is discoverable, and even greater value when it can be analyzed in the context of other data. As a simple example, researchers can use a whole genome sequence from a patient with a rare disease to find a list of potentially causal variants for further investigation. Moreover, these variants can be filtered effectively when combined with genomic data from 100,000 people without the disease. Bringing in the additional 100,000 samples can be nontrivial, both because of the size of the data involved and because of different methods of data collection among studies.
A major challenge for large-scale precision medicine research is in harmonizing data from different sources. This can be overcome by standardizing nomenclature and developing sophisticated metadata descriptions that enable data integration, and by enabling portable reproducible reanalysis of datasets.
Smart, Scalable, Enterprise-Ready Algorithms
Maximizing returns from analysis of millions of genomes requires optimization of analytic tools designed for work with smaller datasets, and in some cases a fundamental rethinking of the approach. Delivering the most accurate and most cost-effective research in drug discovery for precision medicine requires tools specifically designed for analysis of millions of genomes.
One area where new algorithms promise to help speed and accuracy in precision medicine research in millions of genomes is in the delivery of graph-based genome analysis tools. This represents an upgrade on the linear data formats of traditional genomics tools, which can’t scale to the number of samples we need to analyze simultaneously.
Moving to graph-based genome representations advances genetic analysis in two key ways. First, it helps create an ever-more accurate view of both an individual’s genetic makeup and that of the population as a whole. Second, it is a more efficient method to store and analyze vast quantities of genetic data.
The volume of NGS data gathered by the massive genomics projects in progress worldwide is informing the development of new data management and analysis methods. These new developments are transferable beyond these projects, including to pharmaceutical R&D—as the world’s leading biopharmaceutical companies increasingly turn to Big Genomics to identify therapeutic targets and stratify clinical trials.
By combining purpose-built tools for large-scale analysis of multi-omic data sets with well-annotated data and a well-designed infrastructure layer for storing, accessing, and computing on rich datasets, research organizations can unblock their data-driven projects and maximize the returns from these programs.