If you are in any of the fields under the “omics” umbrella, you probably feel there’s a constant downpour of data-related buzzwords: machine learning, artificial intelligence, and big data. Don’t get me wrong, as a computational biologist who cut his teeth during the era of the Human Genome Project, I’m fascinated by technologies that can make sense of omics data. These technologies can’t be improved quickly enough, given that data is accumulating at an ever-rising rate, thanks to high-throughput sequencing, which also heightens data complexity. But I also recognize that when technical terms devolve into buzzwords, it may be a sign that the pendulum has swung from one unfortunate extreme to another. Not long ago, the computational terms seemed deep and full of promise. Now they seem stereotyped and stale.
Let’s not be overawed or disenchanted by buzzwords. What matters are practical solutions, not the names they are given. To get a feel for the computational solutions that can derive meaning from the increasingly vast stores of omics data, let’s briefly review the evolution of omics and its analytical implications.
Beginning of genome sequencing: The first generation of sequencing platforms and analysis software allowed us to progress from focused genetic studies to global genomic studies. The technology that was available enabled the first human genome project. Not surprisingly, when this technology became widely deployed, data became sizable, global, and challenging. No longer could data be adequately analyzed with Excel spreadsheets.
Transition to high-throughput sequencing: After the human genome project was completed, the release of the “next generation” of sequencing instrumentation initiated a sequencing “arms race.” Multiple companies positioned themselves to rapidly develop new technologies. This was when the brakes on genome sequencing were really removed. One run on the high-throughput machines, even the early models, produced 10,000 times more data than a run on the machines used to sequence the first human genome.
I remember casually remarking that a genome sequencing operation no longer needed 300 people, it needed three people, a sequencer, and a garage. A few short months later, Cofactor Genomics was launched in downtown St. Louis. And, yes, it had three people and a sequencer. Instead of sitting in a garage, however, it occupied a converted loft.
Maturation of high-throughput sequencing: As high-throughput technology advanced, sequencing projects generated massive quantities of data. In fact, some of the largest datasets ever seen in the biological sciences were collected.
Initially, the datasets helped us tackle previously difficult problems such as detecting low-frequency variants in DNA. Eventually, we built new applications that hadn’t existed and weren’t even considered earlier, applications such as cell-free DNA sequencing and single-cell transcriptomics.
Much like the “supernova” phenomenon described in Thomas Friedman’s book, Thank You for Being Late, the sudden acceleration of sequencing data can be as confounding as it is illuminating. In many ways, sequencing’s data explosion recapitulated the internet’s data explosion. Pretty soon there was so much data that it became difficult to discern the quality amidst all the quantity.
Sequencing’s convergence with advanced computing: Like any supernova worthy of the name, the sequencing supernova is accompanied by phase transitions. One transition concerns the sorts of questions we can answer, or even imagine asking, if we use sequencing data. Another transition is about progressing from one computational era to another.
If we are to realize the potential of our massive collections of sequencing data, we may need to let go of preconceived notions of how data can be utilized. In the omics field, we may need to take part in a transition that is occurring in computation generally. At the highest level, computation has already passed from the tabulating era (1900–1950), to the programming era (1950–2011), to the cognitive era (2011–present).
A shift to cognitive computing is occurring in genomic-data-driven biotechnology. This shift accounts for the buzzword-generating technologies that have taken center stage while promising to help us make sense of big data and answer the most important questions in human health.
Utility of gene expression models
When used effectively, machine learning and artificial intelligence are powerful approaches that make sense of big data. Take, for example, transcriptomics and RNA. RNA is a dynamic molecule that is constantly changing as a result of external stimulus, infection, and disease. At Cofactor Genomics, we like to say, “RNA molecules are the data packets that constantly stream through our bodily network.”
Transcriptomic data is several orders of magnitude more complex than DNA data. Because transcriptomic data is so complex, RNA has proven useful as a biomarker for disease or disease subtyping—provided data analysis is undertaken using a cognitive approach. This approach ultimately results in classifiers or models that can be applied to future individual samples, and it provides much needed context and classification for a clinical application.
RNA models that reflect the cognitive approach are of intermediate extent and complexity. They sit somewhere between unstructured big data (for example, sequencing data repositories) and the data that defines single-analyte biomarkers (for example, data sets highlighting gene fusions). RNA models represent the Goldilocks principle applied to the field of genomics-driven precision medicine.
Future of precision medicine
When we speak about building biomarkers, we are often referring to the development of a diagnostic. Fortunately, I see signs of a move into this cognitive era of data and diagnostics throughout oncology and precision medicine. For example, in the conference agendas on my desk, the 2019 session titles cite “models,” “classifiers,” and “multianalyte biomarkers” with clear clinical uses. Just a few years ago, session titles cited “big data,” “databases,” “mining,” and “analysis,” but it was unclear how these terms were making an impact.
Maturation of the diagnostics field was fully evident at the Clinical Biomarkers and World CDx Summit in October 2019. At this event, industry leaders—academic researchers, physicians, and representatives of contract research organizations, biotechnology companies, and regulatory agencies—participated in a roundtable session entitled “Multidimensional Biomarkers and Machine Learning–Based Approaches for Precision Medicine.” Everyone on this panel agreed that multianalyte approaches, specifically, multidimensional biomarkers or models, are required to help developers organize clinical trials, increase response rates, and reduce treatment costs. Personally, I can’t wait to see what the cognitive era of data and diagnostic development will mean for the future of genomics, RNA, and precision medicine.