Through multiomics—combined genomics, transcriptomics, proteomics, digital pathology, and other technologies yet to fully unfold—we can now obtain a complete dynamic vision of cancer,” argue the authors of a recent review article (Marshall et al. The Essentials of Multiomics. Oncologist 2022; 27(4): 272–284). They expect that in the near future, the multiomics approach will become routine in tissue testing, where it will generate enormous amounts of data for individual patients. They also anticipate that when this data is analyzed with the help of artificial intelligence, it will “result in improved efficiency and outcomes in our treatment of cancer and other serious illnesses.”
Once just a tantalizing possibility, multiomics is now becoming a reality for research and clinical medicine. However, it still isn’t completely “there.” Some omics technologies lag behind others, and most suffer from the need to integrate the massive amount of data generated. To get a sense of where multiomics stands, we have spoken with several leaders in the field. They agree that multiomics is already enriching our understanding of health and disease, and that it is just beginning to help us advance drug discovery and precision medicine.
Dynamic biomarker discovery
While much of biomarker discovery in the last two decades has focused on genomics, more than 80% of disease risk comes from nongenetic factors, according to Mohit Jain, MD, PhD, founder and CEO, Sapient Bioanalytics. “It’s been said that our zip code is more important than our genetic code,” he says. “A person’s genome is largely static from birth, but disease processes can happen on an ever-changing, dynamic continuum, particularly for chronic illnesses such as heart disease, cancer, neurodegeneration, and other conditions with complex etiologies.”
Jain highlights that multiomic approaches have the power to identify dynamic biomarkers that read out critical factors modulating health, disease, and drug response over time. However, challenges are still being addressed.
“Up until about two years ago, the real challenge was measuring these types of biomarkers with breadth, depth, and scale,” he points out. “This technical challenge is similar to that experienced by genomics when it was progressing from low-throughput Sanger sequencing to massively parallel next-generation sequencing. In the postgenomic space, we need multiomic technologies (such as metabolomics, lipidomics, and proteomics) to robustly analyze thousands of samples at a time in order to discover many, many new biomarkers.”
The company’s scientists set out to do just that by focusing on mass spectrometry. They reengineered both hardware and software to greatly accelerate the speed and scalability of mass spectrometry analyses. “We now can measure thousands of metabolites, lipids, and proteins in a single human biosample very quickly and robustly,” Jain reports, “and we can scale these approaches to tens and hundreds of thousands of biosamples as part of the discovery process.”
In parallel, Jain’s team spent years developing software to extract meaningful information from the massive amount of data generated. They also amassed their own human biology database. “This database,” Jain points out, “was generated from over a hundred thousand human biosamples collected globally, so it allows us to biologically validate biomarkers in large, independent populations. One of the big learnings over the last 20 years is that there is a lot of technology that can generate data. In many ways, data has grown exponentially while knowledge has increased linearly. Our goal is to close that gap between data and knowledge.”
Jain says that Sapient is less a traditional contract research organization than a collaborative partner. “We provide the full spectrum of services from early target discovery through to clinical trials,” he says. “We believe that the world is going to be changed by multiomics data, but we recognize that not all data is created equal. We need to produce high quality data, and then we have to be able to analyze that data to answer biological questions and generate insights. With our services, we can accelerate forward drug development and deployment for our clients.”
Next-generation protein sequencing
Despite remarkable gains made in omics in the past decade, proteomics has lagged behind, and yet proteins hold keys to our understanding of biological function, human health, and disease. “The complexity and diversity of proteins has made comprehensive protein analysis of the impact of individual amino acids on biological function unattainable until now,” notes Jeff Hawkins, president and CEO, Quantum-Si. “Our company’s technical founders wanted to unlock the true potential of proteomics by developing a next-generation benchtop protein sequencer that directly sequences and digitally quantifies proteins.”
Launched in 2022, the company’s Platinum benchtop instrument characterizes proteins by differentiating individual amino acids via distinct binding characteristics captured by an integrated semiconductor device. Hawkins relates, “The core applications are protein identification and protein variant identification as well as characterization of post-translational modifications.”
How does it work? An isolated protein is first digested into peptides that are distributed over millions of wells on the chip. Peptides attach to wells leaving their N-terminal amino acid exposed for binding by fluorescently labeled “recognizer” molecules that generate a signal specific to the N-terminal amino acid. After many such binding events, a “cutter” (aminopeptidase) removes the N-terminal amino acid, exposing the next in the sequence for subsequent binding by its specific recognizer. Simple-to-use software automatically analyzes data gathered across the millions of wells, enabling detection and characterization of the protein.
According to Hawkins, the technology has reaped some unanticipated benefits. “We have been surprised by how complementary our technology is to others, such as mass spectrometry,” he points out. “Instead of viewing it as competitive, we found the opposite.” Additionally, new applications have quickly emerged. For example, protein barcodes are being employed for screening protein libraries with different characteristics utilizing the instrument.
“Barcoding allows clients to solve problems they couldn’t solve before,” Hawkins notes. “Additionally translational laboratories can more easily examine two populations, such as responders versus nonresponders. Others are looking at protein signatures as a given response.
“For multiomics to really take off, you are going to need to have all of the developing technologies—genomic DNA sequencing, proteome profiling, etc.—in use at the same institutes. If they are not, it will take a much longer time until multiomics can solve clinical problems.”
Optical genome mapping
Structural variants (SVs), defined as genomic rearrangements ranging from 50 to thousands of base pairs, dot the entire human genome, suggesting their importance not only for genetic diversity but also for human diseases. Additionally, copy number variants (CNVs), which represent repeated segments of the genome, are also emerging as important players for diversity and disease.
Although SV and CNV analyses can contribute to multiomics insights, challenges remain. For example, according to Amanda Dossat, PhD, market development manager, Bionano Genomics, “Data integration from different omics platforms remains a challenge due to inherent variations in data formats, normalization methods, and scaling requirements.”
Bionano employs optical genome mapping (OGM) and software solutions for genome-wide SV and CNV analyses. The company’s OGM data is built in a reference-free format that provides complimentary information to orthogonal methods such as next-generation sequencing. Additionally, Bionano’s VIA software allows for integration and direct visualization of next-generation sequencing, chromosomal microarray, and OGM data for a single sample.
OGM employs fluorescently labeled DNA molecules to create a high-resolution image of the genome. “This allows users to see large SVs that can’t be detected by other methods, such as karyotyping or fluorescence in situ hybridization,” Dossat explains. “Additionally, when OGM is paired with our powerful analysis software (Access and VIA), users can apply bioinformatic pipelines to ultimately visualize the entire genome in a point-and-click and automated fashion. This view provides unparalleled examination of multiomics data without the need for users to be savvy in bioinformatics or programming.”
Integrating datasets
“It has been estimated that 50–80% of the time spent working with multiomics data is devoted to cleaning and preprocessing, as you’re working with datasets of different types and determining the best method for integration,” says Simon Adar, PhD, co-founder and CEO, Code Ocean. Another vexing issue is that data can originate from different sources.
“Depending on your use case, some data from multiomic layers may be proprietary, and some may be publicly available,” Adar continues. “All of these challenges that are well known in the multiomics space leave us to answer the same question: What’s the best way to create a dataset that is processed in a way that’s reliable and easy for others to work with?”
To help solve these issues, the company has built a large-scale multiomics computational framework. “Our software helps improve the overall experience of working with multiomics data,” Adar asserts. “The first thing is that it’s possible to make any dataset immutable. Once you’ve prepped your dataset, you can be confident that it won’t change over time and that your collaborators will work with the exact same processed copy of the data you did.
“The second is that it’s much easier to ensure that you’re using the correct data at any given moment because of our custom metadata features. It’s quick to search by sample, version, data source, etc.
“The third and final thing is that we have automated the entire process of data provenance. The Lineage Graph gives a crystal-clear picture of where all your result data came from, and how you processed it over time.”
Code Ocean platforms install directly into a client’s Amazon Cloud and automates all of the cloud management tasks. “This is usually out of reach to those without extensive cloud experience,” Adar remarks. “For the scientist, our platform automatically preserves the code, data, and computing environments used for any computational analysis. This means that you can reproduce all your bioinformatics workflows 10 days, 10 months, or 10 years from now with complete confidence. This, as well as features like the Lineage Graph, help R&D teams prove their results to third parties (such as the FDA or persons or groups contributing to quality assurance or due diligence) later down the line.”
Adaptive database
Stavros Papadopoulos, PhD, founder and CEO, TileDB, agrees that dealing with vast quantities of complex data and multiple data platforms has become a big problem in multiomics. “Bioinformaticians, computational biologists, and data scientists in the field have to wrangle and analyze the data at extreme scale to get insights from it,” he says. “This work is hard and expensive, and it takes a lot of time. As a result, areas such as preclinical target discovery and precision diagnostics lose time and money, leading to lower productivity and missed chances for success.”
Papadopoulos states that TileDB is an adaptive database management system for helping to efficiently store, manage, and analyze large-scale, complex data. “It unifies diverse data types such as genomics, transcriptomics, imaging, and clinical tabular data to create multidimensional arrays,” he explains. “It is a powerful data model that optimizes performance and scalability.”
The company builds two classes of software: 1) open-source tools (elements of a broad ecosystem around the TileDB core multidimensional array engine), and 2) a closed-source commercial database system (the TileDB Cloud).
Papadopoulos indicates there are two major advantages to the TileDB system: “On the one hand, modeling and analyzing data as multidimensional arrays leads to unprecedented performance, even in the most complex settings, which in turn dramatically reduces cost and time to insight. On the other hand, bringing different data modalities (such as in multiomics) into a single system eliminates data silos, facilitates collaboration across teams, and leads to richer insights.”
Examples of how TileDB solved problems in the real world include the case of Rady Children’s hospital. “We helped them shorten the diagnostic odyssey in newborn screening by storing and analyzing vast quantities of genomic variant data,” Papadopoulos reports. “We are also working very closely with organizations like the Chan Zuckerberg Initiative in pioneering ways to efficiently manage very large single-cell transcriptomics data.”
According to Papadopoulos, the continued generation of traditional omics data coupled with emerging technologies such as single-cell omics and other high-throughput sequencing techniques, which generate vast amounts of data at an unprecedented rate, will only add to the complexity of the data analysis problem. He declares, “The demand for larger and more intricate omic data sets will continue to grow as will the need for a new kind of database system, such as an adaptive database like TileDB.”