February 15, 2016 (Vol. 36, No. 4)
Making Big Data Work in the Life Sciences Isn’t So Much a Matter of “Upping One’s Game” As It Is a Matter of Raising One’s Expectations
Now more than ever, researchers in the life sciences are generating, accumulating, and storing data. In fact, data troves are reaching exabyte levels, prompting a number of questions. What do we do with all this data? How should we analyze it? How the heck do we make sense of it?
Answering such questions is important not only for furthering research efforts, but also for developing diagnostic and therapeutic products—and for expeditiously moving these products along the precision medicine pathway. Accordingly, GEN presented its biggest big data questions to a panel of experts. In response, the experts provided a set of compelling strategies for handing and analyzing big data more efficiently and effectively.
GEN: It is widely acknowledged that pharmaceutical and life sciences researchers both in academia and industry are being inundated and often overwhelmed with the amounts of data they generate themselves and with informational input from other sources. What do you see as the main obstacles to scientists implementing concrete strategies and plans to successfully handle big data and its analysis?
Dr. Asimenos: Data hoarding is one of the biggest obstacles in realizing the true potential of big data. While stockpiling the results of next-generation sequencing (NGS) may make sense from a business perspective, this practice does a great disservice to the research community at large, where scientific breakthroughs are built upon the research of others via collaboration.
The key to unlocking big data’s potential is making NGS results accessible through a model such as the cloud commons—that is, a central repository for data sets where members of the research community can access and share data, and collaborate on processes and results.
As datasets become larger, they become more difficult to share with collaborators. By having the data in one location, researchers at institutions worldwide can have immediate access and use of large-scale data sets.
Dr. Stadnisky: There is one main obstacle to implementing a big data strategy: time. Successful teams focus on this key limiting reagent. They ask how value can be extracted where time is spent, and how value can be preserved where it is saved.
The first shift occurs before a pipette is picked up. It involves setting a strategy for annotation and controlled vocabulary curation. I interact with an increasing number of scientists who want to ask questions over many years, modalities, and approaches. Supporting investigations of such scope requires an acknowledgement that the metadata and annotation is just as important as the data itself.
The second shift is in designing and implementing an integrated architecture across measurement modalities, finding the time to explore new tools in limited trials.
Mr. Rudy: In short: lack of established community standards and commercial-grade solutions to streamline analysis based on those standards. Bioinformatics algorithms, reference datasets, and “best practice” tools are fast-changing and often driven by researchers. We are just getting around to having the community discussions, such as those in the GA4GH working groups, on representing at a fundamental data level things like variants and their annotations.
The process of aligning and calling variants is embarrassingly parallel and bundled solutions from Illumina and Ion Torrent generally meet the “good enough” threshold. After this point, variant call sets are no longer “big data,” but the complexity of analysis increases exponentially. This is where Golden Helix focuses its efforts in providing commercially supported variant annotation, interpretation, and reporting solutions.
Dr. Shon: Scientists desire to leverage the cumulative knowledge and experience of previous work done in a field. With increasingly large and complex data sets, this task becomes more difficult by the day. The ability to leverage that data is a function of openness and accessibility.
Currently, most organizations are increasingly committed to open sharing of data, but without established processes, standards, and tools to enable easy access to data, metadata, and results, data will not be effectively shared. The ability to fully evaluate previous results enables researchers to determine whether or not a data set answers or validates specific questions at hand. For large institutions, and for the community as a whole, increasing investment in these processes and tools is required for scientists to stand on the shoulders of other scientists.
Dr. Hoefkens: Giving scientists timely and easy access to data is the main impediment to successful analysis of big data in life sciences research. If regular scientists cannot readily access relevant data efficiently, the goals of data integration and knowledge creation can be elusive.
Currently, life sciences research organizations generate data sets that are increasingly multidimensional and information rich and are stored in an ever-growing number of in-house data silos and public data repositories. Scientists’ ability to derive value from the data is often limited by the absence of unified access and ability to search across the various data repositories. In addition, the lack of consistent metadata and semantic mapping complicates the integration of data from different sources.
Dr. Davis-Dusenbery: When we work on the cancer genomics cloud pilot for the National Cancer Institute, which holds one of the largest public data sets, we have four guiding principles for overcoming blockers to large-scale analysis.
- Data should be usable, not just available. Scientists need tools to help them quickly find the right raw data. Scientists not only need to discern what’s relevant, they also need to recognize what’s already been processed.
- The best science happens in teams. Fine-grained permissions help principal investigators, analysts, and tool developers all work in parallel.
- Reproducibility shouldn’t be hard because big data is so…big. Scientists need systems to track every analysis automatically.
- The impact of data is extended by both new data and new tools. Easily adding more of both unlocks new types of analysis across the entire dataset.
Dr. Greene: In our experience at CSRA, the largest obstacle is researchers not sufficiently planning for the size and complexity of their data. Some large efforts start with an insufficiently resourced pilot that is placed into full production. Another impediment is the need to cohesively combine data from multiple research sites to avoid “data munging.” Promoting cohesiveness is coupled to use of standards.
Resolving these issues adds significant overhead, in terms of extending completion times, but it pays off in the long run. For example, we support the development of the Biomedical Research Informatics Computing System, or BRICS (brics.cit.nih.gov).
A project undertaken by the NIH’s Center for Information Technology, BRICS embodies a modular, web-based approach that has been used, for example, to support sister neurological databases. BRICS has also entailed the creation of a data dictionary that provides significant cross-comparison benefits.
Mr. King: Today’s main obstacle is not about getting the data, but having the systems to manage vast amounts of data, create connections, and extract valuable insights. Such systems are necessary if data is to inform decision making.
For life sciences companies, data-informed processes need to happen in a timely fashion. That is, they need to support real-time decisions, not just for retrospective analyses. By providing predictive insights, these processes can have maximal impact today.
Right now, companies don’t have the systems in place to aggregate all data and provide insights and machine learning necessary to guide decisions and seize future opportunities. What scientists need is to have the tools that provide the ability to explore and use data, as well as filter, view, and integrate data within existing systems.
GEN: Which specific big data technological or methodological approach do you favor in helping to address the data analysis roadblock?
Dr. Asimenos: Today, many organizations are participating in global large-scale sequencing projects to study thousands or even millions of genomes. The cloud is the only technology that is capable of keeping pace with big data. It eliminates time and capital expenditure of creating and upgrading local infrastructure for data analysis. The elasticity of the cloud allows for near limitless scalability and immediate availability of resources. And by taking advantage of the cloud’s online nature, researchers are able to share data and tools and collaborate instantaneously with others around the world.
As big data accumulate online, new standards will need to emerge for discovering and querying datasets, as well as for authenticating requests, encrypting sensitive information, and controlling access. The Global Alliance for Genomics and Health and others are working together to develop approaches that facilitate interoperability.
Dr. Stadnisky: We have helped our customers realize the value of moving beyond sifting practices to achieve insight and knowledge curation, by automating routine, repeated analysis and providing a suite of discovery tools in an open architecture. Specifically, we have built an analysis ecosystem for single-cell phenomics data (from flow and mass cytometry) that allows users to automate routine phenotyping. The use of smart pipelines can change an analysis for a sample or group of samples based on derived statistics, and be executed immediately following data acquisition.
Thus, we help scientists to answer the key questions of a study and facilitate the work of discovery in an experiment by providing intuitive tools for discovery analysis. Such tools enable clustering, visualization, data reduction, and ontology querying. They also enable scientists to plug in any algorithm and obtain views from alongside ongoing analyses.
Mr. Rudy: The standard exome or genome align-and-call pipelines have become what most informatics infrastructure should become: boring and predictable. Cloud platforms, such as DNANexus, provide an important service. They put this infrastructure into more individual’s hands.
What I find more interesting is the analytics closer to gene discovery and clinical care. Variant annotation and interpretation requires not only the constant ingestion of the latest releases of new datasets and annotations, but also the application of data science to your archive of sequenced samples.
To scale the warehousing of thousands of whole genomes to support this rapid reannotation and retrospective querying, we have developed compressed column-store techniques paired with a traditional SQL front-end. With deeply integrated genomics-aware representation and annotation algorithms, this storage technique is more efficient than compressed VCF files and optimizes genomic queries such as “How often have I seen this variant?” and “Has ClinVar changed the classification of a variant I have reported on?”
Dr. Shon: For large and complex data sets, nonduplication and standardized curation of metadata are essential for aggregation and sharing of data.
Reproducible research relies upon this and economic factors make this increasingly imperative. Federation of data and bringing analytic tools to data vs. aggregation in a single instance is practically how large-scale data sharing can be enabled. Accordingly, systems, technology and tools are required to manage the ethical, privacy, and legal issues involved in data sharing to do this at a large scale. This is increasingly important as genomic data sets and clinical medicine begin to intersect.
Finally, and paradoxically, we will require increasingly large data set aggregation and analysis to enable the provision of precision medicine. While current genomic technologies are more robust than ever, achieving analytic and clinical validation is increasingly critical for the large-scale adoption of these technologies to provide clinical utility to patients and providers. Software to manage end-to-end processes, from sample to answer, are imperative to deliver genomic technology at scale.
Dr. Hoefkens: PerkinElmer is helping to address the challenges scientists face by leveraging big data with an integrated strategy that simultaneously manages the challenges related to data perception, flexibility, and agility. The technological underpinnings of this strategy are Tibco’s Spotfire®, which is software for data analysis, visualization, and visual exploration, and PerkinElmer’s Signals™, a platform for data storage, retrieval, and normalization.
We have created the PerkinElmer Signals platform as a cloud-based data warehouse system around two key components—a flexible and adaptable means of storing measurement data and associated metadata, and a scientifically aware entity domain model that supports the adaptation of the platform to specific scientific domains.
The high degree of flexibility and agility of the PerkinElmer Signals platform is based on a hybrid approach of combining a traditional SQL data model with a flexible NoSQL search engine. To provide scientists with an easy-to-use and intuitive query interface, the PerkinElmer Signals platform is tightly integrated with Tibco’s Spotfire software for data selection and downstream analysis.
Dr. Davis-Dusenbery: From our experience with national and international projects such as the Cancer Genomics Cloud and the Pan-Cancer Analysis of Whole Genomes, we have found that pairing cloud infrastructure with software that helps scientists with different expertise collaborate is most important. While cloud infrastructure offers limitless storage and as-you-need it computation, it is even more important because it allows you to bring your biological questions to the data. This means when you have a hypothesis, you set up a task and run it within an hour instead of waiting for weeks to download the appropriate data set.
The ability to securely collaborate with other teams, regardless of where they are geographically, is also critical. In modern science, no one person has a monopoly on good ideas, and a system that allows researchers to share just the right amount of access with collaborators lets groups test multiple hypotheses at once and share results.
Dr. Greene: Researchers benefit from having an underlying platform to plug into instead of having to resort to one-off system development. We have used Apache’s ServiceMix bus successfully at the Centers for Disease Control, and we are exploring using the Agave open source, platform-as-a-service solution for hybrid cloud computing developed at the Texas Advanced Computing Center.
Keep it light and avoid heavy architectures. A hardware refresh is costly. Take advantage of the scalability of the cloud, which is unparalleled. Using the cloud for hybrid computing is now essential.
CSRA is working with the Institute for Systems Biology and Google to develop an National Cancer Institute Cancer Genome Cloud pilot focused on the security required for a federal information system. CSRA can really add value here because many organizations need advanced, secure data management systems but lack the knowhow.
Mr. King: Today, what we have is data analysis, not insights to guide decisions. To get great insights, you need to know the right questions to ask of the data. You need algorithmic technology. Assuming that you have the right data, and that it’s refreshed and integrated with existing data sets, you still need trained experts who can ask the right questions of the data.
The data needs to be linked at the very beginning with algorithmic linkage technology, where the life sciences “lens” drives the analytics to put the structured and unstructured data into context. The analysis layer needs that same algorithmic approach, which is where machine learning improves the data and the analytics over time to deliver those relevant insights better and faster. And, finally, visualization technology can give users at any level the flexibility to access the value extracted from the data to better manage research and/or market dynamics.
Data Resolution in Clinical Trials
Capturing adequate physiological data in late-stage clinical trials has long been difficult for pharmaceutical and biotech researchers, particularly in off-site settings. Until recently, the ability to securely and accurately capture and transmit this patient data has been unattainable, leading to millions of data points simply being lost each year. The result: very low fidelity in clinical investigation records.
Today, promising advances in clinical technology are helping to address this shortfall using a novel, patient-centric BYOD approach leading to much higher clarity in results.
Aces Health™ has developed the first scalable mobile app to tackle the issue directly through their patent-pending recording algorithm. “The core of this new solution is our versatile, HIPAA-compliant continuous capture platform,” explains Jordan Spivack, director of product development at Aces. The software allows any number of Bluetooth-enabled medical devices to send patient data to researchers in real-time, who can then analyze the relationships on a unified dashboard. “You can monitor aggregate data across identified trial parameters and track emerging patterns as they begin to develop to make timely assessments.”
Early adopting customers are already encouraged by the results that continuous off-site capture offers, bolstered by the Aces app’s improvements to patient adherence and attrition rates—the latter traditionally as high as 30% industry-wide. “Having access to exponentially more data points per patient opens up a new metric for comparison that we call ‘Data Resolution’. The result is like watching high-definition television instead of viewing an old standard feed—with higher confidence analytics through greater information density, investigators are painting a clearer picture of a trial’s safety and efficacy earlier on.”