June 1, 2015 (Vol. 35, No. 11)
A Fresh Bioinformatics Breeze Is Blowing Revealing Solid Disease Models and Clear Treatment Options
According to Illumina, next-generation sequencing data volume has doubled every year since 2007, representing over a 1,000-fold increase in the amount of data that needs to be processed. Add in proteomics, metabolomics, medical records, and other information, and it is obvious that Big Data is growing at an explosive rate.
But without the proper computational and informatics tools more data won’t necessarily amount to more, or better, information. Ongoing governmental initiatives and commercial advances are shaping the way the scientific community addresses this challenge. Bioinformatics leaders recently convened at CHI’s Bioinformatics for Big Data, Converting Data into Information and Knowledge conference to discuss progress in the field.
The National Center for Multiscale Modeling of Biological Systems (MMBioS)—a collaborative effort between the University of Pittsburgh, Carnegie Mellon University, the Pittsburgh Supercomputing Center, and the Salk Institute for Biological Studies—was established in 2012, in the first round of Biomedical Technology Research Resources (BTRRs). Today, 35 NIH-funded BTRRs create and apply unique technology and methods in their respective fields while facilitating the research of NIH-funded laboratories.
The MMBioS Resource focuses on neurobiological as well as immunological applications. To gain deeper mechanistic understandings, MMBioS develops multiscale simulations to bridge molecular events and disease and organ functions. In particular, MMBioS sustains technology development projects in molecular modeling, cell modeling, and image processing.
These technology efforts are guided by biomedical projects that MMBioS conducts with research groups across the country. These projects focus on glutamate transport, synaptic signaling, dopamine transporter function, T-cell signaling, and neural circuits. Besides these driving biomedical projects, MMBioS engages in a large number of collaborations with experimental and computational research groups. In addition, information and technology is disseminated to the larger scientific community through MMBioS’ website, training workshops, and tutorials.
“This phenomenal type of joint effort is extremely useful,” said Ivet Bahar, Ph.D., distinguished professor and John K Vries chair, department of computational and systems biology, School of Medicine, University of Pittsburgh. “The problems we are dealing with are much more complicated than an individual laboratory can handle. Our role is to build the technology, which we devise in response to existing research needs and challenges.
“The computations we develop are very fast, efficient, and inexpensive so in silico experiments can minimize the wet lab benchtop effort. Computations serve two important roles: they help interpret experimental data in the framework of well-defined quantitative models and methods, and they help build new hypotheses, which are then tested experimentally.”
Pittsburgh also has a new BD2K (Big Data to Knowledge) Center of Excellence. This BD2K project, called the Center for Causal Modeling and Discovery, is a collaboration between the University of Pittsburgh, Carnegie Mellon University, the Pittsburgh Supercomputing Center, and Yale University.
The project’s diverse participants include Carnegie Mellon’s philosophy department. Causality and logic models used for a variety of applications will be further expanded to biomedical Big Data to gain insight into mechanisms of function and to understand relationships important for therapy, especially personalized medicine. A short course for teaching causal modeling techniques is currently being organized.
Large Systems Perspectives
Simple solutions do not solve problems in complex systems. At IPQ Analytics, disease-agnostic models take a large-system perspective and span the entire patient experience, from conditions preceding illness onset to symptom display, diagnosis, treatment decision, physician compliance, patient adherence, and outcome.
Because the basic disease process remains the same, many elements are present in all diseases. Specific risk factors, however, may be weighted differently, and new elements may be added to customize and extend the general model. For example, in a rare pediatric disease, the model was extended to look at pregnancy history and in utero exposures, factors that are also relevant in breast cancer.
“We need to consider the complete system and think about the real-world problem before we can ask the right questions,” insisted Michael Liebman, Ph.D., managing director of IPQ Analytics. “Transitioning data to information to knowledge to clinical utility is difficult. We try to identify how large the gap is, what crucial issues need to be addressed, and what questions need to be answered.
“If the patient is at the top of a pyramid and you work to fill in only the pieces necessary to answer critical questions, then the pyramid remains stable. If you build from the bottom up, as each new technology develops a block, something will always be missing to complete the base, making the pyramid unstable.”
Some data may be expensive and hard to collect. Modeling enables evaluation of the impact of missing information, and it allows identification and prioritization of what the model needs to make it more precise in its predictions.
In the modeling of breast cancer risk, the personalized history of the patient, which may contain information such as changes in weight over the patient’s lifetime and time of menarche, will be scrutinized. The modeling will do so while recognizing breast cancer fundamentals, such as the concept that the breast undergoes developmental change throughout a woman’s lifetime, and the concept that hormonal changes, which produce long-term effects, are influenced by body fat and other factors.
These changes need to be appreciated if modeling it to capture the understanding that risk is not uniform over a woman’s lifetime but varies from stage to stage of personal development. More in-depth analysis of specific regulatory pathways and molecular processes at each stage of development may point to sources of risk and help identify better biomarkers and ways to manage or prevent the disease.
Making Data Accessible
Basic clinical and outcomes research data must be accessible, ideally not just to investigators within an institution, but also across institutions, which may include pharmaceutical companies. Such accessibility is the aim of SPIRIT (Software Platform for Integrated Research Information and Transformation), an integrated research information platform. SPIRIT is designed to enable the integration of in-house, open source, and commercial off-the-shelf applications for the City of Hope (COH).
“We wanted to develop the platform not just to integrate the data and serve the operational needs, but also as a springboard to put together proof of concepts and new applications, such as machine learning and biomedical natural language processing pipelines, which allow us to analyze data and provide results much faster,” discussed Ajay Shah, Ph.D., director of research informatics and systems, COH National Medical Center.
Processing Data Faster
Scientific innovation is progressing at a faster pace than most organizations’ ability to refresh their IT infrastructure; big data requires compute density. For example, usage and technical innovation are driving down sequencing costs, which continues to fuel the informatics demand for higher throughput and accuracy.
The Cray Urika-XA analytics server contains 48 nodes, over 1,500 cores with 6 TB of RAM, a 38 TB solid-state drive, and a 120 TB POSIX-compliant parallel file system. The small-footprint server is preconfigured and delivered with Hadoop and Spark, is optimized for use at high density, and offers a lower total cost of ownership for a normal data center life cycle of three to five years.
Because the server has over 1,500 cores, it can run more than 1,500 compute events simultaneously, two to three times the density of other platforms.
“What makes Urika so cool is the compute density. This convergence of supercomputers and analytics allows scaling from proof of concept to production in the same environment,” stated David Anstey, global head of life sciences at Cray. “You can get more done faster. Think about the possibilities if there were no constraints. What would the impact be if you could ask tougher, more probing questions in an iterative way?”
A combination of technology and people’s ability to leverage that technology effectively, Anstey insisted, will determine how fast precision medicine evolves.
The Urika-XA can transition a data center from batch-mode processing to low-latency fast analytics. End users can run their own jobs while the software handles the workflow, simplifying the scientific analysis.
A large cancer group’s analysis of over 30,000 samples, where the goal was to look at the effect of genetic mutation on gene expression, previously took 6 minutes per sample to complete, almost 3,600 hours for the panel. Rerunning this analysis on Spark using Urika-XA decreased the analysis time to 20 minutes, demonstrating the effectiveness of using in-memory analytics across a significant amount of compute.
The storage capacity of the Urika-XA platform can allow data to be augmented with additional information, such as metabolic and lifestyle histories, and then reanalyzed without data movement, minimizing expense.
Continue reading about Big Data at Big Data’s Potential to Improve Biopharmaceutical Innovation and Growth.