Next-generation sequencing (NGS) brought us an ability to produce maps of whole human genomes in less than a week for just a few thousand dollars. With this ability came a tsunami of data.
Because we cannot simply read the entire genome end to end, NGS generates a very large number of small reads from random locations in the genome. The reads are assembled in larger contigs, and contigs into genomes. This process generates about 100 bytes of compressed data for each base pair and 100 GB for each human genome. Worldwide, the volume of sequencing data is rapidly approaching the exabyte (1018 bytes), with an astounding 5x year-on-year growth rate.
The future development of the field is in clear need of very efficient and rapidly scalable methods of dealing with data storage and analysis. And in order to use the data in a clinically meaningful way, we need methods for assessment of the quality of the resulting sequences.
CHI’s “Next-Generation Sequencing Data Analysis” conference was dedicated to evaluation of progress in the pipeline of computational technologies. The technologies selected for the conference and for this article highlight diverse approaches of dealing with big data.
“Next-generation sequencing instruments can generate up to two terabytes of data per run per sequencer,” said Sanjay Joshi, CTO life sciences, Isilon storage division, EMC. “The ability of sequencing technologies to deliver data drowns out our capability to process and store the raw data and results.”
The EMC storage solution, Isilon OneFS, enables users to grow storage capacity seamlessly, in full sync and linearly with data growth. “Isilon adds storage as simply as Lego blocks,” continued Joshi. “Adding more and more identical nodes to the existing architecture provides so-called scale-out solutions for life sciences computing, the field with practically unlimited needs.”
Isilon OneFS has multiple distinctive characteristics, according to Joshi. In addition to infinite scalability, it spreads the metadata that describes each datafile intelligently across all storage nodes in the system. Therefore, each node “knows” what the other nodes are engaged in, and that eliminates any individual points of failure within the storage cluster. In essence, Isilon OneFS is a self-healing system.
Its nodular structure is ideally suited for multiparallel computing protocols that are indispensable when dealing with assembly of millions of short DNA pieces. Multiple storage needs, simultaneously supported by the same Isilon framework, can be rather diverse in nature.
At Harvard Medical School (HMS), Isilon One FS is connected to a supercomputing center to serve genomics and image-analysis needs. At the same time, it stores learning courses with multimedia applications and HMS administrative workflows.
Isilon found another application at the Laboratory of Neuroimaging at University of California, Los Angeles (LONI), where it stores what is reportedly the largest collection of neuroimaging data in the world, exceeding 430 terabytes. The LONI brain scans represent a unique collection of 2D images that can be stacked to reconstruct 3D brain images.
“Computational challenges of image processing on this scale are daunting,” continued Joshi. “But storing and retrieving the 200 GB images is what often impeded the work of researchers around the word.”
Deployment of Isilon’s storage environment enabled LONI to double the processing speed and reduce network bottlenecks. “Data security and availability are absolutely critical when dealing with potentially identifiable health information. Transfer of information over the internet would inherently be less secure than the Isilon solution within a private cloud context,” Joshi concluded.
No Need for Dedicated Hard-/Software
“Web-based platforms present an ideal infinitely scalable solution for handling big data,” countered Andreas Sundquist, Ph.D., CEO and co-founder, DNAnexus. “The data goes straight from the sequencers into the cloud over the secure protocol. Our customers can access all storage and data-visualization tools without investing in expensive hardware infrastructure.”
DNAnexus rents a segment of the Amazon cloud. Customers acquire services on demand, and the pricing mirrors the data usage. Because of this infinite elasticity, DNAnexus can support sequencing operations of virtually any size, from a single machine to a full-scale sequencing center.
“For DNA sequencing to have real application in healthcare, clinicians should be able to generate the data and receive the results even if they do not have access to a large computational center,” continued Dr. Sundquist. “The analysis will also have to be considerably simplified.”
The success of a 100% outsourcing approach is exemplified by discovery of an unstable variant of dystonin, a protein used in the cytoskeleton. This mutation causes hereditary loss of function in peripheral sensory nerves.
The mutation was identified by a small nonprofit organization, Bonei Olam, which does not have its own DNA assembly, analysis, or storage capabilities. Instead, the scientists contracted DNAnexus to provide the entire workflow from alignment of raw reads to graphical display of matches to the reference genome. In the future, DNAnexus plans to expand such clinically relevant workflows.
A collaboration with Geisinger Health System and University of California, San Francisco will enable DNAnexus to learn from clinical experts how to build innovative solutions for personalized healthcare. DNAnexus will soon be launching an instant genomic and data analysis center that enables collaboration in a unified environment with just a click of a button.
“The cloud is a powerful tool to build absolute best security,” continued Dr. Sundquist. DNAnexus emphasizes multiple approaches to protect personal health information in the cloud environment: physical protection of servers with round-the-clock surveillance, data encryption, audit trails for data access, among others.
“Just a few years ago pharmaceutical companies would not consider the cloud as even remotely possible,” continued Dr. Sundquist. “But now, enhanced data security allows for storing various sensitive data in the cloud, including employee records and financial data.”