Next-generation sequencing (NGS) brought us an ability to produce maps of whole human genomes in less than a week for just a few thousand dollars. With this ability came a tsunami of data.
Because we cannot simply read the entire genome end to end, NGS generates a very large number of small reads from random locations in the genome. The reads are assembled in larger contigs, and contigs into genomes. This process generates about 100 bytes of compressed data for each base pair and 100 GB for each human genome. Worldwide, the volume of sequencing data is rapidly approaching the exabyte (1018 bytes), with an astounding 5x year-on-year growth rate.
The future development of the field is in clear need of very efficient and rapidly scalable methods of dealing with data storage and analysis. And in order to use the data in a clinically meaningful way, we need methods for assessment of the quality of the resulting sequences.
CHI’s “Next-Generation Sequencing Data Analysis” conference was dedicated to evaluation of progress in the pipeline of computational technologies. The technologies selected for the conference and for this article highlight diverse approaches of dealing with big data.
“Next-generation sequencing instruments can generate up to two terabytes of data per run per sequencer,” said Sanjay Joshi, CTO life sciences, Isilon storage division, EMC. “The ability of sequencing technologies to deliver data drowns out our capability to process and store the raw data and results.”
The EMC storage solution, Isilon OneFS, enables users to grow storage capacity seamlessly, in full sync and linearly with data growth. “Isilon adds storage as simply as Lego blocks,” continued Joshi. “Adding more and more identical nodes to the existing architecture provides so-called scale-out solutions for life sciences computing, the field with practically unlimited needs.”
Isilon OneFS has multiple distinctive characteristics, according to Joshi. In addition to infinite scalability, it spreads the metadata that describes each datafile intelligently across all storage nodes in the system. Therefore, each node “knows” what the other nodes are engaged in, and that eliminates any individual points of failure within the storage cluster. In essence, Isilon OneFS is a self-healing system.
Its nodular structure is ideally suited for multiparallel computing protocols that are indispensable when dealing with assembly of millions of short DNA pieces. Multiple storage needs, simultaneously supported by the same Isilon framework, can be rather diverse in nature.
At Harvard Medical School (HMS), Isilon One FS is connected to a supercomputing center to serve genomics and image-analysis needs. At the same time, it stores learning courses with multimedia applications and HMS administrative workflows.
Isilon found another application at the Laboratory of Neuroimaging at University of California, Los Angeles (LONI), where it stores what is reportedly the largest collection of neuroimaging data in the world, exceeding 430 terabytes. The LONI brain scans represent a unique collection of 2D images that can be stacked to reconstruct 3D brain images.
“Computational challenges of image processing on this scale are daunting,” continued Joshi. “But storing and retrieving the 200 GB images is what often impeded the work of researchers around the word.”
Deployment of Isilon’s storage environment enabled LONI to double the processing speed and reduce network bottlenecks. “Data security and availability are absolutely critical when dealing with potentially identifiable health information. Transfer of information over the internet would inherently be less secure than the Isilon solution within a private cloud context,” Joshi concluded.