April 15, 2018 (Vol. 38, No. 8)
Genomic, Transcriptomic, and Metabolomic Data Can’t Be Free, But It Can Be “Free Range”
Don’t count your chickens before they hatch. Really? Whoever said that knew nothing of biodata storage. Biodata, after all, is accumulating so quickly that the real risk is that storage needs will be underestimated, not overestimated.
Much of this biodata is derived from genome sequencing, particularly since genome sequencing costs are falling at a pace that draws comparisons to Moore’s Law. And the genome is hardly the only “ome” contributing to the explosion of biodata. Close behind are the transcriptome, the proteome, the metabolome, and so on. Also, vast amounts of image-based biodata are being collected by cell analyzer systems.
But let’s suppose that users of biodata—researchers, drug developers, and clinical scientists—are awakened to the biodata challenge. There is still the matter of how the challenge should be met. The usual response, the addition of raw storage capacity, would suffice only if biodata were content to be cooped up. But it isn’t. If biodata is to have value, it must be processed, analyzed, shared, and reanalyzed. It must move in and out of workflows, and quickly. Otherwise, basic experiments may be repeated needlessly, drug repurposing efforts may stall, and individualized diagnoses and treatment plans may be delayed.
Pools and Workflows
The biodata explosion, then, may be thought of as a workflow challenge. “In an environment that thrives on the collection, analysis, and distribution of data to researchers around the world, it’s essential to have a storage system in place that reliably supports a 24/7 streamlined workflow with extremely high computational power,” insists Ellis Wilson, Ph.D., a software architect at Panasas, a provider of network-attached storage solutions. “To facilitate highly effective collaboration, data accessibility has to be instant and intuitive. Researchers need the ability to tag, catalog, and search the metadata content of files via natural language, and be able to utilize files and content based on metadata fields instead of file names.”
Because biodata may be shared by scientists representing diverse disciplines, workflow characteristics may vary, tempting biodata managers to create discrete, purpose-tuned storage solutions. Dr. Wilson calls them “storage puddles.”
“These discrete storage puddles tend to deliver low capacity and performance utilization when considered in the aggregate, and they increase overall storage maintenance costs,” warns Dr. Wilson. “Moreover, data movement between storage puddles required for different stages in the data science workflow can be especially inefficient from a time and capacity perspective. Instead, use of a centralized parallel file system avoids the pitfalls of over-tuned and segmented storage while delivering high capacity and performance utilization.”
Ellis Wilson, Ph.D.,
Software Architect Panansas
Tiered Structures
Another challenge in maintaining efficient workflows is the need to move data between different tiers of storage. In general, the data that is used most frequently is stored most accessibly, and most expensively. “Economic concerns will increasingly push infrequently accessed data onto lower cost media tiers,” indicates a white paper from Spectra Logic, a company that builds backup and archive technology. “Just as water seeks its own level, data will seek its proper mix of access time and storage cost.”
Spectra Logic recognizes four tiers: 0: frequently accessed (solid-state technology); 1: online (magnetic disc); 2: nearline (tape); 3: offline (tape). These tiers may apply to both enterprise- and cloud-based solutions.
“If an organization stores raw data and believes re-compute is always possible, then the storage footprint could be relatively small as all the intermediary files and results can be thrown away,” notes Matt Starr, Spectra Logic’s chief technology officer. “But most biotech and bioinformatics organizations have found re-computing is not always possible, and the requirement to preserve those results is a mandate.
“So, the deployment of a tiered storage structure that enables fast access to data with the ability to move data off primary storage seamlessly to more economical (cost per gigabyte stored) data storage, such as scalable tape libraries, is optimal. This creates an automated archival system that can economically preserve scientific research and files for decades.”
Matt Starr
Chief Technology Officer, Spectra Logic
Society’s Genome
The need to accommodate the ceaseless movement of biodata along workflows and between storage tiers complicates data storage. But it is necessary if biodata is to deliver value. Another complication, beyond the scope of this brief article, is security—the need to protect highly sensitive medical information, for example. Yet another complication is the need to recognize biodata storage’s responsibility to preserve that part of modernity for which the life sciences is responsible.
This last complication, in Mr. Starr’s words, is a matter of data storage genetic diversity. “Data storage genetic diversity requires that organizations store their irreplaceable data on two or more differing types of media in at least two geographical locations,” Mr. Starr maintains. “In this way, the genome of an organization can be preserved forever. The key to data’s survival, like the survival of a species, is diversity. Organizations need to be vigilant to enable data to persist: preserve a copy on tape, not just on disk, a copy far away, a copy online, and a copy offline.”