Back in 2008, Martin Wattenberg, a mathematician and computer scientist at IBM’s Watson Research Center, said that the biggest challenge of the Petabyte Age wouldn’t be storing all that data, it would be figuring out how to make sense of it. At 200 terabytes, or the equivalent of 16 million file cabinets filled with text or more than 30,000 standard DVDs, the current 1000 Genomes project dataset is a prime example of this problem.
Datasets have become so massive that few researchers have the computing power to make best use of them. On March 29, Amazon Web Services and the NIH jointly announced that the 1000 Genome project, the world’s largest set of data on human genetic variation, is now publicly available on the Amazon Web Services (AWS) computing cloud.
This public-private collaboration demonstrates the kind of solutions that may emerge from the $200 million Big Data Research and Development Initiative launched by President Barack Obama. AWS is hosting the 1000 Genomes project as a publically available dataset for free, and researchers will pay for the computing services that they use.
Reducing Data Size
Big data, very loosely defined and depending on whom you ask, means the tools, processes, and procedures allowing an organization to create, manipulate, and manage very large datasets and storage facilities. Some datasets have grown so gigantic, on the order of terabytes, exabytes, and zetabytes, that they have created capture, storage, search, sharing, analytics, and visualization issues beyond the current capacity of any one organization to manage.
Part of the problem, some say, is that scientists are pack rats, unable to part with any data—ever. Novel tools and approaches are sorely needed to analyze all this stuff and make it useful. Since single week-long sequencing runs currently can produce as much data as did entire genome centers a few years ago, the need to process terabytes of information has become a requirement for many labs engaged in genomic research, Illumina’s Chief Information Officer, Scott Kahn, noted in an article in Science.
“We at Illumina see genomics big data as more of a challenge than a problem,” Kahn told GEN. “We are pushed to create more sequences, more cheaply, more quickly, thereby producing more and more data to analyze.
“Our approach is to attack the issue on a couple of fronts. The first is to try to engineer more into our instruments so they can process but not remove any information content. We have moved from offline analysis to online analysis and now to offline analysis that can happen on the cloud.”
Illumina introduced a software capability called RTA two years ago. It changed the need to bring image data off the machines to having the bases or reads come off the machine, Kahn explained. “This effected an order of magnitude of reduction in data, but the information remained intact—you didn’t lose anything.
“We then launched BaseSpace to use the network to do processing in the cloud and to aggregate the data centrally,” Kahn said. BaseSpace is Illumina’s next-generation sequencing cloud computing environment for data analysis, archiving, and sharing MiSeq data.
“Variants, for example, are 10 or 20 times smaller than the raw data,” Kahn continued. “So if you know what you are looking for, you can automatically deploy a workflow application that will reduce the size of the data and retain the information that you need.” For example, he said, if you just want to look at variations in the genome, and just store the variations, you reduce the size of the data. “If you want to go back to the raw data, you’d have to resequence. As the cost of resequencing gets lower and lower, it becomes cheaper to resequence rather than store data.”
Illumina, he said, recognized that “where sequencing is going requires a large engagement of the community.” He explained that the company started a group called the Genome Informatics Alliance (GIA). “This alliance, by engaging the participation of a variety of informatics scientists, genomics scientists, and other interested parties, allows us to bring together an otherwise disparate community to discuss topics that they could contribute to and address. Over four years, it has worked out nicely. We can work on problems well in advance through anticipating them and planning for them a little better.”
Compressing as a Solution
The inability to digest big data may compromise its utility in medicine, commented David Haussler, Ph.D., director of the center for biomolecular science and engineering at University of California, Santa Cruz. “Data handling is now the bottleneck,” Dr. Haussler told The New York Times. “It costs more to analyze a genome than to sequence a genome.”
One approach to making the size of the data more manageable and amenable to analysis is to compress it. Dr. Haussler’s Santa Cruz group is participating in a study of different data-compression methodologies. Dr. Haussler noted, however, that “it’s clear that you can’t compress it without losing some information.
“It’s a matter of whether you really need that information or not. Maybe we don’t need to keep as much as we are keeping now, but we need to make sure our analyses work just as well on the compressed data.”
Limitations of the Cloud
The problem is that sequencing data is growing exponentially faster than computing power per dollar, Bonnie Berger, Ph.D., told GEN. “With the advent of next-gen sequencing, the size of the data is going up by a factor of 10 every year while processing power is going up by a factor of two every year, and computer power won’t keep pace with data size.”
The key thing, she said, is that “people suggest that cloud computing will solve the problem. However, it doesn’t change the problem that the data is increasing exponentially faster than computing power per dollar.”
Dr. Haussler also said that cloud tools need development, and currently “there’s no definitive large-scale plan. There are some obstacles in terms of patient privacy and policy issues and other issues of design. If data is spread over several locations and you want to pull thousands of genomes together to compare and analyze them, it’s too expensive to move them, and it’s difficult to run the comparison process when they are kept separate. We are all trying to discover the optimum ways to compare and analyze thousands of genomes.”
Improving Analytical Software
The only solution, according to Dr. Berger, is to “discover fundamentally better algorithms for data processing and sublinear algorithms that work faster and scale so their cost doesn’t explode as the size of the databases increase.”
Mere compression is not the answer, she added, because “eventually you have to look at the data. We need to develop better algorithms that allow us to use the natural redundancy in the data and can operate directly on the compressed data. Much of the new data is similar, and the question becomes how we can take advantage of its inherent redundancy.”
As NIH director Francis Collins said, “The explosion of biomedical data has already significantly advanced our understanding of health and disease. Now we want to find new and better ways to make the most of this data to speed discovery, innovation, and improvements in the nation’s health and economy.” But achieving all of that will take a lot of computing power, inventive software, and cooperation and coordination among multiple scientific disciplines. The availability of big data is only the beginning.