Reducing Data Size
Big data, very loosely defined and depending on whom you ask, means the tools, processes, and procedures allowing an organization to create, manipulate, and manage very large datasets and storage facilities. Some datasets have grown so gigantic, on the order of terabytes, exabytes, and zetabytes, that they have created capture, storage, search, sharing, analytics, and visualization issues beyond the current capacity of any one organization to manage.
Part of the problem, some say, is that scientists are pack rats, unable to part with any data—ever. Novel tools and approaches are sorely needed to analyze all this stuff and make it useful. Since single week-long sequencing runs currently can produce as much data as did entire genome centers a few years ago, the need to process terabytes of information has become a requirement for many labs engaged in genomic research, Illumina’s Chief Information Officer, Scott Kahn, noted in an article in Science.
“We at Illumina see genomics big data as more of a challenge than a problem,” Kahn told GEN. “We are pushed to create more sequences, more cheaply, more quickly, thereby producing more and more data to analyze.
“Our approach is to attack the issue on a couple of fronts. The first is to try to engineer more into our instruments so they can process but not remove any information content. We have moved from offline analysis to online analysis and now to offline analysis that can happen on the cloud.”
Illumina introduced a software capability called RTA two years ago. It changed the need to bring image data off the machines to having the bases or reads come off the machine, Kahn explained. “This effected an order of magnitude of reduction in data, but the information remained intact—you didn’t lose anything.
“We then launched BaseSpace to use the network to do processing in the cloud and to aggregate the data centrally,” Kahn said. BaseSpace is Illumina’s next-generation sequencing cloud computing environment for data analysis, archiving, and sharing MiSeq data.
“Variants, for example, are 10 or 20 times smaller than the raw data,” Kahn continued. “So if you know what you are looking for, you can automatically deploy a workflow application that will reduce the size of the data and retain the information that you need.” For example, he said, if you just want to look at variations in the genome, and just store the variations, you reduce the size of the data. “If you want to go back to the raw data, you’d have to resequence. As the cost of resequencing gets lower and lower, it becomes cheaper to resequence rather than store data.”
Illumina, he said, recognized that “where sequencing is going requires a large engagement of the community.” He explained that the company started a group called the Genome Informatics Alliance (GIA). “This alliance, by engaging the participation of a variety of informatics scientists, genomics scientists, and other interested parties, allows us to bring together an otherwise disparate community to discuss topics that they could contribute to and address. Over four years, it has worked out nicely. We can work on problems well in advance through anticipating them and planning for them a little better.”