Next-generation sequencing arrives with an explosion in the volume of data and at the same time brings considerable challenges for information management. As Ron Ranauro, president and CEO of GenomeQuest, explained, we are currently witnessing “a cycle going on between biological science and computer science that’s unlike anything that came before.”
At the Cambridge Healthtech conference, GenomeQuest will feature a web-based informatics service that provides customers with large-scale computational resources and algorithms on demand. The platform will perform “all-against-all” exhaustive sequence comparisons between sequence reads and reference data, ultimately aiming to provide customers with “the most complete results and therefore the most trustable finds,” points out Ranauro, adding that such an extensive comparison “will, in essence, purify the sample in silico.”
Another important goal, continues Renauro, is to perform this task “in a time period that is a fraction of the time it took to actually produce the data set in the first place.” Critical components of the platform will accomplish this by providing solutions to analyze, share, and archive the vast amounts of data as well as access them through a web browser, integrating thus the benefits of local access and central management.
The vast amount of information generated by next-generation sequencers also comes with a catch—a significant proportion of the output is useless and distinguishing between good and bad data is not just an important task but a veritable challenge.
“How do you judge which reads are good, which reads are bad?” asks Anton Nekrutenko, Ph.D., associate professor at the Center for Comparative Genomics and Bioinformatics at Penn State University. In collaboration with James Taylor, Ph.D., from New York University, he developed GalaxySR, the first freely available open-source system for short reads which he will present at the conference.
The software, which is free and requires only a web browser, is able to perform several quality-control steps even before the sequencing data are downloaded. This platform will make it “as simple as possible to go from the actual sequencing machine to some interpretable results,” emphasizes Dr. Nekrutenko.
In a recent experiment performed to validate this platform, he set out to determine whether one can tell two geographic locations apart, after collecting flies that accumulated on his windshield and examining the nature and abundance of short reads with a 454 analyzer.
Reading counts “is a tricky thing” warns Dr. Nekrutenko, because it depends on several variables such as sample preparation or DNA concentrations, and eukaryotic metagenomics is one immediate area for which he envisions exciting applications of GalaxySR.