A flood of RNA sequencing (RNA-seq) data is already overwhelming existing systems for data analysis. And the waters are bound to keep rising, now that RNA-seq, the primary means of measuring gene expression, is increasingly seen as a tool not only for basic researchers, but also for medical practitioners.
Particularly challenging is the comparison of multiple RNA-seq datasets, including archived datasets, to detect changes in gene expression over time, or differences in gene expression that occur when disease strikes. Such comparisons, however valuable scientifically or clinically, are extremely time consuming, particularly if they depend on frequent reanalysis to capture fluctuations in gene activity.
To facilitate the analysis (and reanalysis) of RNA-seq datasets, computer scientists have been trying various ways to wring the as much performance as possible out of data-analysis platforms. And now, one group of computer scientists, representing researchers from Carnegie Mellon University and the University of Maryland, report that they have developed a new computational method that dramatically speeds up estimates of gene expression.
With the new method, dubbed Sailfish after the famously speedy fish, estimates of gene expression that previously took many hours can be completed in a few minutes, with accuracy that equals or exceeds previous methods. The researchers’ new method was presented online April 20 in the journal Nature Biotechnology, in an article entitled “Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.”
The article’s authors emphasised that gigantic repositories of RNA-seq data now exist, making it possible to re-analyze experiments in light of new discoveries. “But 15 hours a pop really starts to add up, particularly if you want to look at 100 experiments,” said Carl Kingsford, Ph.D., an associate professor in CMU's Lane Center for Computational Biology. “With Sailfish, we can give researchers everything they got from previous methods, but faster.”
The RNA-seq process results in short sequences of RNA, called “reads.” In previous methods, the RNA molecules from which they originated could be identified and measured only by painstakingly mapping these reads to their original positions in the larger molecules.
But the Carnegie Mellon and University of Maryland researchers realized that the time-consuming mapping step could be eliminated. Instead, they found they could allocate parts of the reads to different types of RNA molecules, much as if each read acted as several votes for one molecule or another.
In their article, the researchers explained how their approach worked in terms of k-mers, which refer to nucleotide sequences of length k. “A key technical contribution behind our approach is the observation that transcript coverage can be accurately estimated using counts of k-mers occurring in reads instead of alignments of reads,” the authors wrote.
“By working with k-mers, we can replace computationally intensive read mapping with the much faster and simpler process of k-mer counting,” the authors continued. “One can view the k-mer counting mechanism as a proportional assignment of a read to a set of potential loci, with the strength of the assignment varying with the number of k-mers in the read that match the locus.”
By avoiding the time-consuming step of read mapping, the authors reported, Sailfish is able to provide quantification estimates 20–30 times faster than many current methods without loss of accuracy.
The researcher’s numerical approach might not be as intuitive as a map to a biologist, but it makes perfect sense to a computer scientist, declared Dr. Kingsford, who added that the Sailfish method is more robust—better able to tolerate errors in the reads or differences between individuals’ genomes. These errors can prevent some reads from being mapped, he explained, but the Sailfish method can make use of all the RNA read “votes,” which improves the method’s accuracy.