In the days of Sanger sequencing, genome assemblies were very high quality but tremendously expensive to finish completely. With the proliferation of short-read sequencers, costs per base dropped precipitously, but assembly quality fell, too; the number of contigs increased significantly and repeats, segmental duplications, and gene families became more difficult to assemble correctly. Finished genomes are important for a full understanding of an organism, for serving as a reliable reference genome, and for accurately comparing one organism to another. For example, analyzing and tracking anthrax strains from the 2001 bioterrorist attack were aided by recognizing larger-scale insertions, deletions, and tandem repeat structures, in addition to just measuring single nucleotide variations.1
In several recent studies, teams of scientists set out to determine whether long reads generated by the PacBio RS instrument could bring back the days of gold-standard closed genomes without the prohibitive expense of Sanger sequencing. One such effort was a recent genome assembly project led by Adam Phillippy and Sergey Koren at the National Biodefense Analysis and Countermeasures Center and Michael Schatz at Cold Spring Harbor Laboratory.2 Koren et al. updated the Celera® Assembler program to work with the long reads specific to PacBio data and, in the process, realized that this information would help them build higher quality, cleaner genome assemblies.
The team's breakthrough is an error correction pipeline that takes advantage of the long-read data, mixes in high-accuracy short reads, and runs all of it through the updated Celera Assembler to generate a high-quality assembly. As the paper concludes, through this pipeline, read accuracy is better than 99.9% and median contig sizes double compared to short-read assemblies. In two other publications, researchers used similar approaches to generate complete, finished genomes in a fully automated assembly pipeline.3,4
Koren et al. also evaluated which short reads worked best in conjunction with the long read data, but they ended up without a strong preference. Whatever the platform, they recommend that users of the pipeline have 25x to 50x short read coverage, and then add in “even moderate coverage” of PacBio RS long reads.
Another complex problem was aligning short reads when the long read consisted primarily of repetitive sequence. Repeat regions are often seen with more than 99% similarity, which makes accurately calling an alignment very tricky. The team designed some techniques to deal with this by evaluating the top alignment candidates for every short read, and then carefully assessing the alignment coverage to determine the best match.
Koren et al. noted that single molecule sequencing has advantages beyond genome assembly by presenting some preliminary analysis on the corn transcriptome generated by the Joint Genome Institute. They demonstrate in that work that alternative splicing can be directly read off the sequence data. Having the long PacBio RS reads, therefore, makes possible several applications that would not otherwise be feasible.