One solution to the short-read problem is to produce longer DNA sequences. Third-generation sequencers can directly read a single DNA molecule reportedly provide a clearer view of genomic organization and content. Although these instruments can generate multikilobase sequences with the potential to greatly improve genome and transcriptome assembly, error rates of single-molecule reads are high, approaching 15%.
Novel approaches to getting to more accurate genomes may include combining short- and long-read sequences. Pacific Biosciences’ PacBio® RS High-Resolution Genetic Analyzer, a third-generation sequencer, uses molecular sequencing techniques and advanced analytics, according to the company. The sequencer produces much longer reads than other technologies, up to 100 times longer, thus reportedly providing a more complete picture of genome structure than second-generation technology.
However, the longer sequences, while potentially useful in solving assembling and finishing problems, produced single-pass sequence reads with every eighth or ninth base incorrect.
To get around the incorrect base problem, Dr. Schatz, an assistant professor at Cold Spring Harbor Laboratories, and colleagues at the National Biodefense Analysis and Countermeasures Center and the University of Maryland, worked with PacBio to develop a correction algorithm for the longer sequences generated by third-generation sequencers and an assembly strategy that uses short, high-fidelity sequences to correct the error in single-molecule sequences—an approach they dubbed “hybrid error correction.”
Dr. Schatz explained, “The longer read lengths have fundamentally more information than the short reads: infinite coverage with short reads simply won’t be enough for resolving really complex regions, but just a few long reads in the right spot can solve them. The same is true for phasing haplotypes in the presence of heterozygosity or identifying proper transcript isoforms in the presence of alternative splicing.”
The scientists combined sequences generated by more conventional technology made by Illumina to help correct the mistakes in the single-molecule method. The result is “substantially better” than using Pacific Biosciences’ technology alone, he said. “The data is basically perfect.”
The scientists showed that the approach could successfully be used on reads generated by a PacBio RS instrument from phage, prokaryotic, and eukaryotic whole genomes, including the previously unassembled genome of the parrot Melopsittacus undulatus, as well as for RNA-Seq reads of the corn (Zea mays) transcriptome. The scientists reported that their long read correction achieved >99.9% base-call accuracy, leading to substantially better assemblies than current sequencing strategies. In the best example, the median contig size was quintupled relative to high-coverage, second-generation assemblies.
As highlighted in two additional papers using similar hybrid strategies, the long read lengths have made automated, single-contig bacterial chromosome assemblies a reality.