Combination of long and short reads corrects base-call errors and enables assembly of complex vertebrate genomes.
Two independent research teams report on the development of software they claim can correct the high rate of base-call inaccuracies that currently restrict the use of third-generation single-molecule sequencers such as the PacBio RS system, and enable for the first time the use of long-read sequences in the assembly of higher organism genomes. The approach, reported in two separate papers in Nature Biotechnology, uses data from short, high-fidelity sequences generated by second generation technologies to correct the 15% or so error rate in long reads, resulting in a read accuracy of over 99.9%.
One of the research teams, headed by Cold Spring Harbor Laboratory researcher Michael Schatz, Ph.D., and Sergey Koren, Ph.D., and Adam Phillippy, Ph.D., at the National Biodefense Analysis and Countermeasures Center, describes use of the correction algorithm and assembly strategy, to assemble phage, prokaryotic, and eukaryotic genomes, including the previously unsequenced genome of the Melopsittacus undulates parrot. Their published paper is titled “Hybrid error correction and de novo assembly of single-molecule sequencing reads.”
Pacific BioSciences Eric Schadt, Ph.D., and colleagues at Mount Sinai School of Medicine describe the use of the hybrid correction approach to complete the genome of the cholera strain responsible for the 2010 Haitian outbreak. This paper is titled “A hybrid approach for the automated finishing of bacterial genomes.”
The process of constructing whole chromosomes or genomes by piecing together overlapping stretches of short DNA reads into longer contiguous sequences (contigs) is more efficient the longer the individual read lengths. In theory, this process should be greatly enhanced using single-molecule sequencing technologies that can generate 100x longer read. However, the use of third-generation sequencing is flawed because up to 15% of bases sequenced may in fact be called incorrectly, making it far harder to determine whether two separate reads do actually overlap.
The algorithm developed by the CHSL-led investigators, which they called PBcR (PacBio-corrected Reads), addresses this by mapping much shorter and accurate reads generated using second-generation technology to the error-prone long reads generated on the PacBio RS sequencer. The combined data is run through an assembly program such as the Celera Assembler. The result, the investigators claim, is greatly improved assembly quality when compared with first- or second-generation sequenced reads, and up to five times longer median contig size. Moreover, they suggest, if read lengths continue to increase with advancing technology, the application of error correction could make the prospect of single-contig bacterial chromosome assembly a reality.
In their published paper, Dr. Schatz et al describe how hybrid error correction improved the accuracy of long-read data sets and led to substantially better assemblies of bacterial and eukaryotic genomes than any other sequencing strategy tested. The degree of improvement correlated with the median length of the corrected reads, such that the longer reads generated using third-generation sequencing improved to a greater degree than the shorter reads of older technologies.
“The observed gains are striking because they were entirely a result of resolving repeat structures rather than closing so-called sequencing gaps in the short-read coverage,” they write. “This was due to the PBcR reads’ unique ability to close difficult gaps left by second-generation technologies, such as interspersed, inverted, and complex tandem repeats, which can be difficult to assemble even with paired ends.”
The team then applied the PBcR approach to assemble the Melopsittacus undulatus parrot genome, which hadn’t previously been sequenced, including the regulatory regions of genes involved in vocal learning circuits. The teams claim the hybrid reads represent the most complete assembled bird genome now available.
The ability to apply a hybrid de novo assembly procedure to resolve complex genome regions with repeats was separately demonstrated by the Pacific BioSciences work to with the Haitian cholera genome. “Our hybrid de novo assembly protocol should be applicable for completing the genomes of currently incomplete bacterial genomes in GenBank, as well as for generating complete genomes of bacteria that have not yet been sequenced,” they write. “This work provides a blueprint for the next generation of rapid microbial identification and full-genome assembly.”
Reporting all their experimental data in Nature Biotechnology, the CSHL researchers claim their results demonstrate that high error rates no longer need to represent a barrier to assembly. “High-error, long reads can be efficiently assembled in combination with complementary short-reads to produce assemblies not possible with any prior technology, bringing us one step closer to the goal of ‘one chromosome, one contig’”, they write. “If single-molecule technology continues to advance and reads begin to exceed the lengths of typical bacterial repeats (~6 Kbp) at reasonable cost and throughput, single-contig assemblies of some bacterial chromosomes will be possible without the need for expensive pair libraries. Additionally, we believe many long-sought capabilities will be enabled, such as haplotype separation in eukaryotes, accurate transcriptome annotation, and true comparative genomics that extends beyond an exon-centric view to include the whole genome.”