Novel approaches to getting to more accurate genomes may include combining short and long read sequences. [Andrzej - Fotolia.com]
With advances in second-generation sequencing technologies, genome studies have produced an explosion of sequence data at a fraction of earlier costs.
The lowest-cost technology can now generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consists of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences require de novo assembly before most genome analyses can begin. Scientists say that genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information.
Short sequences, researchers report, must be mapped to unique positions in reference genomes. Reads often have sequencing errors, the reference genome has repetitive elements, and the orientation of a read relative to the reference genome is not known.
According to Michael Schatz, Ph.D., of Cold Spring Harbor Laboratories (CSHL), “Short-read sequencing is excellent for producing high-quality deep coverage of small to large genomes.” However, he said, “The short read length limits its capability to resolve complex regions with repetitive or heterozygous sequences.” As a result important biological sequences like genes or promoter regions are often highly fragmented using short-read sequencing. “The short read length also makes other computations like sequencing entire RNA transcripts or entire 16S rRNA gene sequences in metagenomics projects difficult or impossible.”
One solution to the short-read problem is to produce longer DNA sequences. Third-generation sequencers can directly read a single DNA molecule reportedly provide a clearer view of genomic organization and content. Although these instruments can generate multikilobase sequences with the potential to greatly improve genome and transcriptome assembly, error rates of single-molecule reads are high, approaching 15%.
Novel approaches to getting to more accurate genomes may include combining short- and long-read sequences. Pacific Biosciences’ PacBio® RS High-Resolution Genetic Analyzer, a third-generation sequencer, uses molecular sequencing techniques and advanced analytics, according to the company. The sequencer produces much longer reads than other technologies, up to 100 times longer, thus reportedly providing a more complete picture of genome structure than second-generation technology.
However, the longer sequences, while potentially useful in solving assembling and finishing problems, produced single-pass sequence reads with every eighth or ninth base incorrect.
To get around the incorrect base problem, Dr. Schatz, an assistant professor at Cold Spring Harbor Laboratories, and colleagues at the National Biodefense Analysis and Countermeasures Center and the University of Maryland, worked with PacBio to develop a correction algorithm for the longer sequences generated by third-generation sequencers and an assembly strategy that uses short, high-fidelity sequences to correct the error in single-molecule sequences—an approach they dubbed “hybrid error correction.”
Dr. Schatz explained, “The longer read lengths have fundamentally more information than the short reads: infinite coverage with short reads simply won’t be enough for resolving really complex regions, but just a few long reads in the right spot can solve them. The same is true for phasing haplotypes in the presence of heterozygosity or identifying proper transcript isoforms in the presence of alternative splicing.”
The scientists combined sequences generated by more conventional technology made by Illumina to help correct the mistakes in the single-molecule method. The result is “substantially better” than using Pacific Biosciences’ technology alone, he said. “The data is basically perfect.”
The scientists showed that the approach could successfully be used on reads generated by a PacBio RS instrument from phage, prokaryotic, and eukaryotic whole genomes, including the previously unassembled genome of the parrot Melopsittacus undulatus, as well as for RNA-Seq reads of the corn (Zea mays) transcriptome. The scientists reported that their long read correction achieved >99.9% base-call accuracy, leading to substantially better assemblies than current sequencing strategies. In the best example, the median contig size was quintupled relative to high-coverage, second-generation assemblies.
As highlighted in two additional papers using similar hybrid strategies, the long read lengths have made automated, single-contig bacterial chromosome assemblies a reality.
Jonas Korlach, Ph.D., Pacific Biosciences’ CSO, told GEN that the company is working on getting around the need to combine sequences generated by both second- and third- generation sequencing platforms. “Going forward,” he said, “we don’t think this will remain the paradigm. We have had recent success with assembly algorithms that can take just these long reads from our machine and use a hierarchical assembly process to achieve a finished microbial genome. In a nutshell, the reads are already long and accurate enough to allow for de novo assembly from a single, long-insert DNA library.”
Dr. Korlach also explained that the company has developed an improved consensus algorithm called Quiver (currently available on its software sharing site) that can achieve a significant reduction in consensus error rate, “yielding a final sequencing result that is over 99.999% accurate at 20x sequencing coverage.”
GEN asked Dr. Schatz whether generation of longer sequences will eventually be able to address more complex genomes, such as human genomes.
“Absolutely,” he said. “In the paper we used PacBio long reads to improve the de novo assembly of the 1.2 Gbp parrot genome, and we are currently sequencing several species of rice and worm. In the near future, we have plans to sequence the human genome and the wheat genome with long reads. I expect we will see much more of this as the throughput and read lengths from the instruments improve.
“In the last year, PacBio has improved both read length and throughput by a factor of 3 or 4, and their roadmap shows this trend should continue into the next year.”
Earlier this month, PacBio announced that enhancements to its DNA sequencing system, its XL release featuring new chemistry and software, will allow long read lengths average 5,000 bases. The company said the new chemistry includes a faster polymerase that reads more bases per second. This release also includes the Stage Start feature, which produces longer reads by enabling sequence data collection to begin when the polymerase is activated. Additionally, PacBio said it has increased the length of time the instrument can record data during the sequencing reaction, also contributing to an increase in read lengths.
The CSHL scientists who were trying to assemble the complex rice genome said the new chemistry produced 9x coverage with long reads—50% of the data came from reads 4,800 base pairs or longer.
“Adding the long reads from PacBio doubled the contig connectivity over the current state-of-the-art ALLPATHS-LG assembler and mate-pair recommendations,” Dr. Schatz added.
For more complex genome sequencing assembling and finishing, then, it appears that a combination of longer and shorter works better.