Since the beginning of the modern genomic sequencing era, researchers have had to make a choice. On the one hand there are highly accurate, short reads—the bread and butter sequences that come from Illumina and Complete Genomics. And on the other hand there are long reads, from Pacific Biosciences (PacBio) and Oxford Nanopore, that have notoriously suffered in accuracy.
The holy grail has been a platform that yields long, accurate reads. Today, the team at PacBio has published tweaks to their existing SMRT sequencing technology that brings them one step closer to it.
The work was recently published in Nature Biotechnology in a paper entitled, “Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome”.
“This is the first description of a method for generating read lengths that are both long and accurate,” notes Aaron Wenger, PhD, principal scientist in bioinformatics at PacBio and first author on the paper. The principle of PacBio’s circular consensus sequencing (CCS) has been around for a few years.
PacBio’s CCS system works by adding hairpin adapters ligated on each end of the linear DNA molecule, to create a “SMRTbell” template. The polymerase, bound to the adapter, moves from the adapter across the DNA insert, adding bases and creating the read of the sequence. CCS produces HiFi (high fidelity) reads by going around the adapter multiple times, continuing until the polymerase “dies” creating multiple passes of the same molecule. Typically, CCS is not considered a long-read technology. Wenger notes that the tradeoff to the platform’s high accuracy has been been reads that are 1000–2000-bases long. But, here, they have used the CCS approach to generate reads over 10,000 bases long.
How did they do it? Wenger says that one key was the technical innovation described in the paper known as “pre-extension”. Because PacBio sequencing relies on a camera taking frames, like in a movie, the polymerases are all independent of one another. They’ll keep adding nucleotides until they die and end the read.
The polymerases fall off for different reasons, commonly because the DNA is damaged. To this end, PacBio’s system is held captive to the quality of the DNA that is put on the instrument. So, they devised a way to minimize the amount of damaged DNA that gets placed on the instrument by selecting for undamaged DNA molecules. Wenger explains that they start the sequencing reaction before loading the DNA onto the instrument. After a few hours of extension, if the polymerase is still going, they conclude that the DNA is undamaged. Selectively loading that DNA onto the instrument is the key to getting long reads using their CCS method.
In addition to this, the researchers ensured that the selected DNA molecules are all about the same size by employing the SageELF instrument (made by Sage Science in Beverly, MA). Because they knew the size of the molecules, they knew the best duration for the pre-extension. This novel tweak was key to the process because it allowed the polymerase to go much farther once inside the sequencing instrument. This was “the last thing holding back the polymerase from stopping,” notes Wenger.
PacBio reads typically have a really high error rate (~15% compared with ~0.1% for Illumina.) However, their errors tend to be random, so if the same region is sequenced several times, the errors average out resulting in a “consensus” sequence. Shawn Baker, PhD, consultant at SanDiegOmics.com explains it like this: “If you have a 1% error rate and you sequence to 100X depth, at each base you’ll get roughly 99 reads that look the same, say an ‘A’, and one base that is different, say a ‘G’. The consensus call would be to assume that the base is truly an ‘A’ and ignore the ‘G’ call.”
In this paper, PacBio showed that they can generate very high quality PacBio sequencing by reading the same molecule multiple times (~10 times on average) rather than comparing 10 separate reads together. This means that they end up with individual PacBio CCS reads with roughly the same error rate as Illumina reads, but which are much longer than Illumina reads.
In doing this, they generated highly accurate (99.8%), long, HiFi reads with an average length of 13.5 kilobases (kb). They applied the approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions <50 bp (indels), and 95.99% for structural variants.
“The new CCS protocol together with the new 8M chips by PacBio are shown to be on par with short-read data from Illumina,” notes Albert Vilella, PhD, an independent consultant working with biotech companies in the U.K. Vilella tells GEN that “it is not yet a game-changer but an incremental improvement over the tools available to researchers and clinicians to assay genomes. It will allow assays to produce a more complete readout of all regions in the genome, including those with low mappability with short-reads technologies.”
Vilella continues that this “moves the field one step closer to being able to produce fully assembled complete individual genomes, for a similar cost as the current mapped-based short-read genomes.”
“It’s a big step forward” notes Keith Robison, PhD, principal scientist at Ginkgo Bioworks, because they are “interrogating more of the genome and giving haplotypes.”
Deanna Church, PhD, mammalian applications at Inscripta, raises the question of “polishing” and whether it would still be necessary with this PacBio approach. Polishing is the term given to the combined use of PacBio long reads with Illumina short reads. Typically, the short Illumina sequences are overlayed over long reads to polish them, or figure out where the errors are.
But Wenger asserts that this method eliminates the need for polishing. “The actual raw accuracy of these reads averages 99.8% which is similar to the accuracy of short reads.” Wenger points to a recent bioRxiv preprint from the Eichler lab, that looked at polishing HiFi reads with short reads and showed that polishing did not contribute increased accuracy.
Wenger notes that although the rate of mistakes is similar, the types of mistakes between short reads and HiFi reads are different. Baker adds that, as far as mistakes are concerned, indels are where Pacbio suffers—with higher indel error rates rather than base substitution errors. It’s “in the nature of how they do their sequencing,” he noted. The PacBio system captures those base additions as they happen; but they happen so quickly that the camera can miss it.
So, why wouldn’t everyone switch over to high-quality, long reads? Robison says that the main barrier is cost. “Each PacBio flowcell can deliver only so many reads, so you can choose to have lots of long, error-rich ones or fewer high-quality ones.” He notes that “if you are collecting 1/3 to 1/4 as many finished bases per flowcell but need 1/2 as many, you still need more flowcells to generate the data.” He adds that the question is to what degree labs will be willing to pay a premium and accept lower throughout for more variant information. Baker agrees that the cost per unit of data is going to go up. This method will “make PacBio more expensive per base generated, by reducing the data you get for the dollar.”
Who will use it, then? Wenger says that this is not going to be used for large-scale sequencing of populations, such as “sequencing everyone in Dubai.” Baker adds that people who choose this method will have to give up read length because “you won’t get that ultralong information that would help with, for example, phasing.”
So, it remains to be seen for which applications this method will be most useful. In general, Church thinks that the HiFi reads will be useful, but she also thinks that “you’ll still need some Oxford ultralong reads for good human assembly.” She also notes that “we need to get some good annotation data on these assemblies to know if you can do away with Illumina polishing.”
And, the competition is “not sitting idle,” notes Vilella. He adds that we could see similar multi-pass solutions from competitors soon. All of this will enable scientist to pick and choose the combination of technologies that gives them the best bang for their buck, such as combining long-reads and short-read polishing to produce high-quality end-to-end assembled genomes.