DNA sequencing technologies have revolutionized research and the products resulting from R&D in any number of industries over the past couple of decades. Starting with Sanger sequencing, knowledge about the genetic makeup of an organism has fundamentally altered our approach to solving industrial challenges through biotechnological means. More recently, next-generation, short-read DNA sequencing technologies have increased overall sequencing throughput and become commonplace in certain areas, but this has come at a cost of reduced sequence read lengths. In many cases, scientists are now finding a critical need to supplement short-read sequence with long-read sequence, or in some cases replace it entirely with a single-molecule technology that can offer information beyond the A's, C's, T's, and G's of nucleic acids.
Single molecule, real-time (SMRT®) sequencing is enabling scientists to accomplish things that were not possible with traditional short-read sequencing platforms. These advances include a return to the gold standard of finished genomes as well as detecting and identifying chemical base modifications. These attributes make SMRT sequencing a good fit for a range of industrial applications, including applied microbiology, agricultural biotechnology, enzyme research and design, pathogen research and detection, and biofuels development, among many others.
The technology underlying SMRT sequencing, available through Pacific Biosciences' PacBio® RS High-Resolution Genetic Analyzer, uses a different approach than other next-generation sequencers on the market. SMRT technology harnesses the natural process of DNA replication, using a polymerase to move along the DNA strand and identify bases in real-time as the strand is replicated. The interactions are imaged through a nanostructure called a zero-mode waveguide—a hole so tiny that background signal is suppressed to enable monitoring of individual polymerase molecules. Cameras focused on arrays of tens of thousands of these zero-mode waveguides (called a SMRT Cell) record movies of fluorescence light emitted as bases are incorporated, and from that, sequencing data are determined. This translates to faster operation, currently generating ~30,000–50,000 sequence reads in as little as 60 minutes of instrument run time instead of multiple days on other platforms.
In addition to the ease, speed, and throughput of data generation, the read length and quality of the sequences determine their power in solving biotechnological problems. After generating sequence data, a typical sequencing project consists of two additional steps. The first is mapping, i.e., aligning the reads onto their appropriate location on a known reference genome. In cases in which the origin of the DNA is unknown, and therefore a reference is not available, this step instead comprises genome assembly—the task of overlapping the sequence reads relative to each other to generate a new reference. The second step involves generating the final sequencing result through consensus (averaging the information from overlapping sequence reads). The utility of sequence reads to generate meaningful results thereby critically depends on two factors: the ability to map reads onto a reference or facilitate genome assembly; and the lack of systematic bias in order to generate highly accurate final sequencing results.
Because of the unique, engineered polymerase used in SMRT sequencing, the PacBio RS can produce reads greater than 3,000 bases in length on average, with some individual reads reaching 15,000 bases or longer—or over an order of magnitude longer than short-read technologies. The longer reads greatly facilitate higher-accuracy mapping and genome assemblies than are possible using reads of only a few hundred bases. While the single-pass sequence read accuracy in SMRT sequencing is lower than in second-generation systems, it is important to note that these errors are distributed randomly, thus washing out quickly during the second step of generating consensus. Therefore, SMRT sequencing provides high sequencing accuracy of >99.9% at 10-fold read coverage and >99.999% at 20-fold read coverage.
Because the sequencer examines DNA on the level of a single molecule, it can observe structural and cell-type variation not accessible with other technologies, and enable correlating, or ‘phasing,’ such variation over long genetic distances. Also, the sequencer does not require polymerase chain reaction (PCR) amplification, thereby simplifying the sample preparation workflow and avoiding systematic amplification bias. The lack of amplification also means that the sequencer is capable of detecting the presence of chemical base modifications on the DNA, which are lost during the amplification step and therefore not measurable on other platforms. These data are gathered through changes in the speed of the DNA polymerase incorporation pattern, thus highlighting methylation and other epigenetic events in the genome sequence.
In the days of Sanger sequencing, genome assemblies were very high quality but tremendously expensive to finish completely. With the proliferation of short-read sequencers, costs per base dropped precipitously, but assembly quality fell, too; the number of contigs increased significantly and repeats, segmental duplications, and gene families became more difficult to assemble correctly. Finished genomes are important for a full understanding of an organism, for serving as a reliable reference genome, and for accurately comparing one organism to another. For example, analyzing and tracking anthrax strains from the 2001 bioterrorist attack were aided by recognizing larger-scale insertions, deletions, and tandem repeat structures, in addition to just measuring single nucleotide variations.1
In several recent studies, teams of scientists set out to determine whether long reads generated by the PacBio RS instrument could bring back the days of gold-standard closed genomes without the prohibitive expense of Sanger sequencing. One such effort was a recent genome assembly project led by Adam Phillippy and Sergey Koren at the National Biodefense Analysis and Countermeasures Center and Michael Schatz at Cold Spring Harbor Laboratory.2 Koren et al. updated the Celera® Assembler program to work with the long reads specific to PacBio data and, in the process, realized that this information would help them build higher quality, cleaner genome assemblies.
The team's breakthrough is an error correction pipeline that takes advantage of the long-read data, mixes in high-accuracy short reads, and runs all of it through the updated Celera Assembler to generate a high-quality assembly. As the paper concludes, through this pipeline, read accuracy is better than 99.9% and median contig sizes double compared to short-read assemblies. In two other publications, researchers used similar approaches to generate complete, finished genomes in a fully automated assembly pipeline.3,4
Koren et al. also evaluated which short reads worked best in conjunction with the long read data, but they ended up without a strong preference. Whatever the platform, they recommend that users of the pipeline have 25x to 50x short read coverage, and then add in “even moderate coverage” of PacBio RS long reads.
Another complex problem was aligning short reads when the long read consisted primarily of repetitive sequence. Repeat regions are often seen with more than 99% similarity, which makes accurately calling an alignment very tricky. The team designed some techniques to deal with this by evaluating the top alignment candidates for every short read, and then carefully assessing the alignment coverage to determine the best match.
Koren et al. noted that single molecule sequencing has advantages beyond genome assembly by presenting some preliminary analysis on the corn transcriptome generated by the Joint Genome Institute. They demonstrate in that work that alternative splicing can be directly read off the sequence data. Having the long PacBio RS reads, therefore, makes possible several applications that would not otherwise be feasible.
Case Study: Streptomyces
As scientists begin to use long sequence reads to improve genome assemblies, new genomes are being closed that otherwise would not have been possible. One example of this comes from the Korea Polar Research Institute (KOPRI), where scientists completed an assembly of bacteria found in Antarctica. Recently, Hyun Park, a senior scientist and project leader at KOPRI, has been focusing on Cladonia borealis, the dominant species of lichens found in Antarctica. Understanding these organisms and their adaptation to the polar environment could help a range of industries, including biological engineering. For example, the ultra-low temperatures in the climate affect enzymatic reactions, increasing their specificity. It is possible that better understanding this process would help scientists to reduce known side effects of enzymatic reactions.
Streptomyces, the target of Park's genome project, is one of the bacterial strains found in Cladonia borealis. The bacterium is known for its very high GC content (71%), so Park and his colleagues were well aware that they would have to generate significant coverage to produce an assembly. But even with 200x coverage from a short-read sequencer, they only achieved 185 contigs, far too many to allow for a clear picture of the 7.6 Mb genome. Park and his team turned to SMRT sequencing, using high-accuracy circular consensus short reads and long continuous reads averaging ~1.5 kb. With 15x coverage, they arrived at just 26 contigs—the first useful assembly of the organism's genome, which is making it possible for them to continue their studies.
The consistent and uniform sequencing performance of the PacBio RS, irrespective of the DNA's GC content or sequence complexity, has been highlighted in several other recent publications, be it by closing gaps in unfinished microbial genomes, or for sequencing regions in the human genome that were not amenable by any other sequencing technology—even Sanger sequencing.5,6
In addition to the sequence information provided by any type of sequencer, the PacBio RS at the same time generates a second data set that informs on chemical modifications in the DNA. These DNA base modifications play crucial roles in regulating many fundamental biological processes, such as gene expression, cell cycle regulation, and DNA repair. Notably, base modifications such as methylation can also cause switching between distinct cell types in certain host-adapted bacterial pathogens, thus directly affecting their pathogenicity.7
Another area in which knowledge about methylation is very important relates to research aimed at transforming bacteria with foreign DNA for increasing their bioindustrial productivity. Such transformation can be severely hampered by the presence of restriction modification systems that are mediated by methylation. Thus, knowing about the methylation motifs in a host bacterium can provide the means to circumvent such problems.
Scientists at PacBio have demonstrated that the sequencer can accurately distinguish more than a dozen different types of base modifications.8,9 These data can be analyzed in conjunction with the DNA sequence to give a clearer view of an organism's biology.
In a recent paper from Nobel laureate Rich Roberts at New England Biolabs and collaborators, this unique capability was used to present the complete methylomes of six bacterial species.10 The study included several species of industrial interest, including Geobacter metallireducens, a bacterium capable of reducing iron, manganese, uranium and other metals and thus an interesting target for bioremediation of groundwater contaminants, and a Bacillus cereus strain that was originally isolated from spoiled cheese and belonging to the same genetic subgroup as Bacillus anthracis. Through SMRT sequencing of the genomic DNA and analysis of the polymerase kinetics, the paper not only shows which methylase genes are active, but also reveals their motif recognition sequences—over a dozen of which were new discoveries. This also included two non-specific methyltransferases that may play a protective role during phage infection.
Case Study: E. coli Outbreak
Another example of the value of this capability to distinguish different types of base modifications comes from two studies by Eric Schadt, Chair of the Department of Genetics and Genomics Sciences and Director of the Institute for Genomics and Multiscale Biology at Mount Sinai School of Medicine, who used the PacBio RS to assemble the strain of E. coli responsible for the 2011 outbreak in Germany (O104:H4 serotype). Schadt and collaborators first used SMRT sequencing to study the DNA sequence and prove for the first time that the severe outbreak was caused by an enteroaggregative strain of E. coli that had acquired enterohemorrhagic properties—including the insertion of a Shiga-toxin–encoding lambdalike prophage element—through horizontal gene transfer.11
Despite those remarkable findings, the DNA sequence alone did not fully explain the unusually high virulence seen in the outbreak. After the initial sequencing work was done, Schadt and colleagues went back to reanalyze the data, this time looking at chemical modifications to DNA bases. By analyzing the outbreak strain sequence for base modifications, in this case N6-methyladenine residues, the team discovered a series of methylase enzymes that appeared to target specific sequence motifs throughout the genome as they made their chemical changes. For example, Dam methyltransferase targeted the A residue in DNA with the sequence motif GATC, while a methyltransferase found in the Shiga toxin region acts on the CTGCAG motif.12
The modifications were having a marked effect on gene transcription, and the targeted genes were enriched for pathways linked to horizontal gene transfer in the outbreak strain. Throughout the organism's genome, many pathways were changed by these methylases, including pathways linked to growth and other factors linked to virulence.
Conclusions and Outlook
There is no doubt that sequencing will play an increasingly central role in industrial biotechnology research and applications. From bioremediation and biological energy production to dairy product manufacturing, knowledge about the complete genome and epigenome of an organism under study will be a critical facilitator of a more complete understanding of its biology, and thus provide the means for more effective biotechnological manipulations. The novel approach of SMRT sequencing enables scientists to be truly comprehensive in the way they study organisms of interest. Long reads allow for finishing genomes to the gold standard established by Sanger sequencing, and also give researchers the ability to see distant but linked mutations on a single read. Because the PacBio RS analyzes single molecules, scientists can identify which reads are from which strand.
The instrument also provides high accuracy in discovering and validating SNPs and other variants. And because there is no need for amplification—the step on other sequencing platforms that strips away chemical modification of bases—the PacBio RS platform can be used to detect more than a dozen different epigenetic signatures of base modification information. Taken together, these functions provide an in-depth view of the biological mechanisms operating in an organism.
With any new technology, significant opportunities to expand the system's capabilities exist, and SMRT sequencing is no exception. Upcoming improvements in polymerase engineering, sequencing chemistries, sample preparation, and hardware upgrades will allow for even longer reads at greater throughput, thereby enabling new application spaces and large-scale initiatives, such as the 100K Foodborne Pathogen Genome Project.12
Bioinformatic analysis tools will surely take advantage of such progress; we have had recent success with complete genome assemblies from unknown samples not requiring the combination of short read and long read sequence data, but instead utilizing just the long SMRT sequencing reads exclusively in a hierarchical genome assembly process. We are therefore close to reaching an era of a single-platform, fully automated paradigm of “one sequencing run equals one complete genome and epigenome.”