June 1, 2011 (Vol. 31, No. 11)
Toumy Guettouche, Ph.D.
Shawn Levy, Ph.D.
Stuart Lindsay, Ph.D.
Corey Nislow, Ph.D.
Aniko Sabo, Ph.D.
An Exclusive Q&A with Our Expert Panel
From the Editor in Chief
High-throughput, next-generation sequencing instruments have dramatically accelerated the sequencing of genomes. The growth of this technology serves as an important milepost for the biomedical research community, as the technology shows promise in answering long-standing scientific questions while simultaneously opening up a broad range of new challenges.
Sequencing of the human genome has demonstrated that unrelated individuals are more similar to each other than previously thought. In addition, the expanding number of single nucleotide polymorphisms and copy-number variants that are being discovered illustrate the need to sequence individual genomes more reliably and in a more time-efficient and cost-effective manner. Moreover, advances in genetics and genomics indicate that mutations in one of several individual genes are often linked to the same disease condition or group of conditions, and new associations are constantly being unveiled.
Next-gen sequencing technologies are still very much on the development curve, so improvements in accuracy and read length can be expected on a regular basis for years to come. Next-gen techniques also generate large amounts of data that require constant innovation in methodological approaches and in the development of new and improved tools and equipment.
GEN interviewed a number of leading experts in next-generation sequencing on specific issues, topics, and potential problems that can arise in the application of this technology. It is our hope that you will find the answers to some of your own questions here and also come up with new ideas for carrying out more successful next-gen sequencing.
—John Sterling [email protected]
Ten Factors to Consider when Selecting a DNA Sequencer
What should lab managers look for when purchasing a DNA sequencing instrument? Compiling a list by surveying ten different end users, with ten unique workflows, would create dozens of “tips”, with some at cross-purposes. Instead we turned to über-end-user Doug Smith, Ph.D., who is director of bioinformatics technology at Beckman Coulter Genomics. Beckman is one of the leading genomic services companies, and Dr. Smith has hands-on experience with all major sequencing platforms.
Factor 1: Type of sequencing you expect to carry out
Depending on the type of sequencing you do—de novo sequencing, re-sequencing to identify genetic variations, large or small genomes, or RNA sequencing—one or another type of sequencer might be most appropriate. If the focus is de novo sequencing of small genomes, consider a platform that produces long reads, even if it does not provide the lowest cost per megabase. Examples include the Genome Sequencer FLX system from 454 Life Sciences, which reads 400 bases, or the desktop Ion Torrent instrument from Life Technologies.
If you’re working with large genomes, then it’s much more important to have the lowest cost for the data you’re generating, which means you may have to sacrifice read length. For re-sequencing applications, the short-read technologies such as the Solid™ sequencer from Applied Biosystems or Illumina’s Genome Analyzer IIx will work. Both provide high-quality data, provided you have sufficient coverage to accurately align the reads to a reference sequence and filter out errors.
Factor 2: Anticipated sequencing output per run
Some instruments generate very large quantities of data, some less. Even within a product family from one particular vendor, like Illumina, instruments may produce larger or smaller data streams. Because higher-throughput instruments produce the most data, acquiring the highest-throughput instrument may not make sense. Instead, consider an instrument that perhaps generates longer reads.
On the other hand, if you’re sequencing large genomes de novo, or are running large numbers of individual samples with a significant amount of data per sample, you should probably get a high-throughput sequencer.
Factor 3: Cost per base analyzed
Buying a sequencer with the lowest cost per megabase or gigabase is important if you’re pushing the limits of data generation, but less so if you have very defined needs that can be met by a lower-throughput instrument. Again, cost varies quite a bit across the different platforms, even among sequencers from the same vendor.
Factor 4: Range of available read lengths and the option for generating paired-end or mate-pair reads
Most but not all platforms are capable of generating paired-end reads, but the lengths of individual reads and paired ends vary across platforms. For example, the Illumina GAIIx sequencers generate read lengths of up to 150 bps currently; the HiSeq, Illumina’s highest-throughput platform, can go to 100 bps. Illumina has announced that it will upgrade this to 150 bps later this year. For Solid sequencers, read lengths are shorter, only 50 bases, but the quality of those reads is reputedly higher due to their di-base encoding scheme.
For 454 sequencers, the paired ends are produced within single reads. The single read length is currently 400, which will increase to 700 this summer. But keep in mind that when you generate a paired-end read you will obtain a distribution of read lengths averaging approximately half of the single read length because the two mates are produced out of single reads. So, the average length for paired-end 454 reads will only be 200–350 for the foreseeable future.
Factor 5: Ease of operation
This factor includes both instrument ease of use, plus the level of automation and labor requirements for sample prep. Instruments employing emulsion PCR require additional steps, such as generating the emulsions, breaking them, and enriching the beads. Some vendors whose instruments use emulsion PCR, particularly Life Technologies and 454, provide some accessories that simplify sample prep, but you can’t get around the fact that additional steps are required.
Ion Torrent is now offering a small, desktop instrument that conducts the entire emulsion PCR process in automated fashion. Since labor is a major cost associated with running a sequencer, the fewer things you need to do, the simpler the sample preparation, the lower your overall costs.
Factor 6: Read quality and the error or quality model for the reads
Different technologies produce different quality data. One factor to consider in longer reads is the distribution of quality across the read, where early bases read will be of a higher quality than those at the end. That varies depending on the sequencing technology. Illumina reads have a very high quality at the beginning of the read, which is maintained out to a substantial length, but then it drops off at the end of the read in a characteristic manner.
On the other hand, sequencers from Pacific Biosciences produce very long reads, into the 1,000s of bases, with a wide distribution of read lengths. But these instruments generate a fairly high error rate (~15%), which is distributed evenly throughout the read. So while they don’t have major differences in error rate from the beginning to the end of the read, they do have a higher overall error rate across the read.
These are factors that play into how you wish to use the reads: If long reads are absolutely critical, and you don’t care about the error rate, then PacBio is a good choice; if error rate is paramount, then you’ll want to stick with shorter-read technologies using read lengths in which the error rate is kept low.
Factor 7: Ability to monitor quality parameters in real time during a run
Some runs, on some instruments, can take several days. In those situations it’s a great benefit to be able to monitor the quality of bases as they’re generated. If you can tell that something is going wrong early on, it’s possible to terminate the run and perhaps load a higher-quality sample.
Factor 8: Computing and network requirements
Some instruments generate huge quantities of data that require substantial computing and storage resources for data processing and management. In addition, if you’re planning to utilize cloud computing to process or analyze data, then you also need to consider the network infrastructure you have that gets the data to and from the cloud.
Factor 9: Availability of software for downstream analysis
Fewer software options are available for many newer sequencing platforms compared with older instruments. Older technologies, for example from Illumina and 454, are supported by a large base of robust, sophisticated, and in many instances free, community-developed software. Regardless of the platform you select, make sure that software exists that will support the kinds of workflows and analyses you anticipate.
Factor 10: Buy or outsource
Purchasers should always consider costs, not just of purchasing a sequencer, but also the substantial long-term staffing, training, computer infrastructure, and bioinformatics costs over the lifetime of the instrument, which might be three to five years. At moderate throughput levels, and for many project types, outsourcing might be far more sensible than an outright purchase. This is especially true if you have defined, specific sequencing needs but don’t envision utilizing the entire cost of the sequencing and analysis infrastructure required to run a system successfully over a long period of time.
—Angelo DePalma, Ph.D.
Describe some promising target-enrichment strategies and discuss how they fit into next-gen sequencing operations.
There are two major types of enrichment strategies—capture using hybridization probes and enrichment using PCR amplification.
Hybridization-based methods generally allow a broader range of regions to be captured, from a couple of megabases (MB) to whole exomes. They typically can be used from small to very large sample volumes.
Enrichment methods using PCR typically target smaller areas (up to 10 MB) and require a larger volume of samples to be cost effective. They also require significant investments in hardware.
In our core facility we run hybridization-based methods because they are cheaper and allow us to survey much larger areas for SNPs and mutations. I think PCR-based enrichment strategies are well suited for clinical labs that look at the same regions of the genome in large sample sets. One additional benefit is that PCR-based enrichment strategies allow better discrimination of gene isoforms and pseudogenes.
There are a number of complexity reduction strategies that provide efficient ways to characterize specific regions of the genome. Leveraging the capabilities to synthesize large numbers of oligonucleotides developed for the microarray industry, companies like Agilent, NimbleGen, and others used their in situ synthesized arrays to enable capture of specific regions, up to a few megabases of sequence.
The field shifted substantially after the publication from Andreas Gnirke at the Broad Institute demonstrating the use of in-solution hybridization-based capture. That method, commercialized by Agilent, greatly improved the throughput and capacity of sequence capture to allow the entire exome or more than 50 MB to be captured in an efficient and high-throughput manner.
Related products from NimbleGen, and most recently Illumina, have continued to expand the offerings. For smaller regions of interest, PCR-based approaches such as those from RainDance also offer unique capabilities for capturing specific regions of the genome. Each of the commercial offerings has strengths and weaknesses that should be weighed against the experimental design to determine the best approach.
Considerations like the size of the region of interest, sequencing technology that will be employed, the number of samples to be analyzed, and the budget are important factors. The cost per sample for both catalog products like whole exomes as well as custom designs has dropped dramatically but has not kept pace with reductions in sequencing costs. Therefore the capture costs will continue to be weighed against the costs of complete sequencing. Additional capture strategies such as using BAC clones to generate capture probes are also becoming popular as a means to enrich for regions of interest at low cost.
We are developing a sequence reading device based on reading the electron tunneling current as single-stranded DNA passes between a pair of very closely spaced electrodes in a nanopore. This technology reads epigentic markings directly, uses no reagents, sequences single molecules (at speeds that may approach 100 bases a second), and has the potential for long reads (many tens of kilobases) with no ambiguity when reading homopolymer runs. Thus, single molecule sequencing with nanopores should not require target enrichment.
Effective target-enrichment strategies are a key element of the maturation of next-generation sequencing. Originally, target enrichment was considered synonymous with exome sequencing, and this is still an important element in large-scale genome projects. An early attraction to enrichment was cost, but the first commercial enrichment kits were expensive, and as a result exomes (representing 1% of the genome) could be sequenced for about 20% the cost of a full genome, not 1%.
Exome capture strategies have been the subject of an excellent comparison from the NHGRI. Both on-array capture or solution-based approaches (often derived from oligopools cleaved from arrays) are widely used, relying on either DNA-DNA or DNA-RNA hybridization.
Going forward, the most exciting developments will come from more targeted approaches where hundreds to a few thousand genes in a pathway or particular biology/disease state are the target. For these smaller-scale applications, the exome type strategies (array or solution hybridization) work well, but several other competitive approaches exist, including padlock probe capture (in which a long oligonucleotide is used to anneal to both ends of the target).
Other medium-throughput approaches include traditional PCR, which does not scale well as the degree of multiplexing increases, and microfluidic emulsion-based approaches where each target is amplified in an oil-in-water emulsion bubble.
As sequencing becomes cheaper and faster, good experimental design will remain a constant, and tailoring your experiment with specific targeting strategies will play a larger role. The technologies that ultimately thrive will be those that combine fast turn-around time (to accommodate the iterative nature of genomic studies) and are scalable.
Capture techniques have clear relevance for bringing next-generation sequencing (NGS) to the clinic (aside from the isolated case of prescriptions being written for a patient’s full genome) however, the initial NGS tests will not push the limits of read depth or read length. Rather, they will likely resemble the first FDA-approved microarrays (e.g., the Affymetrix/Roche Cytochrome P450 array), which interrogate a few alleles but in a very robust, reproducible manner. A fool-proof enrichment approach will be key for clinical NGS.
At the Human Genome Sequencing Center, we are using the Roche NimbleGen capture platform because of its target enrichment efficiency and flexibility of capture designs. It fits nicely with both of our production next-gen sequencing platforms, Illumina and Solid. We have successfully created and implemented NimbleGen designs of varying sizes—from designs that focus on small regions of the genome (~500 kb) to much larger, comprehensive whole-exome designs. We have several exome designs that combine established gene sets (CCDS, RefSeq, Vega) and can be expanded with predicted transcription factor sites, miRNAs, and gene-prediction models.
While the cost of whole-genome sequencing has plummeted, we still find that for many projects it is cost effective to perform a targeted capture experiment, whether because a smaller region of the genome has been implicated in previous studies, for example GWAS, or as a cost-effective way to interrogate protein-coding parts of the genome.
What are the advantages and disadvantages of the single-molecule (next-next generation) sequencing system? How does the cost compare with next-generation sequencing approaches?
Apart from the Helicos system, single-molecule sequencing systems (aka, next-next or third-generation) allow long reads but have greater challenges with accuracy than second-generation sequencing systems. They have a much lower throughput, 100 megabases to a couple of gigabases compared to hundreds of gigabases for second-generation systems. Turnaround times for third-generation systems are hours instead of days.
Currently, third-generation sequencing systems are well suited to sequence smaller genomes such as bacteria or viruses, especially because they allow for much faster time to result. This is important in a clinical setting. For some de novo sequencing applications they have advantages because long reads allow easier assembly of complex genomes. However, in terms of throughput and cost per base, second-generation systems are currently far superior.
At this point, third-generation sequencing systems appear to be well suited for niche applications but are not yet ready to compete with second-generation sequencing systems. However, there are some unique applications such as direct sequencing of RNA and direct detection of methylation sites on the horizon. I am looking forward to seeing how the single-molecule nanopore sequencing systems will perform compared to currently available technologies.
With the caveat that nanopore sequencing has yet to deliver, the advantages are: no libraries or reagent costs, long sequential reads, and possibly very fast and cheap reads.
To address the second question first, there is no field data regarding cost/sample on the PacBio instrument, which started shipping in May. One would expect sample prep costs to be low, as there is less sample prep for single-molecule approaches in general. The cost of the PacBio box is slightly higher than Illumina and Solid at ~$800–900K. Its biggest advantages are read lengths (exceptional), speed, and a promising development path, e.g., modified bases, non-DNA applications. Also, as there really is no simple solution to de novo sequencing, PacBio’s read lengths could have a big impact on these applications.
A disadvantage of PacBio is simply its newness. It will require a serious commitment in terms of trouble-shooting, although it appears well-suited for early adopters, not necessarily for smaller centers or individual labs.
Single-molecule approaches might be good for distinguishing very similar sequences, but low error rates are required as well. I’m not certain that the primary advantage of PacBio is its ability to sequence single molecules so much as its read length and speed.
A fundamental question is: When is single molecule a real advantage, versus a clonally amplified molecule? Case in point, Helicos. It competed with the short-read (clonal) technologies, but the value of a single-molecule method was not obvious, and although the number of reads for Helicos at its release was the highest of all the technologies, the clonal platforms have caught up.
Finally, single-molecule sequencing could be essential for certain clinical applications.
As an aside, nanopore sequencing was a hot topic at the “Next Generation Sequencing Congress” held in Boston recently, but no actual data is available yet.
What we are expecting to gain from the single- molecule sequencing platform is much longer reads, enabling us to create long haplotypes and de novo assemblies of human genomes. Also, once implemented in RNA-seq protocols, it will enable us to track and study individual alternative splice forms.
What are the best next-generation sequencing platforms for metagenomic analyses?
In the past, Sanger sequencing and pyrosequencing (e.g., 454) were the best platforms to carry out metagenomic analysis because of the read lengths that these technologies are able to achieve. Recently, short-read technologies (e.g., Illumina HiSeq2000, Life Technologies Solid) have produced read lengths of up to 150 bp with much higher throughput and significantly lower cost per base. Thus, there are an increasing number of publications using short-read technologies to carry out metagenomic analysis.
Choosing a single, best platform would be difficult and it would make assumptions about experimental design and complexity for a rapidly changing field. The challenges of metagenomic analysis, sample-preparation methods, sample number, sequencing conditions, and other factors would need to be weighed to determine which current platform would be advantageous.
Even if a single platform currently stood out among its peers, the rapidly evolving technology may result in a different choice being made in the near future. That said, platforms with high read output and longer read lengths would have advantages in the metagenomics field. Developing technologies with longer read lengths and instruments with single-molecule detection or efficient sample-preparation methods would also have potential advantages, depending on experimental needs.
If nanopore sequencing succeeds in generating very long (good fraction of a megabase) reads, many of the complications of assembling metagenomic sequences would go away.
Metagenomics and microbiome studies are intrinsically interesting to experts and nonexperts—inspired by our curiosity about what can survive in this or that extreme environment, and also, what is living on me?
The challenges of metagenomics are several- fold, including sufficient read density, read lengths, great dynamic range, and sensitive limits of detection. Furthermore, the bioinformatics required are staggering. In many ways, these challenges resemble those for most NGS applications, only more so.
It doesn’t hurt these efforts that the NIH launched a five-year, $140 million effort in 2007.
The field has not settled on a platform, most likely because the range of metagenomics studies is so broad, from small, focused sampling efforts to large-scale microbiome and evolutionary diversity studies. In the former case, traditional Sanger sequencing of 16S rRNA is still used, and pyrosequencing, which has a long history from the early days of metagenomics/microbiome studies, remains a strong player in this space.
Given that current and future efforts consider much larger samples, it is likely that the high read count platforms will become the standard. For example, the Beijing Genome Institute’s metagenomics services rely on Illumina’s HiSeq2000.
Of course, the platform choice will depend on the complexity of your sample; 10s–100s millions of reads can be overkill.
This would seem to be an excellent space for the medium-throughput instruments, including the aforementioned 454 instruments, the Ion Torrent PGM, and the coming MiSeq. Both the PGM and MiSeq offer speed advantages, which could be an important consideration if one were to eventually bring the sequencer to the genome, as opposed to now, where we bring the genome to the sequence lab.
How can next-generation sequencing overcome the multiple gene copy problem?
This should not be a problem with long sequential reads (if nanopores deliver).
One solution is high-read depth, combined with paired-end reads. The 454 platform has been applied to this problem with success on a polyploid genome where 15 reads/gene provided 95% confidence, but there is no reason the multiple gene copy problem cannot work on the shorter-read platforms, indeed the greater read numbers will be an advantage. For example, providing greater confidence at similar depth-of-coverage. With respect to CNVs, paired-end and mate-pair strategies on the Illumina and SOLiD platforms, respectively, have proven effective.
The answer will be longer reads and ultimately very long reads. For example, we are currently testing “strobe–reads” from the PacBio instrument. These reads contain multiple short sub-reads originating from one long fragment of DNA and can span large regions (5–10 kb) of the genome. This will enable us to effectively disentangle some of the duplicate genes and regions in the genome.
Which next-generation platforms are ideally suited to study rare genetic conditions?
To study rare genetic conditions, the sequencing platform needs to be accurate, cost effective, and have a high throughput. These attributes fit so-called short-read technologies from Illumina or Life Technologies best.
Any of the available platforms are ideally suited for the analysis of rare genetic conditions in that they all have the ability to detect rare variants in a single sample or group of samples compared to a reference genome. Experimental design, study size, disease of interest, and validation experiments are substantially more important than the specific methodology employed. The currently available platforms from Illumina and Life Technologies have been demonstrated to be powerful choices in the analysis of human genetic diversity and the detection of rare genetic conditions.
Accuracy is the key here, as well as low cost. Right now this is a difficult problem, as cross platform checks show. The base-calling accuracy for nanopore methods has yet to be established.
I would argue that it is not a question of the best platform, but of the best sample-preparation method.
How can chromosomal repeat regions be assessed by next-generation sequencing?
Chromosomal repeat regions can be assessed through a combination of paired-end sequencing, longer read lengths, and the use of sample-preparation methods that allow large insert libraries to be made. Techniques such as jumping libraries, mate-pair libraries, and related methods allow paired sequencing reads to be generated that were originally separated by several kilobases of sequence. Generating diverse libraries with these methods and combining the sequencing data with more standard sequencing sample-preparation methods allows a robust assembly of the genome to be performed using newer software tools such as ALLPATHS-LG from the Jaffe lab and colleagues at the Broad Institute. These assemblies should perform reasonably well across repeat regions, although certainly some larger repeat regions will always be problematic with short-read sequencing.
It’s the same answer as for repeated genes—long sequential reads should be made possible by nanopores.
At the end of the day, this is a hard computational problem. Several groups are actively working on this question. In addition, algorithms developed for detecting sequence errors and SNPS have been used to disentangle highly repetitive regions. So, regarding the best platform, it would be the one with the lowest error rate.
The bottom line is that no large sequencing center will have a single platform; it is too risky, and several of the platforms are complementary, e.g. Illumina/Solid plus PacBio. Furthermore, most centers are going to have one or more “development machines”, those instruments that are promising but not quite ready for prime time. The evolution of this field will likely include some dead ends and some dramatic transformations.
From my perspective, I am most excited about advances in sample prep, specifically, increases in speed and greater integration of automation.
Finally, one way to gauge/predict the success of a particular platform or methodology is to follow the publications!