Describe some promising target-enrichment strategies and discuss how they fit into next-gen sequencing operations.
There are two major types of enrichment strategies—capture using hybridization probes and enrichment using PCR amplification.
Hybridization-based methods generally allow a broader range of regions to be captured, from a couple of megabases (MB) to whole exomes. They typically can be used from small to very large sample volumes.
Enrichment methods using PCR typically target smaller areas (up to 10 MB) and require a larger volume of samples to be cost effective. They also require significant investments in hardware.
In our core facility we run hybridization-based methods because they are cheaper and allow us to survey much larger areas for SNPs and mutations. I think PCR-based enrichment strategies are well suited for clinical labs that look at the same regions of the genome in large sample sets. One additional benefit is that PCR-based enrichment strategies allow better discrimination of gene isoforms and pseudogenes.
There are a number of complexity reduction strategies that provide efficient ways to characterize specific regions of the genome. Leveraging the capabilities to synthesize large numbers of oligonucleotides developed for the microarray industry, companies like Agilent, NimbleGen, and others used their in situ synthesized arrays to enable capture of specific regions, up to a few megabases of sequence.
The field shifted substantially after the publication from Andreas Gnirke at the Broad Institute demonstrating the use of in-solution hybridization-based capture. That method, commercialized by Agilent, greatly improved the throughput and capacity of sequence capture to allow the entire exome or more than 50 MB to be captured in an efficient and high-throughput manner.
Related products from NimbleGen, and most recently Illumina, have continued to expand the offerings. For smaller regions of interest, PCR-based approaches such as those from RainDance also offer unique capabilities for capturing specific regions of the genome. Each of the commercial offerings has strengths and weaknesses that should be weighed against the experimental design to determine the best approach.
Considerations like the size of the region of interest, sequencing technology that will be employed, the number of samples to be analyzed, and the budget are important factors. The cost per sample for both catalog products like whole exomes as well as custom designs has dropped dramatically but has not kept pace with reductions in sequencing costs. Therefore the capture costs will continue to be weighed against the costs of complete sequencing. Additional capture strategies such as using BAC clones to generate capture probes are also becoming popular as a means to enrich for regions of interest at low cost.
We are developing a sequence reading device based on reading the electron tunneling current as single-stranded DNA passes between a pair of very closely spaced electrodes in a nanopore. This technology reads epigentic markings directly, uses no reagents, sequences single molecules (at speeds that may approach 100 bases a second), and has the potential for long reads (many tens of kilobases) with no ambiguity when reading homopolymer runs. Thus, single molecule sequencing with nanopores should not require target enrichment.
Effective target-enrichment strategies are a key element of the maturation of next-generation sequencing. Originally, target enrichment was considered synonymous with exome sequencing, and this is still an important element in large-scale genome projects. An early attraction to enrichment was cost, but the first commercial enrichment kits were expensive, and as a result exomes (representing 1% of the genome) could be sequenced for about 20% the cost of a full genome, not 1%.
Exome capture strategies have been the subject of an excellent comparison from the NHGRI. Both on-array capture or solution-based approaches (often derived from oligopools cleaved from arrays) are widely used, relying on either DNA-DNA or DNA-RNA hybridization.
Going forward, the most exciting developments will come from more targeted approaches where hundreds to a few thousand genes in a pathway or particular biology/disease state are the target. For these smaller-scale applications, the exome type strategies (array or solution hybridization) work well, but several other competitive approaches exist, including padlock probe capture (in which a long oligonucleotide is used to anneal to both ends of the target).
Other medium-throughput approaches include traditional PCR, which does not scale well as the degree of multiplexing increases, and microfluidic emulsion-based approaches where each target is amplified in an oil-in-water emulsion bubble.
As sequencing becomes cheaper and faster, good experimental design will remain a constant, and tailoring your experiment with specific targeting strategies will play a larger role. The technologies that ultimately thrive will be those that combine fast turn-around time (to accommodate the iterative nature of genomic studies) and are scalable.
Capture techniques have clear relevance for bringing next-generation sequencing (NGS) to the clinic (aside from the isolated case of prescriptions being written for a patient’s full genome) however, the initial NGS tests will not push the limits of read depth or read length. Rather, they will likely resemble the first FDA-approved microarrays (e.g., the Affymetrix/Roche Cytochrome P450 array), which interrogate a few alleles but in a very robust, reproducible manner. A fool-proof enrichment approach will be key for clinical NGS.
At the Human Genome Sequencing Center, we are using the Roche NimbleGen capture platform because of its target enrichment efficiency and flexibility of capture designs. It fits nicely with both of our production next-gen sequencing platforms, Illumina and Solid. We have successfully created and implemented NimbleGen designs of varying sizes—from designs that focus on small regions of the genome (~500 kb) to much larger, comprehensive whole-exome designs. We have several exome designs that combine established gene sets (CCDS, RefSeq, Vega) and can be expanded with predicted transcription factor sites, miRNAs, and gene-prediction models.
While the cost of whole-genome sequencing has plummeted, we still find that for many projects it is cost effective to perform a targeted capture experiment, whether because a smaller region of the genome has been implicated in previous studies, for example GWAS, or as a cost-effective way to interrogate protein-coding parts of the genome.
What are the advantages and disadvantages of the single-molecule (next-next generation) sequencing system? How does the cost compare with next-generation sequencing approaches?
Apart from the Helicos system, single-molecule sequencing systems (aka, next-next or third-generation) allow long reads but have greater challenges with accuracy than second-generation sequencing systems. They have a much lower throughput, 100 megabases to a couple of gigabases compared to hundreds of gigabases for second-generation systems. Turnaround times for third-generation systems are hours instead of days.
Currently, third-generation sequencing systems are well suited to sequence smaller genomes such as bacteria or viruses, especially because they allow for much faster time to result. This is important in a clinical setting. For some de novo sequencing applications they have advantages because long reads allow easier assembly of complex genomes. However, in terms of throughput and cost per base, second-generation systems are currently far superior.
At this point, third-generation sequencing systems appear to be well suited for niche applications but are not yet ready to compete with second-generation sequencing systems. However, there are some unique applications such as direct sequencing of RNA and direct detection of methylation sites on the horizon. I am looking forward to seeing how the single-molecule nanopore sequencing systems will perform compared to currently available technologies.
With the caveat that nanopore sequencing has yet to deliver, the advantages are: no libraries or reagent costs, long sequential reads, and possibly very fast and cheap reads.
To address the second question first, there is no field data regarding cost/sample on the PacBio instrument, which started shipping in May. One would expect sample prep costs to be low, as there is less sample prep for single-molecule approaches in general. The cost of the PacBio box is slightly higher than Illumina and Solid at ~$800–900K. Its biggest advantages are read lengths (exceptional), speed, and a promising development path, e.g., modified bases, non-DNA applications. Also, as there really is no simple solution to de novo sequencing, PacBio’s read lengths could have a big impact on these applications.
A disadvantage of PacBio is simply its newness. It will require a serious commitment in terms of trouble-shooting, although it appears well-suited for early adopters, not necessarily for smaller centers or individual labs.
Single-molecule approaches might be good for distinguishing very similar sequences, but low error rates are required as well. I’m not certain that the primary advantage of PacBio is its ability to sequence single molecules so much as its read length and speed.
A fundamental question is: When is single molecule a real advantage, versus a clonally amplified molecule? Case in point, Helicos. It competed with the short-read (clonal) technologies, but the value of a single-molecule method was not obvious, and although the number of reads for Helicos at its release was the highest of all the technologies, the clonal platforms have caught up.
Finally, single-molecule sequencing could be essential for certain clinical applications.
As an aside, nanopore sequencing was a hot topic at the “Next Generation Sequencing Congress” held in Boston recently, but no actual data is available yet.
What we are expecting to gain from the single- molecule sequencing platform is much longer reads, enabling us to create long haplotypes and de novo assemblies of human genomes. Also, once implemented in RNA-seq protocols, it will enable us to track and study individual alternative splice forms.
What are the best next-generation sequencing platforms for metagenomic analyses?
In the past, Sanger sequencing and pyrosequencing (e.g., 454) were the best platforms to carry out metagenomic analysis because of the read lengths that these technologies are able to achieve. Recently, short-read technologies (e.g., Illumina HiSeq2000, Life Technologies Solid) have produced read lengths of up to 150 bp with much higher throughput and significantly lower cost per base. Thus, there are an increasing number of publications using short-read technologies to carry out metagenomic analysis.
Choosing a single, best platform would be difficult and it would make assumptions about experimental design and complexity for a rapidly changing field. The challenges of metagenomic analysis, sample-preparation methods, sample number, sequencing conditions, and other factors would need to be weighed to determine which current platform would be advantageous.
Even if a single platform currently stood out among its peers, the rapidly evolving technology may result in a different choice being made in the near future. That said, platforms with high read output and longer read lengths would have advantages in the metagenomics field. Developing technologies with longer read lengths and instruments with single-molecule detection or efficient sample-preparation methods would also have potential advantages, depending on experimental needs.
If nanopore sequencing succeeds in generating very long (good fraction of a megabase) reads, many of the complications of assembling metagenomic sequences would go away.
Metagenomics and microbiome studies are intrinsically interesting to experts and nonexperts—inspired by our curiosity about what can survive in this or that extreme environment, and also, what is living on me?
The challenges of metagenomics are several- fold, including sufficient read density, read lengths, great dynamic range, and sensitive limits of detection. Furthermore, the bioinformatics required are staggering. In many ways, these challenges resemble those for most NGS applications, only more so.
It doesn’t hurt these efforts that the NIH launched a five-year, $140 million effort in 2007.
The field has not settled on a platform, most likely because the range of metagenomics studies is so broad, from small, focused sampling efforts to large-scale microbiome and evolutionary diversity studies. In the former case, traditional Sanger sequencing of 16S rRNA is still used, and pyrosequencing, which has a long history from the early days of metagenomics/microbiome studies, remains a strong player in this space.
Given that current and future efforts consider much larger samples, it is likely that the high read count platforms will become the standard. For example, the Beijing Genome Institute’s metagenomics services rely on Illumina’s HiSeq2000.
Of course, the platform choice will depend on the complexity of your sample; 10s–100s millions of reads can be overkill.
This would seem to be an excellent space for the medium-throughput instruments, including the aforementioned 454 instruments, the Ion Torrent PGM, and the coming MiSeq. Both the PGM and MiSeq offer speed advantages, which could be an important consideration if one were to eventually bring the sequencer to the genome, as opposed to now, where we bring the genome to the sequence lab.
How can next-generation sequencing overcome the multiple gene copy problem?
This should not be a problem with long sequential reads (if nanopores deliver).
One solution is high-read depth, combined with paired-end reads. The 454 platform has been applied to this problem with success on a polyploid genome where 15 reads/gene provided 95% confidence, but there is no reason the multiple gene copy problem cannot work on the shorter-read platforms, indeed the greater read numbers will be an advantage. For example, providing greater confidence at similar depth-of-coverage. With respect to CNVs, paired-end and mate-pair strategies on the Illumina and SOLiD platforms, respectively, have proven effective.
The answer will be longer reads and ultimately very long reads. For example, we are currently testing “strobe–reads” from the PacBio instrument. These reads contain multiple short sub-reads originating from one long fragment of DNA and can span large regions (5–10 kb) of the genome. This will enable us to effectively disentangle some of the duplicate genes and regions in the genome.
Which next-generation platforms are ideally suited to study rare genetic conditions?
To study rare genetic conditions, the sequencing platform needs to be accurate, cost effective, and have a high throughput. These attributes fit so-called short-read technologies from Illumina or Life Technologies best.
Any of the available platforms are ideally suited for the analysis of rare genetic conditions in that they all have the ability to detect rare variants in a single sample or group of samples compared to a reference genome. Experimental design, study size, disease of interest, and validation experiments are substantially more important than the specific methodology employed. The currently available platforms from Illumina and Life Technologies have been demonstrated to be powerful choices in the analysis of human genetic diversity and the detection of rare genetic conditions.
Accuracy is the key here, as well as low cost. Right now this is a difficult problem, as cross platform checks show. The base-calling accuracy for nanopore methods has yet to be established.
I would argue that it is not a question of the best platform, but of the best sample-preparation method.
How can chromosomal repeat regions be assessed by next-generation sequencing?
Chromosomal repeat regions can be assessed through a combination of paired-end sequencing, longer read lengths, and the use of sample-preparation methods that allow large insert libraries to be made. Techniques such as jumping libraries, mate-pair libraries, and related methods allow paired sequencing reads to be generated that were originally separated by several kilobases of sequence. Generating diverse libraries with these methods and combining the sequencing data with more standard sequencing sample-preparation methods allows a robust assembly of the genome to be performed using newer software tools such as ALLPATHS-LG from the Jaffe lab and colleagues at the Broad Institute. These assemblies should perform reasonably well across repeat regions, although certainly some larger repeat regions will always be problematic with short-read sequencing.
It’s the same answer as for repeated genes—long sequential reads should be made possible by nanopores.
At the end of the day, this is a hard computational problem. Several groups are actively working on this question. In addition, algorithms developed for detecting sequence errors and SNPS have been used to disentangle highly repetitive regions. So, regarding the best platform, it would be the one with the lowest error rate.
The bottom line is that no large sequencing center will have a single platform; it is too risky, and several of the platforms are complementary, e.g. Illumina/Solid plus PacBio. Furthermore, most centers are going to have one or more “development machines”, those instruments that are promising but not quite ready for prime time. The evolution of this field will likely include some dead ends and some dramatic transformations.
From my perspective, I am most excited about advances in sample prep, specifically, increases in speed and greater integration of automation.
Finally, one way to gauge/predict the success of a particular platform or methodology is to follow the publications!