February 15, 2009 (Vol. 29, No. 4)

Researchers Revisit the Human Genome to Identify Disease-Associated Variants

Targeted resequencing is one of the fastest growing applications for next-generation sequencing technology. The ultimate goal is to look for causative mutations within discrete genomic loci in populations in order to enhance diagnostics and treatments. The availability of next-generation sequencing tools has dramatically increased sequencing throughput.

Some of the challenges and highlights in the field will be presented at CHI’s “Next-Generation Sequencing Conference” next month. Scientists will be reporting on cutting-edge technologies that include massively parallel analysis of DNA fragments linked to beads, microfluidics for droplet-based PCR, and microarray-focused enrichment of targeted sequences.

“Recent data suggests that disease may not be associated with a limited number of common variants, but more likely caused by a large number of rare variants,” suggests Jeff Olson, nucleic acid applications manager, RainDance Technologies (RDT). Targeted sequencing on next-generation sequencers is one of the technologies that can be applied to identify these rare variants in patient samples.

“We have created a whole new platform class using the simplicity and speed of droplet-based microfluidics to increase the efficiency of targeted sequencing,” Olson says. “Basically, we perform PCR in picoliter volume droplets, which avoid the challenges of alternative methods such as multiplexed PCR amplifications and hybridization.”

The technology provides a means for high-resolution analysis of genetic variation among individuals within populations. “The use of droplet-based microfluidics greatly improves uniformity and specificity of the targeted sequencing process, which increases the efficiency of next-generation sequencers.”

The key to accomplishing this process is, according to RainDance, its Sequence Enrichment solution, which begins with the design and generation of a PCR primer pair library in droplets capable of amplifying hundreds to thousands of genomic loci.

The RDT 1000 instrument merges these droplets containing the PCR primer pairs with droplets containing the genomic DNA sample and the PCR reaction mix. This forms a fully functional PCR reaction in the merged droplet. The RDT 1000 instrument can generate these PCR droplets at a rate of 10 million discrete droplets per hour, Olson says. The droplets are then thermocycled with an end result of thousands of PCR-amplified genomic loci in a single tube that can then be characterized by any of the next-generation sequencing platforms.

Scientists carrying out targeted resequencing perform a comparative analysis of candidate genes to search for rare single nucleotide polymorphisms and structural variants.

Identifying Sequence Variants

Ewen Kirkness, Ph.D., investigator in the genomic medicine group at the J. Craig Venter Institute, is utilizing the RainDance Technologies’ droplet-based platform.

“We are all on a learning curve with the new technologies for target preparation and sequencing. Until recently, Sanger sequencing of PCR products has been the gold standard. With high levels of sequence coverage, however, the newer sequencing technologies can reveal a more comprehensive collection of sequence variants in targeted regions of the genome. The challenge now is to optimize sample preparation for the most cost-effective use of the new sequencing technologies.”

Dr. Kirkness’ group is collaborating with Scripps Research Institute to identify genomic variants associated with healthy aging. “We are interested in the healthy elderly, and the specific genes that contribute to this condition. We’ve spent a few months considering the related literature, and have come up with an initial list of about 100 candidate genes. After comparing to matched controls, we hope to identify key genes that are involved in healthy longevity.”

According to RainDance Technologies, the RDT 1000 increases the efficiency of targeted sequencing applications with any next-generation sequencing platform.

Biochips for Automation

A microarray-based approach to targeted resequencing is offered by febit. Its HybSelect™ provides a method for capturing genetic regions of interest in high throughput. “One of the main bottlenecks in next-generation sequencing is that the capacity is not large enough for large eukaryotic (e.g., human) genomes,” Daniel Summerer, Ph.D., head of application development, enzyme-on-chip technologies, explains. “There needs to be a method to allow enrichment of desired sequences to enable studies targeted to specific biological questions. The use of microarray-based enrichment methods helps to solve that problem.”

According to Dr. Summerer, HybSelect employs target-specific DNA extraction using Geniom® biochips. The biochips are directly synthesized with up to 120,000 capture probes specific to the genomic region of interest. Next, the sample (about 1 µg of genomic DNA) is fragmented and adaptors are ligated to the fragments to prepare a DNA library. The library is hybridized to the biochip overnight with active motion enabled by the microfluidics of the Geniom RT analyzer instrument. Finally, the target genomic DNA is recovered and then ready for next-generation sequencing.

Dr. Summerer says that this technology offers several benefits over traditional DNA microarray approaches. “The biochips have flexible probe content, in that chips can be customized within a day to select any target sequence of interest. They also have a unique microfluidic architecture that minimizes the required sample amount and allows the process to be fully automated.”

The company recently employed the technology for a large-scale study to enable the discovery of cancer gene SNPs. “This allows us to identify novel SNPs and other genomic variations that are relevant to various cancer types. Overall, the technology is useful for many types of research such as analysis of various diseases, pathogens, or microbial diversity and may contribute to future diagnostic tools.” The technology is currently available for early access.

Traditional capillary electrophoresis provides a high level of accuracy but is primarily useful for analyzing a limited set of amplicons in a large number of patient samples. The method is both expensive and labor intensive for analyzing a large number of genes. Life Technologies’ SOLiD™ System is designed to tackle large-scale re-sequencing.

“The SOLiD™ System is a highly accurate, massively parallel genomic analysis platform based on sequential ligation and clonally amplified DNA fragments linked to beads,” Michael Rhodes, Ph.D., senior manager of product applications for SOLiD says. “The core technology utilizes two independent flow cells that provide researchers with the flexibility to run two completely independent experiments in a single run. In addition, the system is based on two-base encoding that offers greater than 99.94% base-calling accuracy.  Combined, these features allow the SOLiD System to support a wide range of applications.”

Enhancements to System

The company recently launched the SOLiD 3 System, which offers several enhancements including higher bead density, walk-away automation, data-analysis tools, and multiplex capability. 

Applications for the new system include genomic, transcriptomic, and epigenomic analysis. “For example, the platform can be used for genome-wide or targeted resequencing projects as well as de novo assembly of previously uncharacterized genomes where sequencing is done in the  absence of a reference,” Dr. Rhodes explains. “For resequencing projects, researchers use either genomic DNA for whole genome projects or selected DNA regions using a variety of techniques to select specific regions of DNA including PCR (hundreds of bases), long-range PCR (thousands of bases), to arrays (millions of bases).

“The goal of these projects is to identify causative mutations within a given sample population, such as structural variants including SNPs, copy-number variations (CNV), genomic rearrangements, and insertions and deletions both small (1 base) and large (several kilobases).”

The massive amounts of data generated, however, demands new means for analysis. “The biggest challenge is undoubtedly the bioinformatics,” according to Dr. Rhodes. “The challenge is caused by the amount of data that can be generated from a single run—20 gigabases and more. Already, several consortia have found the most efficient way to transfer data has been physical transfer of hard drives between sites. The scale of data means that many existing tools either cannot process the data volume or produce graphical representations that are overwhelmed by data points.

Population Variations

Next-generation sequencing technologies are also being applied to study evolutionary changes. “There has never been a better time to analyze molecular variation data from natural populations,” notes Paul Marjoram, Ph.D., assistant professor, preventative medicine, Keck School of Medicine, University of Southern California (Los Angeles).

“We are examining mutation and recombination rates between individuals. It is a rite of passage for a computational biologist to develop methods to determine the number of mutations in a data set and to then calculate the rate at which mutations happen. The ultimate goal is to design association studies.”

“Genome-wide association studies interrogate the genome in a set of individuals and look for polymorphisms that differentiate two populations—for example, those that have a disease and those that don’t, explains Dr. Marjoram. “The problem is that often you can only derive partial information. There are holes and gaps in the coverage of the genome. These gaps can be filled by inferring sequences and imputing the missing data. This can be made easier by referring to an external library of data for related individuals in which you already know what falls in the missing regions.

“The key question to ask when starting association studies is ‘how big a sample do I need?’ So, the first step is to do a power calculation. There are a number of ways to do this, but the bottom line is that you have to divide the coverage across samples. We have found that it is better to divide coverage equally across individuals. But, even given that knowledge, you still need to decide whether to use, for example, 100 individuals and 20-fold coverage, or 500 individuals with fourfold coverage.”

Ultimately, it is a like the race of the tortoise and the hare. The hare, in this case, will use inexact methods to more quickly produce a best guess, while the tortoise will perform slow, steady, and difficult annotation of genomic sequences. “Both approaches will produce results in their own time, but faster and more useful methods are available right now by using simplified models or summaries of the data,” concludes Dr. Marjoram.

The SOLiD 3 System was designed to tackle large-scale resequencing, according to Life Technologies.

Protein-Coding Genome

While the cost of resequencing a complete human genome has dropped from millions of dollars to about $25,000-100,000, it is still a very expensive project, says Jay Shendure, assistant professor, genome sciences, University of Washington. “We are far away from the $1,000 genome, but we are making strides.”

Dr. Shendure’s group is examining how to more efficiently isolate and study complex genome subsets. “We are exploring several experimental strategies for massive  multiplex capture of discontiguous genomic subsequences. This is a prerequisite for efficient sequencing of the protein coding genome (PCG), which amounts to about 1% of the entire human genome.”

The PCG is an important genomic subset that would allow more economical screening of many individuals instead of examining the whole genome of fewer individuals. “It all depends on where you think the relevant and interpretable variation will be. Protein-coding variation is more amenable to interpretation and follow-up than regulatory variation.  For every complete human genome that you sequence, you could instead characterize 10–100 times as many PCGs.”

Dr. Shendure is working to optimize a next-generation sequencing strategy that uses microarrays. “Microarrays are useful for direct capture-by-hybridization of sequences of interest, and we are pursuing this. We have also shown that a complex mixture of molecular inversion probes obtained by parallel synthesis on and release from the surface of a programmable microarray can capture approximately 50,000 exons in a single reaction. We’re currently evaluating which of several potential methods for PCG capture will be the most robust and scalable.”

Previous articleExpression Analysis to Offer Genomic Services in U.K. through Partnership with Geneblitz
Next articleEMBL Licenses Tripos’ Software to Advance Drug Discovery