September 1, 2009 (Vol. 29, No. 15)
Reducing the Complexity of the Human Genome for Resequencing Applications
Thousands of genome-wide association studies (GWAS) for hundreds of diseases have identified candidate regions of interest, but have yet to identify the underlying causative genetic variants. What has hindered these projects is the lack of a rapid, affordable method that allows for 100 kilobases to 30 megabases of the human genome to be resequenced across a large-enough sample population to identify all genetic variations that may contribute to a specific disease.
Although whole human genome sequencing has been demonstrated using next-generation sequencing technologies, it is neither practical nor feasible for laboratories outside of major genome centers. Likewise, it is not practical to perform long-range PCR to target hundreds of kilobases, let alone megabase regions of the genome.
NimbleGen Sequence Capture arrays (Roche Applied Science), combined with 454 Life Sciences next-generation sequencing, simplifies the resequencing workflow (Figure 1) while reducing overall project costs. With this combination of technologies, researchers can completely resequence candidate genes and regions previously determined through GWAS, identifying all genetic variants that may be present. These genetic variations include single nucleotide polymorphisms, and insertions and deletions of nearly any size.
When using this sequence capture method to reduce the complexity of the genome, it is important to note that optimal performance is a result of a combination of technologies, where the sequence-capture technology enriches the sample and the sequencing platform serves as the detection device. As such, several important criteria must be evaluated:
- Amount of genomic material needed per enrichment: Some methods require more than 20 to 30 micrograms, while the optimized protocol using NimbleGen Sequence Capture arrays and the 454 Sequencing System requires only five micrograms of genomic DNA.
- Percent of targeted region that is detected using sequencing reads.
- Uniformity of sequencing coverage across the targeted region: The goal is to achieve the same level of enrichment across the region, minimizing regions that are overly represented and ensuring that there are as few gaps as possible.
- Required sequencing coverage to detect genetic variations: This refers to the number of sequencing reads across a specific position to determine the level of variation.
Two different NimbleGen Sequence Capture Custom Delivery arrays are available: a 385K probe array that can target 100 kilobases to 5 megabases of genomic regions, and a larger 2.1M feature array that targets 5 to 30 megabases of genomic DNA sequence.
The NimbleGen Sequence Capture 2.1M Human Exome array targets over 180,000 exons as defined by CCDS (Consensus Coding Sequence, build 36.2), and 551 microRNA genes. This exome sequence capture array has been used to support population studies, cancer disease models, and Mendelian disease studies; researchers can also use it to quickly identify most genetic variations present within the coding portion of the genome.
The NimbleGen Custom Delivery and 2.1M Human Exome arrays are generated using design algorithms that result in uniform binding properties, specificity to the target regions, and optimal number of probes per region for a homogeneous enrichment.
Analysis software is a key component of the NimbleGen and 454 Life Sciences optimized sequence-capture method. The GS Reference Mapper, a dedicated software application, automatically aligns the sequencing reads and provides a tabulated summary of the genetic variations identified, relative to a reference sequence; this analysis takes only a few hours.
Sample Preparation
The first step in the Titanium Optimized Exome Sequence Capture protocol (Figure 2) is the preparation of a 454 GS FLX Titanium series library from 5 µg of randomly fragmented genomic DNA. The GS FLX Titanium library is hybridized to the NimbleGen Sequence Capture 2.1M Human Exome array.
After hybridization, unwanted genomic DNA (the regions not targeted by the array) is washed away, and the captured DNA is eluted. This post-capture DNA library is quantitated to ensure that the capture process performed to a specific quality standard. The library is then incorporated into the standard Genome Sequencer FLX System workflow at the emPCR amplification step, followed by sequencing on the Genome Sequencer FLX Instrument.
Data Processing
The resulting sequence data is analyzed with the GS Reference Mapper software. For sequence capture, the software application requires two input reference files: the complete human genome reference (HG18 from UCSC Genome Browser, University of California Santa Cruz), and the .gff file that describes the targeted portion of the genome enriched by the array.
The software maps all of the sequencing reads against the full human genome reference. Mapping against the full genome reference helps to eliminate false positives by determining if a sequencing read maps uniquely to the target region or if the sequencing read also maps to a region elsewhere in the genome.
To assess the reproducibility of the sequence capture and sequencing process, an experiment was designed and performed that repeated the entire process six times using the publicly available human HapMap NA11881 sample (Table). It was found that over 99% of the reads mapped to the human genome, indicating a high degree of fidelity in the capture and sequencing processes. Of those reads, ~70% mapped to the target region.
The vast majority of the reads that did not map to the target region were within the introns bordering the targeted exons and were most likely captured by probes complementary to the ends of a given exon. This result presents an additional value to researchers interested in querying variation within the exon/intron boundary regions. For some variants, uniformity of coverage and adequate sequencing depth for detection were achieved with just 4.5x sequencing coverage.
Summary
The targeted resequencing of the human exome provides a fast, cost-effective, and bioinformatically tractable method of generating sequence data for the discovery of genetic variation within this important region. The combination of NimbleGen Exome Sequence Capture with 454 GS FLX Titanium Sequencing in an optimized protocol streamlines the workflow, reduces the amount of sample DNA required, and leads to a more uniform coverage across the captured region.
The long sequencing reads allow the discovery of SNPs and more complex variations such as large insertions and deletions. With one array and two sequencing runs, samples can be analyzed to their full potential, enabling a better understanding of genomic variations, which is key to the study of human diseases. Additionally, NimbleGen Sequence Capture Custom Delivery arrays provide an effective method for resequencing candidate genes and other regions previously identified in genome-wide association studies.
Clotilde Teiling ([email protected]) is marketing manager, sequencing, at Roche Applied Science, and Thomas Jarvie, Ph.D., is technical application manager at 454 Life Sciences. Additional information about the Genome Sequencer FLX System and NimbleGen Microarrays is available at www.454.com and www.nimblegen.com. For life science research use only. Not for use in diagnostic procedures. 454, 454 Life Sciences, 454 Sequencing, EMPCR, GS FLX Titanium, and NimbleGen are trademarks of Roche. Other brands or product names are trademarks of their respective holders. License disclaimer information is
available online (www.roche-applied-science.com).