October 15, 2012 (Vol. 32, No. 18)
Greg Crowther, Ph.D.
As next-generation sequencing (NGS) transforms biomedical research, industrial and academic scientists continue to improve the efficiency of their sample-preparation and data-filtering techniques.
One of the main challenges is that of target enrichment: the selective sequencing of genomic or transcriptomic regions. The polymerase chain reaction (PCR) can be considered the original target-enrichment technique and continues to be useful in contexts such as genome finishing. Cliff Han, Ph.D., computational finishing team leader at the Joint Genome Institute in the Los Alamo National Lab, and colleagues are working on advancing the use of sequencing technologies with microbial genomes. Dr. Han’s group focuses on PCR-based target enrichment.
“There are two separate enrichment processes we were trying to develop for unfinished genomes,” said Dr. Han. “One target set is the unique gaps—the gaps in the unique sequence regions. Another is to enrich the repetitive sequences…ribosomal RNA regions, which together are about 5 kb or 6 kb.”
The unique-sequence gaps were targeted for PCR with 40-nucleotide primers complementary to sequences adjacent to the gaps. Unfortunately, this strategy did not yield the several-hundred-fold enrichment expected based on previously published work. “We got a maximum of 70-fold enrichment and generally in the dozens of fold of enrichment,” noted Dr. Han.
Dr. Han was one of a number of scientists who made presentations regarding target enrichment at the “Sequencing, Finishing, and Analysis in the Future” (SFAF) conference in Santa Fe, which was co-sponsored by the Los Alamos National Laboratory and DOE Joint Genome Institute.
According to Dr. Han, for the repetitive regions, PCR primers are based on conserved sections of the genes for 16S and 23S ribosomal RNA, which appear in many locations of bacterial genomes.
“We enrich the genome, put the enriched fragments onto the Pacific Biosciences sequencer, and sequence the repeats,” continued Dr. Han. “In many parts of the sequence there will be a unique sequence anchored at one or both ends of it, and that will help us to link these scaffolds together.”
This work, while promising, will remain unpublished for now, as the Joint Genome Institute has shifted its resources to other projects.
As target-enrichment strategies go, PCR-based methods have both advantages and disadvantages relative to hybridization-based methods, in which “bait” sequences are used to capture the targets of interest from genomic libraries.
“Hybridization methods are flexible and have multiple stop-start sites, and you can capture very large sizes, but they require library prep,” said Jennifer Carter Jones, Ph.D., a genomics field applications scientist at Agilent. “With PCR-based methods, you have to design PCR primers and you’re doing multiplexed PCR, so it’s limited in the size that you can target. But the workflow is quick because there’s no library preparation; you’re just doing PCR.”
Given these considerations, the choice of method depends on a given project’s specific questions and equipment.
“If you’re just thinking about them roughly, the high-throughput sequencers like the Illumina HiSeq system or Life Technologies SOLiD 5500 system, have tremendous capacity and can target very large capture sizes, so hybridization can be a good fit,” said Dr. Jones.
“And then there’s your desktop sequencers, which are limited in capacity, but you can get results very quickly, and so a PCR-based method may be advantageous.”
At the SFAF conference Dr. Jones focused on going beyond basic target enrichment and described new tools for more efficient NGS research. She discussed Agilent’s recently acquired HaloPlex technology, a hybrid system that includes both a hybridization step and a PCR step. Because no library preparation is required, sequencing results can be obtained in about six hours, making it suitable for clinical uses.
However, the hybridization step allows capture of targets of up to 5 megabases—longer than purely PCR-based methods can deliver.
The Agilent talk also provided details on the applications of SureSelect, the company’s hybridization technology, to Methyl-Seq and RNA-Seq research. With this technology, 120-mer baits hybridize to targets, then are pulled down with streptavidin-coated magnetic beads.
“The reason why a long bait is important,” said Dr. Jones, “is that it’s more tolerant of mismatches. So if you have large indels, we’ll still be able to capture and pull down your targets.”
Agilent’s SureSelectXT Human MethylSeq includes baits for capturing the 84 megabases of the genome, including 3.7 million CpGs, that are thought to be most important in determining a cell’s methylation state.
One of the basic clinical challenges in target enrichment is sequencing pathogen DNA amid an excess of host DNA. This was the topic of a presentation by Todd Lane, Ph.D., a research scientist in the systems biology department of Sandia National Laboratories.
For background, Dr. Lane reviewed how host DNA or cDNA is commonly suppressed in mixed samples.
“The DNA or cDNA sample is denatured by melting at high temperature and allowed to re-anneal over time,” he said. “The single-stranded DNA [ssDNA] species that are in highest abundance in the sample will find their opposite-stranded partners most rapidly and form double-stranded DNA [dsDNA].”
The ssDNA will be enriched for rare sequences such as those from the pathogen. This ssDNA can then be separated from the dsDNA with hydroxyapatite chromatography, which has different affinities for the two.
“What our technology has accomplished is the automation of these methods,” Dr. Lane summarized.
Sandia’s new microfluidics platform and bioinformatic analysis of NGS data constitute a Rapid Threat Organism Recognition (RapTOR) system that will “greatly accelerate identification and characterization of novel pathogens,” according to Dr. Lane.
RapTOR is another example of an approach that meshes well with small desktop sequencers.
“By suppressing host sequences, one could obtain the same level of sequencing data on the pathogen using an Illumina MiSeq instead of a higher-throughput machine,” Dr. Lane commented. “This has the additional value of reducing the size of the overall dataset that must be run through bioinformatics analysis. It is much faster and easier to analyze the dataset from a MiSeq run versus that from a HiSeq.”
While RapTOR is not quite ready for clinical deployment, “In the next 1–2 years we would like to demonstrate the effectiveness of our technology in the clinical microbiology/virology arena,” said Dr. Lane.
“We believe that there will be more routine use of sequencing in the diagnosis of infection. Our technology will serve to reduce the barriers, in terms of instrumentation and data-analysis costs, to the adoption of sequencing by clinical labs.”
Short Tandem Repeats
Brian Young, Ph.D., Daniel Bornman, and Seth Faith, all from the Battelle Memorial Institute, took on the challenge of short tandem repeat (STR) analysis from short read sequencing data.
A STR, also called a simple sequence repeat (SSR) or microsatellite, is a sequence of 2 to 6 base pairs that is repeated a variable number of times, depending on the allele. Thirteen highly polymorphic STRs are used as genetic markers in the Combined DNA Index System (CODIS), a database maintained by the FBI for use in identifying crime suspects.
According to Dr. Young, who serves as technical director of identity management at Battelle, there are two main challenges in using NGS data to determine STR genotypes.
“First, the sequence reads must be long enough to completely span all of the possible alleles at the STR locus,” he said. “Overlapping reads (either multiple single-end or mate-paired) do not work since the overlapping potions would fall within the polymorphic tandem repeat region, providing no unique signature to anchor an overlapping match. The second challenge is accurately calling the one (homozygous) or two (heterozygous) alleles represented in the sequencer reads.”
In facing these challenges, Dr. Young noted, “We are not aligning and assembling entire genomes in order to make allelotyping calls. Rather, we are developing work flows that use computationally efficient filters to first classify sequencer reads into informative and noninformative groups. Then allelotyping is performed using just the reads informative for a particular multilocus genotype.”
Regarding the filtering step, he elaborated, “Our filter uses wavelet packet decomposition to rapidly classify reads into two bins: those that contain repeat sequences and those that do not. This approach is faster than alternative character-based classification procedures.”
The sequences that contain repeats are then aligned against a custom-made reference sequence consisting only of the targeted STR loci and their flanking regions. Bow-tie alignment software performs the alignment using this reference sequence in place of a full reference genome.
A major advantage of this very focused analysis of heavily filtered sequence information is that it does not require high-performance computing power.
“Our approach can provide same-day analysis using ordinary PCs running in nonlaboratory environments such as police stations or in deployed military scenarios,” said Dr. Young. “The same concept is applicable to sequencer-based medical devices running in medical clinics.”