January 15, 2017 (Vol. 37, No. 2)

Reveal Hidden Biology by Combining Long Fragment Target Capture with SMRT Sequencing

While the power of whole human genome sequencing to fuel discovery is unparalleled, there are many research questions that can be answered without interrogating the entire genome. Indeed, it can be beneficial to focus available sequencing and analysis resources on specific regions of the genome. For example, when combined with enrichment and multiplexing methods, targeted sequencing can be used to survey disease-specific gene panels across large cohorts, sequence deeply for rare variants or isoforms, or phase highly polymorphic regions of the genome. In these examples, targeted sequencing can deliver high-confidence answers and novel genetic discoveries at a fraction of the cost.

The Need for Long Reads

Most researchers are familiar with short-read targeted sequencing applications. PCR or probe-based hybridization is performed to target genomic or transcriptomic targets of interest, generating templates of a few hundred base pairs in length. While this approach can robustly detect individual SNPs, short reads cannot span larger types of genomic variation and can be difficult to assemble into the larger contigs required for haplotyping or unambiguous identification of mRNA isoforms. This is particularly true if the targeted genomic regions include repeat-rich areas, as is often the case when studying disease-causative structural variants.

Pairing target capture with PacBio single molecule, real-time (SMRT®) sequencing provides a more comprehensive view of genomic regions or transcripts of interest. Since PacBio reads have an average length of over 10 kilobases (kb), they easily cover multi-kilobase PCR amplicons or probe-captured fragments from end to end, with no assembly required. The resulting data can be used to accurately phase both SNPs and structural variants across entire genes without GC bias, identify structural variant exact breakpoints, and reveal the hidden diversity of transcript isoforms.

Probe-Based Target Capture Meets Long Reads

Probe-based target capture is a well-established technology that has now been validated for capturing fragments as long as 6 kb for use with PacBio long-read sequencing (Figure 1). A key advantage to capturing longer fragments is the recovery of undefined regions flanking targeted sites. For example, a probe set that targets only exons will capture proximal introns as well, often allowing for phasing across entire genes. Similarly, novel isoforms or the unknown partners of commonly fused genes can be discovered de novo. When looking at structural variants, targeted regions can be confidently placed in their genomic contexts when neighboring unique sequences are captured as well.

Below we share two examples of researchers who paired these two powerful technologies to generate continuous, uniform sequencing coverage across targeted gDNA or cDNA—painting a more complete picture of the wide range of genomic variation beyond SNPs that impacts human health.

Figure 1

Resolve Disease-Linked Structural Variants

Unable to fully characterize the genomic rearrangements that cause Potocki-Lupski syndrome (PTLS) with other technologies, researchers at Baylor College of Medicine realized they needed a long-read approach. They developed PacBio-LITS, which enables the capture of multi-kilobase fragments with the SeqCap EZ kit from Roche-NimbleGen, followed by sequencing on the PacBio system.

The workflow includes these steps:

  1. Shear genomic DNA with a g-TUBE
  2. Clean and concentrate sample with AMPure beads
  3. Size select with BluePippin
  4. Prepare the library with KAPA Hyper Prep Kit
  5. Amplify the library with Takara LA Taq DNA Polymerase Hot-Start Version
  6. Cleanup with AMPure beads
  7. Hybridize to probes with SeqCap EZ kit
  8. Wash and recover captured DNA
  9. Amplify targets with Takara LA Taq DNA Polymerase Hot-Start Version
  10. Cleanup with AMPure beads
  11. Construct the PacBio library 

Sequencing data was binned and haplotyped with SAMtools and polished to very high consensus accuracy with Quiver.

The goal of the study was to define structural rearrangements associated with PTLS, a rare disease caused by a duplication on chromosome 17 that can range from 400 kb to 14 Mb. Chromosome 17 has an unusual density of highly homologous low-copy repeats (LCRs) and segmental duplications, making it a difficult part of the genome to resolve. Adding to the challenge, up to 30% of PTLS patients have nonrecurrent genomic rearrangement events that include multiple breakpoint junctions. Resolving these junctions can provide insights into how PTLS arises, and add to our knowledge of causative haplotypes.

In this work, the scientists conducted targeted sequencing of three PTLS patients. Earlier work had not yielded all the relevant breakpoints, but long-fragment target capture following PacBio sequencing revealed both known and previously undefined breakpoints. Unsurprisingly, three of the five breakpoints were located within LCRs, which are known to present serious challenges for alternative detection methods.

Breakpoints within LCRs result in large uncertainty regions when characterized with aCGH and ambiguous mapping when sequenced with short reads, thwarting resolution of the underlying genomic structure. With the new breakpoint information, the authors proposed a specific mechanism for PTLS-linked genome rearrangements. They further suggest that PacBio-LITS would be useful for characterizing structural variants linked to other genomic disorders in repeat-dense regions.

Discover Novel, Biologically Relevant Isoforms

Long-read targeted sequencing can also be applied to drive discoveries in transcriptomics (Figure 2). In another recent publication, a team of researchers led by Colleen Nelson, Ph.D., at the Australian Prostate Cancer Research Centre used Roche NimbleGen Seqcap EZ to target relaxin hormone transcripts in tumor and normal prostate cell-lines. The sample-preparation workflow used was as described above, except that instead of gDNA, long cDNA, prepared with the Clontech SMARter PCR Synthesis Kit, was hybridized with NimbleGen probes. Full-length isoforms were identified from circular consensus sequencing (CCS) reads.

Using a long-read approach, the Australian team uncovered previously unrecognized relaxin biology. In men, relaxin-2 (RLN2) is produced by the prostate and enhances sperm motility. In the context of prostate cancer, it has been shown to promote tumor progression. Remarkably, the researchers discovered an unknown isoform of RLN2 that is fused with the neighboring paralog relaxin-1 (RLN1). The fusion isoform lacks the signal sequence that enables RLN2 to be secreted through the ER-Golgi apparatus, suggesting there may be differing biological impacts of the inversely androgen-regulated isoforms.

A key insight from uncovering the RLN1-RLN2 fusion isoform was that previous studies on relaxin had relied on qPCR primers that could not distinguish between the fusion versus RLN1 or RLN2 transcripts. Retesting of tumor and normal prostate tissue and several commonly used prostate cell-lines with redesigned primers revealed that only one cell line, LNCap, recapitulates the relaxin isoform-expression patterns found in tissues. Using a targeted long-read approach that enabled the resolution of complete isoforms unveiled a previously unappreciated level of complexity in relaxin expression, and revealed important differences among prostate cancer models.

Figure 2


Pairing NimbleGen SeqCap EZ enrichment technology with SMRT sequencing from PacBio allows scientists to recover long genomic or transcriptomic regions for targeted, cost-effective analysis. This protocol makes it straightforward to phase SNPs and indels across entire genes or polymorphic regions, sequence full-length transcripts, or delineate the exact breakpoints of previously uncharacterized structural variants. With continuous sequencing coverage of fragments as long as 6 kb, researchers develop a clear and complete view of even complex structural variants, enhancing their understanding of how genomic and transcriptomic variation at all size-scales drives phenotypes in health and disease.

Meredith Ashby, Ph.D. ([email protected]), is a staff scientist, Jenny Ekholm, Ph.D., is a senior scientist, and Luke Hickey is senior director, human biomedical sequencing, at PacBio.

Previous articleEnhancer RNA Reaches Out to Help DNA Loosen Up
Next articlePrecision Medicine Research in the Million-Genome Era