February 1, 2017 (Vol. 37, No. 3)
Shawn C. Baker Ph.D. co-founder and CSO AllSeq
NGS Growing By Leaps and Bounds, Problems Arise
Over the past 10 years, next-generation sequencing (NGS) has grown by leaps and bounds. Outputs have gone up, and costs have come down—both by orders of magnitude. The NIH graph showing this progress is so overused that its main utility now is to help bored conference attendees fill in their “buzzword bingo” cards.
With well over 10,000 instruments installed around the world, we face a paradox: the current generation and the next generation are one and the same. “Next,” in the context of sequencing, has almost completely lost its meaning. We might as well accept that “next-generation sequencing” is now just “sequencing.”
The major platform companies have spent the past couple of years focusing on improving ease-of-use. Illumina’s newer desktop systems, such as the NextSeq, MiSeq, and MiniSeq systems, all operate via the use of reagent cartridges, reducing the number of manipulations and “hands on” time.
The Ion Torrent platforms from Thermo Fisher Scientific have historically been more difficult to use than the Illumina platforms. However, Thermo’s most recent system, the Ion S5, was specifically engineered to simplify the entire workflow, from library prep through data generation.
After hearing about sequencing’s many improvements—greater output, lower costs, and better ease of use—the casual observer may imagine that all of the hard work has been done and that all the barriers to progress have been removed. But the hard work has just started, and many challenges remain.
One of the first areas where problems can creep in is often the most overlooked—sample quality. Although platforms are often tested and compared using highly curated samples (such as the reference material from the Genome in a Bottle Consortium), real-world samples often present much more of a challenge.
For human sequencing, one of the most popular sample types is FFPE (formalin-fixed paraffin-embedded). FFPE is popular for a variety of reasons, not the least of which is the sheer abundance of FFPE samples. According to some estimates, over a billion FFPE samples are archived around the world. This number will continue to grow now that the storage of clinical samples in FFPE blocks has become an industry-wide standard practice.
Besides being widely available, FFPE samples often contain incredibly useful phenotypic information. For example, FFPE samples are often associated with medical treatment and clinical outcome data.
The problem with FFPE samples is that both the process of fixation and the storage conditions can cause extensive DNA damage. “In evaluating over 1,000 samples on BioCule’s QC platform, we’ve seen tremendous variability in the amount and types of damage in sample DNA, such as inter- and intrastrand crosslinks, accumulation of single-stranded DNA, and single-strand DNA breaks,” says Hans G. Thormar, Ph.D., co-founder and CEO of BioCule.
The variable amounts and types of damage, if ignored, can negatively affect the final results. “The impact on downstream applications such as sequencing can be profound: from simple library failures to libraries that produce spurious data, leading to misinterpretation of the results,” continues Dr. Thormar. Therefore, it is critical to properly assess the quality of each sample at the beginning of the sequencing project.
Although the major sequencing platform companies have spent years bringing down the cost of generating raw sequence, the same has not been true for library prep. Library prep for human whole-genome sequencing, at about $50 per sample, is still a relatively minor part of the total cost. But for other applications, such as sequencing bacterial genomes or low-depth RNA sequencing (RNA-seq), it can account for the majority of the cost.
Several groups are working on multiplexed homebrew solutions to bring the effective costs down, but there haven’t been many developments on the commercial front. One bright spot is in the development of single-cell sequencing solutions, such as the Chromium™ system from 10X Genomics, which uses a bead-based system for processing hundreds to thousands of samples in parallel.
“We see single-cell RNA-seq as the right way to do gene expression analysis,” insists Serge Saxonov, Ph.D., co-founder and CEO of 10X Genomics. “Over the next several years, much of the world will transition to single-cell resolution for RNA experiments, and we are excited for our platform to lead the way there.” For large projects, such as those required for single-cell RNA-seq, highly multiplexed solutions will be critical in keeping per-sample costs reasonably low.
Short Reads vs. Long Reads
Illumina’s dominance of the sequencing market has meant that the vast majority of the data that has been generated so far is based on short reads. Having a large number of short reads is a good fit for a number of applications, such as detecting single-nucleotide polymorphisms in genomic DNA and counting RNA transcripts. However, short reads alone are insufficient in a number of applications, such as reading through highly repetitive regions of the genome and determining long-range structures.
Long-read platforms, such as the RSII and Sequel from Pacific Biosciences and the MinION from Oxford Nanopore Technologies, are routinely able to generate reads in the 15–20 kilobase (kb) range, with individual reads of over 100 kb having been reported. Such platforms have earned the respect of scientists such as Charles Gasser, Ph.D., professor of molecular and cellular biology at the University of California, Davis.
“I am impressed with the success people have had with using the long-read methods for de novo genome assembly, especially in hybrid assemblies when combined with short-read higher fidelity data,” comments Dr. Gasser. “This combination of technologies makes it possible for a single investigator with a very small group and a minimal budget to produce a useable assembly from a new organism’s genome.”
To get the most out of these long-read platforms, however, it is necessary to use new methods for the preparation of DNA samples. Standard molecular biology methods haven’t been optimized for isolating ultra-long DNA fragments, so special care must be taken when preparing long-read libraries.
For example, vendors have created special “high molecular weight” kits for the isolation of DNA fragments >100 kb, and targeted DNA protocols have been modified to selectively enrich for large fragments of DNA. These new methods and techniques need to be mastered to ensure maximum long-read yield.
As an alternative to true long reads, some are turning to a specialized form of short reads called linked-reads, such as those from 10X Genomics. Linked-reads are generated by adding a unique barcode to each short read generated from a single long DNA fragment, which is generally >100 kb. The unique barcodes are used to link together the individual short reads during the analysis process. This provides long-range genomic information, enabling the construction of large haplotype blocks and elucidation of complex structural information.
“Short-read sequencing, while immensely powerful because of high accuracy and throughput, can only access a fraction of genomic content,” advises Dr. Saxonov. “This is because genomes are substantially repetitive and much of the information in the genome is encoded at long scales.”
Another challenge facing researchers is the sheer amount of data being generated. The BAM file (a semicompressed alignment file) for a single 30X human whole-genome sample is about 90 GB. A relatively modest project of 100 samples would generate 9 TB of BAM files.
With a single Illumina HiSeq X instrument capable of generating over 130 TB of data per year, storage can quickly become a concern. For example, the Broad Institute is generating sequencing data at the rate of one 30X genome every 12 minutes—nearly 4,000 TB worth of BAM files every year.
BAM files may be converted into VCF (variant call format) files, which contain information only on those bases that differ from the reference sequence. Although the VCF files are much smaller and easier to work with, it is still necessary to retain the the raw sequence files if the researcher is to reprocess the data in the future.
As the cost of sequencing has come down, some have come to the conclusion that resequencing samples for which there is abundant material is easier and possibly even cheaper. And when it comes to analyzing this large amount of data, researchers are spoiled for choice. In fact, with well over 3,000 sequencing analysis tools listed at OMICtools (a directory operated by omicX), researchers can easily be overwhelmed when trying to find the best option.
Clinical Interpretation and Reimbursement
Finally, for clinical samples, there remains the challenge of delivering a consistent, reliable interpretation of the sequencing variants, especially as it pertains to patient care. A typical exome sample will have between 10,000 to 20,000 variants, whereas a whole-genome sample will generally have greater than 3 million. To make things more manageable, the variants are often filtered based on their likelihood to cause disease.
To help guide clinicians, the American College of Medical Genetics and Genomics, the Association for Molecular Pathology, and College of American Pathologists have created a system for classifying variants. Categories include pathogenic, likely pathogenic, uncertain significance (which currently makes up the vast majority in exome and whole-genome samples), likely benign, and benign.
Such schemes, however, have their limitations. Even when a common classification scheme is used on identical datasets, different groups may come up with different interpretations. In a pilot study under the new system, the participating clinical laboratories agreed on their classifications only about 34% of the time.
In cases where there is disagreement or additional analysis is needed to interpret the results, the problem of reimbursement becomes the roadblock. Reimbursement of NGS-based tests can be a major challenge, but reimbursement for interpretation is nearly impossible.
“There’s no way for laboratories to bill for interpretation,” argues Jennifer Friedman, M.D., clinical investigator at Rady Children’s Institute for Genomic Medicine. “It’s a very valuable service that could be available, but nobody is really in that space.
“There’s no way to bill for it—insurance companies won’t pay for it. Despite increasing focus on precision medicine, whether interpretation is by the clinician or by the lab, this most important aspect is not recognized or valued by the healthcare payers.”
Until this changes, the analysis of these patient samples essentially has to be treated as a research project, an option generally available only in a research hospital setting, and only for a limited number of patients.
As much advancement as there has been over the past several years, many challenges remain across the entire NGS workflow, from sample prep through data analysis. And as new advancements are made in the underlying technologies, new challenges will continue to emerge. Rising to these challenges will be critical to ensuring the wide adoption of these genomic technologies and to maximizing their impact on human health.
The Long and Short of Structural Variants
Although next-generation sequencing has contributed to rapid progress in our ability to detect single-base genetic variation, another entire category of variants has been left out of the picture due to the nature of the short-read sequences produced by these platforms. These variants are too small to detect with cytogenetic methods, but too large to reliably discover with short-read sequencing. This is no trivial matter: each human genome contains about 20,000 structural variants, and many have been shown to cause disease.
Single-molecule, real-time (SMRT) sequencing technology is solving the challenge of identifying these structural variants with high sensitivity, in part due to the fundamentally long reads it produces. SMRT sequencing produces reads that are many kilobases long— compared to 200 or 300 bases for short-read sequencers—so they can fully resolve most structural variants such as insertions, deletions, duplications, inversions, repeat expansions, and more.
Many studies are now using long-read SMRT-sequence data for structural variant discovery. In a project presented last year at the American Society of Human Genetics, the NA12878 human sample was sequenced to 10-fold coverage on Pacific Biosciences’ Sequel System, and structural variants were called with the Baylor College of Medicine’s PBHoney tool.
This approach found nearly 90% of structural variants in the genome, based on a comparison to a Genome in a Bottle truth set. Furthermore, long-read coverage identified thousands of novel variants not found in short-read datasets, most of which were confirmed by de novo assembly.
As efforts turn to analysis of structural variants in large cohorts, it is important to strike a balance between sensitivity and cost. Low-fold SMRT-sequencing coverage has the potential to be an effective and affordable solution for structural variant discovery in human genomes, and the benefits apply to other complex genomes as well.