Dan Koboldt
Like the hydra of myth, as soon as the head of one NGS problem is cut off, new ones grow in its place.
I first started MassGenomics in the early days of next-gen sequencing, when Illumina was called “Solexa” and came in fragment-end, 35-bp reads. Even so, the unprecedented throughput of NGS and the nature of the sequencing technology brought a whole host of difficulties to overcome, notably:
- Bioinformatics algorithms developed for capillary-based sequencing didn’t scale.
- Sequencing reads were shorter and more error-prone.
- The instruments were expensive, limiting access to the technology.
- Most of the genetics/genomics/clinical community had no experience with NGS.
All of these are essentially solved problems: New bioinformatics tools and algorithms were developed, the reads became longer and more accurate, benchtop sequencers and sequencing-service-providers hit the market, and NGS was widely adopted by the research community. Mission accomplished!
Yet these victories were short-lived, because we find ourselves facing new challenges. Harder challenges. Here are a few of them.
1. Data Storage
You’ve probably seen the plot of Moore’s Law compared to sequencing throughput. In short, the cost of DNA sequencing has plummeted much faster than the cost of disk storage and CPU. A run on the Illumina HiSeq2000 provides enough capacity for about 48 human exomes. Even if you don’t keep the images, each exome requires about 10 gigabytes of disk space to store the bases, qualities, and alignments in compressed (BAM) format. At three runs a month, each instrument is generating 1.4 terabytes of data files. It adds up quickly.
Analysis of sequencing data—variant calling, annotation, expression analysis, genetic analysis—also requires disk space. Most non-BGI research budgets are finite, so investigators must choose between (1) deleting data, (2) spending money, or (3) holding up data production/analysis. None of those sound very appealing, do they?
2. Achieving Statistical Significance
NGS is no longer an exploratory tool, and descriptive studies reporting a dozen or a couple hundred genomes/exomes are harder and harder to publish. This is particularly true for common diseases, in which large numbers of samples are typically required to achieve statistical significance. The number 10,000 has been discussed as an appropriate number. Even if that many samples could be found, the cost of sequencing so many is substantial. If you had an Illumina X Ten system and could do whole genomes for $1,000 each (that only covers reagents, by the way), it’s still ten million dollars. That’s probably over budget for most groups, so they’ll have to take another tack:
- Sequencing fewer samples, which will make the work harder to fund/publish
- Combining some sequencing with follow-up genotyping, which limits the discovery power
- Collaborating with other labs/consortia, whose sample populations, phenotypes, or study designs may vary
How many of your project planning meetings have ended with someone saying, “Well, maybe we’ll get lucky”?
3. Finding Samples
Getting access to large sample cohorts is another challenge. As I’ve previously written about, given the widespread availability of exome and genome sequencing, samples are the new commodity. High-quality DNA samples from informative sources—tumor tissue, diabetes patients, families with rare disorders, even healthy members of minority populations—are increasingly valuable. Why should an investigator collaborate with you, when they might send the samples off for sequencing on their own?
Sequencing samples with public funds (i.e., NIH grants) adds another layer of difficulty: All sequencing data must be submitted to public repositories. This means that the volunteer must have given informed consent not just for study but for data sharing. Local IRBs even need to sign off. The net result is that many of the samples that come to us for sequencing don’t meet the criteria and must be returned.
4. Privacy
Even if you have an outstanding, comprehensive informed consent document, it might be difficult to get volunteers to sign it. There’s a growing public concern about the privacy of genetic information. As Yaniv Erich demonstrated by hacking the identities of CEPH sample contributors, genetic profiles obtained from SNP arrays, exome, or genome sequencing can be used to identify individual people. They also contain some very private details—like ancestry and disease risk alleles—that might be exploited, made public, or used for discrimination.
How long is it before genetic profiling replaces Google-stalking as a screening tool for job candidates or romantic interests? Thanks for coming in, Mr. Johnson. All we need now is your Facebook password and a cheek swab.
5. Functional Validation of Genomic Findings
Numerous research groups have demonstrated the immense discovery power of NGS. The mere fact that dbSNP—the NCBI database of human sequencing variation—has swelled to more than 50 million distinct variants tells us something about what pervasive genome sequencing capabilities might uncover. And yet, the variants implicated in sequencing-based studies of human disease are increasingly difficult to “sell” to peer reviewers on genetic information alone. Our inability to predict the phenotypic impact of genetic variants lurks beneath the veneer of genetic discoveries like a shark following a deep-sea trawler.
Referees of most high-impact journals want to see some form of functional validation of genomic discoveries. That’s a daunting challenge for many of us accustomed to the rapid turnaround, high-throughput nature of NGS. Most functional validation experiments are slow and laborious by comparison.
6. Translation of NGS to the Clinic
We all know that NGS is destined for the clinic. Targeted sequencing panels are already in routine use at many cancer centers; in time, this will likely become exome/genome sequencing. Possibly transcriptome (RNA-Seq) and methylome (Methyl-Seq) as well. Undiagnosed inherited diseases, and rare genetic disorders whose genetic cause is unknown, are two other common-sense applications. There are many hurdles to overcome in order to apply a new technology to patient care. CLIA/CAP certification is a complex, expensive, and time-consuming process.
The reporting is more difficult, too. Unlike the research setting in which most NGS results have arisen, a clinical setting requires very high confidence in order to report anything back to the patient or treating physician. This is a good thing, since patient care decisions might be made based on genomic findings. Yet it means that we have a considerable amount of work ahead to ensure that genomic discoveries are followed up, replicated, and otherwise vetted to the point where they can be of clinical use.
This article previously appeared on Dan Koboldt’s Massgenomics blog. Dan leads the human genetics analysis group of the Genome Institute at Washington University. He started the Massgenomics blog in 2008 to write about next-generation sequencing and medical genomics in the post-genome era. Website: www.massgenomics.org.