October 15, 2015 (Vol. 35, No. 18)
When people talk about the $1,000 genome, they are not speaking about the whole genome, but the exons, the so-called coding regions of the genome. “Six years ago, I was spending $15,000 per exome sequence,” says Gholson Lyon, M.D., Ph.D., a genomic scientist working for the Cold Spring Harbor Laboratory. “Now that costs about $700.”
Whole genome sequencing is more expensive. “We are still not at the $1,000 genome in my opinion,” Dr. Lyon continues. “Almost everyone I’ve talked to is charging $1,500–2,000, and we pay $3,000 because that gets us 60× coverage of the genome, which we have shown is very important to recover small insertions and deletions in the genome ranging in size from 5 to 50 base pairs.”
Dr. Lyon, who studies rare but heritable medical diseases such as Ogden syndrome and TAF1 syndrome, believes that advances in next-generation sequencing technology—better software algorithms, improved methodologies, and lower costs—accelerate his work and the work of others conducting clinical research.
The standard advocated by Illumina, the industry giant, and other sequencing companies is a 30× genome, which means sequencing the genome enough to generate on average 30 reads aligned at each base pair. But according to Dr. Lyon, the 30× genome does not capture all the insertions and deletions.
“We have a paper in Genome Medicine where we showed that an average of 60× or more provides a very good chance that you’re going to find greater than 95% of all the insertions and deletions,” notes Dr. Lyon. Although this change would double the price of sequencing, to roughly $3,000 using current standard technologies, “it is very good at finding insertions and deletions, which have not been as systematically studied as they should be.”
Other technologies have been in rapid development to combat these deficits. “Single-molecule technology allows for long reads [meaning reads that are over 10,000 base pairs],” he adds. “With Illumina, we’re doing 100–150 base pair reads.”
Specifically, Dr. Lyon is excited about single-molecule real-time (SMRT) sequencing technology from Pacific Biosciences (PacBio). He is, however, concerned with the cost, which must come down considerably to be useful in his research. “The expense is just enormous,” he complains. “Yet there is material in the genome that PacBio sequencing is helping to find that has not been found by other methods.”
Illumina technology works by providing high-throughput short-read sequencing. This approach is optimized for detecting single-nucleotide polymorphisms commonly referred to as single nucleotide polymorphisms.
“Illumina has focused on the throughput,” says Jonas Korlach, Ph.D., CSO of PacBio. He contends that this focus “came at a price of having short read lengths, bias with respect to GC content, and sequence complexity that no longer allows you to sequence all of the DNA that is part of your genome.” Therefore, Dr. Korlach continues, “we wanted to build something that gives you the best performance in all four areas that are relevant to the performance of sequencing.”
According to Dr. Korlach, Illumina and PacBio have dramatically different approaches to sequencing. For example, with Illumina, the GC bias occurring in many diseases, such as Fragile X syndrome, can’t be sequenced accurately, he argues. “[But] we have very little trouble and certainly the least bias of any technology with regards to which type of DNA you’d like to sequence.” PacBio’s SMRT sequencing technology, Dr. Korlach insists, is better at detecting repeats, insertions, and deletions due to its long-read capability.
This “third-generation” technology emerged from a research project initiated in 1997 at Cornell University. This project not only combined semiconductor processing, optics, and biotechnology, it also brought together two graduate students who would ultimately become senior managers at PacBio. One of these two was none other than Jonas Korlach. The other one was Stephen Turner, who earned his doctorate in 2000 and founded PacBio (originally named Nanofluidics) in 2004. The fledgling company hired Dr. Korlach as its eighth employee.
“I had an idea about how to watch polymerases in real time,” Dr. Korlach recalls. “It didn’t take long to recognize that if one could follow a DNA polymerase molecule in real time and detect and identify which of the four bases the polymerase is incorporating while making a new strand, one might have a powerful DNA sequencing method.”
With Dr. Turner and a team of organic chemists, physicists, and biologists at their burgeoning startup, Dr. Korlach helped develop the two main components of SMRT sequencing technology: 1) fluorescently labeled nucleotides and 2) zero-mode waveguides (ZMWs).
“[Incorporating ZMWs] was the brainchild of Steve Turner, whose background is in physics,” Dr. Korlach points out, “If you want to observe the activity of DNA polymerase using fluorescent nucleotides, you are going to have to somehow suppress all the other signal.”
A ZMW can isolate fluorescent emissions to a small region near an attached DNA polymerase. It does so by utilizing a sophisticated understanding of the physical laws of light propagation, allowing for detection of only polymerase-bound nucleotides. “Even though all four nucleotides are in solution and diffusing freely,” explains Dr. Korlach, “they don’t cause a background because the illuminated area is so tiny.”
To differentiate itself in the market, PacBio intends to leverage its technology’s unique advantages. For example, says Dr. Korlach, “PacBio technology is now widely regarded as the new gold standard in microbial sequencing. You really need the complete genome sequenced to understand how infections spread in the hospital, how transmission happens, and how the bacteria mutate.”
Optimizing clinical diagnostics is important for the widespread adoption of next-generation sequencing. Organizations such as the Genome in a Bottle Consortium are focused on providing resources to clinical laboratory clients to reduce ambiguity in sequence analysis.
“A clinical laboratory will often need to establish the accuracy of their sequencing and analysis methods,” says Justin Zook, Ph.D., a founding member of the Genome in a Bottle Consortium and a researcher at the National Institute of Standards and Technology (NIST). At NIST, Dr. Zook is part of the genome-scale measurements group.
According to Dr. Zook, one of the important metrics that clinical laboratories want to generate is high accuracy in reported variant calls: “Clinical laboratories aim to have close to zero false positives or false negatives in their region of interest.” Dr. Zook’s group helps with this part of the sequencing analysis by providing what it calls genomic reference materials.
“The first step,” details Dr. Zook, “is mapping the reads to the reference genome and then calling variants that you think are in the sequence.” Typically, this is accomplished through associating the sequencing reads to a nearly complete dataset often called the “reference genome.” Working with the Coriell Institute for Medical Research, Genome in a Bottle generated a large batch of human cells, “bottled” DNA from these cells, and distributed this DNA so that it could serve as genomic reference material. Genome in a Bottle then sequenced the samples using multiple methods.
“The advantage of us using these different platforms,” he asserts, “is that if any particular sequencing technology has some type of systematic error or bias at a location, we can use other technologies variant calls at that location.
“If a laboratory takes this same DNA, our reference material DNA, and sequences it, they can compare their variant calls to our high-confidence ones and find both false positive and false negatives with respect to ours.”
The Commercial Genetics Laboratory
Sequencing analysis has been broken out into primary, secondary, and tertiary analysis by many groups. Despite increased accuracy in primary sequence analysis, variation still exists in secondary analysis and tertiary analysis, the annotation and interpretation stages.
To improve sequencing analysis, commercial genetics laboratories such as Invitae are increasingly focused on annotation and interpretation. “Our goal,” states Invitae president and COO Sean George, Ph.D., “is to bring an individual’s genetic information into their mainstream medical care.”
According to Dr. George, who runs a CLIA-certified production and testing facility for Invitae, incorporating advanced next-generation sequencing technologies will allow the company to “disrupt the industry pricing structure, consolidate volume, and expand market because of additional demand and unmet need.” He adds, however, that the real future of clinical genomics lies in informatics.
“Most of the technology we use is actually software,” explains Dr. George. “We integrate it and put it into the clinical diagnostic business.” A good proportion of Invitae’s work is dedicated to building the clinical reporting infrastructure to deliver the standard of care in any disease area. “For any test that a doctor is ordering today,” asserts Dr. George, “the trick of translation is really to take that technology and put the informatics pipeline around it.”
If clinical sequencing is to embrace its future as an informational science, it must overcome combinatorial challenges. This means developing algorithms to combine next-generation DNA sequencing results with other “omics” datasets.
Such algorithms are being developed by STATegra, a project funded by the European Union. “The basic idea of the project was to put together data that gets information on different layers of molecular organization,” says Ana Conesa, Ph.D., a STATegra participant and a professor of bioinformatics at the University of Florida in Gainesville. Dr. Conesa works with 11 research groups from across Europe and the United States to modify preexisting software to enhance the integration of datasets for clinical, research, and technical utility.
In one case, Dr. Conesa will analyze TEDDY (The Environmental Determinants of Diabetes in Youth) data to study type 1 diabetes in multiple modalities. Some of this data, including traditional and next-generation genomic measurements, has been published individually or for localized associations over the past 10 years. The time has come, insists Dr. Conesa, to put all this data together: “We will combine all these different types of information—genome, transcriptome, metabolome, and microbiome data, together with clinical and demographic information—to see why some children get the disease, and some do not.”
Measuring CRISPR Efficiency
Although NGS is an important tool for verifying CRISPR gene editing events, it’s considered too costly on a large scale. Instead, a number of researchers choose to combine standard PCR with high-throughput fragment analysis to quickly assess gene editing to determine which samples to sequence for conclusive confirmation.
Since the initial proofs of concept in 2012, the CRISPR-Cas9 system has quickly surpassed its predecessors as a powerful technology for cost-effective, precise, and reliable genome engineering. However, screening for editing events is a critical component of any CRISPR workflow. One company, Integrated DNA Technologies, evaluates the functional efficacy of its CRISPR gBlocks® Gene Fragments using a PCR screening method.
gBlocks Gene Fragments are synthetic DNA that replace cloned single guide RNA (sgRNA) plasmid vectors required to position the Cas9 endonuclease on the target gene. Cas9 then induces double-strand breaks in the DNA, ultimately resulting in random or specific insertions, deletions, or mutations.
To identify the occurrence of gene edits, the company isolates DNA from treated cells and amplifies the targeted gene via PCR. The PCR products are treated with cleavage enzymes that recognize mismatched DNA. The cleaved PCR products are then processed with Advanced Analytical Technologies’ Fragment Analyzer™, which can identify and quantify the cleaved DNA, using capillary gel electrophoresis, to indicate the presence of editing.
“This automated, high-throughput Fragment Analyzer workflow has been ideal for the quantitative analysis of gene editing events while requiring one-tenth of the sample material required for agarose gels,” says Ashley Jacobi, research scientist at Integrated DNA Technologies.
Ian Clift Ph.D. is a Scientific Communications Consultant, Biomedical Associates and Clinical Assistant Professor, Indiana University