September 1, 2008 (Vol. 28, No. 15)
Richard A. A. Stein M.D., Ph.D.
With the Advent of Sophisticated Platforms, Genomes Can Be Sequenced in Record Time
About three decades ago, the Sanger and Maxam-Gilbert methods revolutionized the biomedical world. Sequencing of the 5,386-base bacteriophage PhiX174 in 1977 represented a memorable moment in science.
About one thousand base pairs could be sequenced annually at the time, and relying on the same approach, completion of the Escherichia coli chromosome was estimated to require a thousand years and that of the human genome, a million years.
Recent biotechnological advances changed that prospective, and the advent of next-generation platforms enables a genome to be sequenced within hours to days.
The 454 FLX Pyrosequencer from Roche Applied Sciences was the first next-generation sequencer to become commercially available in 2004, while the Solexa 1G Genetic Analyzer from Illumina was commercialized in the second quarter of 2006, and the SOLiD (Supported Oligonucleotide Ligation and Detection) System from Applied Biosystems launched in 2007.
Most recently, the HeliScope from Helicos BioSciences started shipping in early 2008. VisiGen is developing a nano-sequencing platform to be released by late 2009, and Pacific Biosciences anticipates commercializing a next-generation sequencer by 2010 and having a 15-minute human genome by 2013.
These latest applications rely on single-molecule analysis and are sometimes referred to as “next-next generation” or third generation sequencers.
While next-generation technologies usher us into a new scientific era, the short read lengths generated by certain platforms raise challenges during some applications such as genome assembly.
In late September, Roche anticipates a major product launch to coincide with the Cambridge Healthtech Institute’s “Exploring Next Generation Sequencing” conference in Providence, RI. A reagent and software upgrade for their 454 FLX Genome Sequencer will improve read length from 250 to 400–500 base pairs.
“This becomes significant,” says Timothy Harkins, marketing manager at Roche Applied Science, “because there are biological questions that you just cannot answer with short reads.”
While costs will remain relatively unchanged for customers, data throughput will quintuple to over 500 million bases per instrument run. The upgrade, called Titanium, consists of a thin metal coating that is applied to the pico titer plate walls and eliminates crosstalk between individual wells, thus improving the signal to noise ratio. Moreover, the upgrade will allow for a higher density of wells to be loaded.
If we think about new perspectives that next-generation sequencing brings to address fundamental biological questions, we cannot forget microbial genomes. At the same CHI meeting, Roger Bumgarner, Ph.D., associate professor at the University of Washington (www.washington.edu) and director of the Center for Expression Arrays, will talk about work he performs in collaboration with Casey Chen, Ph.D., from the University of Southern California (www.usc.edu).
Drs. Bumgarner and Chen sequenced seven Actinobacillus actinomycetemcomitans strains and anticipate completing seven more over the next month. Their results reveal a fascinating concept with broad applicability in microbiology: Every newly sequenced genome harbors about 10% new genes, a finding that casts a shadow of doubt on the usefulness of reference genomes when exploring new strains.
An immediate and obvious implication is that microarrays based on a previously sequenced genome will always miss about 10% of the genes in any single strain. “There are going to be a lot of things that are missing in the reference genome,” says Dr. Bumgarner, adding that, for investigators who use reference genomes for sequence assembly, this becomes “one of the big issues.”
The extent and the implications of this finding become even more obvious if we recall that a group B streptococcal pilus-like structure involved in pathogenesis has previously gone undetected for decades, and was only identified after screening multiple genomic sequences.
As we increasingly learn about newly sequenced microorganisms, the microbial pan-genome emerges as a fundamental concept with multidisciplinary ramifications.
While the ability to sequence entire chromosomes is often desirable, certain applications such as genotype-phenotype analyses or the scanning of chromosomal hot spots and disease-related genes require only certain chromosomal regions. A revolutionary approach called targeted re-sequencing, also known as genome partitioning or DNA capture, was developed for this goal and allows the enrichment and capture of relevant genomic regions, which are then made available for next-generation sequencing platforms, while other sequences can be discarded.
At the Cambridge Healthtech meeting, Agilent Technologies will present two targeted resequencing products from its genome partitioning portfolio.
“Customized products are one of our strengths,” maintains Fred Ernani, Ph.D., senior product manager of emerging genomics applications at Agilent. Both products will provide customers with flexibility in study design. One of them, designed for in-solution targeted re-sequencing in collaboration with Chad Nusbaum, Ph.D., co-director of the genome sequencing and analysis program at the Broad Institute, starts with as little as 3 µg DNA and is able to capture 5-30 MB DNA.
The other product was developed through a collaboration with Greg Hannon, Ph.D., from the Cold Spring Harbor Laboratories (www.cshl.edu) for on-array genome partitioning, and requires larger amounts of DNA, in the range of 20 µg.
Both applications promise to address a distinct set of customer needs. The array-based application will be the most cost-effective and flexible product for smaller studies requiring a considerable amount of design iteration. The in-solution product will be very scalable, automatable, and cost-efficient for medium- to large-scale experiments, and provide customers with increased flexibility in their work, and will “address the critical needs of researchers using next-generation sequencing for re-sequencing applications coming off of whole genome association studies and the entire spectrum of sample processing throughput,” states Emily LeProust, Ph.D., Agilent’s R&D chemistry and genome partitioning program manager.
Next-generation sequencing arrives with an explosion in the volume of data and at the same time brings considerable challenges for information management. As Ron Ranauro, president and CEO of GenomeQuest, explained, we are currently witnessing “a cycle going on between biological science and computer science that’s unlike anything that came before.”
At the Cambridge Healthtech conference, GenomeQuest will feature a web-based informatics service that provides customers with large-scale computational resources and algorithms on demand. The platform will perform “all-against-all” exhaustive sequence comparisons between sequence reads and reference data, ultimately aiming to provide customers with “the most complete results and therefore the most trustable finds,” points out Ranauro, adding that such an extensive comparison “will, in essence, purify the sample in silico.”
Another important goal, continues Renauro, is to perform this task “in a time period that is a fraction of the time it took to actually produce the data set in the first place.” Critical components of the platform will accomplish this by providing solutions to analyze, share, and archive the vast amounts of data as well as access them through a web browser, integrating thus the benefits of local access and central management.
The vast amount of information generated by next-generation sequencers also comes with a catch—a significant proportion of the output is useless and distinguishing between good and bad data is not just an important task but a veritable challenge.
“How do you judge which reads are good, which reads are bad?” asks Anton Nekrutenko, Ph.D., associate professor at the Center for Comparative Genomics and Bioinformatics at Penn State University. In collaboration with James Taylor, Ph.D., from New York University, he developed GalaxySR, the first freely available open-source system for short reads which he will present at the conference.
The software, which is free and requires only a web browser, is able to perform several quality-control steps even before the sequencing data are downloaded. This platform will make it “as simple as possible to go from the actual sequencing machine to some interpretable results,” emphasizes Dr. Nekrutenko.
In a recent experiment performed to validate this platform, he set out to determine whether one can tell two geographic locations apart, after collecting flies that accumulated on his windshield and examining the nature and abundance of short reads with a 454 analyzer.
Reading counts “is a tricky thing” warns Dr. Nekrutenko, because it depends on several variables such as sample preparation or DNA concentrations, and eukaryotic metagenomics is one immediate area for which he envisions exciting applications of GalaxySR.
Next-generation sequencing increasingly emerges as the ideal methodology to answer a thought-provoking question. Do we really know our invisible neighbors? Genomics advances make it increasingly clear that we grossly underestimated the extent of microbial diversity on our planet.
Over 99% of microbial species are thought to be unculturable, and therefore, unavailable to be studied by classical methods, suggesting that new approaches will unveil new species.
When hundreds of liters of water from the Sargasso Sea were filtered and the microorganisms captured and sequenced, 148 novel bacterial phylotypes were identified. Metagenomics, fueled by next-generation sequencing platforms, promises to shed light even on these unculturable species and to reveal the abundance of bacterial inhabitants on Earth in their true diversity.
Finally, as we approach the $1,000 genome landmark, next-generation sequencing methods forecast changes in medicine that even in the recent past seemed unimaginable. But is a double-edged sword about to be born?
Genomic polymorphisms were proposed to hold answers to the inter-individual variation of medication responses and the unpredictability of side effects, and genomics promises to catalyze the transition from the one-drug-fits-all therapeutic approach to the era of individualized therapy.
At the same time, insights into the human genome generate fervent ethical and legal debates. Over 1,000 genetic tests are commercially available, and amid estimates that we are all carriers of at least five recessive alleles, the question might soon shift from whether a person carries recessive traits to what those traits are.
Unanswered questions still linger about how genetic information is defined and handled for healthcare and life insurance underwriting, and its impact on patient-physician relationship, employment, and public health promises profound medical, social, ethical, and legal ramifications.
Much-awaited legislation will be instrumental in shaping strategies to ensure that we take full advantage of the arduously earned knowledge and use it to benefit humankind while actively minimizing potential misuse.
Richard A. Stein, M.D., Ph.D., is an assistant research scientist at the New York University School of Medicine. Email: Richard.Stein@nyumc.org.