GEN UPDATES in biotechnology:
Next-Generation Sequencing
Enabling Routine Sequencing of Individual Genomes
Jan Berka
The biotechnology boom of the late 90s and early 2000s and sequencing of the first human genome by the whole-genome shotgun method were in part enabled by technologies, tools, and processes developed over more than a decade. Thanks to these developments, DNA sequencing costs have fallen more than 50-fold. However, it still costs about $10 million to sequence 3 billion base pairs—the amount of DNA found in the genomes of humans and other mammals.
In 2004, the National Human Genome Research Institute (NHGRI), part of the NIH, awarded more than $38 million in grants to spur the development of innovative technologies that would dramatically reduce the cost of DNA sequencing. The near-term goal was to lower the cost of sequencing a mammalian-sized genome to $100,000 and the ultimate goal was to cut the cost of whole genome sequencing to $1,000 or less, which would enable routine sequencing of individual genomes as part of medical care. In 2005, the NHGRI followed with a second round of five-year sequencing technology grants totaling $32 million.
In October 2005, 454 Life Sciences and Roche Applied Science announced a commercial launch of the Genome Sequencer 20 System (GS-20) and reagents, the first DNA sequencing system on the market aimed at achieving the above goals.
The GS-20 System, developed by 454 Life Sciences, is an ultrahigh-throughput automated DNA-sequencing system, capable of carrying out and monitoring sequencing reactions in a massively parallel fashion. Since the GS-20 System provides a complete solution for ultrahigh-throughput DNA sequencing, an individual researcher can, for the first time, prepare samples, sequence reactions, generate sequence reads, and assemble genome sequence data within days. The system includes the Genome Sequencer 20 Instrument and accessories, software to generate basecalls and assemble or map the raw reads, and consumable kits required for library construction, clonal amplification, and sequencing.
The GS-20 sequencing chemistry utilizes the release of pyrophosphate (PPi) that occurs with each nucleotide addition during DNA-directed DNA synthesis to generate an amount of light commensurate with the amount of PPi released. This light is captured by a charge-coupled device camera and converted into a digital signal. The combination of signal intensity and positional information over the PicoTiterPlate™ device allows the Sequencer’s computer to determine the sequence of hundreds of thousands of individual reactions simultaneously, producing millions of nucleotides of sequence per hour (Figure 1).
Using the GS-20, whole-genome sequencing workflow from sample input to data output can be performed by a single operator in a general laboratory setting. It consists of DNA library preparation, emulsion-based clonal PCR amplification, PicoTiterPlate device preparation and sequencing instrument run, and data analysis. The output of a single run is typically 20x106 nucleotides or more (for the 70x75-mm PicoTiterPlate device), and multiple runs can be pooled for off-line assembly/mapping. The final consensus sequence is output as a FASTA file with an associated basecall quality score file.
Figure 1. 454 Life Sciences’ Genome Sequencer 20 Instrument’s main principle of operation. The sequencing reagents are pumped from the Reagents Cassette to the PicoTiterPlate Cartridge (reagent selection is done through a set of valves, not shown). Reagents flow across the surface of the PicoTiterPlate device into the reaction wells, and the spent reagents flow back to the waste container in the hollow Reagents Cassette. The light resulting from the sequencing reaction (white arrows) travels through the back of the PicoTiterPlate device (constructed of optical fibers) to reach the CCD camera. Inset: Each well contains no more than one large DNA bead (sample), which is surrounded by smaller beads that carry the enzymes required for the chemiluminescence reaction.
Software, Flowgrams, Individual Sequence Read Length, and Accuracy
The raw data resulting from a sequencing run consists of a series of digital images captured by the camera, where the images are a representation of the surface of the PicoTiterPlate device over which the sequencing reactions are taking place. Each image corresponds to one reagent flow over that surface, as defined by the run script. If the sample DNA fragment present in a given PicoTiterPlate well is extended during a nucleotide flow, light is emitted from the well and captured on the image corresponding to that flow. Furthermore, the amount of light emitted is proportional to the number of nucleotides extended. Knowledge of the nucleotide flowed while each image is being captured (from the run script), of the location on the PicoTiterPlate device where light is being emitted (coordinates of each pixel on the images), and of the amount of light emitted during each flow (brightness of the pixels in the corresponding images) allows the software to identify PicoTiterPlate wells that contain a DNA library fragment and to determine the sequence of the DNA fragments present in each well.
GS-20 software package is used to process the data acquired by performing one or more sequencing runs on a Genome Sequencer 20 Instrument. Such data processing is divided into two main categories: run-time processing and post-run processing. Part of the run-time data processing can also be delayed and performed after the sequencing run completes, while the post-run data processing is always executed on a Linux-based computer, separate from the computer onboard the Genome Sequencer 20 Instrument.
Each step in the processing of sequencing data in the Genome Sequencer 20 System is governed by a specific application. The run-time applications include acquisition of the raw images during the sequencing run itself, image processing, and signal processing. These are always carried out during a sequencing experiment. One can choose from two post-run processing applications, depending on one’s experimental set up—de-novo genome assembly and re-sequencing, or mapping to a reference genome. An additional application, Phoenix, is an interactive run browser/diagnostic tool that graphically displays the images, some intermediate data, and various output metrics from a sequencing run. An example of a single-well sequence read in the flowgram graphical representation is shown in Figure 2. The output of the run-time software pipeline are filtered sequence reads with phred-equivalent basecall quality scores. Read accuracy for an individual sequence is typically 99% over the 100-base read length.
The applications of the post-run phase all use the trimmed flowgram information from the sequence-producing wells of the sequencing run (or of a pool of runs) as input. The Assembly application generates a consensus sequence of the whole DNA sample by assembly of the reads into contigs (de novo shotgun assembly).
The Mapping application generates the consensus DNA sequence by mapping, or alignment, of the reads to a reference sequence, as well as a list of high-confidence mutations. The current version of the GS-20 software has the capacity of analyzing genomes up to 50 Mbp in size at 15–25x depth of coverage. Mapping application will typically result in >=99.99% accuracy over 95% of the nonrepeat parts of the genome (Q40+ bases), when the average genome coverage is at least 15X.
The Assembler application will yield N50 contigs (contig that contains 50% of the assembled bases) size >=10 kb with >=99.99% accuracy over 95% of the nonrepeat parts of the genome (Q40+ bases), when the average genome coverage is at least 25X. Examples of several bacterial genome assemblies are shown in the Table.
Since the Genome Sequencer 20 System utilizes neither cloning in bacteria
nor electrophoretic separation, sequence coverage biases normally associated
with these techniques are eliminated. We have confirmed the lack of sequence
coverage bias by sequencing several tens of bacterial genomes in-house.
The remaining gaps in assembled genome sequences are largely due to the
presence of sequence repeats longer than ~75 bp.
Figure 2. An example well flowgram showing the four-base key sequence and signal thresholds for singlets, two-, three-, and four-base repeats.
Figure 3. Paired-end library preparation scheme. Genomic DNA is fragmented to yield average fragment size around 2.5 kb. The fragmented genomic DNA is methylated with EcoR I methylase to protect the EcoR I restriction sites. The ends of the fragments are blunt-ended, polished, and an oligonucleotide adaptor is blunt-end ligated onto both ends of the digested DNA fragments. Subsequent digestion with EcoR I restriction enzyme cleaves a portion of the adaptor DNA, leaving sticky ends. The fragments are circularized and ligated, resulting in 2.5-kb circular fragments. The adaptor DNA contains two Mme I restriction sites, and after treatment with Mme I, the circularized DNA is cleaved, 20 nucleotides away from the restriction sites in the adaptor DNA. This digestion generates small DNA fragments that have the adaptor DNA in the middle and 20 nucleotides of genomic DNA that were once approximately 2.5 Kb apart on each end. These small, biotinylated DNA fragments are purified from the rest of the genomic DNA by streptavidin beads.
Paired-End Libraries and Genome Assembly Data
454 Life Sciences has developed a new protocol to generate a library of paired-end fragments that are used to determine the orientation and relative positions of contigs produced by the de novo shotgun sequencing and assembly. The paired-end library DNA fragments are 84 bp long and contain a 44-mer adaptor sequence in the middle, flanked by a 20-mer sequence on each side. The two flanking 20-mers are segments of DNA that were originally located approximately 2.5 kb apart in the genome of interest.
The paired-end library is generated by a simple and robust protocol. The purified paired-end fragments are processed through the normal library preparation protocol for the GS-20 (see Figure 3 for an outline of the process), followed by the standard emulsion PCR and sequencing steps of the GS-20 system. Sequence data obtained from the paired-end reads are combined with standard GS-20 whole-genome shotgun sequencing reads in a new version of the Assembler that aims at genomes up to 1 Gbp in size.
The benefits of combining the GS-20 shotgun sequence reads with the paired-end reads have been tested on several bacterial genomes and a Saccharomyces cerevisiae genome, previously sequenced at 454 Life Sciences. The 4.6-Mbp genome of E. coli K12 strain was sequenced in three standard GS-20 runs to a depth of 22-fold. The assembly performed with the Newbler assembly software resulted in 140 unoriented contigs. An additional sequencing run of a paired-end library yielded approximately 112,000 reads. The paired-end data improved the genome assembly to 20 multicontig scaffolds, covering 98.6% of the genome. An illustration of this result is shown in Figure 4.
The genome of Bacillus licheniform is ATCC 14580 (DSM 13) (4.2 Mbp) was shotgun sequenced in three sequencing runs, yielding approximately 27X over sampling. The assembly performed with the Newbler assembly software resulted in 98 unoriented contigs. An additional sequencing run of a paired-end library yielded approximately 255,000 reads. The paired-end data improved the assembly to nine scaffolds, covering 99.2% of the genome.
The 12.2-Mbp genome of Saccharomyces cerevisiae S288C (16-haploid chromosomes and one 86-Kbp mitochondrion) was shotgun sequenced in nine sequencing runs, yielding approximately 23X over sampling. The assembly performed with the Newbler assembler resulted in 821 unoriented contigs. Two additional sequencing runs of a paired-end library yielded approximately 395,000 reads. The paired-end data reduced the assembly to 153 scaffolds, covering 93.2% of the genome.
In the above examples, we have demonstrated how the ordering and orienting
of contigs, enabled by the paired-end library sequence data, generates
scaffolds that provide a high-quality draft sequence of the genome.
Table. Examples of Several Bacterial Genome Assemblies
| M. genitalium | S. pneumoniae | E. coli | B. licheniformis | |
| Genome Size [bp] | 580,069 | 2,014,239 | 4,639,675 | 4,222,645 |
| Number of GS-20 Runs | 0.5 | 2 | 3 | 3 |
| Assembly Contigs | 19 | 228 | 140 | 105 |
| Assembly Coverage | 96.66% | 92.46% | 97.46% | 98.62% |
| Overall Accuracy | 99.993% | 99.991% | 99.998% | 99.993% |
| Average Contig Size | 29.5 kb | 8.8 kb | 32.4 kb | 39.7 kb |
| N50 Contig Size | 41.0 kb | 14.0 kb | 67.2 kb | 74.3 kb |
| Largest Contig | 130 kb | 66 kb | 164 kb | 262 kb |
Ultradeep Sequencing of Amplicons
The GS-20 system is based on clonal, single DNA fragment molecule amplification in combination with a high-throughput sequencing chemistry. The characteristics of the resulting sequence reads, currently on average 100 bases long but tens-of-thousand-fold deep, open a unique opportunity to employ the 454 Genome sequencer in applications where detection of rare variants of a known sequence in complex mixtures of sequences is crucial.
Direct sequencing of mixed, nonclonal amplicons using Sanger dideoxy terminator chemistry is not sensitive enough to identify and quantitate many of the sequence variants present in biological specimens. Bacterial cloning of amplicons into a vector prior to traditional sequencing of individual clones will increase the sensitivity but not without a large increase in time and cost, thus making this approach uneconomical. 454 technology provides instant cloning of hundreds of thousands of molecules via the emulsion PCR step and highly accurate sequencing, since now each fragment can be sequenced hundred- or thousand-fold deep.
Although there are many potential uses for amplicon sequencing, the molecular biology and software developments at 454 Life Sciences have initially focused on the oncology research applications, where early detection of drug resistance causing mutations in tumor cells may be of paramount importance for design of future diagnostic assays. The level of diversity within oncology-derived samples is low when compared to other types of samples, such as the variable regions within viruses, thereby simplifying the informatics analysis tools. Additionally, sample generation from human genomic DNA, due to the lack of highly variable regions, allows a straightforward design of amplification primers and is therefore well suited to the current 100-bp read lengths of GS-20. Moreover, none of the existing high-throughput technologies, based on hybridization on microarrays or bead arrays, offer the possibility of novel variant detection.
To demonstrate the power of the GS-20 system, we have chosen previously
described SNPs from upstream of the HLA-DMA gene to the TAP2
gene in the class II region of the MHC as a model system. We were able
to reproduce the published data using our system; allele frequencies down
to 3% were easily detected, as shown in Figure 5.
Figure 4: De novo assembly results for E. coli K12 aligned against a reference genome. The reference genome is represented by the top black line. The standard whole-genome shotgun sequence and assembly is represented by the pink bars. Repeat regions of the genome are represented by the green bars at the bottom. Spaces between the pink bars are typically a result of repeat regions that cannot be uniquely assigned to a region in the genome. With the addition of one run of paired-end reads, represented by the purple bars, the genome sequence becomes much more complete.
Conclusions
454 Life Science’s GS-20 sequencing system is the first high-throughput, low-cost alternative to the current systems that are based on Sanger chemistry and electrophoretic separations.
The system opens the possibility of whole genome sequencing to many laboratories
that are not equipped with the infrastructure necessary to support workflow
required by the capillary-based DNA sequencers. Among the major advances
are instant cloning by emulsion PCR, a bias-free, source-sample independent
method that matches in throughput the massively parallel sequencing chemistry
and detection platform. High individual sequencing read accuracies combined
with adequate read length, vast oversampling, and paired-end sequence
information result in high-confidence genome assemblies and open up an
array of future applications that will change the way we study individual
organisms and biological systems.
Figure 5 : Genotyping results of three SNPs in the HLA-DMA gene region (class II MHC). Base changes along the fragment sequence (x axis) are color coded and their positions shown as bars. The primary y axis denotes base change frequency, and the secondary y axis and the black line above the mutation spectrogram represents sequencing coverage. Both high frequency alleles (top panel) and low frequency alleles (bottom panel) are shown.
Jan Berka, Ph.D., is director, molecular biology at 454 Life Sciences. Web: www.454.com. Phone: (203) 871-2318. E-mail: .

