April 15, 2012 (Vol. 32, No. 8)
Harry Gao, M.D., Ph.D. City of Hope Medical Center
Stefan J. Green, Ph.D. University of Illinois at Chicago
Nadereh Jafari, Ph.D. Northwestern University
Brewster Kingham University of Delaware
Andor Kiss, Ph.D. Miami University of Ohio
Robert Lyons, Ph.D. University of Michigan
W. Kelley Thomas, Ph.D. University of New Hampshire
An Exclusive Q&A with Our Expert Panel
As GEN reporter Greg Crowther, Ph.D., pointed out in his article on next-generation sequencing in our March 1 issue, the NGS field is rapidly expanding. In addition to keeping track of ongoing advances in instrumentation, scientists are increasingly interested in NGS practices related to sample prep and analysis. In other words, NGS is becoming a critical component of many investigators’ armamentarium of research tools.
Based on the recognition of the growing importance of NGS, we created this special “Tech Tips” section on the topic for this issue of GEN. We interviewed leading NGS scientists and practitioners to gain their insights on how they best utilize this technique and how they get the most bang for their buck for a wide range of applications. The people interviewed for this special feature included Harry Gao, M.D., Ph.D., director, DNA Sequencing Laboratory, City of Hope Medical Center; Stefan J. Green, Ph.D., director, DNA Services Facility, University of Illinois at Chicago; Nadereh Jafari, Ph.D., research associate professor director, Genomics Core Facility, Northwestern University; Brewster Kingham, director, DNA Sequencing & Genotyping Center, Delaware Biotechnology Institute, University of Delaware; Andor Kiss, Ph.D., supervisor, Center for Bioinformatics and Functional Genomics, Miami University of Ohio; Robert Lyons, Ph.D., director, DNA Sequencing Core, University of Michigan; and W. Kelley Thomas, Ph.D., director, Hubbard Center for Genome Studies, and professor, department of Molecular, Cellular and Biomedical Science, University of New Hampshire.
We specifically asked our interviewees to describe the types of instruments they currently use and the kinds of experiments they are carrying out. We also queried them on probably the greatest roadblock in biotech research today: data overload. How do they deal with it and, more importantly, what are some solutions for overcoming this problem?
These and other questions were designed to elicit responses from our interviewees with you, our readers, in mind. We believe that by tapping into the expertise of top NGS scientists and discovering their approaches to exploiting this sophisticated methodology, we can help shed light on NGS issues that you might be trying to work your way through while simultaneously providing you with some advice on ways to conduct your NGS experiments to garner better results.
What types of next-generation sequencing does your facility use, and for what types of experiments or samples?
Dr. Gao: We use the Illumina HiSeq 2000, GAIIx, and Roche 454 FLX. We sequence more than 90% of our samples on the HiSeq 2000. More than 50% of the samples we receive are for smRNA sequencing, 10% for ChiP-seq, 20% for target, exome, or whole-genome sequencing, and 15% for RNA-seq. The remaining 5% are for methylation and other projects.
Dr. Green: The DNA services facility at the University of Illinois, Chicago houses an Ion Torrent Personal Genome Machine, and—in collaboration with our sister facility at the University of Illinois Urbana-Champaign (UIUC)—has access to Roche 454 Titanium and Illumina HiSeq 2000 platforms. Our facility has focused on instrumentation for library preparation, including a Covaris S2 acoustic shearing device for DNA shearing, a PippinPrep automated size selection device, and an Ion Torrent OneTouch instrument for automated emulsion PCR. In addition, we have a Sequenom MassArray 4 for moderate, multiplexed genotyping (particularly single nucleotide polymorphism analysis).
Our facility accepts a wide variety of sequencing projects, from genome sequencing and resequencing (viral, bacterial, small eukaryote), metagenome sequencing (natural and contaminated soil and aquatic samples), genotyping and targeted re-sequencing, microbial ribosomal RNA gene sequencing, cancer panel amplicon sequencing, and RNA-seq. We perform library preparations directly from nucleic acids provided by customers, or from nucleic acids extracted in our facility.
Dr. Jafari: Our core facility has two Applied Biosystems SOLiD 5500xl instruments, and will soon install an Ion Torrent system. We provide Chip-seq, RNA-seq, and targeted resequencing including exome-seq.
Kingham: My facility provides Illumina SBS sequencing on the HiSeq 2000 platform, as well as single-molecule SMRT sequencing on the Pacific Biosciences RS platform. Many of the Illumina experiments have been small RNA and transcriptome sequencing, with some genomic and targeted sequencing experiments. We are still in the technology-assessment phase with our PacBio RS, so we have not officially launched this as a service. However, we are starting to see the capabilities of the RS influence on our Illumina queue by increasing the number of long read paired-end runs. Hybrid assemblies using Illumina data to “polish” the lower accuracy of PacBio data is becoming standard for those fortunate enough to have access to both platforms.
Dr. Kiss: We currently use a turnkey service at The Ohio State University via 454 Roche, as well as Illumina HiSeq 2000 at the University of Cincinnati. Our projects mostly involve cDNA (EST libraries) and RNA-seq.
Dr. Lyons: The University of Michigan DNA Sequencing Core has most of the major commercial sequencing platforms (five HiSeqs, one GA, two SOLiD 4s, a 454 FLX, a PacBio RS, and an Ion Torrent PGM). The majority of our next-gen activities involve human whole-genome, shallow-draft sequencing on the HiSeq. However, we try to meet all the needs of our very diverse clients, so just about any service procedure is offered, with varying levels of support (ChIP-seq, RNA-seq, metagenomics, microbiomics, exomes, targeted capture, nonhuman genomes, etc.).
Dr. Thomas: At New Hampshire we use the 454 and Illumina platforms to conduct whole-genome shotgun, RNA-seq, BSSeq, metagenomics, metatrascriptome, and a lot of RefSeq. We do many small projects, such as partial 454 plates or single Illumina lanes, per month.
Who are your customers? Which do they value more, cost or speed?
Dr. Gao: Most of our approximately 100 samples per week come from our internal university users. Fewer than 10% are from outside organizations. Cost, turnaround time, and data quality are all important for the researchers we serve.
Regarding speed, filling a flow cell for a particular run takes time. We can fill one flow cell easily for a short, single-read 40 bp run by combining miRNA-seq and ChiP-seq samples. It takes longer to obtain enough sample for a PE 2 x 100 bp run. It would be great to be able to run each lane independently, for example with SOLiD 5500xl.
The turnaround depends on the technology and application. For the Illumina PE 2 x 100 bp, turnaround time is normally 8 days, or 11 days when we’re running 2 flow cells simultaneously. The Oxford Nanopore technology will change the field completely with fast turnaround—about 15 minutes per human genome, and the cost is expected to be less than $1,000. This system is supposed to be available by the end of this year.
Dr. Green: Our customers are largely affiliated with the University of Illinois and include both faculty and research physicians. In addition, external customers typically include microbial ecologists looking for amplicon sequencing, genome, metagenome, and metatranscriptome sequencing. Typically, but not always, cost is more important for my customers than speed. This varies from project to project, and some projects are highly time sensitive (particularly those supporting grants).
Because of the diverse types of samples and projects, it is hard to provide an approximate throughput as this varies significantly. We are currently a rather small facility, processing, for example, only a few Ion Torrent samples per week.
Our facility provides a variety of services to address a range of scientific endeavors—from medical to environmental research. By maintaining an Ion Torrent, we have the ability to rapidly produce sequence data for small to medium projects, and collaborate with other sequencing facilities for the largest projects. Through this approach, we are able to match the appropriate sequencing platform to meet the cost and time constraints of our customers.
Dr. Jafari: We mostly provide services to our internal users at Northwestern University and its affiliates. Both speed and cost are important to our investigators, but current budget issues have put more strain on researchers from the cost perspective.
Kingham: Our customers are largely University of Delaware investigators, or investigators with ties to the university. The rapidly growing field of translational research has connected us with many clinical researchers employed by regional healthcare systems. Most investigators are looking for a balance between cost, amount of data, data quality, and turnaround time. This balance can fluctuate based on the project. For example, turnaround time is critical for obtaining preliminary data for an upcoming grant, while data accuracy is more important for the targeted sequencing of an oncogene. Our HiSeq 2000 runs at about 75–80% capacity; we are still validating the PacBio RS.
Dr. Kiss: Our customers are members of the Miami University community. Cost and speed are both factors for them, but I would say cost is more of a concern. We currently process very few samples as we do not currently have instrumentation on site. In addition to the two facilities mentioned earlier, for large genome sequencing projects we recommend the facilities at Ohio State University’s Plant-Microbe Genomics Facility. Despite its name, this facility conducts all types of sequencing.
Dr. Lyons: We have roughly 100 distinct next-gen client laboratories, virtually all of whom are at our own university. We are willing to accept projects from outside users, but are required to add a surcharge to their recharge rates. In practice, few outsiders opt to send samples for next-generation sequencing here—unlike our Sanger services.
It is impossible to express throughput in terms of either samples or projects, due to the extreme diversity of the projects we handle. For one project, we’ve done over 1,000 human genomes in the past year, with 3,000 more to be completed by mid-2012. Other clients occupy a single lane, yet require disproportionately higher effort on our part. Sample counts are misleading, too. Sometimes a single sample will occupy numerous lanes, while other times a single lane could have up to 96 samples in it.
In my opinion, clients are somewhat more concerned with cost right now than with speed. A close third consideration, though, is flexibility and availability of options. We try to accommodate clients with urgent needs or nonstandard protocols when we can.
Dr. Thomas: We organize sequencing primarily for research groups at the University of New Hampshire. These researchers are primarily environmental biologists and microbiologists, with a few biochemists; both speed and cost are issues for them. We have wait times in excess of six months and sometimes as long as one year. This is specific to the Illumina platform, which is the most popular and cost-effective. But every run is two weeks long. This is a major issue as there is not sufficient infrastructure in the U.S. to meet sequencing research needs. It is also a problem for us because we have to go outside for this service.
How do you deal with data overload? How important is informatics technology to your workflow?
Dr. Gao: We are lucky to have great support from our institution. We have more than 600 terabytes of online storage and a tape-based data backup system in place. We also have a good bioinformatic core with several commercial and open-source software systems. But data analysis is still very challenging.
Our group works with the university’s bioinformatics core on data analysis, but the customer decides how the data will be analyzed. Software can do quite a bit, like aligning the data to references, mutation/SNP calling, deletions/insertions calling, gene expression, DNA binding site peak-finding, and others. We also rely on our own scripts for special applications. But no one software package is capable of doing everything for next-generation sequencing.
Dr. Green: This is a definitely a major concern for us. Although the Ion Torrent files are small relative to those generated by Illumina and Roche 454, the data output is still impressively large. We are currently seeking solutions with our developing bioinformatics department.
Dr. Jafari: In the beginning we had many issues with the amount of data we generated. Currently, we are not retaining a large amount of those data and images. At the same time, we now have access to the Northwestern computing and storage capabilities, which has made our data management much easier. Our new data-retention policy will further help our data-management and will prevent data overload.
Kingham: Data overload need not necessarily be a problem, but it must be handled properly or bad things will happen. As a shared resource facility we deal with this by implementing a data retention policy and seeing that investigators understand what this means for their data. Informatics technology has never been more important. Even with data overload, it is our responsibility to see that all the data has integrity, and is properly backed up.
Dr. Kiss: Illumina’s BaseSpace solution is very attractive to us and may be the tipping point in helping us decide whether to purchase an Ion Torrent or a MiSeq instrument. We are required by NIH and NSF to store all data for five years after the project is completed. We currently use CD backup, server backup, and external HDD backup.
Dr. Lyons: Information overload is indeed a huge, huge problem. We are constantly struggling to manage, store, and deliver the data. We dedicate a significant amount of effort toward developing sample-tracking and project-tracking software specific to our unique needs. We dedicate a significant amount of effort toward increasing our basic storage infrastructure. We don’t even try to do the bioinformatics; that is the province of a separate core, but they are badly overloaded and it’s going to get worse!
Dr. Thomas: We deal with data poorly and ad hoc, but we’re working to provide centralized servers and storage. Informatics technology—both storage and software—are critical for our users who are not computer scientists, and who may be conducting their first-ever DNA sequencing experiment.
How can instrument makers improve instruments and streamline workflows?
Dr. Gao: We need faster turnaround. Eight or eleven days is too long for a sequencing run. Higher throughput and lower instrument costs are also important.
Dr. Green: The length of the workflow, particularly for the Ion Torrent, is a significant concern, and we have been looking for measures to address this. The time itself is not so significant, but that much of it is hands-on laboratory time, and some aspects of chip loading are highly sensitive to the experience of the user.
Dr. Jafari: Easier workflow and better on-instrument data-analysis software would be instrumental in streamlining NGS projects. By “on-instrument” I mean having a separate computer, not part of the instrument, holding the software. I think this should conduct basic sequencing analysis, just like they have for microarray analysis. Affymetrix and Illumina have basic tools that, if the user provides some basic information, will spit out some-fold change, P-values, and basic stats. I think this can now be done, especially for RNA-seq and ChiP-seq.
Kingham: At the rate this field is advancing, this is a difficult question to answer. Maybe some of these instruments or workflows should be improved before they are commercialized. The level of variability seen on many next-gen platforms needs improvement.
Dr. Kiss: Making the bioinformatics pipeline fully automated and consistently improving this aspect of the post-run analysis would be a big improvement. The CLC Genomics Workbench software package appears to perform most of the automation a facility like ours is looking for, as well as satisfying much of our user base.
Dr. Lyons: I can’t really comment much on this topic. Manufacturers are stressed just as we are, trying to keep on top of a dramatically evolving field. While there are many things they could do to help us (improved software, flexibility of applications, better bioinformatics support), in reality, the manufacturers probably have their hands full just keeping their instruments competitive in terms of the most basic productivity measures.
Dr. Thomas: Instruments could be made cheaper and faster. One of my major issues is the lack of clarity with companies like 454 and Illumina with regard to their products’ details. Developing new protocols is almost impossible as it is extremely difficult to obtain detailed information for key things like linker sequences. This kills the distributed application development process.
Is your facility considering upgrading hardware, software, or some other aspect of your sequencing activities?
Dr. Gao: We are considering upgrading to HiSeq 2500 to have a faster turnaround option. Oxford Nanopore technology is very promising for lowering costs, faster turnaround, and longer sequence reads.
Dr. Green: We are investigating automated robotic library preparation to allow us to focus more on quality control of library preparation and increase sample throughput. In addition, we are planning on purchasing the next-generation Ion Torrent instrument—the Proton. This will eventually allow for whole human genome sequencing on a single chip.
Kingham: We are always considering upgrading. It’s important to stay on top of where the technology is going. Institutional investigators are really what drives the acquisition of new technology, so it’s important to effectively communicate what the future genomics landscape will look like.
Dr. Kiss: Yes, we are either going to purchase an Ion Torrent or a MiSeq instrument. Full genome sequencing is available to Miami University principal investigators via the Ohio State University 454 Roche sequencers. We cannot afford to duplicate this, and there is no reason to. But, we could definitely afford $80,000–$120,000 for the benchtop sequencer with low acquisition and operating costs. We are also considering buying CLC Bio’s Genomic Workbench as site-licensed software for next-generation sequencing analysis. What is most attractive to us about this software is its cross-platform nature.
Dr. Jafari: We are planning to upgrade our 5500xl to 5500W, which is supposed to eliminate the use of ePCR and double output. These upgrades will lower our prices significantly. We are looking forward to getting the more affordable instruments that can handle whole-genome sequencing faster and at much lower cost. It is best to avoid having a few large centers running all the whole-genome sequencing projects.
Dr. Lyons: We are almost always planning expansions. Two more HiSeqs should arrive in the next couple of months. We may acquire an eighth HiSeq soon thereafter. Our newest sequencer, the PGM, needs to be provided with IT support and a technical team. Other instruments recently added include a Qiagen PyroMark and another Sanger sequencer.
A MiSeq is almost certainly in our future. The FLX should soon get the Plus upgrade. At least some of the HiSeqs will get upgraded to the 2500 model. Our LIMS system is undergoing vast upgrade to accommodate recent changes. This field is truly in a constant state of flux.
The reason we upgrade is simple. Newer instruments almost always provide significantly improved cost efficiency, improved production, and better data. Our clients benefit by staying in the forefront of their research field.
Dr. Thomas: Yes, due to overwhelming need we are now in the process of purchasing software for university-wide support. We’re looking for software that is user friendly, but the problem with such software is that it is not readily compatible with normal laboratory computers.