Two key sequencing techniques are no longer at odds, thanks to an international effort led by scientists at University of California (UC) San Diego. The researchers have developed a reference database, called Greengenes2, which makes it possible to compare and combine microbiome data derived from either 16S ribosomal RNA gene amplicon (16S) or shotgun metagenomics sequencing techniques.

“This is a significant moment in microbiome research, as we’ve effectively rescued over a decade’s worth of 16S data that might have otherwise become obsolete in the modern world of shotgun sequencing,” said Rob Knight, PhD, professor in the departments of Pediatrics at UC San Diego School of Medicine and Bioengineering and Computer Science at UC San Diego Jacobs School of Engineering. “Standardizing results across these two methods will significantly improve our chances of discovering microbiome biomarkers for health and disease.” Knight is senior author of the team’s published paper in Nature BiotechnologyGreengenes2 unifies microbial data in a single reference tree,” which introduces Greengenes2.

Microbiome studies depend on scientists’ ability to identify which microorganisms are present in a sample. To do this, they sequence the genetic information in the sample and compare it to reference databases that list which sequences belong to which organisms. 16S and shotgun sequencing are the two techniques most widely used in microbiome research, but they often yield different results. “… investigators using these different methods typically find their results hard to reconcile,” the authors wrote. “This lack of standardization across methods limits the utility of the microbiome for reproducible biomarker discovery … A key problem is that whole-genome resources and rRNA resources depend on different taxonomies and phylogenies.”

The original Greengenes database had been widely used in the microbiome field for well over a decade. It was the reference database used by notable projects including the National Institutes of Health Human Microbiome Projectthe American Gut Project, the Earth Microbiome Project and many others.

However, one of its fundamental limitations was that it relied on the sequence of a single gene, 16S, to identify the organisms in a sample. This well-studied gene has long been used as a taxonomic marker, with each organism having its own 16S “barcode.” This method can describe the contents of a microbiome sample with genus-level resolution, but it cannot always identify specific species or strains of microbes, which is important for clinical work.

Modern microbiome studies have since transitioned to using shotgun sequencing, which looks at DNA from all over the organisms’ genomes, rather than focusing on only one gene. This powerful approach gives researchers more species-level specificity and also provides insight into the microbes’ function. “A key problem is that whole-genome resources and rRNA resources depend on different taxonomies and phylogenies,” the authors continued. “Microbiome science has been described as having a reproducibility crisis, but much of this problem stems from incompatible methods … For example, Web of Life (WoL) and the Genome Taxonomy Database (GTDB) provide whole-genome trees that cover only a small fraction of known bacteria and archaea, while SILVA and Greengenes are more comprehensive but are most often not linked to genome records.”

So while scientists may have attributed the discrepancies between the two techniques to differences in the way the samples are prepared in the lab, the new study demonstrates that incompatibilities between the two techniques arise from differences in computation.

A better reference database allows for the same conclusions to be drawn from both methods. This addresses an important issue in the reproducibility of microbiome research and allows the re-use of data from millions of samples in older studies. Knight and colleagues noted, “We reasoned that an iterative approach could yield a single massive reference tree that unifies these different data layers (for example, genome and 16S rRNA records), which we call Greengenes2.”

The Greengenes2 phylogeny can be used to identify microorganisms in either 16S or shotgun sequencing data
The Greengenes2 phylogeny can be used to identify microorganisms in either 16S or shotgun sequencing data. [UC San Diego Health Sciences]

In trying to resolve these incompatibilities, the researchers first expanded the Web of Life whole genome database. They then used several new computational tools developed with co-author Siavash Mirarab, PhD, associate professor at UC San Diego Jacobs School of Engineering, to integrate existing high-quality full-length 16S sequences into the whole-genome phylogeny. With another machine learning tool developed by Mirarab’s group, they placed 16S fragments from over 300,000 microbiome samples. The result was an expansive reference database that both 16S and shotgun sequencing data could be mapped onto.

To confirm whether Greengenes2 would help standardize findings from either sequencing techniques, the researchers acquired both 16S and shotgun sequencing data from the same human microbiome samples and analyzed them both against the backdrop of the Greengenes2 phylogeny. The results from both techniques showed highly correlated diversity assessments, taxonomic profiles and effect sizes—something researchers had not seen before. “By inserting sequences into a whole-genome phylogeny, we show that 16S rRNA and shotgun metagenomic data generated from the same samples agree in principal coordinates space, taxonomy and phenotype effect size when analyzed with the same tree,” the team stated.

“Through Greengenes2, a huge repository of 16S data can now be brought back into the fold and even combined with modern shotgun data in new meta-analyses,” said McDonald. “This is a major step forward in improving the reproducibility of microbiome studies and strengthening physicians’ ability to draw clinical conclusions from microbiome data.”

The authors further concluded, “Taken together, these results show that use of a consistent, integrated taxonomic resource dramatically improves the reproducibility of microbiome studies using different data types and allows variables of large versus small effect to be reliably recovered in different populations.”

Previous articleThe Long and Winding Road: On-Demand DNA Synthesis in High Demand
Next articleIn Vivo Genome Editing of Stem Cells Induced by LNP-Based Delivery of mRNA