Scientists are being given the opportunity to use a new algorithm that can search through the mass spectra of potentially billions of microbial-derived compounds in the hunt for promising new drug candidates that haven’t previously been investigated. The algorithm, called Dereplicator+, can identify already known compounds within repositories and eliminate them from further analyses, so that only novel compounds are evaluated. “It's unbelievable how many times people have rediscovered penicillin,” says Hosein Mohimani, Ph.D., an assistant professor in Carnegie Mellon University’s (CMU) computation biology department. And whereas previous approaches to evaluating mass spectra have been limited to searching for peptides, Dereplicator+ can evaluate the features of mass spectral data to identify peptides and other classes of natural products, including polyketides, terpenes, benzenoids, alkaloids, and flavonoids. 

Dr. Mohimani, together with a research team headed by the CMU scientists and colleagues at St. Petersburg State University in Russia, report on initial studies using Dereplicator+ in a paper published in Nature Communications, which is titled, “Dereplication of microbial metabolites through database search of mass spectra.”

The Global Natural Products Social (GNPS) molecular networking project is a recent mass spectrometry data repository for natural products to which thousands of laboratories have contributed a billion mass spectra, represents a major opportunity to hunt for new drug compounds, the authors explain. But while analyzing the mass spectra of compounds is a relatively inexpensive way of identifying potential new drug candidates, existing techniques have been limited largely to searching for peptides, which have simple structures. Being constrained in this way to look only for peptides is effectively just “looking at the tip of the iceberg,” Dr. Mohimani points out. “… while spectra from GNPS represent a gold mine for future natural products discovery, their interpretation remains challenging,” the authors note. “The vast majority of GNPS spectra have evaded all attempts to interpret them, indicating that there exists a large dark matter of metabolomics.”

One of the other main challenges when searching through databases of natural products for new compounds, is the high rate of rediscovery of known candidates, the authors continue. What is needed are approaches that can identify and eliminate these known compounds from new searches. “The process of using the information about the chemical structure of a known natural product to identify this compound in an experimental sample (without having to repeat the entire isolation and structure-determination process) is called dereplication,” they comment. And while early dereplication approaches were based on deriving the exact chemical formula and searching or compounds with that formula in chemical structure databases, this type of method often failed, at least in part because existing chemical databases contain many compounds with identical formulas.

The new Dereplicator+ algorithm developed by Dr. Mohimani and colleagues is based on their original Dereplicator algorithm for analyzing mass spec data, which could look for and dereplicate peptides—peptidic natural products (PNPs)—but couldn’t work with other classes of molecules that had more challenging structures. In contrast, Dereplicator+ can pick out different types of molecule that are more complex than simple peptides, by predicting through the generation and analysis of fragmentation graphs, how the mass spectrometer would break the molecules apart piece by piece.

The team used 5,000 known compounds and their mass spectra to train the model to be able to predict how other compounds would break apart in the mass spectrometer. In their initial tests, it took just a week and 100 computers for the Dereplicator+ algorithm to evaluate 1 billion spectra in the GNPS molecular network and identify more than 5,000 promising, previously uninvestigated compounds. And as well as identifying and eliminating known compounds, the algorithm can also identify variants of known compounds that might otherwise have gone undetected within a sample the researchers point out.

When compared with their original Dereplicator algorithm, Dereplicator+ was found to “identify 77% more compounds (consisting of nonpeptide metabolites and mixed peptide-PKs) that were missed by Dereplicator,” the authors state. “We show that Dereplicator+ can search all spectra in the recently launched GNPS molecular network and identify an order of magnitude more natural products than previous dereplication efforts … In contrast to existing database search tools in metabolomics, Dereplicator+ is the first database search tool for natural products that can search the entire GNPS molecular networking infrastructure against large databases of chemical structures, and identify variants of known metabolites using molecular networking.”

The researchers are making the algorithm available for use by any investigator to study different repositories.

Previous articleAlgorithm Could Help Scientists Split a Protein and Reassemble It to Functionality
Next articleRoche, GO Therapeutics to Develop Antibody-Based Cancer Treatments