A stubborn, mountainous peak kept appearing in the lab’s mass spectrometry analyses. “We had absolutely no idea what it was,” recalled Stephen Barnes, Ph.D., director of the Targeted Metabolomics and Proteomics Laboratory at the University of Alabama at Birmingham. He suspected a contaminant. At the time in 2010, however, small molecule databases were quite thin for scientists running metabolomics experiments. So, his lab turned to America’s favorite search engine, Google. After some trial and error, the lab entered a highly accurate mass of the unknown molecule into the search field and got a hit, a journal article from Richard Caprioli’s lab at Vanderbilt University. The identity of the mystery peak was now known: dimethyl-octadecyl ammonium chloride, a molecule frequently found in cleaning products.
“I didn’t go to databases because at that point it wasn’t an obvious thing to do,” Barnes said.
Building a Treasure Trove
Today, that Achilles heel for metabolomics researchers now longer exist, as databases that aid in the identification of compounds for metabolomics research have become larger and more widely used. The databases come in many flavors: pay-walled, freely accessible, downloadable, cloud-based, limited user input, full user input, and so on. One database that has grown to become one of the largest is METLIN, a spectral database from Scripps Research that is freely available and lives in the cloud, making it accessible to virtually any researchers and requiring a relatively simple setup.
“The METLIN database is actually incredibly unique right now in terms of its size,” said Gary Siuzdak, Ph.D., senior director of the Scripps Center for Metabolomics and co-developer of METLIN.
First launched online in 2005, METLIN has grown rapidly over the course of the last year from about 15,000 compounds to 150,000 compounds—and spanning more than 350 chemical classes—for which relevant fragmentation information is available. “This has been the result of a lot of things coming together over the last year to allow us to perform high-throughput analyses on a variety of different types of molecules that we’ve been able to get our hands on,” Siuzdak noted. By comparison, he said, the database from the government-funded National Institute of Standards and Technology, or NIST, is one order smaller, in the tens of thousands.
Although METLIN may be one of the largest available metabolomics databases, it is far from complete. “We don’t have any comprehensive spectral databases out there,” said Lloyd Sumner, Ph.D., director of the Metabolomics Center at the University of Missouri. “We don’t even know what the size of the metabolome is.”
Data repositories and analytical tools that allow researchers to perform comparative analyses of their spectra are becoming more sophisticated and may interface with spectral databases. For example, METLIN is integrated with comparative analysis tool XCMS online. Released online in 2012, XCMS online was the first technology that allowed for nonlinear alignment of data from untargeted mass spectrometry experiments. “It really changed how we do untargeted metabolomics,” said Siuzdak. In untargeted metabolomic experiments, researchers can cast the widest net possible and don’t need to specify which metabolites they are looking for. “That’s why [XCMS online] has on the order of twenty-two hundred citations so far,” he asserted.
Conversely, the Scripps suite of tools added another freely accessible, cloud-based platform in August: XCMS MRM and METLIN MRM. Unlike XCMS online and METLIN, this new platform is for targeted mass spectrometry. Targeted mass spectrometry is highly sensitive and is used when researchers know which metabolites they are looking for. “Almost every area of biological science uses this targeted technology to look at specific molecules that we’re interested in,” Siuzdak said. For example, when an Olympian is suspected of doping, targeted mass spectrometry is used to screen for the compound of interest. “What we wanted to do was create something that would allow you to use XCMS and also METLIN for these types of [targeted] analyses across all of these different applications.”
Despite the relative success of the Scripps suite of tools, there are downsides for users, such as the lack of user input and being based in the cloud. “The downside of the platform is that, in particular with the METLIN database, you can’t download it,” said Corey Broeckling, Ph.D., director of the Proteomics and Metabolomics Facility at Colorado State University. “Its value of being online is wonderful, but if you have the ability to do other things with the data, you can’t actually utilize their database, like some of the other database tools.”
Beyond data repositories and databases, computational tools have also emerged that take a user’s spectral data and predict the compound’s structure. Sumner’s group builds some of these computational tools, and while they can be helpful, he said that “typically there [are] still 50 to 90% of the compounds in our profiles that are unknown after searching those databases. Metabolite identification is the number one grand challenge that we face in metabolomics.”
Turning to the Crowds
As the metabolomics field strives to identify compounds accurately and quickly, the incompleteness and imposed limitations of the more traditional databases have led some researchers to take matters into their own hands. Groups have formed databases, along with analytical platforms, through crowdsourcing the metabolomics community. One platform in particular is the Global Natural Products Social Molecular Networking (GNPS), which is made up of a database, data repository, and analysis tools.
“The key reason for creating that infrastructure is so that we can capture knowledge and share knowledge with the community,” said co-developer of GNPS Pieter Dorrestein, Ph.D., director of the Collaborative Mass Spectrometry Innovation Center at University of California San Diego. “I got really frustrated with databases where you just deposit data, and then you can’t do anything with it. We wanted to create a system where people actually upload data as they start to do the analysis, so you don’t have to redo it at the end when you go publish.”
Unlike platforms like XCMS and METLIN, crowdsourced tools allow any user to contribute and provide input. “Particularly in the natural products world, [allowing anyone to contribute] opens up a vast trove of authentic and purified compounds that are frequently not commercially available at all,” said Broeckling. “As such, crowdsourced databases are likely to have content that simply isn’t anywhere else.”
The GNPS platform allows users too freely annotate any piece of uploaded information and has a system much like Yelp where annotations are assigned a star ranking based on accuracy of identification. “What we’re seeing is actually continuous improvement of the knowledge that exists within the infrastructure,” said Dorrestein. “We also had an instance where people were going back and forth, correcting data points within the infrastructure, and it turns out that they were just using different names for the molecule. That only became visible once the dialogue started to happen.”
Although not a proponent of crowdsourcing, Siuzdak explained that cost is a big reason why some labs turn to crowdsourcing. “To buy these molecules can quickly lead to astronomical costs. It is completely impractical for most to pursue the analysis of standards. However, the human cost of using data from questionable sources is much higher.” Because anyone can upload their data from virtually any source, critics of crowdsourced platforms are concerned that quality of data is impaired and confidence of accurate metabolite identification is weakened. “The crowdsourcing is only good if there is a very strong filter for quality,” explained Barnes. “Otherwise you’re going to get drowned in noise of things that aren’t interesting.” If the data quality isn’t up to par, he warned, “people will chase shadows.”
Although crowdsourced tools may lack desired standardization to confidently identify one’s molecule, they have a place in the field. “It’s going to take a village to solve some of these grand challenges that we face, and to do that, we’re going to have to use whatever mechanisms that we have available,” Broeckling noted. “So I’m not opposed to crowdsourcing. If people are sharing their data, then you don’t need to crowdsource, and truthfully I think maybe 10% of people share their data.”
Setting Standards
Lack of standardization among incoming data extends well beyond just crowdsourced tools. It plagues all databases, data repositories, and comparative analysis tools in metabolomics because every lab has their own home-brew recipe for metabolomic experiments. Each lab has a different extraction protocol, profiling method, mass spectrometer, and the list goes on. This lack of standardization makes it difficult, if not impossible at times, to identify and confidently compare compounds. “The methods are never going to be as standardized as some of the other omics,” said Broeckling. “Standardization is much more difficult with metabolomics than, say, genomics, and the reason is metabolites are way more diverse than genes are, chemically speaking.”
Although a set of rigorous standards may seem like a logical next step, the field of metabolomics may not yet be ready. “It’s not a good idea yet to mandate that everybody do it the same at this point, because the technology is constantly evolving,” Sumner pointed out. “But if you want to be able to compare things, then it is critical that people do try to work together to adopt some commonality. Right now, we’re just trying to get people to report the fundamental information of how they generated the data so it can be reused appropriately, but many people do not even do that.”
Nearly a decade has passed since Barnes’ lab observed the dreaded “mystery peak.” Like the rest of the metabolomics community, he now has numerous database options on hand, with his preference being METLIN. However, “if I don’t go to METLIN, I just put the mass into Google, and sometimes it’s in Google,” he said. “I will use whatever resource I can find because none of the resources are complete. They can be enormous but not complete—and I may just have a compound that nobody has studied before.”
This article was originally published in the November/December 2018 issue of Clinical OMICs. For more content like this and details on how to get a free subscription, go to www.clinicalomics.com.