May 1, 2005 (Vol. 25, No. 9)
Growth in Public Protein Databases Improves Access to Raw Mass Spectrometry Data
Access to proteomic data lags “very much behind the DNA microarray field,” says Edward Marcotte, Ph.D., associate professor in the department of chemistry and biochemistry at the University of Texas, Austin (UTA). This paucity of large-scale, public-access repositories for protein data could slow basic proteomic and related functional genomic research and its applications to drug discovery.
Dr. Marcotte attributes the lag in access to mass spectrometry (MS) data to the newness of the technologyhigh throughput MS has not been around for more than three or four yearsand cultural differences that determine how different fields establish policies regarding data disclosure.
The problem is two-fold, according to Dr. Marcotte. The public databases in which to deposit MS data are few, and the ones that do exist are small and not growing as quickly as anticipated. Also, the authors of most published reports of MS data are not depositing their data in public repositories, limiting access by the scientific community.
In an April article, Dr. Marcotte and colleagues wrote, “Unfortunately, the field of proteomics has been slow to embrace one of the most important lessons from DNA microarrays: open availability of raw data.” They cite the ready availability of DNA microarray data together with public genome sequence data as a driving force behind computational research in functional genomics.
In response to this unmet need, the Institute for Cellular and Molecular Biology at UTA established the Open Proteomics Database (OPD; bioinformatics.icmb. utexas.edu/OPD), a public database for storing and disseminating protein expression data derived from MS experiments.
The database currently contains about 1.2 million spectra representing experiments from four organismsEscherichia coli, Homo sapiens, Saccharomyces cerevisiae, and Mycobacterium smegmatis. Most of the data derive from experiments done at the Institute. Few outside sources have contributed to the database since its inception.
Demand for Data
Dr. Marcotte does, however, see growing interest in accessing the OPD. For example, the National Institute of Standards in Technology (NIST) recently used the data from OPD to decide whether peptide data are unique and reproducible enough to warrant inclusion of peptide-standard spectra in NIST repositories. Additionally, computational biologists have been using OPD in the development of benchmark algorithms for interpreting MS data.
Dr. Marcotte reports about equal usage of the database by scientists in academia and industry. Biotech and pharmaceutical companies are primarily using external datasets to refine their own internal tools for data interpretation and applying proteomic data to drug discovery.
In a sort of “Catch-22” conundrum, the relatively small amount of MS data available in the public domain limits the ability to develop good analytical software tools, which consequently compromises the quality and utility of the data that is available.
Although a typical protein expression experiment may generate thousands of peptide fragmentation spectra, “only 20% of those are ever interpreted,” says Dr. Marcotte. The other 80% are thrown out because currently available analytical algorithms are not able to interpret them, largely due to excess noise or post-translational modifications.
“We have interpreted less than 20% of the data in OPD datasets,” says Dr. Marcotte. That will be enough, he hopes, “to start bootstrapping our way up to better algorithms.” Then researchers can retroactively analyze the existing data to extract the rest of the information.
A Lack of Standards
One well-recognized obstacle to large-scale development and acceptance of protein expression databases is the lack of standardization in the industry. At present there are many competing approaches for collecting, reporting, and analyzing data, although an effort toward developing standards for distributing and publishing this data is under way.
Currently, the trade-off is whether it is better to have a large archive of data culled from many sources and of variable and poorly defined quality, or whether to limit data compilations according to individualized standards.
Universal standards would help eliminate some of these issues and facilitate the expansion of existing databases.
Another challenge for database developers, according to Dr. Marcotte, is the need to store descriptions of the experimental techniques used to generate data in a way that is computer searchable.
Alexey Nesvizhskii, a research scientist at the Institute for Systems Biology (ISB) in Seattle, echoes the problem of a lack of standards in the field, noting that if care is not taken in filtering MS data, the result can be a high rate of false positives.
ISB’s PeptideAtlas database (www.peptideatlas.org) is a public access database that accommodates proteins from different organisms as well as various reference protein sequence sets as starting material. To ensure that consistent statistical criteria are applied to the data analysis, all raw data submitted to ISB’s PeptideAtlas database from external sources is processed through the internal pipeline.
A multicenter group of researchers recently co-authored a paper describing the PeptideAtlas (Genome Biology 2004;6:R9). The researchers are using the database, designed as an expandable repository for MS-derived proteomic information, to annotate the established human genome sequence.
At the end of 2004 the PeptideAtlas contained nearly 225,000 mass spectra from 52 human samples, representing almost 27,000 peptides. Those figures will be updated shortly.
The database was largely developed at ISB, where independent research groups were using high throughput MS to generate data for their individual projects. Recognizing that MS data could serve as a source of experimental evidence to confirm genomic information, ISB began compiling a database of its researchers’ published datasets.
To grow the database more rapidly the group asks researchers to contribute their large datasets. “There is an understanding among some that it is more valuable to make the data available for more than just its specific use,” says Nesvizhskii. In the future he would like to see the development of better data-mining tools for use in correlating global protein expression by tracking peptides across experiments and identifying proteins expressed in specific cells and tissues.
Nesvizhskii likens this strategy to mining EST databases to track gene expression. The tools needed to integrate PeptideAtlas with other existing databases are also evolving, as are strategies to overcome the challenges presented by differing formats and the need to exchange data in order to exploit synergies between diverse datasets.
Bioinformatic Tools
The early goal of applying bioinformatic tools to turn raw biological data into meaningful information that could drive decision-making in drug discovery turned out to be more complex than was perhaps anticipated. “There are still many white spots’ in our understanding,” says Holger Karas, Ph.D., senior vp of business development and co-founder of Biobase (www.biobase.de).
Dr. Karas sees the immediate goal of bioinformatics tools as integrating signaling and metabolic networks, and in the long term, defining cell-to-cell networks “will add another order of magnitude to the complexity of databased systems.”
In the future, Dr. Karas envisions a growing effort to merge genomic and proteomic data with a need for tools able to interconnect gene regulation and protein expression databases.
Biobase’s Proteome BioKnowledge Library (www.proteome.com) offers access by subscription to six database volumes of information (obtained through the acquisition of Proteome, a former subsidiary of Incyte), with each volume focusing on a different organism.
Last fall the library added the BioKnowledge Interactions Module, which includes protein-protein interaction data for mammalian species and a visualization tool that allows users to view how proteins interact.
The library is available as an online version or an installed version. Installation enables integration and mining with a user’s internal tool set. Customers can subscribe to individual volumes or to the six linked volumes.
Muthu Meyyappan, Ph.D., vp of sales and marketing at Biobase, describes a broad range of customers who use the library to obtain species-specific annotations for proteins and diseases of interest, for high throughput data analysis of microarray and MS data, and as a “protein index, since we provide annotations only for validated public domain sequences.”
Deciphering Biological Pathways
The main benefit of protein databases is to allow the average biologist, whether in academia or industry, to work with genomic and proteomic data and to understand gene and protein function without the need for a bioinformatics person by their side, in the view of Michael Campbell, executive vp of business development at Applied Biosystems (www.appliedbiosystems.com).
Newly released enhancements to version 5.0 of the company’s Panther protein database include interactive resources for associating protein families with biological pathways and tools for analyzing gene expression data in relation to molecular functions, biological processes, and pathways.
Panther is a public access database available without restriction at panther.appliedbiosystems.com. Earlier this year Panther became part of the European Bioinformatics Institute’s InterPro database, which enables text- and sequence-based searching.
The current version of Panther includes a gene-classification system, 60 pathway diagrams, and an index ontology that comprises about 250 molecular function search terms and a similar number of biological process terms. Website tools used for microarray and SNP analysis have been adapted for use in protein expression analysis.
At present, Panther includes 36,000 hidden Markov models, 6,000 protein families, and 31,000 subfamilies. “Our goal is to increase the number of pathways to over 500,” with an emphasis on signaling pathways as well as metabolic pathways, says Campbell.
Early on, Panther was part of the Celera Discovery database. Celera researchers wanted to be able to query a gene based on its function, as well as its name and location, and to do gene expression analysis. The site became free to the public in the summer of 2004. Customers can incorporate their own data into the system and share search information with their coworkers.
Future Trends
Describing the evolution of the life sciences industry over the past 15 to 20 years, Nasri Nahas, CEO of Geneva Bioinformatics (www.genebio.com), states that “we are now more in a knowledge-limited rather than information-limited era,” marked by the need to create new ways of organizing information and generating knowledge.
“In my opinion, the tools available today do not yet properly address this challenge; the majority of the available tools are more targeted at information stacking rather than informationintegration, which calls for putting together the right information in a contextualized manner in order to allow research scientists to corroborate those separate and disparate pieces of information, and thus to generate pertinent biological knowledge,” says Nahas.
In March, GeneBio announced a joint venture with the Current Science Group (www.current-science-group.com), resulting in the formation of Current BioData, a company that will develop and distribute the ProXenter platform, readying it for launch into the marketplace.
ProXenter is GeneBio’s web-based Protein Exploration Center, a protein relational database that contains an annotated dataset for each of the specific groups of proteins of interest to the drug discovery industry. The platform also provides bioinformatics tools for browsing, visualizing, and analyzing data.
GeneBio was established to commercialize the products developed by the Swiss Institute of Bioinformatics, including the Swiss-Prot database, which is available via subscription to corporate customers and free of charge to academia.
Available products include Prosite, a protein domains database, Swiss-2DPAGE, Phenyx MS protein-identification software, and Melanie 2-D gel-analysis software. ProXenter will likely be made available on a yearly subscription model.
The philosophy underlying the development of ProXenter, according to Nahas, points to two desired characteristics of a protein database: “A product that any bench scientist can easily use without being an expert in bioinformatics and high-quality editorial content” that is defined, structured, and provided by experts in the field who also develop the tools needed to access and work with the data.
Given that the bottleneck has shifted from the information level to the interpretation or knowledge level, Nahas identifies goals in the evolution of proteomics databases:
Encompass heterogeneous but congruent types of data
Present this data to life scientists in an accessible and searchable manner
Provide the necessary tools for cross-reference information and sources to facilitate data interpretation
Be open to editorial content that increases the breadth of information while keeping it uniform
Nahas envisions an “intensive software publishing platform” in which highly specialized, editorially intensive databases collate publicly available data and organize it in a structure that is useful and understandable, adding analysis and commentary.
“Today we have thousands of journals sold via subscriptions, in 10 to 20 years there will be thousands of editorially intensive databases integrated in a single platform and probably sold on individual subscriptions,” says Nahas.
“In the future, protein database entries could replace traditional review articles,” with scientists submitting their data or knowledge about a particular protein family or molecular process and becoming an “author” of the database.
“The original raw data and basic annotation can be, or even must be, freely available to all, because science cannot progress without wide access to and sharing of basic research findings.” Editorial annotation and proteomics databases built from the raw data “should be payable,” in Nahas’ view.