Researchers headed by teams at the University of Oxford and the MRC Laboratory of Molecular Biology have created a new, publicly available database that they hope will shrink, not grow, over time. The database, dubbed the “unknome,” is effectively a compendium of the thousands of understudied proteins encoded by genes in the human genome, which are known to exist, but for which the function is mostly not known.

Led by Matthew Freeman, PhD, at the Dunn School of Pathology, University of Oxford, and Sean Munro, PhD, at the MRC Laboratory of Molecular Biology, the researchers carried out their own analysis of a subset of proteins in the database, which revealed that a majority contribute to important cellular functions, including development and resilience to stress. Freeman, Munro, and colleagues described their work in PLOS Biology, in a paper titled “Functional unknomics: Systematic screening of conserved genes of unknown function.” In their summary the team said, “We have developed an approach to tackle directly the huge but under-discussed issue of the large number of well-conserved genes that have no reliably known function, despite the likelihood that they participate in major and even possibly completely new areas of biological function … Our work illustrates the importance of poorly understood genes, provides a resource to accelerate future research, and highlights a need to support database curation to ensure that misannotation does not erode our awareness of our own ignorance.”

Since the release of the first draft of the human genome in 2000 it has become clear that our DNA encodes thousands of likely protein sequences, but the identities and functions of many are still unknown. “The human genome encodes approximately 20,000 proteins, many still uncharacterised,” the authors wrote. “The mystery and the potential biological significance of these unknown genes is enhanced by many of them being well conserved and often being unrelated to known proteins and thus lacking clues to their function.” There are multiple reasons for this lack of focus on the unknown genes and proteins, including the tendency to focus scarce research dollars on already-known targets, and the lack of tools, including antibodies, to interrogate cells about the function of these less understood proteins. There is also a tendency to focus on proteins that are abundant and widely expressed and so are likely to be present in cell lines and model organisms, the investigators acknowledged. “Whatever the reasons, this inadvertent neglect of the unknown is clear and does not appear to be diminishing.”

But the risks of ignoring these proteins are significant, the authors argue, since it is likely that some, perhaps many, play important roles in critical cell processes, and may both provide insight and targets for therapeutic intervention. Moreover, the investigators pointed out, evidence from studies of gene expression and genetic variation indicates that many of the poorly characterised proteins are linked to disease, “including those that are eminently druggable.”

To promote more rapid exploration of these proteins, the authors created the Unknome database (, that assigns to every protein a “knownness” score, reflecting the information in the scientific literature about function, conservation across species, subcellular compartmentalization, and other elements. Based on this system, there are many thousands of proteins whose knownness is near-zero. Proteins from model organisms are included, along with those from the human genome. The database is open to all and is customizable, allowing the user to provide their own weights to different elements, thereby generating their own set of knownness scores to prioritize their own research.

Munro commented, “The role of thousands of human proteins remains unclear and yet research tends to focus on those that are already well understood. To help address this we created an Unknome database that ranks proteins based on how little is known about them, and then performed functional screens on a selection of these mystery proteins to demonstrate how ignorance can drive biological discovery.”

To test the utility of the database, the authors chose 260 genes in humans for which there were comparable genes in Drosophila flies, and which had knownness scores of one or less in both species, indicating that almost nothing was known about them. The scientists then used RNAi technology to knock down the target genes in the flies. “To assess the value of the Unknome as a foundation for experimental work, we selected a set of 260 Drosophila proteins of unknown function that are conserved in humans and used RNA interference (RNAi) to test their contribution to a wide range of biological processes,” they wrote. These experiments showed that for many of the proteins, a complete knockout of the gene was incompatible with life in the fly. But—“Of course, there is more to life than being alive,” they noted—partial knockdowns, or tissue-specific knockdowns led to the discovery that a large fraction contributed to essential functions. Knockdown of some genes resulted in loss of viability, and functional screening of the rest revealed hits for fertility, development, locomotion, protein quality control, and resilience to stress,” the scientists wrote.

The results suggest that, despite decades of detailed study, there are thousands of fly genes that remain to be understood at even the most basic level, and the same is clearly true for the human genome. “These uncharacterized genes have not deserved their neglect,” Munro said. “Our database provides a powerful, versatile and efficient platform to identify and select important genes of unknown function for analysis, thereby accelerating the closure of the gap in biological knowledge that the Unknome represents.

The authors commented in conclusion, “In practical terms, the Unknome database provides a resource for researchers who wish to exploit the opportunities associated with unstudied areas of biology … We hope that our work will inspire others to define and characterize further the Unknome and also to seek to ensure that gene annotation has the support and technology to preserve and recognize true ignorance.”

Previous articleMitochondrial Structure May Serve as Target for Future Therapeutics for Age-Related Diseases
Next articleA Molecular Portrait of ALS and FTD