Mining Published Works
“Could drug research be supported by mining the texts of knowledge repositories, such as PubMed? We believe that semantic technologies bring new e-biomaker discovery opportunities, in a manner which is 50% more cost-efficient than traditional molecular biomarker discovery process,” asserts Paul Walti, Ph.D., CEO of InfoCodex, a software provider based in Buchs, Switzerland.
InfoCodex software distinguishes itself from more traditional software, which is based on natural language processing, by recognizing and “comprehending” the actual content of a large number of unstructured documents. By analyzing seemingly unrelated documents, publications, and reports, InfoCodex can categorize unstructured information and correlate small, seemingly unrelated facts.
The software combines a very large thesaurus organized in a complex taxonomy of about 10,000 concepts. It applies information theory to transform documents into mathematical models, conduct unsupervised semantic clustering, and match multilingual documents according to meaning.
“Our software is able to determine the meaning of unknown words and correlate them with words in the InfoCodex Linguistic Database, providing a cross-language content recognition,” says Dr. Walti.
As opposed to natural language processing, which recognizes relationships between facts only if they are already explicitly stated in the document, the InfoCodex semantic engine can find hidden correlations distributed over groups of documents. “Our engine is almost like a supreme human super-reader, except that no human team, however specialized, is able to create and maintain the overview of all publications, simply because of their sheer number and the rate of accumulation,” continues Dr. Walti.
In a pilot experiment with Merck, InfoCodex took on the task of discovering new e-biomarkers from over 120,000 PubMed abstracts, clinical trial summaries, and internal Merck documents. Without any involvement of subject matter experts, InfoCodex was able to identify over 10,000 potential biomarker/phenotype candidates, further narrowed down to just 1,000 of specific genes or proteins.
Next, the Merck and Thomson Reuter scientists further parsed the cohort to about 20 novel high-quality candidates prioritized by confidence scores. Some of these have since been validated in biological experiments.
It should be noted that none of the 20 abstracts contained the term “biomarker.” Despite the constraints of this first pilot experiment, the ability of automated data mining to identify new e-biomarker candidates has the potential to impact pharmaceutical research.