January 1, 2007 (Vol. 27, No. 1)
Darryl Leon Ph.D.
Exploiting Text-mining Technologies to Accelerate Disease-based Research
The pharmaceutical and biotechnology industries have faced increasing research and development costs while dealing with a limited number of new molecular targets. In addition, a key bottleneck in the drug discovery process is the validation that a target can be correlated to a given disease.
A simple approach for exploring if a target or a compound has been related to a specific disease or gene is to review the published literature. To help scientists who are engaged in disease-based research, experts with text-mining experience have created approaches to improve the process of extracting information from existing scientific knowledge, i.e., biomedical literature.
With thousands of journals and over 15 million abstracts in Medline (and over 2,000 more added daily), it is physically impossible for any scientist to keep current of the research and to explore all of the areas that may provide further insights into their scientific discovery. Most scientists rely on simple information-retrieval techniques to obtain scientific articles pertaining to a topic of interest.
Typically, this type of searching is performed using searching software that scans for terms identical to the query term. Although this approach is fast and returns many articles, the number of useful and insightful abstracts can be quite large and difficult for a human to review.
Sophisticated software programs have been developed to try to understand how the words in a scientific abstract are used and how the words correspond to the query term provided by the user. Unfortunately in a biomedical or biological abstract, it is very common for different words to represent the same biological entity (e.g., LASS1 and LAG1), for the same term to have different biological meanings (e.g., PAP is an alias for PAP, MRPS30, and PAPOLA), or for a single term to mean different ideas with respect to a given discipline (e.g., SCT represents either secretin or stem cell transplant).
In the drug discovery arena, many researchers doing disease-related research do not want to become experts in text-mining techniques, but simply want to find critical, published information about a disease, drug, or gene.
Therefore, what is needed is a well-designed software program that can understand typical language ambiguities, apply these challenging concepts to effectively derive key scientific knowledge from biomedical abstracts, and make the text-mining results easy to navigate and visualize for scientific researchers, rather than computer scientists.
AKS2 (AlmaKnowledgeServer 2) is a text-mining system that examines all abstracts added to Medline and then applies statistical and rules-based analyses so one can uncover relationships that exist in the scientific literature between diseases, symptoms, drugs, genes, proteins, and chemical compounds. The AKS2 pipeline is composed of four steps: capturing and integrating data, structuring and storing the acquired information, providing access to embedded knowledge in the data, and finally enabling the user to exploit the incorporated knowledge (Figure 1).
The first two steps in the AKS2 pipeline are transparent to the user. Because the steps are performed automatically and offline, the extracted knowledge can be accessed and exploited quickly by a scientist. The first step is focused on capturing and integrating diversely formatted data from many sources, such as Medline and many reference databases such as Entrez Gene, UMLS, and PubChem. Next, lexicons (dictionaries) are created from the integration of all the databases.
The second step is the extraction process, which starts with the detection and tagging of the bioentities mentioned in the text. Once the information has been organized, the system generates a knowledge base to store results (i.e., sentences and documents, the collection of bioentities attached to their synonyms and external database identifiers, and the annotations generated by the system).
The last steps allow the end user access to a set of extraction, analysis, and visualization tools that are designed to retrieve and exploit all of the precalculated information from the AKS2 database. The AKS2 interface is a web application and can be accessed within an organization’s intranet. In addition to the various searching modes, the AKS2 system has an interactive graphical tool for visualizing literature relationships between bioentities.
AKS2 can be applied to disease-based research to aid in the finding of related drugs and genes in published abstracts, and it can help researchers explore and discover relationships in the literature that are not easily revealed with information retrieval approaches.
For instance, a researcher can type a disease name (e.g., breast cancer) in the Search Bioentities interface, and the system returns the disease and a list of similar diseases (e.g., breast carcinoma) that may be closely related to user’s original query. After the user selects a concept, the system returns an interactive Summary page showing the relevance scores of related bioentities based on text analytics (Figure 2).
The Summary page also shows the most recent papers and a list of authors who publish frequently about the bioentity of interest. AKS2 understands the various naming ambiguities that exist for gene names and symbols. Hence, while a simple information retrieval system may require five to ten different searches, AKS2 can perform a search of similar scope with a single query.
Knowledge Visualization and Analysis
The interactive Summary page is a starting point for viewing additional knowledge-mining results and performing further analyses. The researcher can launch an interactive graphical viewer to visualize how their bioentity relates to other published bioentities. Moreover, the graphical viewer gives the user an option to explore how all of the bioentities relate to one another with a single mouse click (Figure 3). When an interesting connection is found between two bioentities, a new analysis can be launched to show which articles are associated between the two bioentities.
Not only can one perform a search and visualize the results generated from a single search term or phrase, but the AKS2 interface gives the researcher the option to import a file containing a list of gene names or identifiers obtained from a microarray experiment. The system searches for all of the genes on the list and permits the researcher to build a graph showing how the genes are related to one another based on the literature.
In addition to the interactive graphical-visualization tool, the system provides scientists the choice to generate reports for printing or for sharing. The reports can be a list of articles, abstracts, or sentences that contain the query term. Furthermore, abstract reports are color-coded for easy identification and review. The Details page for genes and proteins provides a list of external databases, and it automatically provides links to external websites so more can be learned about the gene or protein in terms of sequence, pathway, structure, and function.
Understanding the mechanisms and related molecular players in a disease is a complex process that includes both experimental approaches and computational methodologies. Because the computational technology of automated text mining can help researchers be more knowledgeable about their disease areas and create more efficient drug discovery processes, it is essential that any text-mining software provide the scientist a powerful but easy-to-use system. AKS2 addresses both of these challenges, and is a practical tool for any research group carrying out disease-related research.