January 1, 2018 (Vol. 38, No. 1)

Linguamatics Brings Natural Language Processing to Non-Experts, Expediting Drug Development

When Eli Lilly needed to extract data from clinical trial reports, it used an artificial intelligence (AI) approach called natural language processing (NLP) to replace manual searches. Because approximately 80% of the world’s data is unstructured, this automated methodology let researchers mine a large volume of data more quickly and more easily—and less expensively—than would have been possible with manual methods. And the results were more accurate, as well as more comprehensive.

NLP technology involves retrieving hard-to-recover tidbits of information from electronic health records (EHRs), physicians’ notes, scientific papers, and other nonstructured texts, as well as structured databases. When combined with other key data, the resulting insights can provide scientists, physicians, and regulators with more comprehensive and more accurate information. In other words, NLP can help investigators see the “whole picture.”

NLP is particularly helpful in drug development. “Roche used NLP to search Medline for abstracts mentioning breast cancer and human epidermal growth factor receptor 2,” notes Roger Hale, Ph.D., cofounder and COO of Linguamatics. The information helped scientists more accurately predict the success of a drug they were guiding into Phase II and III trials.

Other NLP users are extracting data from quality assurance documents and drafts of their planned regulatory submissions to compare information in text against that in tables and to identify errors in calculations and formatting. As a result, these NLP users are improving the quality of their regulatory submissions.

It Was the Worst (and Best) of Times…

When Linguamatics was formed in 2001, “Speech-processing technology was just becoming useable,” Dr. Hale recalls. The technology was so unwieldy that only a few organizations were even attempting to use it.

While working on NLP at SRI International’s Cambridge (U.K.) research lab, Dr. Hale and three colleagues saw the opportunity to develop a novel approach that would make data extraction more applicable. Their method held the promise of enabling users who weren’t NLP experts to extract precise results. “It used interactive information extraction (I2E),” Dr. Hale notes. “And it was based on the notion that text mining results can be returned and refined interactively and improved iteratively.”

2001 was also the year the dotcom bubble burst. The financial community was still reeling, and technical startups weren’t in vogue. Funding was hard to find. Rather than seek venture capital in a virtually closed market, “We grew organically from customer revenues,” Dr. Hale recalls. The company’s four cofounders experienced many small successes, but one in particular stood out.

“The EU Bioinformatics Institute ran a series of workshops on text mining, and we presented our technology there, directly to big pharma,” Dr. Hale remembers. “That was dream access!” The presentation was instrumental in netting early sales for the company. Linguamatics’ first client was AstraZeneca, which soon was followed by Johnson & Johnson. Now, its clients include 18 of the top 20 pharmaceutical companies in the world, as well as the FDA and NIH.

In 2014, in the U.K., Linguamatics won the prestigious Queen’s Award for Enterprise in the category of international trade. It recognized a five-year export growth rate for I2E that exceeded 300%. “We were quite honored,” Dr. Hale says, modestly.

Mining Is Harder Than It Seems

Extracting numerical information, even from structured data such as spreadsheet information, is more challenging that it sounds. That’s because numbers can be expressed in many different ways. For example, the NLP technology needs to understand that “1 kg” is the same as both “one kilogram” and “1,000 grams.”

That challenge is only exacerbated by non-numerical data. Consider: mining thousands of patient records to determine patients’ histories of smoking. Such information typically is unstructured and nuanced by comments such as “doesn’t smoke,” “is in a smoking environment,” or “quit cigarettes.” Lack of structure and variable wording make keyword searches inadequate without manual curation.

While the NLP industry struggles with those and similar examples, Linguamatics, asserts Dr. Hale, sets itself apart with a platform that “pinpoints and extracts specific relationships from unstructured text and normalizes concepts from unstructured and structured text.” In other words, its technology understands that different terms can represent the same data and can recognize the difference between “quit smoking” and “smokes,” for example.

The iterative nature of the technology ensures that results become more specific with each pass through the application, and that users can modify queries in real time.

Tip: Know What You Want

Researchers already using NLP can improve their results if they first understand exactly what results they want. “Focus is very important,” Dr. Hale emphasizes. “NLP users need a good understanding of the use case and the problem [they] want to solve.”

How to solve that problem may be up for discussion, too. The industry has developed multiple approaches to NLP. For example, statistical NLP finds patterns in data, but needs real-world examples to train. Rule-based systems are an alternative, but need a specialist to translate patterns into healthcare concepts. An agile NLP system—Linguamatics’ approach—combines statistics-driven and rules-based approaches to eliminate many of their limitations and provide faster, more comprehensive results.

One of the unique things about Linguamatics, Dr. Hale insists, is that its NLP application lets users search internal and external data sources and combine the results “in a completely unified way, so it’s not apparent you’re searching from two different databases.” Linguamatics makes this type of search available for published literature, too, to provide searches of full text articles.

To help companies get started, Linguamatics keeps the “OnDemand” evaluation system available in the cloud, supporting immediate access. “To search public data, connect to our cloud,” Dr. Hale suggests. “We have a range of applications.” To search their own private content, users need an on-premise installation. 

Figure 1. An example I2E web portal. Web portals bring the power of natural language processing to a broad user base, by providing three search options (simple, advanced, and smart) for a range of text-mining needs. PORTAL IMAGE

On the Horizon: Broader Access

Currently, Linguamatics is focused on broadening the accessibility and usage of I2E (Figure 1). “We’re increasing the power, accuracy, precision, and recall to make NLP accessible to a broader range of users,” Dr. Hale details. For example, the company is building web portals and embedding I2E into production workflows to enable semantic enrichment of documents (Figure 2). Such enhancements let users markup text for all the various relationships within it, allowing users to mine phenotypic data from EHRs to find patients with certain characteristics or to identify patients who need more or different types of care.

To develop its capabilities, Linguamatics works closely with a handful of experts in data mining. Collaboration with the developers of the open-source Konstanz Information Miner (KNIME) resulted in a more capable workflow engine for I2E, and work with Pentavere enhanced that company’s analytics platform for unstructured healthcare text. “We tend to partner synergistically, where combining our technology with a partner’s increases value to the customer,” Dr. Hale declares.

Figure 2. Linguamatics I2E extracts key cancer insights from pathology reports including histology grade and behavior, body site, and demographics to support disease registries, phenotypic analysis, and biospecimen databases. Tableau illustration shows results from analysis of The Cancer Genome Atlas (TCGA) pathology reports. TABLEAU DEMOGRAPHICS IMAGE

Linguamatics is also active in the NLP community, helping develop new technologies and new ways of using them. This includes work with machine learning, a subset of AI that increasingly is being integrated with NLP. Applications include analyzing clinical outcomes by treatment and searching for correlations among elements of complex problems (Figure 3).

Linguamatics uses machine learning to understand natural language. This helps it differentiate nouns from verbs and navigate other semantic and linguistic pitfalls that are easy for humans but tricky for machines.

“Machine learning has some drawbacks when used alone, though,” Dr. Hale admits. “Basically, you start machine learning using training data curated by experts, and the system learns from that. It’s very expensive to obtain that training data, and if needs change, you need a new dataset.” Linguamatics is working to overcome that hurdle. “Combining machine learning with I2E can produce the training data efficiently,” Dr. Hale observes.

Linguamatics mission hasn’t changed from its early days. It’s still focused on ways to extract the maximum amount of relevant information from scientific papers, EHRs, and other unstructured texts to inform decision making. As that becomes ever easier, researchers and clinicians finally can see the complete picture.

Figure 3. This figure illustrates the ability of I2E to pull out a variety of gene-disease annotations and relationships—in this case, to find genes that have some biomarker-type relationship with breast cancer. The flexible nature of I2E means that as a user, one can decide on the relationships one needs. For example, clinicians need definitive relationships between genes and disease processes; however, if you work in discovery, then being able to search for gene-disease associations can be very valuable. I2E BIOMARKERS IMAGE


Location: 324 Cambridge Science Park, Milton Road, Cambridge, UK CB4 0WG
Phone: +44 1223 651910
Principal: Roger Hale, Ph.D., COO and Cofounder
Number of Employees: 100
Focus: Linguamatics is a pioneer in natural language processing (NLP), an artificial intelligence technique for text mining. The company mines both unstructured and structured data for knowledge discovery in the biotech, pharmaceutical, and other high-value industries.
Previous articleObesity Leaves Lasting Impression on Blood-Forming Stem Cells
Next articleAdvanced Image of A2aAR Protein Could Lead to Better Drug Design Methods