Send to printer »

Feature Articles : Jun 1, 2008 (Vol. 28, No. 11)

Digging for Insights through High-Content Data Mining

New Multiparametric Solutions & Techniques Offer Accuracy and Ease of Use
  • Gail Dutton

In the rush toward high-throughput screening, high-content data mining was pushed aside. Now, it’s racing to catch up, as researchers in disciplines throughout the world find meaningful ways to use data that’s already at their fingertips. What they’re realizing is that data mining for one or two parameters is no longer sufficient. Researchers may have teased out 20, 50, or more than 100 parameters for each compound in a screen but need to narrow their focus to the few, perhaps five, that are most relevant.

A number of companies as well as university researchers are developing high-content data-mining applications to resolve that challenge. In the process, many are going the extra mile to ensure that mining can be performed easily and accurately by scientists in their labs without the need to hand off the project to the bioinformatics department. Consequently, results are faster.

In any project, the challenge is to determine the right readout parameters to better evaluate the data. “That’s where the field is investing much of its energies,” notes Daniela Gabriel, Ph.D., associate director, center for proteomic chemistry, lead finding platform at Novartis Institutes for BioMedical Research (www.nibr.novartis.com). There’s still significant variation among data-mining tools, and “that influences the outcomes significantly.”

Dr. Gabriel found that the software normally used by her lab couldn’t evaluate very dense wells. That realization launched a project that ultimately compared four software analysis modules from different vendors to determine the best method for analyzing neurite outgrowth to determine neurotoxicity of small molecules. “We had nearly the whole well pretty much covered with primary neuronal cells,” she says. Of the four applications tested, choosing optimum algorithms for each, one was superior in analyzing those images.

Beyond evaluating the dense wells, the data mining tools had to deal with 20 to 50 parameters for analysis. “Nuclear area, nuclear intensity, neurite length, neurite area, luminosity, and many other features that may not be directly linked were part of the analysis, which provided the opportunity to make correlations that otherwise may not have been possible,” Dr. Gabriel explains. From those parameters, “we reduce it to about five useful features.”

In analyzing the high-content data-mining software, Dr. Gabriel’s main concerns were to ensure that the bitmap covered and detected all the cells. “Other evaluation parameters were ease-of-use of the software application and the speed of the algorithms in order to apply the application in secondary assays in drug discovery projects,” she says.

Combining Image Acquisition and Analysis with Data Mining

Multiparametric cellular data-mining is being addressed by Molecular Devices (MDC; www.moldev.com) through its Acuity Xpress® platform. It works with other MDC components including the MDCStore™ Database. This allows integration of image acquisition and image analysis data as well as the mining of high-content screening data. It provides results in days versus weeks traditionally required for comparable analyses, according to Pierre Turpin, Ph.D., product manager, cellular imaging-analysis software.

One benefit is that scientists get information quickly without needing to wait for the bioinformatics group to become involved, notes Dr. Turpin. Usually, he explains, people rely upon a mix of third party applications and tools that typically weren’t designed for high-content analysis. This approach thus constrains the number of parameters that can be dealt with at a given time. By combining image acquisition, analysis, organization, and data mining into one package, “we provide not just the data but the right data,” Dr. Turpin states.

Because cell-by-cell multiparametric results are linked automatically with the original image, it is easy for researchers to drill down to that image and validate the results, he emphasizes. Given the original image, Dr. Turpin notes that you must ask whether the analysis makes sense. “You may be looking at the wrong compound or an artifact unless you can look at the original image,” he says. “Or you could miss a compound.”

AcuityXpress uses an open application programming interface to simplify the process of reading and writing to the database and to facilitate exporting and importing data. “The database is fully scalable and can be installed on a PC or on a server for wider access.”

Project Specific Technology and Development

There are nearly as many applications for high-content analysis as there are projects. MAIA Scientific (www.maia-scientific.com) is developing what it calls an intuitive data-mining application for use with its high-content fluorescence and bright field imaging high-throughput screening.

Researchers at the National Changhua University of Education in Taiwan, in another example, are combining multiparametric data mining with case-based reasoning to develop a system to diagnose and develop a prognosis for chronic diseases.

Off-the-shelf solutions aren’t necessarily optimal or available for all disciplines. Consequently, some researchers are building their own. Pfizer Research Technology Center (www.pfizerrtc.com) is using high-content data mining to predict drug-induced hepatotoxicity. Scientist Arthur Smith, Ph.D., and colleagues developed a database of drugs that were marketed and safe and therapies that failed because of toxicity.

Then, using text mining, high-content biology, and primary cells, they developed a database of toxicological and pharmacokinetic content. Multivarient analysis was used to develop a decision-tree algorithm to identify toxic drugs. The result provides a highly accurate, early toxicological screen, according to Dr. Smith. Savings have been substantial enough for the program to be expanded to other areas.

Seth Harris, Ph.D., research scientist II at Roche (www.roche.com), is another case in point. He is developing a multistructure data-mining application for x-ray crystallography. Traditionally, he says, structural biology would provide one or two structures in an area. Now, it’s feasible to determine 100 or more structures of a target complexed with various small molecules.

His application is “somewhere between back of the envelope and preliminary implementation,” he reports. The focus right now is to understand what’s important in the structure. Currently, computational chemists and crystallographers get together and analyze the structure, identifying the properties that are important in a given development project. “I want the computer to further facilitate that.”

Dr. Harris’ intention is to push the conceptual framework from simple distance-based analysis so that it yields increasingly sophisticated metrics that can, for example, tabulate electrostatic metrics between the protein and the ligand. Because the significance of similar interactions varies according to the protein environment in which they occur, determining the most important parameters is difficult, he explains. Data like that “is hard to tabulate into numbers.”

The application is conceived as a guide for chemists engaged in drug design but it could also have merit as a data organizer. It is particularly advantageous for those who are new to a program or who work on multiple projects to help discover and track the most pertinent or novel structures.

Enterprise-Wide Informatics Platform

Of all these programs, Johnson & Johnson Pharmaceutical Research & Development’s (J&JPRD; www.jnj.com) endeavor is the most ambitious. The company is developing an enterprise-wide, overarching framework for discovery informatics. Called ABCD for Advanced Biological and Chemical Discovery, it unites multiple data systems in various cultures and many continents, providing a common ontology as well as a common way to describe, store, and interact with data.

Decision support was the first phase. It included a data warehouse that gathered biological and chemical information from multiple sources. The team built the Third Dimension Explorer in-house to mine the data, says Edward P. Jaeger, Ph.D., director of information technology for the research and early development division at J&JPRD. “We had a sense of the commercial offerings and we had a fair amount of technological expertise.”

The benefit is cohesiveness on a number of levels including user interactions and data interpretation. For example, because the same underlying technology was used to render chemical structures and support deeper analysis, there were fewer inconsistencies. The Third Dimension Explorer component can handle millions of rows and hundreds of columns of data, Dr. Jaeger continues. “We haven’t explored the practical upper limits.”

Dr. Jaeger and his team are in the process of building out the registration and transactional tools to streamline how data is registered and made available to discovery scientists. Many of the initial elements are available now, and others will be rolled out during the next 12 to 18 months.

An informatics project of this scope wasn’t just a matter of implementing technology, it required business process change. “That was the biggest value for us,” Dr. Jaeger notes. “Once we started to pull together data from throughout the world, we were driven to make the data more consistent.”

At first, systems were developed to curate the information. Eventually, however, the biologists and managers decided to establish an editorial board to develop the standards to which data must adhere before it may be added to the data warehouse.

The result is more useable, coherent, and understandable data. “We’re moving toward the point where we don’t have to do a lot of active curating,” Dr. Jaeger explains. The system will be maintained to accommodate new technologies and ontologies and thus evolve along with the science.

ABCD is having a positive effect, according to Dr. Jaeger. A growing user base bears this out. “We weren’t replacing anyone’s technology with ABCD,” Dr. Jaeger explains. “We competed in the marketplace for users based on capabilities.” Right now, ABCD has attracted 1,300 scientists, a major percentage, he notes. The ultimate goal is to complete ABCD’s penetration in the discovery science realm, adding tools and capabilities and enhancing the sophistication of data analysis. Eventually, it may roll out to other areas within Johnson & Johnson.