Send to printer »

Feature Articles : Jan 1, 2010 (Vol. 30, No. 1)

Bioinformatic Tools Drive High-Content Analysis

Robust Computational Resources Make It Possible for Scientists to Fully Utilize Experimental Data
  • Catherine Shaffer

Rapid advances in instrumentation and robotics have made high-content screening (HCS) faster than ever. Quantities of data have increased beyond all past superlatives to the point that it can best be described as ridiculous. Historically, vendors have been behind the curve, offering relatively underpowered systems with inflexible data-analysis packages. As computational technology catches up with assay technology, scientists are embracing the open-source movement and polishing up their programming skills to supercharge their already fast screens.

At CHI’s “High Content Analysis” conference to be held later this month in San Francisco, leaders in the field will gather to share their data-finessing successes. One of the most intense areas of interest in this field is image analysis. When a thousand images may be taken of a single plate, and plates are processed in batches of hundreds, how to handle the data is a nontrivial problem.

Simply moving that volume of graphics files from one place to another can be a chore, never mind analyzing them. But this is exactly what John McLaughlin, Ph.D., a research fellow at Rigel Pharmaceuticals, does on a weekly basis.

In its pursuit of aurora kinase inhibitors, Rigel has developed a phenotypic screen using pattern recognition. This is the same technology being developed by law- enforcement agencies for screening video images for criminal suspects.

Pattern-recognition identifies features in test images and then uses a classification system to train classifiers, which can then be used to mine large image datasets for patterns of interest. According to Dr. McLaughlin, “it’s a highly dimensional kind of data. In this case, there are 140 measurements for every cell. Not only does this technology help us quantify huge datasets more efficiently it can also suggest potential mechanisms of action of our compounds.” Pattern recognition also has utility in deterring the mode of action of a drug without spending a great deal of time and resources on secondary screens.

The assay looks for proliferation of cells after treatment with a small molecule inhibitor compound. In order to analyze the data, Dr. McLaughlin uses a cluster built from 20 or 30 PCs. “Nowadays you can buy quad cores or dual-quad cores for not really all that much. If you’ve got a couple of those then you’ve got a small cluster.”

A typical screen will process for four days on this cluster—a testament to the mind-boggling size of the dataset. “A year ago, we had a backlog of things we needed to do, so I pulled in a number of computers from other groups at our company and used them at night when people went home.”

One challenge faced by Dr. McLaughlin and other scientists working with large sets of imaging data is the closed nature of software packages. “Most vendors have some level of customization built in, but I’ve found many times with industry-standard systems that they don’t provide nearly enough flexibility,” Dr. McLaughlin explains.

“I understand the reasoning for keeping code proprietary, but before I commit to an analysis system I often don’t know what my requirements are, I just know it’s inevitable that something they don’t allow for will arise later. I wouldn’t write my own code if I didn’t have to, so if they would open their code up fully or partially like Matlab, then I could spend my time doing science instead of building tools.”

Dr. McLaughlin’s lab has adopted an open-source program called CellProfiler, developed out of the Broad Institute, which is built in the Matlab programming language linked to a MySQL database platform.

OMERO

For scientists who are not programming experts, which is most of them, tools and applications are becoming available that bridge the gap for nonpower users. The Open Microscopy Environment (OME), an international open-source consortium, has developed OME Remote Objects (OMERO), an open-source Java-based software suite that can import and analyze the output from just about any type of microscope or high-content analysis (HCA) image data.

Unlike many other graphics export/import actions, OMERO captures not just the pixels in the image file, but the metadata associated with them. OMERO also interfaces with many popular analysis programs like Matlab and CellProfiler.

Jason Swedlow, Ph.D., professor at the University of Dundee and president of Glencoe Software, is one of the founders of OME and leads the team that developed OMERO. “We wanted to develop tools that would provide infrastructure for data management and enable interoperability between different types of image data and analysis tools. Moreover, our whole approach and philosophy is open—we are passionate about developing a community.”

OME’s viewpoint is that most labs are  enterprise-data producers, and that the scale of data is comparable to that managed by banks and hedge funds. Therefore, biologists should have the same powerful tools that these industries use to manage their data. The concept seems to be catching on. Dr. Swedlow estimates that OMERO has been installed on roughly 1,100 servers around the world. “We have tens of terabytes of data under management on our own servers in our lab and know of many larger installations.”

Mega HCS

Robert F. Murphy, Ph.D., Lane professor of computational biology at Carnegie Mellon University, has embraced an approach that is orthogonal to the traditional compound-target drug screen.

Typically, cell-based assays are used to screen potential drug compounds against a disease target, and targets are chosen one at a time. If you have just one or two targets to screen, then it can be done quite efficiently, but when the group of potential targets is large, then the overall task of screening all available compounds against all targets becomes unmanageable, even with the highest possible throughput.

“The vision of doing a high-content screen to find a drug candidate is premised essentially on the notion that you only have to look at one target and one cell type (or a small number of cell types),” Dr. Murphy says.

But drugs may behave differently in other cell types or on other pathways within the cell. Dr. Murphy, a pioneer in the field of pattern recognition, has developed methods for probing the relationships of many compounds to many proteins in a single experiment. The method is based on the assumption (or hope) that clusters of proteins and clusters of drug compounds have similar behavior, so that the number of total experiments needed is reduced.

“That sounds a little bit like magic,” he concedes, “because you don’t necessarily know in advance what the right combinations are. There are methods that deal with that question.”

Assuming you know nothing, these methods enable you to learn about the dependencies that are in common between targets and compounds and thereby make it possible to measure a much smaller fraction of the total number of combinations.

Early results of simulations where the system must learn a known “correct answer” have been encouraging. This approach would allow scientists to do mega-type drug screens where many drugs, proteins, pathways, or cell types are addressed in a single experiment—a body of work that could take many years to complete the old fashioned way.

Bioassay Ontology

Another developing concept in high-throughput and high-content screening is bioassay ontology. This project, inspired by the NIH Roadmap and funded by the National Institutes of Health Genome Research Institute, seeks to create an ontology and software tools for searching, retrieving, and integrating small molecule high-throughput and high-content screen data.

This project addresses the problem of the vast amount of screening experiments that are publicly available, but which are described primarily in the form of free text. Developing an ontology for these experiments will make it easier for scientists to share, analyze, or mine data, without reinventing the wheel every time they do it.

An ontology facilitates searching and integration similar to the semantic web. Creating an ontology involvesformalizing domain knowledge using standardized vocabulary into concepts and relationships with properties. The approach involves top-down domain expert-driven and bottom-up development using automated text mining and natural language processing.

According to Stephan Schuerer, Ph.D., assistant professor of pharmacology at the University of Miami, creating a bioassay ontology not only enables a biologist easier access to the data, but enhances the interpretation of the data. Being aware of relationships that may not be apparent, or giving a name to concepts, aids in the thinking process, leading to new ideas and theories.

Multivariate Analysis

In addition to producing a huge volume of data, image analysis for cell-based assays can have a high degree of complexity. Phenotypic changes in cells are often subtle and involve an overwhelming number of parameters. The application of multivariate analysis to this problem is being tackled by Jonathan Sexton, Ph.D., assistant professor at the Biomanufacturing Research Institute and Technology Enerprise at North Carolina Central University.

Dr. Sexton’s team has developed a cell-based assay for peroxisome biogenesis, which is of significance in type 2 diabetes. The assay used a fluorescent reporter to monitor changes in peroxisomes under the influence of drug compounds.

The resulting data had a low signal-to-noise ratio, and it was difficult to identify phenotypic changes in the peroxisomes. Using multivariate analysis, Dr. Sexton’s team was able to normalize the data for cell size and DNA content in an unbiased way.

“We identified as many parameters about this one object as we could—shape, size, total mass, how many vertices, nearest neighbor distances and on and on until we had 30 or 40 parameters,” Dr. Sexton explains. His group also included data for negative and positive controls, and ran a principle component analysis (PCA), which organizes variables in decreasing order of variability and is often used to create predictive models. “PCA allowed us to capture the relevant signal, reject the natural variability, and not interject any human bias in choosing what these parameters were.”

High-content screening has benefited greatly from a close marriage with bioinformatics. Robust computational resources are allowing scientists to take advantage of the full range of data produced by experiments. As well, a new breed of programmer-scientist is adapting sophisticated modeling and statistical analysis methods for the life sciences.

These methods not only process data for completed experiments, but in true systems-biology fashion become a tool for hypothesis development and experimental design. Development of bioassay ontology and the semantic web help to make the scientific literature smarter for the benefit of the whole community.