Rapid advances in instrumentation and robotics have made high-content screening (HCS) faster than ever. Quantities of data have increased beyond all past superlatives to the point that it can best be described as ridiculous. Historically, vendors have been behind the curve, offering relatively underpowered systems with inflexible data-analysis packages. As computational technology catches up with assay technology, scientists are embracing the open-source movement and polishing up their programming skills to supercharge their already fast screens.
At CHI’s “High Content Analysis” conference to be held later this month in San Francisco, leaders in the field will gather to share their data-finessing successes. One of the most intense areas of interest in this field is image analysis. When a thousand images may be taken of a single plate, and plates are processed in batches of hundreds, how to handle the data is a nontrivial problem.
Simply moving that volume of graphics files from one place to another can be a chore, never mind analyzing them. But this is exactly what John McLaughlin, Ph.D., a research fellow at Rigel Pharmaceuticals, does on a weekly basis.
In its pursuit of aurora kinase inhibitors, Rigel has developed a phenotypic screen using pattern recognition. This is the same technology being developed by law- enforcement agencies for screening video images for criminal suspects.
Pattern-recognition identifies features in test images and then uses a classification system to train classifiers, which can then be used to mine large image datasets for patterns of interest. According to Dr. McLaughlin, “it’s a highly dimensional kind of data. In this case, there are 140 measurements for every cell. Not only does this technology help us quantify huge datasets more efficiently it can also suggest potential mechanisms of action of our compounds.” Pattern recognition also has utility in deterring the mode of action of a drug without spending a great deal of time and resources on secondary screens.
The assay looks for proliferation of cells after treatment with a small molecule inhibitor compound. In order to analyze the data, Dr. McLaughlin uses a cluster built from 20 or 30 PCs. “Nowadays you can buy quad cores or dual-quad cores for not really all that much. If you’ve got a couple of those then you’ve got a small cluster.”
A typical screen will process for four days on this cluster—a testament to the mind-boggling size of the dataset. “A year ago, we had a backlog of things we needed to do, so I pulled in a number of computers from other groups at our company and used them at night when people went home.”
One challenge faced by Dr. McLaughlin and other scientists working with large sets of imaging data is the closed nature of software packages. “Most vendors have some level of customization built in, but I’ve found many times with industry-standard systems that they don’t provide nearly enough flexibility,” Dr. McLaughlin explains.
“I understand the reasoning for keeping code proprietary, but before I commit to an analysis system I often don’t know what my requirements are, I just know it’s inevitable that something they don’t allow for will arise later. I wouldn’t write my own code if I didn’t have to, so if they would open their code up fully or partially like Matlab, then I could spend my time doing science instead of building tools.”
Dr. McLaughlin’s lab has adopted an open-source program called CellProfiler, developed out of the Broad Institute, which is built in the Matlab programming language linked to a MySQL database platform.