Pharmaceutical and biomedical research is evolving to take advantage of the development of bioinformatic research programs, incorporating data from new high-resolution assays and technologies such as microarrays and fluorescent in situ hybridization.
Information supplied by these methods concerning the physiological functions of genes can provide a molecular understanding of the mechanism of diseases, which can lead to effective therapies. It can also assist in basic scientific research, such as improving our understanding of genetic networks or embryological development.
The value added by these new assay technologies resides in the detailed insight provided by the high-resolution data they output. This blessing, however, is also a curse: the huge volume of data must be processed, assessed for quality, and have its relevant features extracted before any visualization or statistical analysis is possible.
The time when it was possible for a lab scientist to manually perform this processing has now passed; researchers would prefer to put their education and skills to use by concentrating on the science, rather than wasting their days repetitively annotating images or manipulating tables in a spreadsheet program. Manual processing is also error-prone and often subjective.
For these new technologies, then, automated processing is essential. Effective development of data-processing algorithms demands tools that enable rapid prototyping and implementation of these algorithms; ideally this will be the same environment that is subsequently used for visualization and statistical analysis of the processed data.
Automated processing is a first step only, however. The deployment of assay technologies in a high-throughput environment can mean that even an automated data-processing step will be the bottleneck. In this case, researchers can take advantage of the increasing availability of computer clusters to parallelize their data processing, releasing the bottleneck so that research can take place at the speed of science, not of analysis.
We explore these issues further in a case study—the automated parallel-processing and subsequent visualization and analysis of data from the FlyEx database.
The FlyEx database is a web repository of segmentation gene-expression images. It contains images of fruit fly (Drosophila melanogaster) embryos at various stages of development (cleavage cycles 10–14A) and quantitative data extracted from the images. Such work is invaluable in understanding the developmental processes in the embryo. Spatial and temporal patterns of gene expression are fundamental to these processes.
The images are created using immunofluorescent histochemistry. Embryos are stained with fluorescently tagged antibodies that bind to the product of an individual gene, staining only those parts of the embryo in which the gene is expressed. In this case, the genes studied are those involved in the segmentation of the embryo, such as the even_skipped, caudal, and bicoid genes. The embryo is then imaged using confocal microscopy.
The database currently contains more than two million data records and thousands of images of embryos. Clearly, the automated processing of these images is a necessary step in the construction of the database.
Mathworks has developed an algorithm for processing the images using MATLAB and Image Processing Toolbox.
MATLAB is an environment for technical computing, data analysis, and visualization. As an interactive tool, it enables researchers to prototype algorithms and analyses; its underlying programming language, optimized for handling large scientific datasets, allows these prototypes to be automated for high-throughput applications. Toolboxes provide application-specific add-on functionality, such as image and signal processing, multivariate statistics, and bioinformatics.
The image-processing algorithm involves a number of steps. The embryo image is first rotated into a standard configuration—centered, with the anterior-posterior axis horizontally oriented. The second step removes noise from the image and equalizes the brightness across the entire image using adaptive histogram equalization.