Leading the Way in Life Science Technologies

GEN Exclusives

More »


More »
March 01, 2010 (Vol. 30, No. 5)

Quantitative HT Gene-Expression Analysis

Automating and Parallelizing the Process Can Help to Avoid an Analysis Bottleneck

  • Pharmaceutical and biomedical research is evolving to take advantage of the development of bioinformatic research programs, incorporating data from new high-resolution assays and technologies such as microarrays and fluorescent in situ hybridization.

    Information supplied by these methods concerning the physiological functions of genes can provide a molecular understanding of the mechanism of diseases, which can lead to effective therapies. It can also assist in basic scientific research, such as improving our understanding of genetic networks or embryological development.

    The value added by these new assay technologies resides in the detailed insight provided by the high-resolution data they output. This blessing, however, is also a curse: the huge volume of data must be processed, assessed for quality, and have its relevant features extracted before any visualization or statistical analysis is possible.

    The time when it was possible for a lab scientist to manually perform this processing has now passed; researchers would prefer to put their education and skills to use by concentrating on the science, rather than wasting their days repetitively annotating images or manipulating tables in a spreadsheet program. Manual processing is also error-prone and often subjective.

    For these new technologies, then, automated processing is essential. Effective development of data-processing algorithms demands tools that enable rapid prototyping and implementation of these algorithms; ideally this will be the same environment that is subsequently used for visualization and statistical analysis of the processed data.

    Automated processing is a first step only, however. The deployment of assay technologies in a high-throughput environment can mean that even an automated data-processing step will be the bottleneck. In this case, researchers can take advantage of the increasing availability of computer clusters to parallelize their data processing, releasing the bottleneck so that research can take place at the speed of science, not of analysis.

    We explore these issues further in a case study—the automated parallel-processing and subsequent visualization and analysis of data from the FlyEx database.

    The FlyEx database is a web repository of segmentation gene-expression images. It contains images of fruit fly (Drosophila melanogaster) embryos at various stages of development (cleavage cycles 10–14A) and quantitative data extracted from the images. Such work is invaluable in understanding the developmental processes in the embryo. Spatial and temporal patterns of gene expression are fundamental to these processes.

    The images are created using immunofluorescent histochemistry. Embryos are stained with fluorescently tagged antibodies that bind to the product of an individual gene, staining only those parts of the embryo in which the gene is expressed. In this case, the genes studied are those involved in the segmentation of the embryo, such as the even_skipped, caudal, and bicoid genes. The embryo is then imaged using confocal microscopy.

    The database currently contains more than two million data records and thousands of images of embryos. Clearly, the automated processing of these images is a necessary step in the construction of the database.

    Mathworks has developed an algorithm for processing the images using MATLAB and Image Processing Toolbox.

    MATLAB is an environment for technical computing, data analysis, and visualization. As an interactive tool, it enables researchers to prototype algorithms and analyses; its underlying programming language, optimized for handling large scientific datasets, allows these prototypes to be automated for high-throughput applications. Toolboxes provide application-specific add-on functionality, such as image and signal processing, multivariate statistics, and bioinformatics.

    The image-processing algorithm involves a number of steps. The embryo image is first rotated into a standard configuration—centered, with the anterior-posterior axis horizontally oriented. The second step removes noise from the image and equalizes the brightness across the entire image using adaptive histogram equalization.

  • Click Image To Enlarge +
    Figure 1. Results of the automated image-processing and statistical analysis of a fruit-fly embryo

    Finally, the boundaries of each cell in the embryo are found using a brightness thresholding step, and the pixels within each cell boundary are separated into their red, green, and blue channels. These red, green, and blue channels correspond to the level of expression of these three segmentation-related genes: even_skipped, caudal, and bicoid. Figure 1 shows a profile of the three expression levels along the anterior-posterior axis.

    The algorithm is available from MATLAB Central, an online community that features a newsgroup and downloadable code from users at the File Exchange. The use of prebuilt methods, such as brightness thresholding, histogram equalization, and median filtering for noise reduction (supplied as building-block algorithms in Image Processing Toolbox), together with the ability of MATLAB to treat images simply as numerical arrays, allow the prototype to be compact, readable, and rapidly constructed.

    The processed results can then be analyzed and visualized using the same MATLAB environment. This approach offers important advantages in reducing the number of tools necessary for the overall analysis—streamlining the process, reducing the training needs of scientists, and removing a source of error in the transfer of data to a separate statistical package.

    The results could be analyzed according to the nature of the experimental context. For example, you could use:

    • time-series methods to examine how the gene-expression profiles vary over time;
    • ANOVA to examine the effects of different treatments or disease conditions; or
    • multivariate statistical or machine-learning methods, such as Principal Component Analysis or Cluster Analysis, to examine complex relationships within the profiles and between samples.

    Analysis results can also be visualized with other data sources and annotations, such as publicly available information on the genes themselves, public repositories of genetic data such as GenBank, and genetic pathway visualizations.

Related content