Tutorials: Mar 1, 2010 (Vol. 30, No. 5)

Quantitative HT Gene-Expression Analysis

Automating and Parallelizing the Process Can Help to Avoid an Analysis Bottleneck

(Page 1 of 2)

    Pharmaceutical and biomedical research is evolving to take advantage of the development of bioinformatic research programs, incorporating data from new high-resolution assays and technologies such as microarrays and fluorescent in situ hybridization.

    Information supplied by these methods concerning the physiological functions of genes can provide a molecular understanding of the mechanism of diseases, which can lead to effective therapies. It can also assist in basic scientific research, such as improving our understanding of genetic networks or embryological development.

    The value added by these new assay technologies resides in the detailed insight provided by the high-resolution data they output. This blessing, however, is also a curse: the huge volume of data must be processed, assessed for quality, and have its relevant features extracted before any visualization or statistical analysis is possible.

    The time when it was possible for a lab scientist to manually perform this processing has now passed; researchers would prefer to put their education and skills to use by concentrating on the science, rather than wasting their days repetitively annotating images or manipulating tables in a spreadsheet program. Manual processing is also error-prone and often subjective.

    For these new technologies, then, automated processing is essential. Effective development of data-processing algorithms demands tools that enable rapid prototyping and implementation of these algorithms; ideally this will be the same environment that is subsequently used for visualization and statistical analysis of the processed data.

    Automated processing is a first step only, however. The deployment of assay technologies in a high-throughput environment can mean that even an automated data-processing step will be the bottleneck. In this case, researchers can take advantage of the increasing availability of computer clusters to parallelize their data processing, releasing the bottleneck so that research can take place at the speed of science, not of analysis.

    We explore these issues further in a case study—the automated parallel-processing and subsequent visualization and analysis of data from the FlyEx database.

    The FlyEx database is a web repository of segmentation gene-expression images. It contains images of fruit fly (Drosophila melanogaster) embryos at various stages of development (cleavage cycles 10–14A) and quantitative data extracted from the images. Such work is invaluable in understanding the developmental processes in the embryo. Spatial and temporal patterns of gene expression are fundamental to these processes.

    The images are created using immunofluorescent histochemistry. Embryos are stained with fluorescently tagged antibodies that bind to the product of an individual gene, staining only those parts of the embryo in which the gene is expressed. In this case, the genes studied are those involved in the segmentation of the embryo, such as the even_skipped, caudal, and bicoid genes. The embryo is then imaged using confocal microscopy.

    The database currently contains more than two million data records and thousands of images of embryos. Clearly, the automated processing of these images is a necessary step in the construction of the database.

    Mathworks has developed an algorithm for processing the images using MATLAB and Image Processing Toolbox.

    MATLAB is an environment for technical computing, data analysis, and visualization. As an interactive tool, it enables researchers to prototype algorithms and analyses; its underlying programming language, optimized for handling large scientific datasets, allows these prototypes to be automated for high-throughput applications. Toolboxes provide application-specific add-on functionality, such as image and signal processing, multivariate statistics, and bioinformatics.

    The image-processing algorithm involves a number of steps. The embryo image is first rotated into a standard configuration—centered, with the anterior-posterior axis horizontally oriented. The second step removes noise from the image and equalizes the brightness across the entire image using adaptive histogram equalization.

    Figure 1. Results of the automated image-processing and statistical analysis of a fruit-fly embryo
    Click To Enlarge +

    Figure 1. Results of the automated image-processing and statistical analysis of a fruit-fly embryo

    Finally, the boundaries of each cell in the embryo are found using a brightness thresholding step, and the pixels within each cell boundary are separated into their red, green, and blue channels. These red, green, and blue channels correspond to the level of expression of these three segmentation-related genes: even_skipped, caudal, and bicoid. Figure 1 shows a profile of the three expression levels along the anterior-posterior axis.

    The algorithm is available from MATLAB Central, an online community that features a newsgroup and downloadable code from users at the File Exchange. The use of prebuilt methods, such as brightness thresholding, histogram equalization, and median filtering for noise reduction (supplied as building-block algorithms in Image Processing Toolbox), together with the ability of MATLAB to treat images simply as numerical arrays, allow the prototype to be compact, readable, and rapidly constructed.

    The processed results can then be analyzed and visualized using the same MATLAB environment. This approach offers important advantages in reducing the number of tools necessary for the overall analysis—streamlining the process, reducing the training needs of scientists, and removing a source of error in the transfer of data to a separate statistical package.

    The results could be analyzed according to the nature of the experimental context. For example, you could use:

    • time-series methods to examine how the gene-expression profiles vary over time;
    • ANOVA to examine the effects of different treatments or disease conditions; or
    • multivariate statistical or machine-learning methods, such as Principal Component Analysis or Cluster Analysis, to examine complex relationships within the profiles and between samples.

    Analysis results can also be visualized with other data sources and annotations, such as publicly available information on the genes themselves, public repositories of genetic data such as GenBank, and genetic pathway visualizations.


Related content

GEN News Highlights

Insight & Intelligence™

GEN Articles

BLOG biotech

Application Notes

Webinars

GEN Podcasts

Video Channel

Events

New Products

Best of the Web

Subscription center

GEN MAGAZINE

Genetic Engineering & Biotechnology News (GEN) has retained its position as the most widely read biotechnology publication around the globe since its launch in 1981. Published 21 times a year and with additional exclusive editorial content online, GEN's unique news and technology focus includes the entire bioproduct life cycle from early-stage R&D, to applied research including omics, biomarkers, as well as diagnostics, to bioprocessing and commercialization.

Subscribe

e-NEWSLETTERS

Add GEN to your Inbox! Subscribe today to our complimentary e-newsletters and stay abreast of the latest biotech news and trends. Click the magnifying glass icon next to a newsletter title to view a sample.

 

All fields are required

Email
  Confirm Email
  First Name
  Last Name
  Organization
  E-Alert Format
 
 
  Sign Up

ADVERTISEMENT

ADVERTISEMENT

ADVERTISEMENT

GENpoll

FDA Performance

How much progress do you think FDA has made in reviewing drugs and bringing them to market faster?

Suggest a Poll