November 15, 2013 (Vol. 33, No. 20)
Richard A. A. Stein M.D., Ph.D.
A key prerequisite for dissecting biological processes that shape development, differentiation, and disease is the in-depth characterization of gene expression regulatory networks.
Understanding the central position of transcription factors in these networks emerges as a key task, one that is also marked by challenges, particularly those stemming from the plethora of data unveiled by the “omics” sciences and the advent of high-throughput approaches.
“The amount of available data is exploding,” says Lihua Julie Zhu, Ph.D., research associate professor at the University of Massachusetts Medical School. This vast amount of information has triggered challenges related to data storage, data integration, and the computational resources and tools that are required. The genome-wide identification and characterization of transcription factors relies on two widely used approaches, chromatin immunoprecipitation followed by either high-throughput sequencing (ChIP-seq) or genomic tiling microarray hybridization (ChIP-chip). Binding site identification only constitutes the initial step of the analysis.
According to Dr. Zhu, “One of the challenges is how to integrate data from different sources.” Dr. Zhu and colleagues recently presented a downstream workflow to integrate the analysis of ChIP-seq, ChIP-chip, and gene expression data, by using Bioconductor packages and publicly available data from the Gene Expression Omnibus.
“We want to carefully annotate and integrate as much information as possible, and make it readily usable by others in the community,” explains Dr. Zhu. Binding site annotation, motif discovery, and the construction of regulatory networks integrated with gene expression datasets and public databases are key steps toward defining regulatory pathways and networks. “The ultimate goal is to integrate the data and translate them scientifically into actionable information to benefit patients at the bedside,” adds Dr. Zhu.
“We have adopted a simplified view of the gene world, one that does not address all the aspects related to isoforms and alternative isoform usage,” says Doron Betel, Ph.D., assistant professor of computational biomedicine at Weill Cornell Medical College.
Even basic questions, such as comparing the expression levels of two genes, are not always straightforward. Several methods were developed for gene expression analysis, and RNA-Seq is one of the most widely used ones. Most frequently, RNA-Seq is used to measure differential gene expression patterns across multiple samples, a process that involves comparing the number of fragments that map to a specific transcript.
This approach differs significantly from the one used for gene microarray analyses, and represents one of the reasons that make statistical algorithms assume an instrumental role. While a number of specific algorithms and software programs exist, different groups rely on different tools for their analyses.
“From a scientific point of view, it is difficult to decide which method to use, because this would first require the identification of the genes that are differentially expressed, while the goal is to answer that very question,” says Dr. Betel.
Dr. Betel and colleagues compared several freely available RNA-Seq statistical analysis platforms on two datasets, the Sequencing Quality Control (SEQC) and RNA-Seq data from biological replicates of three cell lines that were part of the ENCODE project. This work highlighted significant differences among various methods, in terms of their sensitivity and specificity in detecting differential gene expression.
“We were surprised to find that one of the most commonly used methods for array expression analysis, the limma package, which incorporated some slight adaptations for RNA-Seq while retaining the same algorithm, turned out to have a performance comparable to (and in some aspects even better than) other methods, which had been tailored to or specific for RNA-Seq data, and that was in many ways unexpected,” says Dr. Betel.
As part of this analysis, Dr. Betel and colleagues demonstrated that adding replicate samples significantly improves the detection power as compared to increasing sequencing depth. This pointed toward the value of incorporating replicates as a preferred strategy over increasing the number of sequence reads.
“Differential expression is most difficult to detect for genes expressed at low levels, and for them the number of replicate samples is the predominant factor in gene expression analysis,” says Dr. Betel. So far these analyses have been performed only on mRNA molecules. “We are not certain whether these conclusions would apply in the same way for small RNA molecules, such as microRNAs, or RNA extracted from RNA-binding proteins, and it is worth revisiting these questions to see how these methods perform,” adds Dr. Betel.
A new approach has been proposed to address the inherent difficulty of RNA-Seq in reliably detecting whether a difference is real or due to noise, a capability that is particularly relevant for reads that occur in small numbers.
“The novel idea in our strategy was to combine two sources of information, RNA read counts and co-expression networks, to boost the predictive statistical power, and this allowed us to correct the bias that normally exists in RNA-Seq-based analysis against genes expressed at low levels,” says Tao Jiang, Ph.D., professor of computer science and engineering at the University of California, Riverside.
RNA-Seq facilitates analyses that previously were challenging or impossible to perform. One of these is the detection of isoforms generated by alternative splicing, a topic that has been lagging behind. Isoforms have different structures as a result of alternative splicing and perform different functions. “In many instances even the term ‘gene expression’ may be misleading, and the direction in gene expression analysis will shift toward isoform-specific analysis,” reports Dr. Jiang.
Many tools have been developed since 2010 to infer mRNA isoforms from RNA-Seq data. Of these, Cufflinks, Scripture, and IsoLasso, the latter one developed by Dr. Jiang and colleagues, are among the most popular ones. Dr. Jiang and colleagues revealed that the sensitivity of these tools is generally below 25% when all expressed isoforms are considered, and increases to approximately 75% when isoforms expressed at low levels are excluded.
“This is clearly insufficient for differential expression analysis, and currently the main challenges are to improve the accuracy of isoform inference as well as to enhance the functional annotation of isoforms,” says Dr. Jiang. An additional challenge in gene expression analysis is the tissue specificity of certain isoforms. “In our analyses, we are using co-expression networks compiled from many different tissues, but all this network information should be tissue-specific, because different tissues have different regulatory networks,” explains Dr. Jiang.
“For complex tissues, such as the brain, but for other tissues as well, single-cell profiling can be very useful in deciphering the cellular complexity,” says Constance L. Cepko, Ph.D., professor of genetics and neuroscience at Harvard Medical School.
While the biological relevance of intercellular differences has been increasingly appreciated in many tissues and in various contexts, it is still technically challenging to reliably profile gene expression in individual cells.
Several years ago, investigators in Dr. Cepko’s lab relied on PCR-based approaches to profile gene expression in single retinal cells. “But we were concerned about what can happen during many PCR amplification cycles, as well as the initial sampling step inherent in first-strand DNA synthesis,” says Dr. Cepko.
A challenging facet of single-cell experiments is obtaining an independent verification of the gene expression profiles obtained from microarray or RNA-Seq data. “We met that challenge by performing in situ hybridization on the associated cells, profiling gene expression programs in single cells at various stages of differentiation and development.”
“This gave us much more information and a great deal of resolution of the elements within a complex system,” says Dr. Cepko. Single-cell gene expression profiling also opens the path toward establishing a catalog of all the genes expressed in specific cell types, exploring their potential functions and probing for changes during specific states.
Studies of the function of many of those genes within the retina have seen another stride with a recent advancement in Dr. Cepko’s lab, the possibility of co-opting fluorescent proteins to build synthetic complexes with specific biological activities for regulatory purposes. In this approach, GFP-binding proteins are introduced into cells expressing GFP or one of its derivatives. These originate from camelid antibodies.
“We tested pairs of GFP-binding proteins to identify those that can co-occupy GFP molecules. After defining such pairs, we fused the GFP-binding domains to transcriptional activation or DNA binding domains, and we introduced the cDNAs encoding the fusions into cells,” says Dr. Cepko. This allowed the building of GFP-inducible systems in which GFP-binding proteins can be recruited onto a GFP scaffold to promote the GFP-dependent transcription of specific genes.
“With this system, we can express any gene specifically only in GFP+ cells,” adds Dr. Cepko. The collection of mouse or zebrafish strains with specific GFP expression patterns can now be exploited for functional studies.
“Very basic information on the topography and dynamics of epigenetic marks in neuronal tissue is still lacking, but this situation may soon change,” says Angel Barco, Ph.D., principal investigator at the Instituto de Neurociencias de Alicante, Spain. Investigators in Dr. Barco’s lab recently mapped four histone acetylation marks in the hippocampus, and examined the impact of histone deacetylase inhibitors (HDACIs) on genome-wide acetylation and gene expression profiles.
A key observation was the limited impact that hypoacetylation exerted on hippocampal gene expression, suggesting that gene expression changes triggered by HDACIs could be partially or totally histone-independent.
“Transcription factor-dependent effects, which are restricted to specific loci, appear, in fact, more suitable to explain transcriptional effects downstream of HDACIs than histone acetylation changes,” says Dr. Barco. This is additionally supported by observations that several transcription factors are acetylated in response to HDACIs, opening the need to focus on post-translational modification in a broader context, one that is not restricted to histones.
“HDACIs are very attractive drugs in neuropsychiatry, but despite the reported beneficial effects, we still do not understand their mechanisms of action,” concedes Dr. Barco.In many studies performed to date, ChIP-Seq analyses used neural tissue chromatin extracts obtained from cultured neurons or dissected brain regions, preparations that are characterized by a marked cellular heterogeneity.
“These approaches are unveiling just the average tissue epigenome, but in the near future, nuclear sorting and effective ChIP-Seq protocols will provide an opportunity to profile specific neuronal populations, such as cells responding to a given stimulus, or even specific neurons,” adds Dr. Barco.
Both a technological tool and a scientific field, gene expression analysis is tackling scientific questions that merely a few years ago were unapproachable. It it continues to mature into a dynamic and vibrant biomedical area, exemplified by single-cell analyses and systems approaches.