Whether you’re comparing disease to normal tissue, tumor samples with and without drug treatment, cultured cells before and after exposure to environmental stimuli, or any number of other experimental conditions, measured changes in the pattern of gene expression between experimental and control samples can provide a view into what is happening in the cell. But how do you make sense of the data, translating measured expression values into evidence for the biological mechanism at work?
There are two ways to approach analysis of differential gene expression: through traditional downstream analysis approaches and through the more recently described upstream analysis approach.
Since 2008, BIOBASE has been spreading the word about the added value offered by upstream analysis compared to traditional downstream analysis alone, but why is upstream analysis so important?
Traditional downstream analysis looks at enrichment of functional categories within differentially expressed gene sets—categories that include Gene Ontology’s molecule function, biological process and cellular component, disease-associated genes, biomarkers and therapeutic targets, and signaling pathways. Further methods search for network modules or cluster co-regulated genes over a time course.
Downstream methods rely solely on the subset of genes that are differentially expressed—genes that provide evidence of the effect, much like ripples in a lake provide evidence of the effect of the stone that penetrated the surface—but which do not themselves necessarily identify the cause of the differential gene expression.
What if the causal molecule, the stone, is not differentially expressed? What if, for example, increased activity of a growth factor sets off a signaling cascade, but the gene expression of the growth factor itself, or of components of its pathway, do not change? In such situations the causal signal can be completely lost when looking at differentially expressed genes only.
Upstream analysis, on the other hand, does not assume that causal molecules should undergo expression changes and finds the cause by applying biological principles. Upstream analysis uses the promoter sequences of upregulated genes to identify the pattern of transcription factors that are most likely to be responsible for the coordinated experimental observations. Once the unique set of responsible transcription factors has been identified, the entire network of signaling reactions is used to reveal causative upstream key nodes that have activated the important transcription factors (Figure 1).
There are, of course, times when downstream analysis itself, through characterization of measured expression changes, can provide clues as to the molecular cause behind experimental observations. But even in such ideal scenarios we find that the causal signal becomes less apparent with decreasing stringency in compiling the study gene list. In the worst case, the causal signal becomes unrecognizable.
In contrast, we find that upstream analysis more robustly stands up to changes at different steps of the workflow. The upstream approach provides a larger overlap with relevant signaling pathways and greater significance to the key causal molecule than downstream analysis alone. Moreover, as more differentially expressed genes are added to the study gene list, the more upstream analysis can take advantage of the added information.
Take for example the case of TNF alpha. Viemann et al., investigated the effect on gene expression of treating microvascular and macrovascular endothelial cells with TNF alpha, a well characterized growth factor known to be involved in many important processes and signaling networks.
Our challenge was to see, using the differentially expressed set of genes, whether we could re-identify TNF alpha as the causal agent of the observed expression changes. With plenty of functional properties and signaling connections described in the scientific literature, we would expect it to be easy to detect a strong association with TNF alpha-related signaling pathways in the dataset through downstream analysis alone. And we were in fact able to detect overlap with components of the TNF alpha pathway using the standard downstream analysis approach within ExPlain™.
However, only a small number of upregulated genes were mapped to the TNF-alpha pathway, which was composed of over 50 molecules. Even with the application of less stringent fold change criteria on the list of statistically significant genes, the overlap could not be improved. Instead the already low statistical significance of the most stringent setting was further eroded.
In contrast, using the upstream analysis approach within ExPlain we found that the overlap with components of the TNF alpha pathway was not only more significant but remained consistent even as we increased the gene list size through application of less rigorous fold change cut-offs, particularly when considering entire networks that connected relevant transcription factors with their key nodes (Table 1). We could even, in addition, widen the scope of transcription factors selected for further analysis, and the result remained robust (Table 2).
Why is such a difference in significance observed? Because reconstructing causal agents through downstream analysis relies on there being an overlap between the up-regulated genes and the TNF alpha pathway, an effect that can become diluted when too many upregulated genes are considered without increasing the pathway overlap.
On the contrary, upstream analysis relies on TNF alpha being identified as a master controller of the transcription factors that bind to the promoter sequences upstream of the upregulated genes, an association that is not as easily diluted and, most importantly, is the biologically plausible way to go (Figure 2).
Clearly the ability to identify the transcription factors most likely to bind the promoter sequences of the upregulated genes is critical, but not all tools use the same approach to transcription factor analysis. Other tools rely solely on predefined sets of connections between transcription factors and potential target genes. Such an approach, aside from being unusual compared to current scientific methodology, can collapse easily.
Transcription factors can regulate vastly different sets of target genes depending on cell type or condition, leaving the scientist confronted with the same dilemma demonstrated for the downstream pathway analysis. Further, possibilities to discover novel transcription factor/gene associations are limited from the start by such an approach.
In contrast, the upstream analysis of ExPlain applies biological principles and state-of-the-art scientific practice. It works with an exhaustive collection of binding motifs assembled from manually curated, peer-reviewed information about sequence-level transcription factor binding sites to identify the important transcriptional regulators of any novel combination of genes as well as in novel sequences.
Through controlled algorithmic prediction, ExPlain fills in the significant gaps in knowledge left by an incomplete literature view of the gene-regulation landscape, thus ensuring users of the widest and most realistic view possible.