Gene Expression’s Big Rethink

If The “One Gene, One Protein, One Function” Idea Was True, We Would Have Genomic Gridlock

One gene, one protein? No. One gene, one functional product? No, not that, either—even though saying “functional product” has the virtue of recognizing that a stretch of DNA may give rise to a protein or a noncoding RNA.

Whatever we may assume a gene will do, we should avoid perpetuating the idea that it will do just one thing—or that it will do one thing all by itself.

So, we should forget “one to one.” Instead, we should think “one to many” or even “many to many.” If we feel unequal to the task, we needn’t despair. We can always resort to bioinformatics.

We might have dispensed with one-to-one thinking a long time ago, at least as far back as the Human Genome Project. Back then, it was still surprising that the genome contained just 22,000 protein-encoding genes. Not only was this number smaller than scientists had expected, it corresponded to just 1.5% of the genome’s total content. The remaining 98.5% of the genome was sometimes called “junk.” It has come to be appraised more highly. It is now better understood that it contains stretches of DNA that encode RNA molecules that function not as templates for protein synthesis, but as regulatory elements.

Yet, even as we expand our concept of genomic function, we still need to guard against one-to-one thinking. Whether we are dealing with the genome’s protein-encoding elements or noncoding RNAs, we need to be aware, at a minimum, of variant forms—both protein isoforms and RNA isoforms, the latter of which may include isoforms of miRNAs (microRNAs), or isomiRs.

“Most of the downstream analyses have been based on the assumption that one gene makes one functional product,” says Ramana V. Davuluri, Ph.D., professor of preventive medicine at Northwestern University Feinberg School of Medicine. “This assumption, from a bioinformatics perspective, is too simplistic.”

Dr. Davuluri leads a bioinformatics group that is interrogating gene-expression signatures and developing diagnostic and prognostic tools. The group is well aware of recent findings that over half of the genes encoded in the human genome produce multiple protein isoforms with potentially varied functions. These findings cast doubt on the notion that the “gene” is the functional unit in a living cell.

“In mammalian cells,” notes Dr. Davuluri, “the total products of the transcriptome could be up to 200,000 if all variants and noncoding genes are included.”

Complementary Assays

Two technologies used to interrogate gene expression, RNA-Seq and microarray analysis, often return strongly correlated results. These technologies, however, have not been evaluated for their concordance at the isoform level.

To understand the correlation between RNA-Seq and exon-array platforms in detecting isoforms, Dr. Davuluri and colleagues compared gene- and isoform-level expression for glioblastoma multiforme transcripts from The Cancer Genome Atlas (TCGA). Glioblastoma multiforme is one of the three malignancies for which TCGA contains both RNA-Seq and exon-array data.

The investigation revealed that only about 36% of the differentially expressed isoforms identified by RNA-Seq were also classified as differentially expressed by exon arrays, and that about 70% of the ones classified as differentially expressed by exon arrays were also classified as such by RNA-Seq, indicating that isoform-level expression may be masked by gene-expression estimates.

“Gene-expression arrays and RNA-Seq will be used in a complementary manner,” asserts Dr. Davuluri. “And if the costs of sequencing drop further, people will use sequencing more and more.”

While microarrays are more cost-effective, RNA-Seq provides several advantages, including single-nucleotide resolution and the possibility of performing analyses without prior knowledge about the targeted sequences.

To quantitatively compare gene-expression measurements between different analytical platforms and allow signatures to be transferred across them, Dr. Davuluri and colleagues made use of the PIGExClass (platform-independent isoform-level gene-expression-based classification) system. Using this computational tool, the investigators performed the first isoform-level assay for the molecular stratification of cancer.

Dr. Davuluri’s group examined exon-array and RNA-Seq isoform-level profiles for glioblastoma multiforme samples, and it illustrated the possibility of stratifying patients into one of the four molecular subgroups. As a result of the isoform-level analysis, the subgroup classification changed for 19% of the samples, leading to a different prognostic classification, a finding of critical therapeutic and prognostic relevance.

“The technology for the data-generating platforms moves fast,” comments Dr. Davuluri. “But the data that comes out from the platforms cannot be understood without informatics.”

miRNA Isoform Analysis

“We used to assume that disease was a matter of a certain number of regulatory molecules and a certain number of regulatory targets,” says Isidore Rigoutsos, Ph.D., professor of pathology, anatomy, and cell biology and director of the Computational Medicine Center at Thomas Jefferson University. “But we have shifted away from this abstract or reductionist view. We have developed an understanding of disease that is much more complex.”

Paralleling the conceptual shift that led to the departure from the one gene-one polypeptide hypothesis, Dr. Rigoutsos and colleagues revealed that a similar reductionist view has existed when describing nonprotein-encoding genomic loci. In a recent study, Dr. Rigoutsos and colleagues sifted through TCGA data, catalogued miRNA isoforms that could be detected by RNA-Seq, and revealed that some miRNA loci produce several isoforms. For this data, analysis indicated that each locus generated about five isomiRs on average. The distribution of isoforms among loci was uneven, however, and as many as a few dozen distinct isomiRs could be detected from an individual locus.

“If we include the isoforms, the same number of loci is found to encode more players, which have many more interactions with their own mRNA [messenger RNA] partners, and this provides many more opportunities to create therapeutic targets and approaches,” says Dr. Rigoutsos.

Experiments from Dr. Rigoutsos’ group support the involvement of different miRNA isoforms in shaping subgroup classification and therapeutic and prognostic outcomes. In a study that involved patients with triple-negative breast cancer, Dr. Rigoutsos found that several isoforms of miRNA-183-5p were upregulated in triple-negative breast cancer in Caucasian, but not in African-American women, and integrative analyses of miRNA/mRNA expression revealed that in luminal A and luminal B breast cancers, their putative interactions differed extensively between the two subtypes, presenting distinct therapeutic and prognostic challenges.

In cell-culture studies, Dr. Rigoutsos’ laboratory also found that different isomiRs form the same hairpin have distinct effects on mRNAs and the cellular transcriptome. For example, different isomiRs encoded by the miR-183-5p locus had a different targetome, and even a shift in two nucleotides with respect to the archetype miRNA markedly changed the effect of each individual isomiR on the transcriptome.

Collectively, these analyses revealed that the multitude of miRNA isoforms produced from a miRNA locus provides a much more detailed understanding of the post-transcriptional processes that orchestrate the regulatory events in breast cancer, as compared to only the archetype miRNA produced by the respective locus.

While protein isoforms have been known for many years, the discovery of miRNA isoforms is much more recent. “We were able to use a lot of approaches and learn a lot about what protein isoforms do, but we did not have the same amount of time, and did not spend the same amount of effort, to understand what the different microRNAs from the same locus do,” says Dr. Rigoutsos.

Genetic Variation and Drug Response

Imagine taking a patient’s skin cells, using them to derive induced pluripotent stem cells [iPSCs], differentiating the stem cells to produce cells of a particular type, and then exposing the differentiated cells to drugs that the patient might be given, suggests Russ B. Altman, M.D., Ph.D., professor of bioengineering, genetics, medicine, and biomedical data science at Stanford University. Such procedures might detect the potential for drug-induced toxicity and reduce the incidence of serious side-effects in the clinical setting.

The ability to predict adverse effects is particularly important for therapeutic agents that are associated with a high likelihood of failure or adverse effects. Predicting adverse effects could also help tailor treatments in a more rational manner.

An example of a drug with a challenging adverse effect profile is doxorubicin. This chemotherapeutic agent is known to be cardiotoxic in some patients, but predicting which patients are at risk is difficult. In fact, no reliable means of predicting doxorubicin-induced cardiotoxicity (DIC) exists, so the drug cannot be administered with confidence.

In a recent study conducted in collaboration with Dr. Paul Burridge from Northwestern University School of Medicine and Dr. Joseph Wu from Stanford Cardiovascular Institute, and other colleagues, bioinformatics analyses performed by Dr. Altman’s group were critical to show that patient-specific human induced pluripotent stem cell-derived cardiomyocytes can recapitulate at the single-cell level the predilection to develop doxorubicin-induced cardiotoxicity.

“It was pretty straightforward, on the informatics side, to show a correlation between the cellular responses and the clinical responses,” asserts Dr. Altman. “This correlation is incredibly exciting.”

Human iPSCs obtained from female patients with breast cancer and matched with healthy volunteers were differentiated into cardiomyocytes. RNA-Seq and microarray analyses were subsequently used to profile and compare gene-expression changes in the cardiomyocytes derived from the healthy volunteers and in those from the breast cancer patients with and without clinical DIC. Cells derived from patients presenting clinical DIC were more sensitive to therapy, exhibited increased metabolic stress and reactive oxygen species, and had impaired intracellular calcium signaling, as compared to cells derived from patients who did not show clinical DIC.

Using microarray analyses to examine gene-expression perturbations in response to various doxorubicin concentrations, this study revealed that in vitro, the cardiomyocytes recapitulated patients’ predilection to DIC. The study also indicated that genetic and molecular analyses could provide a powerful tool to predict clinical toxicity to therapeutic agents.

“The findings in the research setting are very intriguing,” comments Dr. Altman. “There is a lot of engineering to make them more reliable and reproducible.”

Even though stem cell studies have shown a lot of promise, reproducibility has been particularly challenging, and results from different labs may vary depending on multiple factors, including small differences in experimental protocols and the versions of the stem cells used by various labs, for which it is very difficult to show equivalency.

“The work is only half complete when the research is published,” Dr. Altman concludes. “Lots of details need to be addressed before this can be put into routine clinical use.”

Stanford and Northwestern scientists have shown that doxorubicin-induced cardiotoxicity (DIC) can be predicted for individual patients. After probing the transcriptomes associated with patient- specific responses, the scientists determined that iPSC–derived cardiomyocytes from patients with toxicity have lower basal metabolism and mitochondrial content. Cells from patients with or without DIC can be distinguished based on sarcomeric organization, as indicated by staining for a-actinin (red) and cardiac troponin T (green). [Russ B. Altman, M.D., Ph.D., Stanford University]

Comparative Analyses

“Right now, there is some confusion in the field about how to analyze RNA-Seq data,” says Avi Ma’ayan, Ph.D., professor of pharmacology and systems therapeutics at Mount Sinai School of Medicine. “But some order will emerge, and the right way to go about creating that order is to compare the various tools and pipelines for their quality and their ability to recover biological knowledge.”

To facilitate the integrative analysis of gene-expression signatures extracted from the Gene Expression Omnibus (GEO), a large repository of gene expression data generated and deposited by individual research groups, Dr. Ma’ayan and colleagues recently developed GEN3VA (gene expression and enrichment vector analyzer), a web-based software application that allows the multilevel analysis of microarray profiles.

“GEN3VA allows investigators to aggregate published studies and to extract and compare gene expression signatures,” explains Dr. Ma’ayan.

Validating the ability of GEN3VA to uncover novel information, in a case study that proposed to dissect pathway changes that occur during aging, Dr. Ma’ayan and colleagues comparatively examined a collection of gene-expression signatures that included old and young mammalian tissues. This analysis included 244 human, mouse, and rat genomic signatures that originated from 62 tissue and cell types across these three species.

“We wanted to collect as many signatures as possible, regardless of the tissue and organism, to find the most common genes that are up- and downregulated, and perform enrichment analysis on those common genes to find small molecules that can reverse or mimic the aging signature,” details Dr. Ma’ayan.

This approach incorporates data collected for the Library of Integrated Network-based Cellular Signatures (LINCS) and has the power to discover new small molecules that can modulate gene expression. It can also assess reproducibility across datasets generated with different platforms, which is a topic that is attracting considerable interest.

“We identified a conserved set of genes,” informs Dr. Ma’ayan. “Some of these genes have been known for a while, but others are novel. Also, we found that NFkB is a critical transcription factor that can regulate genes and increase their expression in aging.”

These results raise the possibility of using small molecules to modulate these pathways and potentially attenuate and even reverse aging. “The strength of this analysis,” insists Dr. Ma’ayan, “is that the data was sourced not from a single lab but from multiple labs. Also, the labs used different platforms.”

At Mount Sinai’s Icahn School of Medicine, the Ma’ayan Laboratory has developed an open-source bioinformatics pipeline to extract knowledge from typical RNA-Seq studies and generate interactive principal component analysis (PCA) plots. The PCA plot shown here was generated using Gene Expression Omnibus/Sequence Read Archive data, which represents ~55,000 RNA-Seq human samples. Colors reflect the results of text searches on the metadata associated with each sample. [Alexander Lachmann, Ph.D.]

Splice-Sensitive Sequencing

“Microarrays, RNA-Seq, epigenetic sequencing, and other types of sequencing will be major components of high-throughput analyses,” says Thomas C. Whisenant, Ph.D., research scientist in molecular and experimental medicine at The Scripps Research Institute. Collectively, these tools are ideally positioned to generate multi-omic panels of data related to nucleic acids.

“The data will then be used as input into a software that will generate a profile to help investigators direct their research toward the most interesting targets based on the output,” adds Dr. Whisenant.

One of the ongoing challenges in gene-expression analyses is that comparable approaches differ in their accuracy in capturing specific datasets. In a recent study, Dr. Whisenant and colleagues compared data generated on microarrays with data generated using next-generation sequencing to interrogate the same blood-based classifiers.

“While the end result of the analysis was comparable, the overlap was in the 50–60% range at the end of each analysis,” observes Dr. Whisenant. “That is remarkably discordant for an analysis that uses the same samples that have been treated roughly the same way.”

One of the potential problems for discordant results across experiments is that assays are variable at the technical level. “The process of acquiring nucleic acids, amplifying the fragments, ligating the adaptors, and completing various other steps before finally getting a readout is so variable, that it is difficult to get a good, repeatable estimate of expression from the same sample,” cautions Dr. Whisenant.

While the biological relevance of splicing has been increasingly appreciated in recent years, capturing splicing variants by sequencing is still challenging due to several technical considerations. “Looking at splicing is difficult in general,” says Dr. Whisenant. “The amount of material that is needed to get reliable, consistent data is greater.”

Dr. Whisenant and colleagues recently used RNA-Seq to examine gene expression and splicing changes that occur during T-cell activation. This study sought to identify the genes that are bound to the splicing factor U2AF2 during T-cell activation. Using splicing-sensitive microarrays, the investigators measured the impact on gene expression when some of these proteins were knocked down by means of RNA interference.

Another topic of interest in sequencing technologies revolves around the need to perform more sensitive types of sequencing, which will generate information not only about populations of cells, but also about groups of cells and individual cells in a population. “This approach,” asserts Dr. Whisenant, “will help investigators resolve single-cell levels of expression, single-cell copy number at the genome level, and compartment-level expression data—for example, expression only in the nucleus or only in the cytoplasm.”

Given the need to obtain reliable count data for every exon in a gene, the detection of splicing variants requires sequencing at higher depth and is associated with higher costs. Some microarrays contain probes that can hybridize to any of the isoforms in a sample, and their use is poised to decrease costs.

“But the problem with microarrays is that if the spliceoform of interest is not on the array, one would never detect it,” advises Dr. Whisenant. “This opens a discovery problem.”