Send to printer »

Feature Articles : Apr 1, 2012 (Vol. 32, No. 7)

Exploiting Gene-Expression Data

  • Kathy Liszewski

The field of gene expression continues to grow and evolve. Commonly used technologies such as microarrays are packing higher densities in a smaller footprint, and qPCR is becoming more accurate and reproducible with new guidelines in place.

Additionally, the introduction of next-generation sequencing has revolutionized the study of transcriptomics by promoting RNA analysis via cDNA sequencing on a massive scale (RNA-seq). The latter eliminates the limited dynamic range of detection in microarrays but adds its own challenges of reproducibility and interpretation.

GEN spoke with several researchers who shared their insights on how they are utilizing gene-expression technologies, the challenges faced, and what they expect of the field for the future.

Transcriptome Analysis

Pathogens subvert the immune system by sensing and then responding to avert host protective responses. This is accomplished by activating and expressing the pathogen’s virulence genes. Jay Zhu, Ph.D., assistant professor of microbiology, University of Pennsylvania Perelman School of Medicine, suggests that pathogens may also specifically repress other sets of their genes to “outsmart” the innate immune response.

“We are studying Vibrio cholerae, the causative agent of cholera. These bacteria employ both positive and negative transcriptional regulation in order to colonize the host intestine and establish infection. Our goal is not only to better understand how the bacteria cause disease by manipulation of gene expression, but also to develop a therapeutic against its targets.”

Dr. Zhu says the organism’s entire genome has been sequenced, allowing for easier reading of gene-expression changes. To assess how V. cholerae targets and works in the target intestinal tract, they first infect mice with the organism and then monitor intestinal responses.

“Although we could isolate RNA from the intestine, it is difficult to utilize a typical microarray to analyze gene-expression changes because one cannot get high-quality bacterial RNA from such a complex tissue. Therefore, we use other genetic tools such as transposon mutagenesis, RNA-seq that allows sequencing of the entire transcriptome. We next confirm our findings using RT-PCR to confirm with a few targets for screening.

According to Dr. Zhu, these approaches established that components of flagellar biosynthesis also controlled so-called quorum sensing by regulating hapR expression.

“Quorum sensing refers to the ability of bacteria communicating with each other to determine certain cellular process in the whole population. Our studies identified components of flagellar biosynthesis that also participated in the control of quorum sensing so that V. cholerae can sense the “right environment” (i.e., intestines) to activate virulence genes. Overall this data provided a link between regulation of motility and regulation of quorum sensing by V. cholerae during infection of hosts.”

Dr. Zhu says these studies provide a clearer picture of how the bacteria can access colonization sites and at the same time allow the natural expression of virulence genes.

In Search of Biomarkers

“We are now moving into an era of individualized medicine,” reports George Vasmatzis, Ph.D., assistant professor, department of laboratory medicine and pathology, Mayo Clinic. “The clinical dilemma is to predict which subsets of patients will respond most effectively to a given treatment and to develop specific tests for that. The goals of such molecularly targeted medicine depend on the identification of specific biomarkers that could stratify patient populations.”

Dr. Vasmatzis utilizes a combination of technologies. He first captures the specific cell populations of interest using laser capture microdissection (LCM) and then amplifies the genomic DNA. The amplified DNA from these samples is analyzed using next-generation sequencing to evaluate DNA changes. Finally, he validates his findings using microarrays in which RNA levels can be correlated with genetic expression.

“We find that the use of these technologies together provides a powerful means to profile as well as stratify patient populations. We are able to separate different grades and different types of tumors and then look for genetic changes. Next-gen sequencing is capable of sequencing both sides of DNA fragments and can do so for hundreds of millions of sequences. For a couple thousand dollars, one can virtually cover an individual’s entire genome.”

Looking next at RNA expression data from microarrays can provide a global look at what is upregulated or downregulated. For example, using this methodological approach, Dr. Vasmatzis and colleagues discovered recurrent translocations in the DUSP22 phosphatase gene on 6p25.3. “DUSP22 is an important prognostic biomarker in T-cell lymphomas. We hope to utilize this same approach to also work on other cancers such as lung, endometrial, and prostate cancers.”

Juvenile Idiopathic Arthritis Subtypes

Another example of the use of gene-expression profiling is for identifying patient subtypes in juvenile idiopathic arthritis (JIA). “We are studying gene-expression analysis in peripheral blood mononuclear cells (PBMC) in order to identify sets of genes that may help us better understand differences within the patient population,” says Michael G. Barnes, Ph.D., research associate, division of rheumatology, Cincinnati Children’s Hospital Medical Center, speaking on behalf of a large team of researchers involved in the project, which was supported by the NIH.

Juvenile idiopathic arthritis (JIA) encompasses the majority of childhood arthritis. Although seven subtypes have been described, there is increasing evidence for heterogeneity even within these types. “Use of genome-level technologies can provide a comprehensive determination of genetic and genomic biological signatures, giving an unprecedented opportunity to define JIA on the basis of molecular phenotypes and can help us understand disease mechanisms. This may ultimately help improve therapeutic approaches,” Dr. Barnes explains.

To begin the analysis, PBMC are first isolated using Ficoll gradient centrifugation. RNA is immediately stabilized and later isolated and purified. “We assess RNA quality with standard protocols and then label it using NuGEN Ovation (NuGEN Technologies). Next we hybridize the labeled samples to Affymetrix GeneChips (Affymetrix). This array has nearly 55,000 probe sets and can measure up to 47,000 transcripts.”

Processing the monumental amount of data generated into meaningful results requires the use of bioinformatic approaches. “To begin analysis, we import the data we generate into a program called GeneSpring GX (Agilent Technologies). We then adjust batch to batch variation by a process called distance-weighted discrimination. Next we identify genes with different levels between groups. Finally, we perform a functional analysis of the data.”

Employing these approaches, Dr. Barnes and colleagues found substantial PBMC gene-expression differences in patients with early-onset JIA as compared to those with late-onset disease.

“Age of onset may be an important characteristic for classifying certain JIA patient subtypes. Today, differential diagnosis between the oligoarticular and polyarticular JIA subtypes is based, to a large extent, on how many joints are affected in patients. Utilizing molecular approaches in addition to other biologic markers like antinuclear antibodies (ANA) provides great potential to grasp pathologic mechanisms that may help explain the differences between patients with early and late disease onset. Understanding these processes ultimately may lead to better treatments for JIA.”

RNA-Seq

The emerging field of massively parallel cDNA sequencing, or RNA-seq, provides exciting potential to rapidly characterize and quantify transcriptomes. It is a young and evolving field, however, with challenges accompanying advances and opportunities.

“RNA-seq is currently in its early stages much like the way microarrays were 10 years ago,” notes Kellie J. Archer, Ph.D., associate professor, department of biostatistics, Virginia Commonwealth University.

“It is a wonderful tool with exciting possibilities. Aside from gene expression, we can look also at exon expression, identify microRNA precursors, etc. Even one run provides an enormous amount of such information. But we first must address a number of important issues.”

Dr. Archer says one such challenge is the issue of mapping. “How do we map RNA sequences to a reference genome?

“Mapping sequences in which introns have been removed by cis-splicing can be accomplished, but how do we effectively handle alternative splicing? How do we take quality of reads into account in downstream analyses? There is a lot of research as to the most efficient method to use for mapping, and there are many tools emerging. But it is not yet clear which is the best and most accurate.”

A second issue is how to perform statistical analysis.

“With RNA-seq, the assay returns number of reads per sequence, not a continuous variable reflecting relative abundance (as is the case with traditional gene-expression microarrays). If one merges data across samples, several sequences will have zero counts and the data range can be quite large, so the normal distribution no longer holds. Therefore, we can’t employ commonly used statistical tests such as t-tests.

“Earlier papers examining technical replicates used a Poisson distribution, but more recent studies involving biological replicates suggest a negative binomial model may handle the overdispersion more accurately.”

A third issue is the presence of technical artifacts. “We don’t yet know how to address the fact that different RNA-seq technologies aren’t directly comparable. Initially it was expected that RNA-seq would reveal the truth about number of transcripts in a sample, but we see artifacts from different high-throughput sequencing technologies.”

Dr. Archer believes these problems will be solved, just as they were for microarrays. “The field is definitely moving from the traditional microarray platform in the direction of RNA-seq. Aside from the cost and instrumentation needed, the technical challenges will be solved as the field progresses. Companies are constantly improving their platforms and seeking to give the best sequencing performance at the lowest costs. This field will progressively provide a fuller and complete knowledge of both the qualitative and the quantitative aspects of RNA biology and thus gene expression.”

Diverse Techniques for Studying Gene Expression

In a recent study in Respiratory Research entitled “Systems-level comparison of host responses induced by pandemic and seasonal influenza A H1N1 viruses in primary human type I-like alveolar epithelial cells in vitro,” a research team from China and Canada utilized gene-expression analysis to compare transcriptional responses to infection with a seasonal H1N1 influenza virus or a pandemic H1N1 influenza virus isolated during the 2009 influenza pandemic.

Based on the published data, scientists at Ingenuity Systems tested the ability of a new web-based report to correctly identify expected results and gain biological insights. Using iReport, the Ingenuity Systems researchers not only validated the published results, they also identified additional genes, pathways, and processes involved in seasonal H1N1 influenza compared to pandemic influenza infection from the same data, said Megan Laurance, Ph.D., product manager at Ingenuity Systems.

“Forty-three other genes encoding zinc finger proteins as well as nine other genes encoding small nucleolar RNAs were observed to be downregulated,” explained Dr. Laurance. “In addition, cytosolic pattern-recognition receptors were activated in response to seasonal H1N1 infection.

“In less than two days, we were able to confirm the presence of particular pathways and processes using a single tool that tackles both the statistical and biological analyses of gene expression data. iReport expounded upon the findings presented by Lee et al. and provided additional genes of interest for future studies in the areas of transcription and mRNA transport, which are downregulated upon seasonal H1N1 infection but not pandemic H1N1 infection.”

GeneGo, a Thomson Reuters Business

Current approaches to deriving genomic biomarkers can produce reasonably accurate biomarkers, but these lack robustness and cannot generally be linked biologically to the endpoint, according to scientists at Thomson Reuters.

One barrier to the more extensive use of these genomic biomarkers is the difficulty in determining the biological relevance of the signatures from the classifying genes identified, limiting their utility for risk assessment, they said.

Via a poster entitled “A Novel Method for Deriving Mechanistically-Anchored Gene Expression Biomarkers,” Richard J. Brennan, Ph.D., et al at Thomson Reuters described the development of a new technique for deriving genomic signatures using discrete modules of genes representing a variety of biological pathways and functional categories. They also discussed a two-stage machine-learning approach that identifies individual modules with classification power, and combines them into a meta-signature to optimize predictive performance.

“These ‘functional descriptors’ have comparable performance to gene signatures for the same endpoint generated using other supervised machine-learning methods,” explained Dr. Brennan. “A functional descriptor predicting renal tubule injury was derived with an estimated sensitivity of 81.7% and specificity of 98.0%, comparable to the performance of a standard gene signature on the same training set (83% and 94%, respectively).”

Functional descriptors also encompass information about the pathways and metabolic processes involved, leading to an understanding of the biological relevance of the signature, he continued. Classification of tubule toxicity was based on perturbation of pathways involved in cytoskeletal remodeling, lipid metabolism, vitamin D signaling, and amino acid metabolism among others.

“Functional descriptors therefore hold the promise of combining predictive and mechanistic systems toxicology,” added Dr. Brennan.

“The Functional Descriptor™ approach leverages a manually curated knowledge base of functional categories to derive a series of signatures for an endpoint using these predefined gene sets as features. Querying the descriptor set for a class prediction permits an investigation into the contributing biological properties.”

Expression Analysis

Officials at Expression Analysis (EA) maintain that RNA-Seq has gained much interest due to the potential performance benefits relative to gene-expression microarrays. They cite a number of expected advantages, including unbiased content, more precise quantification, detection of novel isoforms, and detection of structural variation.

Each of these measures is dependent on the read length, number of reads generated, and other factors that comprise the sequencing strategy,” they said.

“There has been some reluctance to switch from array to sequencing-based expression studies because of cost and the availability of bioinformatic tools to support RNA-Seq datasets,” explained Steve McPhail, CEO. “Most importantly, there has been no way to compare years of existing array-based datasets to RNA-Seq datasets.”

EA recently completed a performance comparison of various RNA-Seq strategies to microarrays in a real-world experimental scenario. The experiment consisted of 15 breast cancer cell lines (five unique lines representing each of three breast cancer subtypes). It revealed that at 12 million sequencing reads there were 25–35% greater number of genes found to be differentially expressed compared to that of microarrays.

The study reportedly also demonstrated a 50% increase in the detection of genes and a 500% increase in isoform detection. At 25 million sequencing reads, the numbers jumped to 40–50% increase in the magnitude of genes found to be differentially expressed, 67% increase in the detection of genes, and a 550% increase in isoform detection.

“EA has developed a tool to map a portion of the sequencing data to Affymetrix probe sets thus enabling researchers to directly compare their existing array-based datasets to RNA-Seq data,” said McPhail. “This tool presents data in a CEL file format also allowing researchers to utilize their existing array-based data-analysis pipelines to become familiar with the power of RNA-Seq datasets.”

Illumina

Detailed in a poster entitled “Detection of cancer-associated mutations, rearrangements and gene-expression changes by targeted deep sequencing of FFPE RNA and DNA,” a research team described the development of FFPE-compatible targeted RNA sequencing and analysis methods for the study of over 200 cancer-associated genes.

Geoff Otto and colleagues from Foundation Medicine and the Albany Medical College used protocols, validated on cell lines where known mutations and gene fusions (e.g., BCR-ABL1) were detected, to characterize 49 FFPE non-small-cell lung cancer tumors.

Technical reproducibility in digital expression profiling exceeded r>0.99 and >0.95 for cells lines and FFPE RNA, respectively, according to the scientists, who added that RNA-seq provided evidence of alterations in the genome, including point mutations and novel rearrangements involving known oncogenes.

Differential expression of oncogenes including EGFR, KIT, and RET was also revealed, ranging from 2- to 50-fold across different tumors. Combination of RNA and DNA sequencing data on identical FFPE samples corroborated functional consequences of genomic alterations. Examples included expression of mutated KRAS and TP53 alleles and reduced STK11 expression in a tumor that had a homozygous deletion at the DNA level.

Foundation Medicine used Illumina’s HiSeq 2000 system for detection of mutations in cancer samples. The scientists reported that they were able to achieve a high level of sensitivity for detecting mutations in cancer-related genes without any apriori assumptions on the specific mutations.

“Application of [Illumina’s] next-generation sequencing technologies to FFPE RNA and integration with extant DNA sequencing methods is anticipated to expand our understanding of clinically relevant cancer biology and improve patient care,” noted the researchers, who conclude that targeted RNA-seq of FFPE RNA is highly reproducible and preserves transcript abundance.

“RNA-seq results were highly concordant with DNA-seq, and 93% of somatic hotspot mutations were detected along with several gene fusions,” the scientists pointed out. “Integration with DNA-seq enabled comprehensive molecular profiling of a FFPE tumor sample with high sensitivity and specificity to mutations, copy number alterations, gene fusions and changes in gene expression.”