The complex and dynamic transcriptional patterns unveiled by the ENCyclopedia of DNA Elements (ENCODE) project, together with the finding that less than 2% of the transcriptional output of the human genome encodes proteins and approximately 98% encodes noncoding RNAs, are some of the advances that reshaped the field and even required that we revisit the definition of the gene.
While insights into the genome have repeatedly been a source of thought-provoking findings, the transcriptome, with its unprecedented and unexpected levels of complexity, promises to be even more intriguing. The emergence of RNA-Seq allowed quantitative and high-throughput analyses of the transcriptome to be performed in different cell types and under various conditions, and with the massive amounts of data that have been generated, computational analysis is emerging as one of the most critical challenges.
“Two of the basic problems in transcriptome analysis are identifying the true sets of transcripts in a given tissue at a given time, and defining the dynamics of gene expression,” says Zhong Wang, Ph.D., staff scientist and group lead for genome analysis at the DOE Joint Genome Institute.
The superior sensitivity and accuracy of RNA-Seq, along with its ability to measure transcript isoform levels and to reconstruct transcriptomes even in the absence of a reference genome, made it a method of choice for transcriptome analysis. However, sequence reads generated by existing platforms are often short, and this represents one of the challenges in the field.
“Pairwise gene expression analyses are routinely performed to compare genes or gene sets that are differentially expressed between cancer tissues and normal tissues,” says Dr. Wang. Among the numerous statistics and bioinformatics challenges, one relates to the difficulties that accompany choosing the most appropriate algorithms. “The statistics very much depend on the types of data-analysis software one wants to use,” he says.
There are situations when the expression level of most genes does not change between two conditions and, in this case, certain assumptions and specific types of statistical analyses are more applicable. In other instances, when the expression levels of most genes change, a different type of analysis might be recommended, and more biological replicates are needed to increase the statistical power.
“Different algorithms are needed for different datasets, and this requirement is best defined by the specific biological question that is being addressed. I don’t think there is a simple solution to address this challenge, it is more like an art,” explains Dr. Wang.
In many industrialized countries, cancer and heart disease are the two most prevalent causes of mortality. Head and neck cancers constitute the sixth most frequent malignancy worldwide, and squamous cell carcinomas, the vast majority of malignancies in this group, represent a significant concern, particularly due to their dire prognosis.
Most patients with head and neck cancers have a history of smoking and drinking. While the incidence of head and neck cancers at most sites has dramatically dropped in the U.S. since World War II, along with a decrease in smoking, oropharyngeal cancer appears to be an exception because its incidence has increased in people who are younger and lack a history of smoking and drinking.
“A potential explanation for this trend appears to be the infection with human papilloma viruses,” says David I. Smith, Ph.D., professor of laboratory medicine and pathology at Mayo Clinic. Human papilloma viruses were also implicated in cervical cancer, where viral integration into the host genome represents an essential step during malignant transformation.
To better understand the involvement of human papilloma viruses in oropharyngeal cancer, Dr. Smith and colleagues used RNA-Seq in combination with exome sequencing to perform a whole-transcriptome analysis in oropharyngeal carcinoma patients including current smokers, never smokers, or ex-smokers with at least 10–15 years of smoking cessation.
“RNA-Seq provides more information and, overall, is a much more comprehensive approach than microarrays to explore the transcriptome,” says Dr. Smith. The analysis revealed that certain genes are differentially expressed among the three groups, and the increased expression of genes involved in DNA repair in human papilloma virus-negative current smokers, as compared to the two other groups, emerged as a distinguishing feature.
While transcriptomics is increasingly becoming routine in the clinic, the bottleneck of data analysis is emerging as one of its most acute challenges. “In the next couple of years we will see an absolute revolution in understanding alterations that occur in cancer and, most importantly, we will be able to design therapies targeting those specific alterations,” says Dr. Smith.