Genome sequencing has enabled researchers to make huge strides in solving health and environmental challenges, from identifying causative disease variants to developing a disease-resistant, high-yielding rice plant. But genomes are merely blueprints: the transcript isoforms produced by alternative splicing during transcription of DNA into RNA provide a dramatic increase in the protein-coding potential of the genome.
This means that we can’t know which proteins, with what regulatory elements, are encoded from just reading the DNA of an organism. Furthermore, while the genomic blueprint is virtually identical in all cells, the transcript isoforms that are actually expressed vary widely by tissue type and disease state. Hence, we must look one step further, at the RNA isoforms that are produced and expressed, to uncover the mechanisms behind the traits we are interested in studying.
The need for full-length RNA sequencing
The study of alternative splicing is far from novel; short-read RNA-seq is a staple technique in labs for looking at gene expression patterns in various tissues, disease states, or myriad treatment-and-control experiments. However, since transcripts are usually 1–10 kb long and RNA-seq reads are only a few hundred bases, transcript sequencing is performed by fragmenting transcripts into smaller pieces and then reassembling them computationally. As the number of alternative splice forms increases for a gene, the task of assembling transcripts correctly becomes computationally challenging. For example, the gene Dscam (Down syndrome cell adhesion molecule) in the fruit fly Drosophila melanogaster is reported to have more than 30,000 isoforms. And in mammals, neurexin genes (presynaptic cell-adhesion molecules that are essential for synapse formation and synaptic transmission) have thousands of alternative isoforms. Failure to correctly represent the full complexity of transcript isoforms can lead to missing important isoforms that are indicative of disease states or contribute to agronomically important traits.
With the development of long-read RNA sequencing, it is now possible to sequence cDNA transcripts in their entirety—without assembly. These full-length sequences improve genome annotation and provide a way to look at gene expression data in an isoform- or allele-specific way. Here, we highlight a few applications of full-length RNA sequencing to better understand plant and animal genomes and the workflow for isoform sequencing methods (Figure 1).
Applications of full-length RNA sequencing
The first, and perhaps most obvious, use of full-length transcript sequences is for genome annotation. This becomes especially beneficial in plant and animal sciences where genomes are often far more complex than those of humans and a reference-quality genome assembly may still be prohibitively expensive.
One example of this is the transcriptome analysis of the bioenergy crop switchgrass (Panicum virgatum L.) performed by a group of Chinese and American scientists.1 After pooling RNA from six different tissues and sequencing with long reads, the scientists “identified 105,419 unique transcripts covering 43,570 known genes and 8,795 previously unknown genes.” In addition, they discovered more than 45,000 novel transcripts for known genes in their dataset. The long-read RNA sequencing provided a substantial amount of additional data for further investigation.
Full-length transcripts are also improving the annotation of genes in humans that previously were hard to characterize due to segmental duplications, as demonstrated by Dougherty et al.2 The researchers targeted 19 gene families expressed in the brain containing long segmental duplications and found that nearly 50% of the expressed gene duplicates had changed substantially from their ancestral models due to novel sites of transcription initiation, splicing, and polyadenylation.
Others have used full-length RNA sequencing to similarly characterize alternative splicing and polyadenylation profiles, but in this case between different species. In a publication from Cold Spring Harbor Laboratory, researchers used isoform sequencing to compare two major crops, maize and sorghum.3 Using full-length transcript sequences from 11 matched tissues, Wang et al. were able to see alternative splice sites and polyadenylation patterns that were conserved between the species, as well as identifying others that were species-specific. Armed with new information on what differentiates the two plant species, the researchers can now take a more targeted approach in studying species-specific genes.
It also cannot be overlooked how beneficial having full transcripts in a single sequencing read can be when it comes to gene fusions associated with disease. Garnering the cover of Genome Research last August, Nattestad et al. used full-length RNA sequencing in the breast cancer cell line SK-BR-3 (one of the most important models for HER2+ breast cancers) to identify novel gene fusions that led to RNA transcripts arising from sections of three different genes.4 By coupling the technique with genomic sequencing of the same sample, they were able to pinpoint these gene fusions to loci of nested genomic variants, shedding new light on the complex mechanisms of cancer evolution.
Lastly, isoform sequencing is now being used to explore allelic imbalances in isoform expression. Since these techniques produce one sequence per RNA molecule, allele-specific isoform expression can be detected by observing SNP differences between transcript sequences arising from different alleles of the same gene. At the 2019 International Plant & Animal Genome Conference, Elizabeth Tseng, PhD, developer of the IsoPhase tool for detecting allele-specific expression in full-length transcripts, presented a poster demonstrating the power of this method using hybrid F1 crosses of maize.5
Isoform sequencing workflow
Isoform sequencing is conducted from 300 ng of whole RNA. Once poly-A RNA is selected for, it is converted to cDNA to prepare for library construction. In addition, there are options to multiplex up to 12 samples for a streamlined, cost-effective approach.
Once the sequencing data is generated, reads are classified as full-length by selecting reads flanked by cDNA primers and poly-A tails. These reads can then be clustered at the transcript isoform-level to generate a unique consensus.
For a reference-based approach, the isoforms can then be mapped to a reference genome, and tools such as SQANTI and Maker can be used to annotate the isoforms. For a de novo approach, Iso-Seq analysis from PacBio includes push-button bioinformatics workflows to process the isoform data without requiring reference genomes or annotations.
Whichever method you use, long-read RNA sequencing with isoform-level resolution is a great solution for resolving insufficient connectivity and isoform splice uncertainty (Figure 2).
Long-read RNA sequencing can generate full-length cDNA sequences—no assembly required—to characterize transcript isoforms across an entire transcriptome or within targeted genes, enabling researchers to discover new genes, transcripts, and alternative splicing events; improve genome annotation to identify gene structure, regulatory elements, and coding regions; and increase the accuracy of RNA-seq quantification with isoform-level resolution.
1. Zuo, Chunman et al. Revealing the transcriptomic complexity of switchgrass by PacBio long-read sequencing. Biotechnol. for Biofuels 201811:170 doi.org/10.1186/s13068-018-1167-z
2. Dougherty, Max L. et al. Transcriptional fates of human-specific segmental duplications in brain. Genome Res. 2018. 28: 1566–1576
3. Wang, Bo et al. A comparative transcriptional landscape of maize and sorghum obtained by single-molecule sequencing. Genome Res; 2018; 28: 921–932.
4. Nattestad, Maria et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res. 2018; 28: 1126–1135.
Michelle Vierra is manager, plant and animal sciences, at Pacific Biosciences.