Scientists at Children’s Hospital of Philadelphia (CHOP) say they have developed a computational tool offering researchers a new technique for detecting the different ways RNA is spliced when copied from DNA. Because variations in how RNA is spliced play crucial roles in many diseases, this new analytical tool will provide greater capabilities for discovering disease biomarkers and therapeutic targets, even from RNA-sequencing data sets with modest coverage, according to the researchers.
Study leader Yi Xing, PhD, director of the Center for Computational and Genomic Medicine at CHOP, and first authors and PhD students Zijun Zhang and Zhicheng Pan report (“Deep-learning augmented RNA-seq analysis of transcript splicing”) on their DARTS framework in Nature Methods. DARTS (Deep-learning Augmented RNA-seq analysis of Transcript Splicing) uses deep-learning-based predictions to harness the wealth of information available in public datasets of RNA-seq, thus allowing for new insights into alternative splicing, explained Xing.
“The conceptual innovation of DARTS is it provides a bridge from big data in the public domain to smaller data sets in focused studies with individual investigators,” said Xing. “DARTS offers the ability to transform massive amounts of public RNA-seq data into a knowledge base, represented as a deep neural network, of how splicing is regulated. Using this computational framework, we can push that into any individual lab. This could be really useful and increase the efficiency of the experiment and enable new discoveries. With just 20 or 30 million RNA-seq reads, you can make educated guesses and inferences on things you were never able to see in the past.”
“A major limitation of RNA sequencing (RNA-seq) analysis of alternative splicing is its reliance on high sequencing coverage. We report DARTS, a computational framework that integrates deep-learning-based predictions with empirical RNA-seq evidence to infer differential alternative splicing between biological samples,” write the investigators. “DARTS leverages public RNA-seq big data to provide a knowledge base of splicing regulation via deep learning, thereby helping researchers better characterize alternative splicing using RNA-seq datasets even with modest coverage.
Massively parallel RNA sequencing is now the standard technology researchers use to investigate alternative splicing. However, to accurately measure alternative splicing, the RNA sequencing experiments have to go deep. The consensus view is that over 100 million sequences are needed for analyzing alternative splicing, but due to the high cost, most researchers cannot afford going this deep with their RNA sequencing experiments. Moreover, many medically important genes are not expressed at high levels. Even a deep RNA sequencing experiment cannot generate enough coverage on such genes, making it virtually impossible to measure the genes’ alternative splicing patterns.
In the current study, Xing’s team first drew on large-scale public-domain RNA sequencing data from sources such as the ENCODE Consortium, the international program launched by the National Human Genome Research Institute, to identify all the functional elements in the genome, including those acting at the level of RNA. Using these massive data sets, DARTS trains a deep neural network for predicting changes in alternative splicing. The model incorporates messenger mRNA levels of 1,500 RNA binding proteins and 3,000 sequence features.
To allow researchers to use the deep-learning model in their own studies, the deep neural network predictions are combined with actual RNA sequencing data generated on specific biological samples using a statistical framework called Bayesian hypothesis testing. Researchers can use this information in their individual labs to better characterize alternative splicing across different biological conditions.
The researchers applied DARTS to lung and prostate cancer cell lines to test its ability to predict splicing patterns in the cells. These cell lines are models for the transition from epithelial to mesenchymal cells—an important process in both embryonic development and cancer metastasis. By leveraging the deep learning predictions, DARTS discovered changes in alternative splicing patterns in numerous genes that escaped detection by conventional computational tools because these genes were expressed at low levels in the cells. The study team then performed experiments to validate these novel predictions. These new discoveries may allow scientists to better identify biomarkers and therapeutic targets of diseases.
“DARTS offers an exciting conceptual framework that we could adapt to other uses,” added Xing. “For example, we might create a version that predicts alternative splicing in specific patient tissues.” This could potentially improve the diagnosis of rare diseases from a tissue biopsy, a useful technique for pediatric centers such as CHOP that often evaluate children with puzzling, undiagnosed disorders.
DARTS, Xing concluded, could enable scientists to discover more about the contributions of understudied genes that may not be expressed at high levels, but have important impacts on health and disease. “DARTS offers a new window into the dark matter of the transcriptome,” he said.