Scientists from the University of North Carolina (UNC) at Charlotte, led by Weijun Luo, PhD, and Cory Brouwer, PhD, report the development of an artificial intelligence algorithm to “clean” noisy single-cell RNA sequencing (scRNA-Seq) data. The team’s study (“A Universal Deep Neural Network for In-Depth Cleaning of Single-Cell RNA-Seq Data”) appears in Nature Communications.
From identifying the specific genes associated with sickle cell anemia and breast cancer to creating the mRNA vaccines in the ongoing COVID-19 pandemic, researchers have been delving deeply into genomes since the Human Genome Project of the 1990s. Technology has evolved from those early days of batching thousands of cells together to decrypt the millions of base pairs that make up genetic information. In 2009, researchers created scRNA-Seq, now used widely in biomedical research, which only sequences the transcriptome or the expressed portion of the genome in a single cell of a living organism.
Unfortunately, scRNA-Seq data is “noisy” and has plenty of errors and quality issues. Sequencing a single cell rather than many cells results in frequent dropouts (missing genes in the data). A single cell, like a single person, may have its own health issues or be at an awkward stage in its life cycle—it may have just divided, or be on its way to cell death, which can create more errors or technical variations in the scRNA-Seq data.
In addition to single-cell specific problems, genomic profiling usually comes with “normal” issues of sequencing errors. All these errors need to be cleaned from the data before it can be used or interpreted, which is where the new AI algorithm comes in.
The algorithm, called AutoClass, looks like a step up from existing statistical methods. Most existing methods assume that errors (or noises) would follow certain predefined distribution, or how likely the errors will occur and how big the errors can be. Existing methods are often unable to fully clean the data to reveal biological signals and may even add new errors because of their improper assumptions on data distribution.
In the opposite, AutoClass does not make any distributional assumption, hence can effectively correct a wide range of noises or technical variations, according to the research team.
“scRNA-Seq is being widely used in biomedical research and generated enormous volume and diversity of data. The raw data contain multiple types of noise and technical artifacts, which need thorough cleaning. Existing denoising and imputation methods largely focus on a single type of noise (i.e., dropouts) and have strong distribution assumptions which greatly limit their performance and application,” the investigators wrote.
“Here we design and develop the AutoClass model, integrating two deep neural network components, an autoencoder, and a classifier, to maximize both noise removal and signal retention. AutoClass is distribution agnostic as it makes no assumption on specific data distributions, hence can effectively clean a wide range of noise and artifacts. AutoClass outperforms the state-of-art methods in multiple types of scRNA-Seq data analyses, including data recovery, differential expression analysis, clustering analysis, and batch effect removal.”
“AutoClass is an AI algorithm based on a special deep neural network designed to maximize both noise removal and signal retention,” said Luo, who currently works at Novant Health as senior director of data science and AI. “The AI teaches itself to differentiate signal vs. noise in the data by seeing enough data. Usually the more data it sees, the better it performs.”
In the study, Luo noted that he and his team demonstrated that AutoClass can reconstruct high-quality scRNA-Seq data and enhance downstream analysis in multiple aspects. In addition, AutoClass is robust and performs well in various scRNA-Seq data types and conditions, he added.
AutoClass is highly efficient and scalable and works well with data of a wide range of sample sizes and feature sizes and runs smoothly even on a regular PC or laptop, the scientists said. AutoClass is open source online.
Brouwer is a professor of bioinformatics and genomics & director of bioinformatics services at UNC Charlotte.