Computer scientists at Carnegie Mellon University (CMU) report that neural networks and supervised machine learning techniques can efficiently characterize cells that have been studied using single cell RNA-sequencing (scRNA-seq). This finding could help researchers identify new cell subtypes and differentiate between healthy and diseased cells, according to the team.

Rather than rely on marker genes, which are not available for all cell types, this new automated method analyzes all of the scRNA-seq data to select just those parameters that can differentiate one cell from another. This enables the analysis of all cell types and provides a method for comparative analysis of those cells, point out the researchers from CMU’s computational biology department whose study (“A web server for comparative analysis of single-cell RNA-seq data”) appears in Nature Communications. They say the web server called scQuery makes the method usable by all researchers.

“Single cell RNA-Seq (scRNA-seq) studies profile thousands of cells in heterogeneous environments. Current methods for characterizing cells perform unsupervised analysis followed by assignment using a small set of known marker genes. Such approaches are limited to a few, well-characterized cell types. We developed an automated pipeline to download, process, and annotate publicly available scRNA-seq datasets to enable large-scale supervised characterization,” write the investigators.

“We extend supervised neural networks to obtain efficient and accurate representations for scRNA-seq data. We apply our pipeline to analyze data from over 500 different studies with over 300 unique cell types and show that supervised methods outperform unsupervised methods for cell type identification. A case study highlights the usefulness of these methods for comparing cell type distributions in healthy and diseased mice. Finally, we present scQuery, a web server which uses our neural networks and fast matching methods to determine cell types, key genes, and more.”

Over the past five years, single cell sequencing has become a major tool for cell researchers. In the past, scientists could only obtain DNA or RNA sequence information by processing batches of cells, providing results that only reflected average values of the cells. Analyzing cells one at a time, by contrast, enables researchers to identify subtypes of cells, or to see how a healthy cell differs from a diseased cell, or how a young cell differs from an aged cell.

This type of sequencing will support the National Institutes of Health’s new Human BioMolecular Atlas Program (HuBMAP), which is building a 3D map of the human body that shows how tissues differ on a cellular level, according to Ziv Bar-Joseph, Ph.D., professor of computational biology and machine learning and a co-author of today’s paper, who leads a CMU-based center contributing computational tools to that project.

“With each experiment yielding hundreds of thousands of data points, this is becoming a Big Data problem,” said Amir Alavi, a Ph.D. student in computational biology who was co-lead author of the paper with post-doctoral researcher Matthew Ruffalo, Ph.D. “Traditional analysis methods are insufficient for such large scales.”

Drs. Alavi, Ruffalo, and their colleagues developed an automated pipeline that attempts to download all public scRNA-seq data available for mice—identifying the genes and proteins expressed in each cell—from the largest data repositories, including the NIH’s Gene Expression Omnibus (GEO). The cells were then labeled by type and processed via a neural network, a computer system modeled on the human brain. By comparing all of the cells with each other, the neural net identified the parameters that make each cell distinct.

The researchers tested this model using scRNA-seq data from a mouse study of a disease similar to Alzheimer’s. The analysis showed similar levels of brain cells in both healthy and diseased cells, while the diseased cells included substantially more immune cells, such as macrophages, generated in response to the disease.

The researchers used their pipeline and methods to create scQuery, a web server that can speed comparative analysis of new scRNA-seq data. Once a researcher submits a single cell experiment to the server, the group’s neural networks and matching methods can quickly identify related cell subtypes and identify earlier studies of similar cells.