Researchers in France report that they have developed a machine learning toolbox that can read and analyze protein sequences. Their study (“Learning protein constitutive motifs from sequence data”) appears in eLife.
The research demonstrates that, when trained to read sequence data, artificial neural networks called Restricted Boltzmann Machines (RBM) can provide information on protein structure, function, and evolutionary features. It is believed to be the first method that can extract this level of detail from sequence data alone.
“Statistical analysis of evolutionary-related protein sequences provides insights about their structure, function, and history. We show that Restricted Boltzmann Machines, designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to twenty protein families, and present detailed results for two short protein domains, Kunitz and WW, one long chaperone protein, Hsp70, and synthetic lattice proteins for benchmarking,” the investigators wrote.
“The features inferred by the RBM are biologically interpretable: they are related to structure (such as residue-residue tertiary contacts, extended secondary motifs (α-helix and β-sheet), and intrinsically disordered regions), to function (such as activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and turning up or down the different modes at will. Our work, therefore, shows that RBM are a versatile and practical tool to unveil and exploit the genotype-phenotype relationship for protein families.”
A key issue is trying to understand which parts of a protein sequence are responsible for which properties, according to paper co-author Jérôme Tubiana, former PhD student in the physics laboratory at l’École Normale Supérieure (ENS), in Paris. “Answering this question could have significant implications for pharmaceutical development,” explained Tubiana. “For example, it could help with the design of new proteins that have desired functions, or with predicting the future sequence evolution of proteins in living organisms, such as pathogens, and identifying appropriate drug targets.”
To explore this question, Tubiana and his collaborators applied RBM to 20 protein families. The researchers presented detailed results for four protein families. They discovered that, after learning, the connections between the artificial neurons in the RBM are interpretable and relate to the protein’s structure, function (such as activity), or phylogeny. Additionally, the team found that they could use RBM to design new protein sequences by composing and turning up or down the different artificial neural units at will.
“Our RBM model shows how machine learning techniques can solve complex data recognition and draw conclusions from data in an interpretable way,” says co-author Simona Cocco, PhD, CNRS director of research at the ENS physics laboratory. “This runs counter to the more complex, black-box models that are traditionally used in data science, as statistical analyses provided by these tools are largely uninterpretable. The interpretability of our method is a major benefit to scientists; it bears the promise of allowing them to generate proteins with desired functions in a controlled way.”