Just as strings of words may impart meaning, sequences of amino acids may confer definite three-dimensional structures and desirable chemical and biological properties. The key word here is “may.” In synthetic proteins, amino acid sequences may end up making sense or giving rise to gibberish. How might a sequence’s “meaning” be known in advance? This question has long vexed protein engineers, who seek eloquent—err, elegant—solutions to biomanufacturing problems. Fortunately, an answer may be at hand. It’s called unified representation, or UniRep, a machine learning approach.
UniRep comes from Harvard’s Wyss Institute for Biologically Inspired Engineering, where a research team led by George Church, PhD, has used deep learning, a kind of artificial intelligence, to distill the fundamental features of proteins directly from their amino acid sequences. According to Church and colleagues, the approach needs no additional information, and it moves a lot of laborious laboratory experiments to the computer.
The researchers’ deep learning approach was introduced October 21 in Nature Methods, in an article titled, “Unified rational protein engineering with sequence-based deep representation learning.” The article indicates that UniRep allows the construction of statistical models that are broadly applicable and generalize to unseen regions of sequence space. Also, the article maintains that the statistical models are “semantically rich and structurally, evolutionarily, and biophysically grounded.”
“Our data-driven approach predicts the stability of natural and de novo designed proteins, and the quantitative function of molecularly diverse mutants, competitively with the state-of-the-art methods,” the article’s authors wrote. “UniRep further enables two orders of magnitude efficiency improvement in a protein engineering task.”
More conventional approaches to protein engineering include directed evolution and rational design. In directed evolution, protein engineers randomly vary the linear sequence of amino acid building blocks encoding a natural protein and screen for variants with the desired activity. In rational design, protein engineers model proteins based on their actual 3D structures to identify amino acids that likely will impact protein function.
Directed evolution can cover only a small part of the enormous space of possible protein sequences. Rational design approaches are limited by the relative scarcity of painstakingly resolved 3D protein structures. UniRep, however, promises a more holistic understanding of protein function.
“Instead of extensively characterizing proteins to understand their design principles, we used a neural network to learn those rules in an unbiased way, by systematically looking for patterns in a vast trove of raw protein sequences in public databases,” said Surojit Biswas, a graduate student in Church’s group and one of the three co-first authors on the Nature Methods paper. “The neural network learned a lot of the rules that we as humans have previously come to know through many painstaking studies, and beyond that, it also discovered new features in proteins.”
The neural network approach can be likened to learning a language where the learner builds a semantic understanding of how complex sentences are constructed from strings of letters and words. In protein language, UniRep was trained to predict the next amino acid in a protein sequence starting from its first one by exploring all the possibilities in protein sequences contained in public databases.
While proceeding through the remainder of the protein, one amino acid at a time, UniRep makes and draws on an internal “summary” of the sequence it has seen so far in the protein, which the team calls its “hidden state,” to take into account its individual sequence and structural features. Feeding that information, and results from many other proteins, back into its algorithm, UniRep gradually revises the way it constructs hidden states, which improves its predictive capabilities over time.
In the language analogy, the learner will be able to predict the next word of a sentence they are reading with increasing likelihood, based on a constantly improving understanding of syntax and choice of words.
“We trained UniRep on about 24 million protein sequences for roughly three weeks to enable it to predict sequences and their relationship to features like protein stability, secondary structure, and accessibility of internal sequences to surrounding solvents within proteins it had never seen before,” said Grigory Khimulya, who was a student at Harvard College and is also a co-first author along with Biswas and Ethan C. Alley. “UniRep accurately described these features in proteins from very different protein families whose structures had been well characterized in previous studies, even in synthetic proteins that don’t have a counterpart in nature.”
The team took UniRep a step further and used it as a tool to predict how single amino acid substitutions impact the function of proteins. Think Mad Libs, but for proteins.
The neural network robustly quantified the effects of single amino acid mutations in eight different proteins with diverse biological functions including enzyme catalysis, DNA binding, molecular sensing. In addition, using the Aequorea victoria green fluorescent protein (GFP) as a model, they tasked UniRep to analyze 64,800 variants of the protein, each carrying 1–12 mutations, which demonstrated that it could accurately anticipate how the distribution and relative burden of mutations changed the protein’s brightness.
“Compared to other strategies, our data-driven approach reaches state-of-the-art or superior performance in predicting multiple properties of proteins at costs much lower than other methods,” said Church. “This makes it a truly empowering tool for protein engineers in many areas.”