By bridging the conceptual divide between human language and viral evolution, MIT researchers have developed a powerful new computational tool for predicting the mutations that allow viruses to “escape” human immunity or vaccines. Its use could negate the need for high-throughput experimental techniques that are currently employed to identify potential mutations that could allow a virus to escape recognition. The computational model, based on models that were originally developed to analyze language, can predict which sections of viral surface proteins are more likely to mutate in a way that would enable viral escape, and it can also identify sections that are less likely to mutate, which would represent good targets for new vaccines.

“Viral escape is a big problem,” said Bonnie Berger, PhD, the Simons Professor of Mathematics and head of the Computation and Biology group at the Massachusetts Institute of Technology (MIT) Computer Science and Artificial Intelligence Laboratory. “Viral escape of the surface protein of influenza and the envelope surface protein of HIV are both highly responsible for the fact that we don’t have a universal flu vaccine, nor do we have a vaccine for HIV, both of which cause hundreds of thousands of deaths a year.”

Berger and colleagues report in Science (“Learning the language of viral evolution and escape”) on their development and use of the computational model to identify possible targets for vaccines against influenza, HIV, and SARS-CoV-2. Berger and Bryan Bryson, PhD, an assistant professor of biological engineering at MIT and a member of the Ragon Institute of MGH, MIT, and Harvard, are senior authors of the paper, and the lead author is MIT graduate student Brian Hie.

One reason it’s so difficult to produce effective vaccines against some viruses, including influenza and HIV, is that these viruses mutate very rapidly, and this “viral escape” mechanism allows them to evade the antibodies generated by a particular vaccine. Viral escape represents a key obstacle to antiviral and vaccine development, the authors wrote. “Viral mutations that allow an infection to escape from recognition by neutralizing antibodies have prevented the development of a universal antibody-based vaccine for influenza or HIV and are a concern in the development of therapies for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection.”

Different types of viruses acquire genetic mutations at different rates, and HIV and influenza are among those that mutate the fastest. While understanding the rules that govern the evolution of escape mutations could inform therapeutic design, current techniques for identifying potential escape mutations are limited. “Escape has motivated high-throughput experimental techniques that perform causal escape profiling of all single-residue mutations to a viral protein,” the team noted. “Such techniques, however, require substantial effort to profile even a single viral strain, and testing the escape potential of many (combinatorial) mutations in many viral strains remains infeasible.”

For viral mutations to promote escape from immune evasion, they must help the virus change the shape of its surface proteins, so that antibodies can no longer bind to them. However, the protein can’t change in a way that stops, or adversely alters its function.

The MIT team decided to model these criteria using a type of computational model known as a language model, from the field of natural language processing (NLP). Such models were originally designed to analyze patterns in language, specifically, the frequency with which certain words occur together. The models can then make predictions for which words could be used to complete a sentence such as “Sally ate eggs for …” The chosen word must be both grammatically correct (the concept of syntax) and have the right meaning (semantics). In this example, an NLP model might predict “breakfast,” or “lunch.”

The researchers’ key insight was that this kind of model could also be applied to biological information such as genetic sequences. In that case, grammar is analogous to the rules that determine whether the protein encoded by a particular sequence is functional or not, and semantic meaning is analogous to whether the protein can take on a new shape that helps it evade antibodies. Therefore, a mutation that enables viral escape must maintain the grammaticality of the sequence but change the protein’s structure in a useful way. “To escape, a mutant virus must preserve infectivity and evolutionary fitness—it must obey a “grammar” of biological rules—and the mutant must no longer be recognized by the immune system, which is analogous to a change in the “meaning” or the “semantics” of the virus,” the investigators explained. “If a virus wants to escape the human immune system, it doesn’t want to mutate itself so that it dies or can’t replicate,” Hie commented. “It wants to preserve fitness but disguise itself enough so that it’s undetectable by the human immune system.”

To model this process, the researchers trained an NLP model to analyze patterns found in genetic sequences, which allows it to predict new sequences that have new functions but still follow the biological rules of protein structure. Similar to how word changes can preserve a sentence’s grammar but alter its meaning, the machine learning algorithms modeled how escape can be achieved by mutations that preserve the biological syntax that governs viral infectivity, yet alter the virus’ semantics, so it is no longer recognized by neutralizing antibodies. “We identified escape mutations as those that preserve viral infectivity but cause a virus to look different to the immune system, akin to word changes that preserve a sentence’s grammaticality but change its meaning,” the investigators stated. “Searching for mutations with both high grammaticality and high semantic change is a task that we call constrained semantic change search (CSCS).”

“Computationally, the goal of CSCS is to identify mutations that confer high fitness and substantial semantic changes at the same time,” commented Yoo-Ah Kim and Teresa M. Przytycka, from the NIH’s National Center of Biotechnology Information, National Library of Medicine, in an accompanying perspective, titled “The language of a virus,” in the same issue of Science.

One significant advantage of this kind of modeling is that it requires only sequence information, which is much easier to obtain than protein structures. The model can be trained on a relatively small amount of information. For their reported study, Hie, Bryson and colleagues used 60,000 HIV sequences, 45,000 influenza sequences, and 4,000 coronavirus sequences.

“Language models are very powerful because they can learn this complex distributional structure and gain some insight into function just from sequence variation,” Hie commented. “We have this big corpus of viral sequence data for each amino acid position, and the model learns these properties of amino acid co-occurrence and co-variation across the training data.”

Influenza hemagglutinin protein color coded by escape potential as predicted by constrained semantic change search (CSCS) model. [Brian Hie]

Once the model was trained, the researchers used it to predict sequences of the coronavirus spike protein, HIV envelope protein, and influenza hemagglutinin (HA) protein that would be more or less likely to generate escape mutations. For influenza, the model revealed that the sequences least likely to mutate and produce viral escape were in the stalk of the HA protein. This is consistent with recent studies showing that antibodies that target the HA stalk (which most people infected with the flu or vaccinated against it do not develop) can offer near-universal protection against any flu strain.

The model’s analysis of coronaviruses suggested that a part of the spike protein called the S2 subunit is least likely to generate escape mutations. The question still remains as to how rapidly the SARS-CoV-2 virus mutates, so it is unknown how long the vaccines being deployed to combat the COVID-19 pandemic (at the time of writing), will remain effective. Initial evidence suggests that the virus does not mutate as rapidly as influenza or HIV. In their studies of HIV, the researchers found that the V1-V2 hypervariable region of the protein has many possible escape mutations, which is consistent with previous findings, and they also found sequences that would have a lower probability of escape.

SARS-CoV-2 spike protein color coded by escape potential as predicted by constrained semantic change search (CSCS) model. [Brian Hie]

“The language of viral evolution and es­cape proposed by Hie et al. provides a pow­erful framework for predicting mutations that lead to viral escape,” Kim and Przytycka noted. However, questions do remain, and just as different people can interpret the same given sentence dependent on their past experience and fluency in the language, immune responses differ between individuals dependent on factors such as past infections and overall immune system strength, they pointed out. “It will be interesting to see whether the proposed approach can be adapted to provide a ‘personalized’ view of the language of virus evolution.”

Since finalizing their paper for publication, Hie and colleagues have also applied their viral escape model to the new variants of SARS-CoV-2 that recently emerged in the U.K. and South Africa. That analysis, which has not yet been peer reviewed, flagged viral genetic sequences that should be further investigated for their potential to escape the existing vaccines, the researchers said.

The researchers are now working with others to use their model to identify possible targets for cancer vaccines that stimulate the body’s immune system to destroy tumors. They suggest that it could also be used to design small-molecule drugs that might be less likely to provoke resistance, for diseases such as tuberculosis. “There are so many opportunities, and the beautiful thing is all we need is sequence data, which is easy to produce,” Bryson said.