First, there was Delta. Then Omicron. Now, it’s the Omicron subvariant BA.2.12.1. And, public health officials are keeping a close watch on the BA.4 and BA.5 subvariants. These waves of cases, caused by infections with new variants, have characterized the COVID-19 pandemic. But how can public health officials know which variants are likely to cause large numbers of cases and which will not take hold in the population?
Now, scientists have developed a machine learning model, called PyR0, that analyzed millions of SARS-CoV-2 genomes. In doing so, it can predict which viral variants will likely dominate and cause surges in COVID-19 cases. It can also help researchers identify which parts of the viral genome will be less likely to mutate, uncovering good targets for vaccines that will work against future variants.
The findings appear in Science in the paper, “Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness.”
The AI tool, trained using over six million SARS-CoV-2 genomes from the GISAID (Global Initiative on Sharing Avian Influenza Data) database, can estimate the effect of genetic mutations on the virus’s fitness. When tested on viral genomic data from January 2022, it predicted the rise of the BA.2 variant, which became dominant in many countries in March 2022. PyR0 would have also identified the alpha variant (B.1.1.7) by late November 2020, a month before the World Health Organization listed it as a variant of concern.
“This kind of machine learning-based approach that looks at all the data and combines that into a single prediction is extremely valuable,” said Pardis Sabeti, MD, D.Phil, an institute member at the Broad Institute, professor at the Center for Systems Biology and the department of organismic and evolutionary biology at Harvard University, and a Howard Hughes Medical Institute investigator. “It gives us a leg up on identifying what’s emerging and could be a potential threat.”
PyR0, based on a machine learning framework called Pyro, was originally developed by a team at Uber AI Labs. PyR0 can analyze millions of genomes—all of the publicly available SARS-CoV-2 data—in about an hour. It groups similar sequences together defining “clusters” of genomes by their shared mutations. Next, the model determines which mutations are becoming more common and estimates how quickly each mutation can cause the virus to spread. It also estimates how rapidly the number of cases of different variants will increase based on their genetic makeup.
By identifying which mutations are important for the fitness of particular variants, the model also offers biological insight into how COVID-19 spreads and develops. For example, knowing the critical mutations can help scientists predict whether new variants will be more contagious or evade neutralizing antibodies, and can also help them decide which mutations to study in greater detail.
“The SARS-CoV-2 genome now has accumulated many mutations, so it becomes extremely challenging to interrogate all combinations of mutations,” said Martin Jankowiak, PhD, a machine learning fellow at the Broad Institute. “The advantage of this kind of analysis is that it looks at the entire genome holistically, and may point to mutations or variants that are receiving less attention in the lab.”
The study suggests that current increases in viral fitness stem from the virus’s ability to escape immune responses. They add that public health officials, with advanced warning of a variant’s sequence and characteristics, could implement specific measures to manage case counts. And knowing which mutations are contributing to a variant’s survival—and are thus not likely to change—can help researchers pick better targets for future vaccines.
New versions of this or similar models could further improve predictions by taking into account interactions between mutations. The researchers say that with further work, their model could help monitor other viruses that have enough genetic data.
“The amount of data that we have, together with the methods that we’ve developed, allow us to get a real-time view of the virus evolving in different locations around the world in a way that was just not possible during previous epidemics,” said Fritz Obermeyer, PhD, a machine learning fellow at the Broad Institute.