New research by scientists at the University of Glasgow suggests that machine learning (ML) models developed using viral genomes can be harnessed to predict the likelihood that any animal-infecting virus will migrate to infect humans, given biologically relevant exposure. Most emerging infectious diseases of humans—such as COVID-19—are caused by viruses originating from other animal species, so identifying high-risk, potentially animal-to-human jumping zoonotic viruses earlier can help to improve research and surveillance priorities. One application of the new models developed by Nardus Mollentze, PhD, Simon Babayan, PhD, and Daniel Streicker, PhD, suggested that they could have identified SARS-CoV-2 as a relatively high-risk coronavirus strain, without any prior knowledge of zoonotic SARS-related coronaviruses.

Reporting in PLOS Biology on development of the ML model, the scientists concluded that the machine learning approach demonstrated that “… the zoonotic potential of viruses can be inferred to a surprisingly large extent from their genome sequence, outperforming current alternatives.” Their published paper is titled, “Identifying and prioritizing potential human-infecting viruses from their genome sequences.”

Most emerging infectious diseases of humans are caused by viruses that originate from other animal species, but identifying zoonotic diseases prior to emergence is a major challenge because only a small minority of the estimated 1.67 million animal viruses are able to infect humans, the authors explained. “Determining which animal viruses may be capable of infecting humans is currently intractable at the time of their discovery precluding prioritization of high-risk viruses for early investigation and outbreak preparedness,” they wrote.

Most viruses are now discovered using untargeted genomic sequencing, which often involves many simultaneous discoveries but limited phenotypic data, so an ideal approach would quantify the relative risk of human infectivity upon relevant exposure from the viral sequence data alone, the team continued. “By identifying high-risk viruses warranting further investigation, such predictions could alleviate the growing imbalance between the rapid pace of virus discovery and lower throughput field and laboratory research needed to comprehensively evaluate risk.”

Current models can identify well-characterized human-infecting viruses from genomic sequences. However, by training algorithms on closely related viruses, such as different strains of the same species, but potentially omitting secondary characteristics of the viral genome that are linked to infection capability, means that these models are less likely to find “signals of zoonotic status” that generalize across viruses. In contrast, the team noted, “We aimed to develop machine learning models that use features engineered from viral and human genome sequences to predict the probability that any animal-infecting virus will infect humans given biologically relevant exposure (here, zoonotic potential).”

Bats caught during zoonotic virus surveillance efforts (Madre de Dios, Peru) [Daniel Streicker, Mollentze N, et al., PLOS Biology, CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)]
To develop more accurate machine learning models using viral genome sequences, the researchers first compiled a dataset of 861 virus species from 36 families. They next built machine learning models, which assigned a probability of human infection based on virus taxonomy and/or relatedness to known human-infecting viruses. They then applied the best-performing model to analyze patterns in the predicted zoonotic potential of additional virus genomes sampled from a range of species.

The researchers found that viral genomes may have generalizable features that are independent of virus taxonomic relationships, and which may preadapt viruses to infect humans. The team was able to develop machine learning models that were capable of identifying candidate zoonoses using viral genomes. “In requiring only a genome sequence, our approach has quantitative and qualitative advantages over alternative models for zoonotic risk assessment,” they concluded. They also suggested that routine proxies of zoonotic risk that can be applied to poorly characterized viruses, including virus taxonomy and relative phylogenetic proximity to human-infecting species, have “limited discriminatory power.” This they say, “has far-reaching implications for how risk is perceived.” So while it might be intuitive to assume that newly identified viruses that are closely related to those that are already know to infect humans represent a threat, this assumption, the team also pointed out, to their knowledge had never been tested.

The team acknowledged limitations to the use of models, as computer models represent only a preliminary step to identifying zoonotic viruses with potential to infect humans. Viruses flagged by the models will require confirmatory laboratory testing before pursuing major additional research investments, they pointed out. Further, while these models predict whether viruses might be able to infect humans, the ability to infect is just one part of broader zoonotic risk, which is also influenced by the virus’ virulence in humans, ability to transmit between humans, and the ecological conditions at the time of human exposure.

According to the authors, “Our findings show that the zoonotic potential of viruses can be inferred to a surprisingly large extent from their genome sequence. By highlighting viruses with the greatest potential to become zoonotic, genome-based ranking allows further ecological and virological characterization to be targeted more effectively.”

“These findings add a crucial piece to the already surprising amount of information that we can extract from the genetic sequence of viruses using AI techniques,” Babayan added. “A genomic sequence is typically the first, and often only, information we have on newly-discovered viruses, and the more information we can extract from it, the sooner we might identify the virus’ origins and the zoonotic risk it may pose. As more viruses are characterized, the more effective our machine learning models will become at identifying the rare viruses that ought to be closely monitored and prioritized for preemptive vaccine development.”

And, as the authors concluded, “Independently of the mechanisms involved, the performance of our models shows how increasingly ubiquitous and low-cost genome sequence data can inform decisions on virus research and surveillance priorities at the earliest stage of virus discovery with virtually no extra financial or time investment … Genome-based zoonotic risk assessment provides a rapid, low-cost approach to enable evidence-driven virus surveillance and increases the feasibility of downstream biological and ecological characterization of viruses.”