On January 11, 2020, the first draft genome of the SARS-CoV-2 virus was posted online by Chinese researchers. Quickly, many of the virus’s genes were quickly determined as it shares a large amount of similarity with other coronaviruses. However, the full complement of protein-coding genes remained unresolved.

Now, researchers have generated what they describe as the most accurate and complete gene annotation of the SARS-CoV-2 genome. In the study, several protein-coding genes were confirmed and others were newly identified. The research team also analyzed nearly 2,000 mutations that have arisen in different SARS-CoV-2 isolates since it began infecting humans, allowing them to rate how important those mutations may be in changing the virus’ ability to evade the immune system or become more infectious.

This work is published in Nature Communications in the paper, “SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes.”

“We were able to use this powerful comparative genomics approach for evolutionary signatures to discover the true functional protein-coding content of this enormously important genome,” said Manolis Kellis, PhD, professor of computer science in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) as well as a member of the Broad Institute of MIT and Harvard.

The team used comparative genomics to elucidate the SARS-CoV-2 genome. The SARS-CoV-2 virus belongs to a subgenus of viruses called Sarbecovirus, most of which infect bats. The researchers performed their analysis on SARS-CoV-2, SARS-CoV (the cause of the 2003 SARS outbreak), and 42 strains of bat sarbecoviruses.

Kellis has previously developed computational techniques for doing this type of analysis, which his team has also used to compare the human genome with genomes of other mammals. Using these techniques, the researchers confirmed six protein-coding genes in the SARS-CoV-2 genome in addition to the five that are well established in all coronaviruses. The team determined that the region that encodes a gene called ORF3a also encodes an additional gene, which they name ORF3c, which overlaps with ORF3a but in a different reading frame. The role of this new gene, as well as several other SARS-CoV-2 genes, is not known yet.

The authors wrote that they “find strong protein-coding signatures for ORFs 3a, 6, 7a, 7b, 8, 9b, and a novel alternate-frame gene, ORF3c, whereas ORFs 2b, 3d/3d-2, 3b, 9c, and 10 lack protein-coding signatures or convincing experimental evidence of protein-coding function.”

The researchers also showed that five other regions that had been proposed as possible genes do not encode functional proteins, and they also ruled out the possibility that there are any more conserved protein-coding genes yet to be discovered.

“We analyzed the entire genome and are very confident that there are no other conserved protein-coding genes,” said Irwin Jungreis, PhD, lead author of the study and a CSAIL research scientist. “Experimental studies are needed to figure out the functions of the uncharacterized genes, and by determining which ones are real, we allow other researchers to focus their attention on those genes rather than spend their time on something that doesn’t even get translated into protein.”

The researchers also recognized that many previous papers used not only incorrect gene sets, but sometimes also conflicting gene names. To remedy the situation, they brought together the SARS-CoV-2 community and presented a set of recommendations for naming SARS-CoV-2 genes, in a separate paper published a few weeks ago in Virology.

In addition to the annotation work, the researchers also analyzed more than 1,800 mutations that have arisen in SARS-CoV-2 since it was first identified. For each gene, they compared how rapidly that particular gene has evolved in the past with how much it has evolved since the current pandemic began.

They found that in most cases, genes that evolved rapidly for long periods of time before the current pandemic have continued to do so, and those that tended to evolve slowly have maintained that trend. However, the researchers also identified exceptions to these patterns, which may shed light on how the virus has evolved as it has adapted to its new human host, Kellis said.

In one example, the researchers identified a region of the nucleocapsid protein that had many more mutations than expected from its historical evolution patterns. This protein region is also classified as a target of human B cells. Therefore, mutations in that region may help the virus evade the human immune system, Kellis said.

They wrote, “Cross-strain and within-strain evolutionary pressures agree, except for fewer-than-expected within-strain mutations in nsp3 and S1, and more-than-expected in nucleocapsid, which shows a cluster of mutations in a predicted B-cell epitope, suggesting immune-avoidance selection.”

“The most accelerated region in the entire genome of SARS-CoV-2 is sitting smack in the middle of this nucleocapsid protein,” he added. “We speculate that those variants that don’t mutate that region get recognized by the human immune system and eliminated, whereas those variants that randomly accumulate mutations in that region are in fact better able to evade the human immune system and remain in circulation.”

The researchers also analyzed mutations that have arisen in variants of concern, such as the B.1.1.7 strain from England, the P.1 strain from Brazil, and the B.1.351 strain from South Africa. Many of the mutations that make those variants more dangerous are found in the spike protein, and help the virus spread faster and avoid the immune system. However, each of those variants carries other mutations as well.

“Each of those variants has more than 20 other mutations, and it’s important to know which of those are likely to be doing something and which aren’t,” Jungreis said. “So, we used our comparative genomics evidence to get a first-pass guess at which of these are likely to be important based on which ones were in conserved positions.”

The annotated gene set and mutation classifications are available in the University of California at Santa Cruz Genome Browser.

“We can now go and actually study the evolutionary context of these variants and understand how the current pandemic fits in that larger history,” Kellis said. “For strains that have many mutations, we can see which of these mutations are likely to be host-specific adaptations, and which mutations are perhaps nothing to write home about.”