In the largest screen to date for alternative genetic codes, a computer program named Codetta scanned more than 250,000 genome sequences from bacteria and archaea to identify five never-before-seen genetic codes.

This work is published in eLife, in the paper, “A computational screen for alternative genetic codes in over 250,000 genomes.

Across most of the tree of life, the genetic code is universal. But scientists have discovered a handful of exceptions—alternative genetic codes in some organisms that show that the code can evolve to some degree. Without a comprehensive look at many genomes, it is difficult to draw general conclusions about the evolutionary trajectories of codon reassignment.

Now, Yekaterina (Kate) Shulgina, a graduate student in the lab of Sean Eddy, PhD, professor of molecular and cellular biology and of applied mathematics at Harvard University, and a Howard Hughes Medical investigator, have developed a method to look at large numbers of genomes.

Codetta is a computational method to predict the amino acid decoding of each codon from nucleotide sequence data. Until now, scientists using similar programs have been able to analyze hundreds of genome sequences. Codetta scales up scientists’ code-cracking ability substantially, letting the team systematically screen nearly all known bacteria and archaea—more than 250,000 genomes—for new genetic codes.

The new method is faster, more rigorous, and more comprehensive than previous efforts, said Ken Wolfe, FRS MRIA, an evolutionary geneticist at University College Dublin who was not involved with the research. “They looked at every genome that’s available for bacteria and archaea—essentially, all the data that exists.”

Surveying the genetic code in over 250,000 bacterial and archaeal genome sequences in GenBank, Shulgina and Eddy discovered five new reassignments of arginine codons (AGG, CGA, and CGG), representing the first sense codon changes in bacteria. Shulgina’s new codes are “going straight into the textbooks,” said Eddy.

Codetta reads a genome, then taps into a database of known proteins to compute a likely genetic code. “My method takes advantage of the fact that a lot is known about what proteins are expected to look like,” she said. The program can use that information to figure out which three-letter sets in a particular genome sequence correspond to which amino acids.

Their analysis uncovered some surprises. The team discovered five instances where the code for the amino acid arginine was reassigned to a different amino acid. The results represent the first time scientists had seen such a swap in bacteria. The big question, Shulgina said, is why the code for arginine is so frequently changed. That could hint at the evolutionary forces responsible for forging new codes.

The authors write that, in a clade of uncultivated Bacilli, “the reassignment of AGG to become the dominant methionine codon likely evolved by a change in the amino acid charging of an arginine tRNA. The reassignments of CGA and/or CGG were found in genomes with low GC content, an evolutionary force which likely helped drive these codons to low frequency and enable their reassignment.”

The work’s practical implications are immediate: scientists using Codetta, which is freely available, will be able to correctly predict which proteins an organism is making. But the program might unlock more sweeping biological insights too.

Unearthing the full set of genetic codes used across life’s kingdoms could crack open a long-standing biological enigma: how an organism can change its genetic code at all. “There are all kinds of theories out there, but it’s still a real mystery,” Eddy said. “How does this possibly happen?”

Shulgina and Eddy are now on the prowl for even more new codes. Because they tend to crop up in small genomes, the team plans to turn Codetta loose on viruses and cellular compartments like mitochondria and chloroplasts. “This is going to be rich hunting ground,” Eddy said.