In this corner: the state-of-the-art enzyme function prediction tool … the Bioinformatic Brawler … the Basic Local Alignment Search Tool for proteins … BLASTp! And in this corner: the contender … the In Silico Kid … contrastive learning–enabled enzyme annotation … CLEAN!
Neither the champ nor the challenger had to be told to have a clean fight. Both had trained to be artificial intelligence (AI) tools that could predict the functions of enzymes based on their amino acid sequences. Both had demonstrated that they could go the distance even if a contest involved unstudied or poorly understood enzymes.
So, what was the outcome? By a decision … CLEAN! It attracted high scores for accuracy, reliability, and sensitivity. CLEAN, according to its creators at the University of Illinois Urbana-Champaign, promises to advance research in genomics, chemistry, industrial materials, medicine, pharmaceuticals, and more.
“Just like ChatGPT uses data from written language to create predictive text, we are leveraging the language of proteins to predict their activity,” said Huimin Zhao, PhD, a University of Illinois Urbana-Champaign professor of chemical and biomolecular engineering. “Almost every researcher, when working with a new protein sequence, wants to know right away what the protein does. In addition, when making chemicals for any application—biology, medicine, industry—this tool will help researchers quickly identify the proper enzymes needed for the synthesis of chemicals and materials.”
Zhao led the research team that developed CLEAN and evaluated its performance. The team’s results appeared in the journal Science, in a paper titled, “Enzyme function prediction using contrastive learning.”
“We present a machine learning algorithm named CLEAN (contrastive learning–enabled enzyme annotation) to assign enzyme commission numbers to enzymes with better accuracy, reliability, and sensitivity compared with the state-of-the-art tool BLASTp,” the article’s authors wrote. “The contrastive learning framework empowers CLEAN to confidently (i) annotate understudied enzymes, (ii) correct mislabeled enzymes, and (iii) identify promiscuous enzymes with two or more EC numbers—functions that we demonstrate by systematic in silico and in vitro experiments.”
With advances in genomics, many enzymes have been identified and sequenced, but scientists have little or no information about what those enzymes do, said Zhao, a member of the Carl R. Woese Institute for Genomic Biology at Illinois.
Other computational tools try to predict enzyme functions. Typically, they attempt to assign an enzyme commission number—an ID code that indicates what kind of reaction an enzyme catalyzes—by comparing a queried sequence with a catalog of known enzymes and finding similar sequences. However, these tools don’t work as well with less-studied or uncharacterized enzymes, or with enzymes that perform multiple jobs, Zhao said.
“We are not the first one to use AI tools to predict enzyme commission numbers, but we are the first one to use this new deep-learning algorithm called contrastive learning to predict enzyme function. We find that this algorithm works much better than the AI tools that are used by others,” Zhao said. “We cannot guarantee everyone’s product will be correctly predicted, but we can get higher accuracy than the other two or other three methods.”
The researchers verified their tool experimentally with both computational and in vitro experiments. They found that not only could the tool predict the function of previously uncharacterized enzymes, but it also corrected enzymes mislabeled by the leading software and correctly identified enzymes with two or more functions.
Zhao’s group is making CLEAN accessible online for other researchers seeking to characterize an enzyme or determine whether an enzyme could catalyze a desired reaction.
“We hope that this tool will be used widely by the broad research community,” Zhao said. “With the web interface, researchers can just enter the sequence in a search box, like a search engine, and see the results.”
Zhao said the group plans to expand the AI behind CLEAN to characterize other proteins, such as binding proteins. The team also hopes to further develop the machine-learning algorithms so that a user could search for a desired reaction and the AI would point to a proper enzyme for the job.
“There are a lot of uncharacterized binding proteins, such as receptors and transcription factors. We also want to predict their functions as well,” Zhao said. “We want to predict the functions of all proteins so that we can know all the proteins a cell has and better study or engineer the whole cell for biotechnology or biomedical applications.”