The human genome is three billion letters of code, and each person has millions of variations. Artificial intelligence (AI) programs can find patterns in the genome related to disease much faster than humans can. They also spot things that humans miss. Someday, AI-powered genome readers may even be able to predict the incidence of diseases from cancer to the common cold.

Unfortunately, AI’s recent popularity surge has led to a bottleneck in innovation, according to Peter Koo, PhD, assistant professor at the Cold Spring Harbor Laboratory.

“It’s like the Wild West right now. Everyone’s just doing whatever the hell they want,” says Koo. AI researchers are constantly building new algorithms from various sources. And it’s difficult to judge whether their creations will be good or bad. After all, how can scientists judge “good” and “bad” when dealing with computations that are beyond human capabilities, asks Koo.

To address this issue the Koo lab created GOPHER (short for GenOmic Profile-model compreHensive EvaluatoR), a new method that Koo says helps researchers identify the most efficient AI programs to analyze the genome. “We created a framework where you can compare the algorithms more systematically,” explains Ziqi Tang, a graduate student in Koo’s laboratory.

Members of the team working with Peter Koo, PhD, who is shown rear, center. [Cold Spring Harbor Lab]
The researchers published their work “Evaluating deep learning for predicting epigenomic profilesNature Machine Intelligence.

“Deep learning has been successful at predicting epigenomic profiles from DNA sequences. Most approaches frame this task as a binary classification relying on peak callers to define functional activity. Recently, quantitative models have emerged to directly predict the experimental coverage values as a regression. As new models with different architectures and training configurations continue to emerge, a major bottleneck is forming due to the lack of ability to fairly assess the novelty of proposed models and their utility for downstream biological discovery,” write the investigators.

“Here we introduce a unified evaluation framework and use it to compare various binary and quantitative models trained to predict chromatin accessibility data. We highlight various modeling choices that affect generalization performance, including a downstream application of predicting variant effects. In addition, we introduce a robustness metric that can be used to enhance model selection and improve variant effect predictions. Our empirical study largely supports that quantitative modeling of epigenomic profiles leads to better generalizability and interpretability.”

Method judges AI programs on several criteria

GOPHER judges AI programs on several criteria: how well they learn the biology of our genome, how accurately they predict important patterns and features, their ability to handle background noise, and how interpretable their decisions are. “AI are these powerful algorithms that are solving questions for us,” says Tang. But, she notes: “One of the major issues with them is that we don’t know how they came up with these answers.”

GOPHER helped Koo and his team dig up the parts of AI algorithms that drive reliability, performance, and accuracy. The findings help define the key building blocks for constructing the most efficient AI algorithms going forward. “We hope this will help people in the future who are new to the field,” says Shushan Toneyan, another graduate student at the Koo lab.

Imagine feeling unwell and being able to determine exactly what’s wrong at the push of a button, says Koo. AI could someday turn this science-fiction trope into a feature of every doctor’s office. Similar to video-streaming algorithms that learn users’ preferences based on their viewing history, AI programs may identify unique features of our genome that lead to individualized medicine and treatments, continues Koo.

The team hopes GOPHER will help optimize such AI algorithms so that researchers can trust they’re learning the right things for the right reasons. Toneyan says:  “If the algorithm is making predictions for the wrong reasons, they’re not going to be helpful.”