Scientists may now be one step closer to understanding the inner logic of artificial intelligence (AI) models used for genomics thanks to a new tool from a group at Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory (CSHL). In a new paper published this week in Nature Artificial Intelligence, they described a computational tool called Surrogate Quantitative Interpretability for Deepnets (SQUID) which uses deep neural networks (DNNs) to help interpret how AI models analyze the genome.
In their paper, which is titled, “Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models,” the developers explain that SQUID uses “simple models with interpretable parameters” to “approximate the DNN function with localized regions of sequence space.” They claim that unlike other methods, SQUID “removes the confounding effects that nonlinearities and heteroscedastic noise in functional genomics data can have on model interpretation.” As proof of its effectiveness, they present results from experiments that show that SQUID “consistently quantifies the binding motifs of transcription factors, reduces noise in attribution maps, and improves variant-effect predictions.”
“The tools that people use to try to understand these models have been largely coming from other fields like computer vision or natural language processing. While they can be useful, they’re not optimal for genomics,” explained Peter Koo, PhD, an assistant professor at CSHL and senior author on the paper. “What we did with SQUID was leverage decades of quantitative genetics knowledge to help us understand what these deep neural networks are learning.”
SQUID works by generating an in silico library of variant DNA sequences, training a surrogate model called a latent phenotype model on the data using a program called Multiplex Assays of Variant Effects Neural Network or MAVE-NN, and then visualizing and interpreting the model’s parameters. With this tool, scientists can run thousands of virtual experiments simultaneously and identify which algorithms make the most accurate predictions about the variants.
While virtual experiments can’t exactly replace laboratory tests, “they can be very informative” for helping scientists form hypotheses for how a particular region of the genome works or how a mutation might have a clinically relevant effect,” said Justin Kinney, PhD, a CSHL associate professor and one of the co-authors of the study.
The scientists also described using SQUID to study epistatic interactions in cis regulatory elements as a way to evaluate its performance. To test whether SQUID could work on this task, they “implemented a surrogate model that describes all possible pairwise interactions between nucleotides within a sequence.” They then used the model “to quantify the effects of pairs of putative AP-1 binding sites.” Their results demonstrated that the “pairwise-interaction models” they created yielded more accurate results than “additive surrogate models.” Specifically, SQUID was able to “quantify epistatic interactions that were otherwise obscured by global nonlinearities in the DNN.”
Compared to several other methods, SQUID is more computationally demanding, its developers noted. They suggest that it may work better for researchers working on in-depth analysis of specific sequences such as disease-associated loci rather than those working on large-scale genome analyses.