
Researchers can train artificial brain-like neural networks to classify images, such as cat pictures. Using a series of manipulated images, the scientists can figure out what part of the image--say the whiskers--is used to identify it as a cat. However, when the same technology is applied to DNA, researchers are not certain what parts of the sequence are important to the neural net. This unknown decision process is known as a "black box". (Source: Ben Wigler/CSHL, 2021)
Wading through more than three billion base pairs of DNA sequence, it is not possible for the human brain to figure out at a glance whether a certain stretch of the cipher binds a transcription factor or lies in an accessible location in the chromatin mesh.
Such functional attributes can be identified by artificial intelligence algorithms called neural networks, modeled loosely after the human brain, that are designed to recognize patterns.
Peter Koo, PhD, Assistant Professor at Cold Spring Harbor Laboratory (CSHL) and collaborator Matt Ploenzke, PhD, of the Department of Biostatistics at Harvard University use a type of neural network called convolutional neural network (CNN) to develop a way to train machines to predict the function of DNA sequences.
These findings are reported in an article titled, ‘Improving representations of genomic sequence motifs in convolutional networks with exponential activations,’ published in Nature Machine Intelligence.
The study reports that teaching the neural networks to predict the functions of short sequences enabled it to decipher patterns in larger sequences with training. The initial experiments in the current study are conducted on synthetic DNA sequences. The scientists then generalize the results to real DNA sequences across several in vivo datasets. In future studies the researchers hope to analyze more complex DNA sequences that regulate gene activity critical to development and disease.
Although artificial intelligence and deep learning scientists have figured out how to train computers to recognize images, they have yet to understand how machines learn to classify objects quickly and accurately. It is even harder to figure out how machines classify abstract patterns, such as motifs in DNA sequences, since we humans cannot recognize the right answers.
Machine-learning researchers have trained neural networks to recognize commonplace objects such as cats or airplanes by repeatedly training these neural networks with many images. The program is then tested by presenting it a new image of a cat or an airplane and noting whether it classifies the new image correctly.
Translating the same approach to test whether neural networks can detect sequence patterns in DNA is, however, not straightforward. This is because it is possible for a human to ascertain whether the neural network has made a correct identification of an object like a cat or a dog, because the human brain itself can draw this conclusion. It is not so in the case of detecting biologically meaningful patterns in DNA sequences.
The human brain cannot recognize functional patterns in DNA sequences. Therefore, even if the neural network comes up with a series of meaningful motif patterns in a stretch of DNA, researchers may not be able to tell if the computer identifies a meaningful pattern correctly.
Human programmers are therefore unable to judge the reasons behind the learning process that the neural networks undergo or the accuracy of decisions that neural networks arrive at. This hidden process that makes is difficult to trust the output of the neural network is what researchers refer to as a “black box”.
“It can be quite easy to interpret these neural networks because they’ll just point to, let’s say, whiskers of a cat. And so that’s why it’s a cat versus an airplane. In genomics, it’s not so straightforward because genomic sequences aren’t in a form where humans really understand any of the patterns that these neural networks point to,” says Koo.
The authors train CNNs by showing it genomic DNA sequences. This learning process resembles how brains process images.
CNNs have become increasingly popular and constitute the state-of-the-art technology in accurately predicting a variety of regulatory motifs in genomic DNA. CNNs success is due to its ability to directly learn patterns from the training data. However, little is understood about the inner workings of CNNs, like many other deep learning algorithms, and are therefore labelled black boxes.
The study introduces a new method to teach important DNA patterns to one layer of the CNN. This allows the neural network to build on the data to identify more complex patterns. Koo’s discovery makes it possible to peek inside the black box and identify some key features that lead to the computer’s decision-making process.
But Koo has a larger purpose in mind for the field of artificial intelligence. There are two ways to improve a neural net: interpretability and robustness. Interpretability refers to the ability of humans to decipher why machines give a certain prediction. The ability to produce an answer even with mistakes in the data is called robustness. Usually, researchers focus on one or the other.
“What my research is trying to do is bridge these two together because I don’t think they’re separate entities. I think that we get better interpretability if our models are more robust,” says Koo.
Deep learning has the potential to make a significant impact in basic biology but the major challenge is understanding the reasons behind their predictions. Koo’s research develops methods to interpret black box models, with a goal of understanding the underlying mechanisms of sequence-function relationships in genetic regulation.
Koo hopes that if a machine can find robust and interpretable DNA patterns related to gene regulation, it will help geneticists understand how mutations affect cancer and other diseases. “We have teamed up with other members of the CSHL Cancer Center to investigate the sequence basis of epigenomic differences across healthy and cancer cells,” notes Koo.