Scientists have developed an artificial intelligence (AI)-based neural network model that accurately predicts gene expression in yeast. The team has validated the ability of its neural network in high-throughput experiments and the work opens doors for a broad spectrum of scientific questions. The model can help design genes with customized levels of expression for the development of gene therapies or industrial applications and clarify evolutionary mechanisms that regulate gene expression.

The findings were published in the journal Nature, in an article titled, “The evolution, evolvability, and engineering of gene regulatory DNA.”

“This work highlights what possibilities open up when we design new kinds of experiments to generate the right data to train models,” said Aviv Regev, PhD, a professor of biology at MIT, core member of the Broad Institute of Harvard and MIT, head of Genentech Research and Early Development, and the senior author of the study.

Aviv Regev, PhD, a professor of biology at MIT, is senior author of the study.

The researchers employed two key technologies to predict gene expression in the yeast Saccharomyces cerevisiae. The first measured expression of a gene that encodes yellow fluorescent protein (YFP) in yeast cells. The authors built a library of over 30 million different, 80 base pair promoters—noncoding sequences of DNA where transcription of a gene starts—and measured the production of YFP by each cell carrying a different promoter.

The researchers then used the YFP expression dataset to train an AI system called a convolutional neural network, to predict gene expression from the dataset. They validated the network’s ability to predict gene expression on a new set of promoters.

The authors tested the network’s ability to predict gene expression from random starting sequences. They used this data in computer-simulated evolutionary cycles to change the starting sequences in ten rounds to generate promoter sequences that drive very high or very low YFP expression. Using high-throughput assays, the researchers tested 500 computer-generated promoter sequences and validated that these drove extreme YFP expression as the neural network predicted.

Eeshit Vaishnav, a PhD student at MIT, is first author of this study.

“Our study involved cutting-edge machine learning (deep transformer neural network models) and high-throughput experimental (gigantic parallel reporter assays) techniques coupled with state-of-the-art computing infrastructure (tensor processing units),” said Eeshit Vaishnav, a PhD student at MIT and first author of the study.

The network can be used to gain deeper insights into evolutionary mechanisms. For example, the authors showed three to four mutations are enough to change a random starting sequence into a sequence that drives very high or very low expression of YFP. They also showed over half of all yeast genes are stabilized in a manner that changes in their promoter sequences do not change gene expression.

Carl de Boer, PhD, an assistant professor at the school of biomedical engineering at the University of British Columbia, is an author of the study.

In addition to predicting how changes in promoter sequences in yeast affected gene expression, the team also devised a unique way of representing which genes will be expressed and how gene expression will affect traits, using two-dimensional mathematic maps called fitness landscapes. These fitness landscapes will allow a simpler depiction of changes in gene expression in the past and forecast the future evolution of non-coding sequences in organisms beyond yeast.

“We now have an ‘oracle’ that can be queried to ask: What if we tried all possible mutations of this sequence? Or, what new sequence should we design to give us a desired expression?” said Regev. “Scientists can now use the model for their own evolutionary question or scenario, and for other problems like making sequences that control gene expression in desired ways.”

“I am also excited about the possibilities for machine learning researchers interested in interpretability. They can ask their questions in reverse, to better understand the underlying biology,” Regev added. “I believe these kinds of approaches will be important for many problems—like understanding genetic variants in regulatory regions that confer disease risk in the human genome, but also for predicting the impact of combinations of mutations or designing new molecules.”

“Creating an accurate model was certainly an accomplishment, but, to me, it was really just a starting point,” said Vaishnav. “This model can serve as an ‘oracle’ in evolutionary studies to conduct and interpret in silico experiments, predict which regulatory mutations affect expression and fitness, design or evolve new sequences with desired characteristics, determine how quickly selection achieves an expression optimum, identify signatures of selective pressures on extant regulatory sequences, visualize fitness landscapes, and characterize mutational robustness and evolvability.”

Martin Taylor, PhD, a professor of genetics at the University of Edinburgh’s Medical Research Council Human Genetics Unit who was not involved in the research, said the study shows that artificial intelligence can predict the effect of regulatory DNA changes and reveal the underlying principles that govern millions of years of evolution.

“There are obvious near-term applications, such as the custom design of regulatory DNA for yeast in brewing, baking, and biotechnology,” Taylor said.  “But extensions of this work could also help identify disease mutations in human regulatory DNA that are currently difficult to find and largely overlooked in the clinic. This work suggests there is a bright future for AI models of gene regulation trained on richer, more complex, and more diverse data sets.”

Vaishnav said, “The paper introduces a general framework for studying and designing gene regulatory DNA sequences to control gene expression. This framework could be applied to other organisms, including humans. For instance, this will eventually enable us to design regulatory sequences that would encode for expression of desired genes only under specified circumstances for gene therapy applications. The paper also addresses several fundamental open questions in the study of gene regulatory sequences, their evolutionary history, and future evolvability.”

In their next experiments, the team intends to perform a similar set of experiments to generate models that predict gene expression in human cells.