Although machine learning (ML) and other artificial intelligence tools are useful to analyze the massive amounts of data being generated by sequencing technologies, most ML tools are difficult for non-experts to access and use. Recently, automated machine learning (AutoML) methods have been developed that can automate the design and deployment of ML tools, but they still require a certain amount of expertise.
Now, a group of scientists at the Wyss Institute for Biologically Inspired Engineering at Harvard University and MIT has built a new AutoML platform designed for biologists with little to no ML experience. The platform, BioAutoMATED, can use sequences of nucleic acids, peptides, or glycans as input data, and its performance is comparable to other AutoML platforms while requiring minimal user input.
The platform is described in Cell Systems in the article, “BioAutoMATED: An end-to-end automated machine learning tool for explanation and design of biological sequences.”
“Our tool is for folks who don’t have the ability to build their own custom ML models, who find themselves asking questions like, ‘I have this cool data set, will ML even work for it? How do I get it into an ML model? The complexity of ML is what’s stopping me from going further with this data set, so how do I overcome that?’,” said Jackie Valeri, a graduate student in the lab of Wyss Core Faculty member Jim Collins, PhD. “We wanted to make it easy for biologists and experts in other domains to use the power of ML and AutoML to answer fundamental questions and help uncover biology that means something.”
To build an all-in-one AutoML for biology, the team modified three existing AutoML tools that each use a different approach for generating models: AutoKeras, which searches for optimal neural networks; DeepSwarm, which uses swarm-based algorithms to search for convolutional neural networks; and TPOT, which searches non-neural networks using a variety of methods including genetic programming and self-learning. BioAutoMATED produces standardized output results for all three tools, so that the user can easily compare them and determine which type produces the most useful insights from their data.
The team built BioAutoMATED to be able to take as inputs DNA, RNA, amino acid, and glycan sequences of any length, type, or biological function. BioAutoMATED automatically pre-processes the input data, then generates models that can predict biological functions from the sequence information alone.
To test-drive their new framework, the team first used it to explore how changing the sequence of the ribosome binding site affected the ribosome binding efficiency in E. coli. They fed their sequence data into BioAutoMATED, which identified a model generated by the DeepSwarm algorithm that could accurately predict translation efficiency. This model performed as well as models created by a professional ML expert, but was generated in just 26.5 minutes and only required ten lines of input code from the user (other models can require more than 750). They also used BioAutoMATED to identify which areas of the sequence seemed to be the most important in determining translation efficiency, and to design new sequences that could be tested experimentally.
They then moved on to trials of feeding peptide and glycan sequence data into BioAutoMATED and using the results to answer specific questions about those sequences. The system generated highly accurate information about which amino acids in a peptide sequence are most important in determining an antibody’s ability to bind to the drug ranibizumab (Lucentis), and also classified different types of glycans into immunogenic and non-immunogenic groups based on their sequences.
“Ultimately, we were able to show that BioAutoMATED helps people 1) recognize patterns in biological data, 2) ask better questions about that data, and 3) answer those questions quickly, all within a single framework—without having to become an ML expert themselves,” said Katie Collins, a graduate student at the University of Cambridge who worked on the project while an undergraduate at MIT.
Any models predicted with the help of BioAutoMATED, as with any other ML tool, need to be experimentally validated in the lab whenever possible. But the team is hopeful that it could be further integrated into the ever-growing set of AutoML tools, one day extending its function beyond biological sequences to any sequence-like object, such as fingerprints.
“Machine learning and artificial intelligence tools have been around for a while now, but it’s only with the recent development of user-friendly interfaces that they’ve exploded in popularity, as in the case of ChatGPT,” said Collins. “We hope that BioAutoMATED can enable the next generation of biologists to faster and more easily discover the underpinnings of life.”