AI Model Creates Functional Proteins From Scratch
A team of researchers has created a new AI system that can design artificial enzymes from scratch. These enzymes showed activity similar to natural ones in lab tests, even though their synthetic amino acid sequences were very different from any known natural protein.
The study demonstrates that natural language processing, originally developed for understanding and generating language text, can also capture some basic principles of biology. The AI program, called ProGen, was created by Salesforce Research and uses next-token prediction to build artificial proteins from amino acid sequences.
Scientists said the new technology could surpass directed evolution, the Nobel-prize-winning protein design technology, and it will boost the 50-year-old field of protein engineering by accelerating the creation of new proteins that can have various applications from therapeutics to plastic degradation.
“The artificial designs perform much better than designs that were based on the evolutionary process,” said James Fraser, Ph.D., professor of bioengineering and therapeutic sciences at the UCSF School of Pharmacy, and an author of the work, which was recently published in Nature Biotechnology. A previous version of the paper has been posted on the preprint server BiorXiv since July of 2021, where it received several dozen citations before being published in a peer-reviewed journal.
“The language model is learning aspects of evolution, but it’s different than the normal evolutionary process,” Fraser said. “We now have the ability to tune the generation of these properties for specific effects. For example, an enzyme that’s incredibly thermostable or likes acidic environments or won’t interact with other proteins.”
The scientists trained the machine learning model by giving it the amino acid sequences of 280 million different proteins from various sources. Then, they refined the model by using 56,000 sequences from five lysozyme families and some additional information about these proteins. The model quickly produced a million sequences, and the researchers chose 100 to test based on how similar they were to natural protein sequences and how realistic their amino acid patterns and meanings were. They made five artificial proteins from this first group of 100 and tested them in cells against an enzyme from chicken egg whites called hen egg white lysozyme (HEWL). This enzyme is also found in human tears, saliva, and milk, where it protects against bacteria and fungi. Two of the artificial enzymes could break down the bacterial cell walls with similar activity to HEWL, but their sequences were only about 18% identical to each other. The two sequences were about 90% and 70% identical to any known protein respectively.
Nikhil Naik, Ph.D., Director of AI Research at Salesforce Research, and the paper’s senior author, said: “Sequence-based models are very good at learning structure and rules when you train them with lots of data. They learn what words can go together, and also how they combine.”
The proteins had many possible designs. Lysozymes are small proteins with up to about 300 amino acids. But with 20 different amino acids, there are a huge number (20300) of possible combinations. That’s more than taking all the humans who ever lived, multiplied by the number of sand grains on Earth, multiplied by the number of atoms in the universe.
It’s amazing that the model can easily make working enzymes from so many possibilities.
Ali Madani, Ph.D., founder of Profluent Bio, a former research scientist at Salesforce Research, and the paper’s first author, said: “This shows we are entering a new era of protein design with the ability to create functional proteins from scratch out-of-the-box. This is a new and versatile tool for protein engineers, and we’re excited to see the therapeutic applications.”
Reference: “Large language models generate functional protein sequences across diverse families” by Ali Madani, Ben Krause, Eric R. Greene, Subu Subramanian, Benjamin P. Mohr, James M. Holton, Jose Luis Olmos Jr., Caiming Xiong, Zachary Z. Sun, Richard Socher, James S. Fraser and Nikhil Naik, 26 January 2023, Nature Biotechnology.