Large language models also work for protein structures
">
The success of ChatGPT and its competitors is based on what's termed emergent behaviors. These systems, called large language models (LLMs), weren't trained to output natural-sounding language (or effective malware); they were simply tasked with tracking the statistics of word usage. But, given a large enough training set of language samples and a sufficiently complex neural network, their training resulted in an internal representation that "understood" English usage and a large compendium of facts. Their complex behavior emerged from a far simpler training.
A team at Meta has now reasoned that this sort of emergent understanding shouldn't be limited to languages. So it has trained an LLM on the statistics of the appearance of amino acids within proteins and used the system's internal representation of what it learned to extract information about the structure of those proteins. The result is not quite as good as the best competing AI systems for predicting protein structures, but it's considerably faster and still getting better.
LLMs: Not just for language
The first thing you need to know to understand this work is that, while the term "language" in the name "LLM" refers to their original development for language processing tasks, they can potentially be used for a variety of purposes. So, while language processing is a common use case for LLMs, these models have other capabilities as well. In fact, the term "Large" is far more informative, in that all LLMs have a large number of nodes--the "neurons" in a neural network--and an even larger number of values that describe the weights of the connections among those nodes. While they were first developed to process language, they can potentially be used for a variety of tasks.
The task in this new work was to take the linear string of amino acids that form a protein and use that to predict how those amino acids are arranged in three-dimensional space once the protein is mature. This 3D structure is essential for the function of proteins and can help us understand how proteins misbehave after they pick up mutations or allow us to design drugs to inactivate the proteins of pathogens, among other uses. Predicting protein structures was a challenge that flustered generations of scientists until this decade, when Google's AI group DeepMind announced a system that, for most practical definitions of "solved," solved the problem. Google's system was quickly followed by one developed along similar lines by the academic community.
Both of these efforts relied on the fact that evolution had already crafted large sets of related proteins that adopted similar 3D configurations. By lining up these related proteins, the AI systems could make inferences about where and what sort of changes could be tolerated while maintaining a similar structure, as well as how changes in one part of the protein could be compensated for by changes in the other. These evolutionary constraints let the systems work out what parts of the protein must be close to each other in 3D space, and thus what the structure was likely to be.
The reasoning behind Meta's new work is that training an LLM-style neural network could be done in a way that would allow the system to sort out the same type of evolutionary constraints without needing to go about the messy business of aligning all the protein sequences in the first place. Just as the rules of grammar would emerge from training an LLM on language samples, the constraints imposed by evolution would emerge from training the system on protein samples.
Paying attention to amino acids
How this worked in practice was that the researchers took a large sample of proteins and randomly blocked out the identity of a few individual amino acids. The system was then asked to predict the amino acid that should be present. In the process of this training, the system developed the ability to use information like statistics on the frequency of amino acids and the context of the surrounding protein to make its guesses. Implicit in that context are the things that required dedicated processing in the earlier efforts: the identity of proteins that are related by evolution, and what variation within those relatives tells us about what parts of the protein are near each other in 3D space.
Assuming that reasoning about how LLMs would work is true (and Meta was building on earlier research that suggested it was), the trick to developing a working system is getting the information contained in the neural network back out. Neural networks are often considered a "black box," in that we don't necessarily know how they come to their decisions. But that's becoming increasingly less true over time, as people build in features like the ability to audit the decision-making process.
In this case, the researchers relied on the LLM's ability to describe what's termed its "attention pattern." In practical terms, when you give the LLM a string of amino acids and ask it to evaluate them, the attention pattern is the set of features that it looks at in order to perform its analysis.
To convert the attention pattern to a 3D structure, the researchers trained a second AI system to correlate the attention pattern for proteins where we know their 3D structures with the actual structure. Since we only have experimentally determined structures for a limited number of proteins, the researchers also used some of the structures predicted by one of the other AI systems as part of this training.
The resulting system was termed ESM-2. Once it was fully trained, ESM-2 was able to ingest a raw string of amino acids and output a 3D protein structure, along with a score that represents its confidence in the accuracy of that structure.