AlphaFold2, RoseTTAFold, and the future of structural biology
August 15, 2021 8:30 PM   Subscribe

Back in 2020, we discussed DeepMind's impressive advances in protein structure prediction, as evinced by their showing at CASP14. They've finally made good on their promise to publish their work on AlphaFold2! In the interim, though, David Baker's group at the University of Washington - perennially one of the top academic groups in the field - got impatient and re-engineered their own protein structure prediction approaches to come up with RoseTTAFold, which appears to be almost as good (and rather less computationally intensive).

Looking for slightly more comprehensible explanations of how it works? The usual science and tech sites (Ars Technica, Stat News, TechCrunch, etc.) have some basic explanations. Getting into more technical explanations that still provide some solid background on both the biology and the computational side, we've got a Youtube explanation that doesn't seem half-shabby, if you're a fan of video, and an article on an AI-focused site that's actually a bit heavier on the bio background. For commentary from researchers closer to the field, check out more from a member of the Oxford Protein Informatics Group (expanding on 2020 thoughts), and from Mohammed AlQuaraishi, who also had some ruminations on the CASP14 results ("it feels like one's child has left home.") Broadly, there's no One Weird Trick that makes AlphaFold2 work: it takes a lot of extant ideas from the bioinformatics (multiple sequence alignments and evolutionary co-variation), protein modeling, and machine learning worlds, and engineers them into something really impressive.

How will this change structural biology? As a comment in Nature Structural and Molecular Biology describes, it's actually likely to be a huge boon for people tackling tough experimental structural techniques, which have for a long time uses protein models when working with and solving structures from experimental data. And as EMBL-EBI points out, it's going to be helpful for hypothesis generation in a lot of areas of basic research, particularly for people working with protein families that had little or no existing structural information. But as Derek Lowe from In the Pipeline notes, and has discussed previously, it's also possible to oversell what this will mean for, say, drug discovery: determining protein structures is important and has historically often been quite challenging, but there are many parts of biomedical research for which it wasn't really the rate-limiting step. AlphaFold2 and RoseTTAFold also have some very real limits, as highlighted in this FEBS post - there are some limits to their ability to predict protein complexes, and they can't handle proteins that bind cofactors or non-protein things like amino acids, or that have post-translationally modified amino acids, or that form several different conformations in the actual cell. Given that bioinorganic chemists estimate that a quarter of proteins bind metals - only one of the classes of cofactors under discussion! - this does mean that a significant subset of proteins are going to be less well-predicted than simpler cofactor-free monomeric proteins until AlphaFold2, RoseTTAFold, and similar algorithms undergo more development.

Want to predict some protein structures yourself? For researchers working on model organisms, DeepMind helpfully already predicted the structure of every protein and made them available to the public via EMBL-EBI. For everyone working on something odder, both projects are up on GitHub (AlphaFold2, RoseTTAFold), but the Baker group's also implemented RoseTTAFold on their Robetta server, while DeepMind's put a (slightly limited) version of AlphaFold2 on a public Colab notebook. Researchers have, of course, been digging into these tools and figuring out how to adapt them to work on more complicated problems, like this Colab notebook (dubbed ColabFold on GitHub), which tackles protein-protein complexes. (One of the lead people on that - Sergey Ovchinnikov - has a talk available online discussing both how AlphaFold2 works and their Colab setup.) Despite the caveats, it's all quite exciting.
posted by ASF Tod und Schwerkraft (11 comments total) 23 users marked this as a favorite
 
Keeping my editorializing and additional overly-technical commentary to a, well, comment: I'll note that as someone who's arguably the target audience - I work with totally uncharacterized protein families that are very much not found in model organisms - I have, unsurprisingly, been spamming all available servers with my protein sequences, even knowing they may or may not get the metal-binding and cofactor sites or protein-protein interactions quite right. My expectations were actually a little low (my protein families are challenging!), but the consistency in results from AlphaFold2 and RoseTTAFold has been striking, especially compared to the best previous-generation efforts like trRosetta from the Baker group at the University of Washington or C-I-TASSER from the Zhang lab at the University of Michigan: ångström-level RMSDs between the best 5 models, not arrays of wildly different structures. The results definitely can't replace the need for crystal structures yet, at least for my proteins, but the ways they reflect my experimental data are pretty intriguing and are certainly giving me reason to prioritize some hypotheses over others, and they are remarkably consistent between AlphaFold2 and RoseTTAFold. So while I do think the "we've solved protein folding!!1!!1" announcements are slightly overblown (I suspect a solid majority of proteins qualify as so-called edge cases that are aren't optimal candidates for structure prediction, namely proteins with cofactors, partners, context-dependent structures or post-translational modifications), we're definitely much, much closer than I would have expected even a year ago.

On a less bio-heavy level, I admit I'm really relieved to see that DeepMind's released their work not just as a publication but as open code on GitHub and Colab, since lack of transparency was a big concern when the CASP14 results came out in November. Similarly, I'm really relieved that RoseTTAFold is coming up with very similar results for my proteins (i.e. results seem less likely to be the figment of a lone machine learning instance over-trained on the contents of the protein [structure] data bank), and the Baker group (and the Zhang group, which I imagine has probably also been working on an AlphaFold-influenced I-TASSER variant) have always done a good job of not just making their algorithms public but of running public servers to make structure prediction easily accessible to other researchers.
posted by ASF Tod und Schwerkraft at 9:45 PM on August 15, 2021 [12 favorites]


I am very far from the research these days--are they doing anything with post-translational modifications and/or IDPs?

edit: sorry, you addressed that in your description! They are not (doing it well) :(
posted by Anonymous at 6:12 AM on August 16, 2021


For people who aren't in the weeds of protein folding but are interested in the implications of machine learning getting good at stuff, I recommend the "more from a member of the Oxford Protein Informatics Group" link by Carlos Outeiral Rubiera, particularly the second half under "Second act: it’s all about the goss."

Back when AlphaZero came out, where Google got state-of-the-art results in chess using an engine developed for Go, it sort of stood as an argument by DeepMind that hard problems are about to get easy. Existing chess engines representing many years of many people's expert time encoding knowledge about chess, and AlphaZero represented no knowledge about chess and still won, and isn't that something?

Rubiera's discussion doesn't exactly dispute that, because a similar thing happened here -- Google took a huge leap in the state of the art, without (as I understand it) starting with any additional insight into the underlying problem but only with better machine learning engineering. But it adds the wrinkle that only Google could have done it -- the "secret sauce" to make protein structure prediction useful was having some large fraction of the most skilled machine learning people in the world working on the problem, along with more compute power for exploration than most anyone else could have spent. To put it another way, Google could do this, but the entire world including Google couldn't have done very many other things like this at the same time.
posted by john hadron collider at 6:41 AM on August 16, 2021 [1 favorite]


john hadron collider: Existing chess engines representing many years of many people's expert time encoding knowledge about chess, and AlphaZero represented no knowledge about chess and still won, and isn't that something?

Do I remember correctly that the currently most powerful chess engine is Stockfish + NNUE, i.e. one which combines expert human knowledge, brute force calculation, and a neural network?
posted by clawsoon at 6:49 AM on August 16, 2021


RoseTTAFold looks great. I think it's very important that there's an open source / free software parallel to the machine learning work being done by Google/DeepMind and OpenAI. RoseTTAFold has a non-commercial only license on it, that's a shame, but it's better than nothing.

There's a download of the linked RoseTTFold paper from the University of Cambridge. It didn't take a lot to train RoseTTAFold: "Using eight 32GB V100 GPUs, it took about four weeks to train the model up to 200 epochs". That training time would cost about $70,000 to $100,000 if you lease the computers from a cloud computing service.

The training set seems awfully small for naive machine learning; 23,000 clusters / 200,000 protein chains. But I don't know much about the protein folding problem. I imagine there's just not a lot of training data available! AlphaFold trained on 170,000 protein structures so that sounds like the same ballpark.
posted by Nelson at 7:28 AM on August 16, 2021


That training time would cost about $70,000 to $100,000 if you lease the computers from a cloud computing service.

Of course, you will never just train one model if you're developing new methods.The DeepMind effort here effectively says 'hey, guys, this problem has a solution,' which then makes it much easier to justify sinking in the time and money to train new, leaner models to solve it.

Also worth noting that the first solution in ML space is allllllways massively larger than it really needs to be. The original WaveNet took 20 minutes of server time to produce 1 second of audio; current WaveRNN can produce 16khz audio faster than realtime on a single thread in a phone. For the DeepMind folks, the least-risky option is to throw obscene amounts of compute at the problem and see if it works: This removes most of the second-guessing about whether compute is the bottleneck.
posted by kaibutsu at 7:52 AM on August 16, 2021 [3 favorites]


Nice to get some extra perspective on this. There was a lot of news in a short amount of time. I wrote it up (my article is in there..) but still only feel I scratched the surface. Thanks for adding this context!
posted by BlackLeotardFront at 11:04 AM on August 16, 2021


ASF Tod und Schwerkraft: the consistency in results from AlphaFold2 and RoseTTAFold has been striking, especially compared to the best previous-generation efforts like trRosetta from the Baker group at the University of Washington or C-I-TASSER from the Zhang lab at the University of Michigan: ångström-level RMSDs between the best 5 models, not arrays of wildly different structures.

As someone in chemical engineering, but not with a protein folding background, is having very consistent structures a good thing? Given that the model wasn't built to deal with cofactors or other species binding to the protein, does this agreement imply these methods have found the right structure? Or do these methods all settle into an attractive wrong structure?

I guess all of these models are opaque black boxes, so you can't say something like 'this algorithm is known to neglect these important effects, and so it's odd structure is likely because our protein has that going on'.
posted by crossswords at 11:27 AM on August 16, 2021


(BTW my estimate of $70,000 to $100,000 may be eight times too big; it may be more like $10,000ish. I specced eight computers each of which is configured with a V100 GPU. But I missed the thing about each machine being "GPU dies: 8 NVIDIA_TESLA_V100". It might be you only need one machine like that to match the 8 GPUs the paper mentions.)
posted by Nelson at 2:16 PM on August 16, 2021


Or do these methods all settle into an attractive wrong structure?

They can! If you are like me and have STRONG FEELINGS about the influence of, say, post-translational modifications on protein conformation, then someone producing a computed structure of something that is TOTALLY hyper-glycosylated in real life might make your eye twitch. That doesn't make this development not an achievement, though!
posted by Anonymous at 2:26 PM on August 16, 2021


As someone in chemical engineering, but not with a protein folding background, is having very consistent structures a good thing? [...] Or do these methods all settle into an attractive wrong structure?

What schroedinger said! I will not be surprised if the models I'm getting are off by more than DeepMind would like to advertise, because I do have STRONG FEELINGS about the importance of cofactors and protein-protein interactions (not expecting much in the way of PTMs on these particular proteins, but those can obviously be key too). And they might be just utterly wrong, too, having settled on some computationally attractive but incorrect structure, and there really isn't any way of being sure yet whether that's going to be the case.

But if it is, it is a qualitatively different failure mode than the previous failure modes I was seeing on structurally uncharacterized protein families using older prediction tools, which were essentially the algorithms flailing and coming up with a bunch of unrelated nonsense, and then coming up with different unrelated nonsense when I submitted, say, distant homologs from the same protein family. With these new systems, I'm seeing consistency between the predicted best models for a given protein sequence, between models for distant relatives from the same family, between models that take into account the partner protein (and which always place it in the same location, oriented similarly) and models that don't, and between models produced by AlphaFold2 and by RoseTTAFold (i.e. it's not just DeepMind's black box coming up with this protein fold - two distinct systems agree on it, though they are two systems that use similar approaches). Putting all of that together has left me willing to at least consider the possibility that there's something to the models, at least in terms of the overall protein fold, if not necessarily ångström-level accuracy. Arguably more importantly, I've done a bunch of biochemistry on this protein family, and some of the information I have (altering certain sets of amino acids alters my cofactor binding or my enzyme activity) is potentially consistent with the models, wherein key amino acids are located in a way that would potentially make sense for the active site of my protein, the substrates and cofactors it binds, etc. It's enough to make me consider prioritizing certain experiments over others - and those experiments themselves may help support (or refute!) the models. And, well, in the meantime: crystallography screening!

I should note that there may also arguably be some "implicit" training regarding cofactors, even if AlphaFold2 and RoseTTAFold aren't explicitly trying to place them. The presence of, say, a heme cofactor is going to put evolutionary pressure on a protein family to maintain a heme-binding axial ligand (one of only a handful of amino acid types are likely to show up here), a surrounding environment that's friendly to a porphyrin ring, possibly a second axial ligand within about 6 ångströms and positioned in a certain way, and so on. So even without the cofactor, it's certainly conceivable that these systems might pick up indirectly on the sequence and structural constraints a cofactor-containing protein family has, even if they're not directly predicting the cofactor itself. It's also true that if there are structurally characterized similar cofactor binding sites, both systems can draw on existing templates. But the extent to which these systems may be indirectly accounting for cofactor binding sites - and particularly the extent to which it's true for rarer cofactors or cofactors bound by a new protein fold - remains to be seen.
posted by ASF Tod und Schwerkraft at 4:26 PM on August 16, 2021


« Older Dubai Is A Parody Of The 21st Century   |   Billy’s Caricatures Literally Kidded Them to Death Newer »


This thread has been archived and is closed to new comments