Protein Complex Structure Predictions, Already?

I last wrote about the new advances in protein folding here – as the world has been hearing, computational techniques have really moved the protein structure prediction field along more than anyone was expecting (or at least expecting quite yet). In that post, I mentioned that the high level of accuracy that the new techniques had achieved meant that we would now be throwing our resources at harder problems, such as protein complexes (that is, assembling these individual proteins into the physical arrays that they so often form in the living cell).

Well, that day has arrived rather sooner than I thought! The Baker group at Washington has published this paper, where they used a combination of their RoseTTAFold and DeepMind’s AlphaFold to not only generate structures of protein-pair complexes, but to predict them out of thin air. They used a library of predicted structures from the yeast proteome and searched through over 8 million potential pairings. That led to 1505 predicted complexes: 699 of these are already known and have structures in the PDB, 700 of them have some validation in the literature (but no detailed structure), and 106 of them were completely new. 

Comparing the 699 that can be compared to experimental structures, the results are pretty good. 92% of them have half or more of the predicted residue-to-residue interactions present, so the modeling does indeed seem to be on the right track. The team then went on to the 806 others that have no such detailed structure and are releasing those results. Interestingly, these involve some pairs that include proteins of unknown function interacting with known proteins (and presumably giving some clues thereby to at least some of their actual roles in the cell). There are a few ternary complexes in there, too, where A was predicted to interact with B, and B with C, and the whole triple combination made computational sense as well. Looking at these, you can generally see that there’s a protein in the middle, with two completely different regions interacting with each of the partners, which is what you’d figure: predicting ternary (or larger) complexes where there are higher-order interactions between the partners is going to be a lot more computationally intensive. 

And here’s a preprint from a multi-center European team that’s trying a similar sort of thing, but in the human proteome. They used AlphaFold2 to try to predict structures for over 65,000 human protein interaction pairs (identified from earlier experimental evidence), and found over 3,000 where the models came in with high-confidence predictions. 1371 of those have no homology to any known protein complex structures, which is impressive. They’re also seeing signs of repeating patterns in things like phosphorylation sites, which suggest regulatory networks that could be involved in the formation of these complexes, possible hot-spots for mutations associated with malfunctions, and more. And they too are seeing extensions in their data to some larger multi-protein assemblies.

I have to say right off that I’m really struck by the way that these software approaches can already be applied to these problems. The single-protein prediction results were of course impressive by themselves, but it really does speak to the power of these techniques that they can be extended to protein-protein complexes. Remember, in all these cases, we’re not looking at new advances in our fundamental understanding of protein interactions, but rather at a spectacular success for modern pattern-recognition algorithms. There is a huge amount of (often high quality) data to turn these algorithms loose on, as you can see by the way both of these papers ground themselves in the huge PDB database of known protein structures and the great number of “protein interactome” experimental work that’s been done over the years. That’s exactly what you need to get them to deliver such results. The constant lesson of AI/ML approaches is that the more data you have for them to work with, and the better curated it is for quality and variety, the better off you are. The other side of that statement is that these approaches can also demonstrate the inescapable truth of the “Garbage In, Garbage Out” rule if you’re not careful, and in a spectacular, relentless, and comprehensive way.

These groups are extremely well aware of that, naturally, and that’s why they’ve been able to generate such eye-catching results. The European group, for example, mentions that their best results came from investigating putative protein-protein interaction pairs that two different experimental techniques could both agree on (affinity-based and complementation-based methods). If you’re going after the human proteome, you’re going to need all the help you can get to narrow down the problem – the Baker group paper notes (correctly) that you’re (1) working with a much greater number of potential interacting pairs to start with, (2) dealing with many proteins for which we have thinner data, because they’re unique to higher organisms, which means that we can’t necessarily see what the deeply conserved structural elements might be across a wider evolutionary range, (3) also dealing with a wider range of homolog/paralog proteins due to gene duplication events and the like, which also thins out the knowledge for any given member of the group. 

So we have not solved the protein complex prediction problem yet, for sure, but we’ve made a lot more progress in the area than I would have expected. The three factors above are going to have to be addressed by a lot of time, money, and hard work, and there’s another big one to deal with as well: sheer computing power. You’d think that we would have enough of that available to deal with most anything, but the proteome will set you straight on that. Both of these efforts I’m writing about today picked their battles pretty carefully: neither in the yeast nor the human data sets did they just turn the software loose on the complete number of proteins available and go out to lunch. That’s just beyond our capabilities right now. At every step, both teams had to made judgment calls about computational speed versus accuracy of the resulting predictions, and we’re going to be making those tradeoffs for some time to come. Those tradeoffs become especially acute, very quickly, when you start talking about multi-protein complexes, as you’d imagine. So don’t listen to anyone who tells you that we’ve now got all that stuff sorted out, because the human species doesn’t have enough hard data nor enough computing power to do anything of the kind yet.

But what we can do is exciting, and we’re just going to get better at it. The successful parts of these experiments will be used to refine new ones, and the continual growth in structural biology data is going to go right into the hopper as well. Chemical biology experiments have been doing nothing but getting more powerful and comprehensive over the years as well, and that sort of this-must-be-interacting-with-this data will also be scooped up and added to these approaches. When you think about it, this is how science works, anyway: you build on past results, made firmer and firmer by confirmation over time, and are able to reach out into new areas, which gradually firm up themselves, and so on. But you get to see this happening spectacularly quickly when it’s being driven by computational power (and by automation, as is the case with so much of the modern structural biology and chemical biology work). It’s like watching time-lapse footage of plants growing and flowering, or buildings being constructed, compared to the way we used to have to do this. 

This work has direct relevance to all the medicinal chemistry efforts to find small molecules that interfere with these protein-protein interactions, naturally. There’s already been a lot of work to try to learn the general rules of which protein surfaces fit together and how, but this will extend that even more. And it bears on a couple of other very hot areas in drug discovery as well: targeted protein degradation and the “molecular glues”. Both of these involve bringing proteins together though the intermediacy of small (and in some cases not so small) molecule third parties, and you can be sure that researchers in these areas will be reaching for all the computational help they can get, because both of them have been pretty stubbornly empirical.

But what these calculations can’t tell you, of course, is which protein-protein interactions to disturb, which proteins to degrade, or which proteins to glue together. My larger points from that earlier protein-folding post apply here as well. These protein-interaction studies do not answer the bigger, harder questions about cell biology, and they’re not designed to. They’re going to tell us where to look to better answer those questions, though, and that’s extremely valuable. Knowing that protein X interacts with protein Y tells you nothing (prima facie) about why that interaction is happening, or what it might be doing, or what disease state that might be associated with, or why you should care. But knowing that this interaction exists is a huge clue. You then apply what you already know about these proteins’ functions, their locations in the cell, their homologs in other species, any known mutations of them in animals or humans, and all that other data out there to form hypotheses about where this interaction fits in the gigantic scheme of What’s Going On In A Cell. And then you can see what other processes it might be hooked up to, what that means in a healthy cell and what diseases those might be involved with, which leads you on to a whole new pile of testable predictions. Man, is there ever a lot to do. But we’ve never had better, faster, and more capable tools to do it than we have today, and these protein-structure techniques are very powerful ones.