Not AlphaFold’s Fault

With all the talk about AlphaFold’s impact (some justified, some not), this new paper is certainly of interest. It’s from a team at MIT that’s looking at the software’s protein structure predictions in the application of “reverse docking”. The usual form of compound docking calculations starts from a list of candidate compounds and then docks each of them into a target protein, generating (one hopes) a rough rank order of the ones that are most likely to bind. This can (and is) done on compound collections up to humungous virtual libraries, and this sort of virtual screening has been a long-term goal of computational chemistry approaches in general. As it stands, it can indeed be useful, but there’s no guarantee that it will be for any given case – you have to try it and check the results versus experimental evidence to see how things are going. And since there are a lot of ways of running these calculations, you can sometimes find that one approach gives notably better (or, to be fair, notably worse!) results than another, but again, it can be difficult-to-impossible to know which way that’s going to go beforehand as well.

This paper, though, is looking at “reverse docking”, which is taking a given compound and running it past a list of protein structures, in an attempt to find out how many targets it might be hitting and which are most important or interesting. It’s fair to say that this is a less common calculation, but it would also be of a lot of use if it could be made more dependable. You could take a crack at identifying targets for compounds that came out of phenotypic screening, and also survey the landscape with your drug candidates to look for other mechanisms and potential toxic interactions. The vast numbers of protein structure predictions made possible by AlphaFold and related techniques holds out some promise in that direction, since it’s true that one limitation on this idea (but not the only one, for sure) has been that many proteins haven’t had decent structures to dock against. But the same warnings as in the above paragraph apply, of course: one docking approach might be giving you more solid results against Protein #23 while a completely different one might be the right choice against Protein #17, and you still have no good way of knowing this from first principles.

In this case, the authors took a set of about 39,000 compounds that included known drugs (including known antibiotics), active natural products, and a range of other diverse structures, and found 218 that showed up as actives in a screen against E. coli cultures (50 micromolar concentration, measuring growth inhibition). I’m a little surprised that there were only 218 positives, but then again, I did several years of antibacterial drug discovery work and I was barely able to even annoy the little creatures, much less really slow down their growth or kill them, so there’s that. They then ran each of these compounds past the structures (as given by AlphaFold) of a large set of essential E. coli proteins. That’s an ambitious thing to do, but that’s what huge data sets like AlphaFold’s are there to enable, right? The 296 essentials were arrived at by consensus scoring of the numerous genome-knockout/knockdown scans that have been done in that species over the years, making the reasonable assumption that any targets that truly impaired growth were likely to have been on the hits in these lists. To their credit, the team also tried a number of different docking and scoring approaches to get around the difficulties mentioned in the last paragraph (or at least to highlight them for particular cases). Combining these various predictions into a consensus model improved things overall, which is good to see. But read on to see what “improved” means in this context, and contemplate what “unimproved” must look like, considering that many virtual screening efforts don’t actually go to this amount of trouble.

About 80% of the actual bacterial-growth screening hits were in fact members of known antibiotic classes, with the remainder being a mix of known cytotoxic compounds and some new wild cards. All in all, this is a good set to try this approach on, because you know in many cases what results you should expect from such a reverse-docking screen. 218 compounds versus 296 proteins gives you >64,000 combinations to calculate (and as noted above, these were done via several different computational methods), and I won’t go into the details of those computational approaches, but to the extent that I can judge them, they look good to me. So this is a pretty solid test and no small amount of effort. It’s especially valuable given the number of internal controls (compounds with known targets and compounds with known binding conformations within those targets). For comparison, they also did the same calculations with 100 randomly selected compounds from the set that showed no effect on bacterial growth at all.

The authors state that “our approach predicted widespread compound and protein promiscuity for both active and inactive compounds“, but the question always is how many of those predictions are false positives? That’s a notorious problem with docking and scoring approaches, and indeed, when you examine the data you find that the number of strong binding interactions predicted is basically the same between the active compounds and the inactive ones. That ain’t good. It’s like the old joke about economists predicting nine out of the last three recessions, but we are (in theory) supposed to understand protein structures and compound binding at a higher level than we understand macroeconomics. The paper looks at a specific set of 142 compound/target interactions from their set that have solid literature support and would be expected to be picked up by this approach, and note that only 3 of them were predicted to have strong binding. So there are not only a heap of false positives (inactive compounds that are nonetheless predicted to bind to the active sites of key bacterial proteins, just like real antibiotics), but there are also tons of false negatives – interactions that are already known to exist but are not picked up. It’s only when you go up to the most stringent binding energy cutoffs that the modeling performs better than random guessing, and even then it’s not exactly wonderful.

Now, one valuable feature of AlphaFold’s data is that each protein comes with a confidence score, based on how well the modeling should have worked to give the structure prediction. There was no correlation between these confidence levels and the performance of the modeling in this case, interestingly. I should say here that these disconcerting results do not have to be laid at AlphaFold’s doorstep. In fact, when the proteins whose structures are known experimentally are compared, things don’t really look any better! This strongly suggests that the problem is in the docking-and-scoring part of the process, not the protein-structure part. The authors went on to apply machine-learning techniques to that part, trying to see if there was some way to concentrate things to the more useful docking approaches. It looks like a good deal of work went into this, but I would call the results only partly successful. Some ML approaches did seem to improve the accuracy of the predictions (although not to any sort of startling degree), but others didn’t (or were slightly worse), and as usual with these comparisons, there seems to have been no way of predicting a priori which way things would go. You can get a read on the situation with a data set like this one, where there’s a lot of experimental ground truth out there for both the proteins and the compounds, but if you were going out into a less-explored area you’d have no way of knowing if you were improving your results or not.

So AlphaFold is probably giving us reasonable protein structures here, but what this study highlights is what can be our (very limited) ability to do useful things with them! This is one of the reasons that I keep pouring cold water on the “AlphaFold will revolutionize drug discovery” hype. There are parts of the drug discovery process where protein structures could in theory be very useful, but we have a lot of trouble getting the kind of use out of them that you’d imagine. It would have been interesting to run this paper’s experimental design past some of the “The Revolution Is Here!” hypesters a few months ago, because I am sure that none of them would have predicted what actually happened here. And of course there are plenty of other factors that slow down drug discovery where knowledge of protein structure is of even less help, and some of those are among the biggest problems we face in this business, such as picking the right targets to work on in the first place. 

But zooming back to this particular problem – docking and scoring of small molecule-protein interactions –  all I can say is that as ugly as it is now, it has improved over the years. (Which should give you some idea of how ridiculous some of the hype has been about it in the earlier cycles!) And it’s continuing to improve. It obviously has a long way to go still (I mean, look at these results and try to argue otherwise), and we’re nowhere near where some people would like to believe we are. But at the same time, there is no reason that computational approaches like this can’t work. We’re not up against any physical laws here. It’s just a very, very hard problem, even with 2022 hardware and software. It’s been a long road even to get here, and it’s going to be a long road from here on up, but we should be able to get there. . .eventually.