Fooling the Protein Folding Software

Here’s a preprint that’s been out for a while, but I wanted to call attention to it because its subject is of great interest to a lot of researchers: the protein structure predictions of programs like RoseTTAFold and AlphaFold. The paper specifically examines the former for robustness, rather than focusing on accuracy of the predictions themselves. That is, what happens when you change a given protein’s sequence just a bit and have the program predict the structure again? This is a variety of “adversarial attack”, which is the fancy ML/AI way of describing the sort of tire-kicking that you should give any new research technique, computational or not. What sort of challenges is it up to? When do its outputs start becoming less reliable, and how do you know when you’ve crossed into that zone?

It’s been shown many times that proteins with high sequence homology almost always have very similar structures. And “high homology” can start kicking in even around 40%. The exceptions to this tend to be very interesting cases indeed and are worth studying all by themselves, because they’re so rare. In this work, the authors use Blocks Substitution Matrices (BLOSUM) as a measure of similarity. That’s a lookup technique, used for many years now, based on the actual sequence alignments of regions (blocks) in very closely related proteins, showing how likely various substitutions are in nature among such matched sets. There are several of these tables – BLOSUM80 (amino acid pairs clustered at the 80% level) is the one for checking very closely related proteins, while BLOSUM45 is for more distantly related ones. The current paper uses the one in the middle, BLOSUM62. These matrices score any particular amino acid alignment as a zero if it’s just what you’d expect by random chance, positively if it’s seen more than by chance in the experimental data, and negatively if it’s seen less than expected by chance (disfavored in nature).

Now, you could get false readouts in a 3D similarity-scoring system with just a straight read on the spatial coordinates of the structures – for example, if a whole protein subunit is otherwise predicted well but is moved into a different space as an entire unit. The authors use a distance-geometry method that isn’t sensitive to that sort of thing, but it’s computationally intensive if you compute every amino acid in the proteins that way. So they focus on predicting the two residues that are furthest apart, which takes it from an O(n2) problem down to roughly an O(n) one, and I simply cannot believe that I’m using “Big-O” notation twice in about two week’s worth of blog posts. I really must not be living my life in the proper manner.

Anyway. It’s been shown many times that image-recognition algorithms can be thrown off (often grievously) by careful substitution of a small number of pixels in the original image. It’s believed that pretty much any “trained-up” neural-network system is vulnerable to this sort of thing, although these adversarial attacks will of course vary in type between different systems. The question here is whether a system like RoseTTAFold has a similar weakness, and the answer is yes, indeed. Every neural-network system has its flaws. The authors found that substitution of as few as five amino acids can lead the program to predict very different three-dimensional structures, which is the equivalent of an image-recognition declaring that a picture of Richard Nixon is in fact a begonia after a few subtle stripes are applied. 

This certainly does not mean that RoseTTAFold is useless (nor is AlphaFold, which is surely vulnerable to the same sort of thing but is not openly available for such testing). It just means that you have to check to see how strong a given structure prediction is. This paper usefully suggests using the most productive adversarial attacks they discovered to do that, which gives a direct score of how robust a given prediction is. The authors found a strong correlation between this robustness and the accuracy of the predicted structures (versus experimental data) when such variations are put in, so this really does look like a measure of trustworthiness. I would expect something like this to become a standard step in the use of these protein predictions – it had better be – as well as a sign of where the algorithms themselves will need some shoring up.