Making Up Proteins

One of the biggest themes of the last few decades of discovery in biology and chemistry is the constant effort to extract knowledge from the billions of years of evolutionary optimization that we find ourselves surrounded by. It’s not easy, because there are no annotations in the code and no documentation left lying around. We’re left with the output from untold millennia of “Hey, whatever works”. So although the answers to the “Why” and “How” questions come hard, we at least know that when we study living systems in detail we are seeing Things That Have Been Proven To Work.

Protein structure and function (and the underlying RNA and DNA sequences) is a perfect landscape to explore these ideas. Examined closely, we can start to reconstruct how we ended up with the proteins we have (there are several mechanisms at work), and we’re busy working out why they have the folds and shapes that they do. As we find it, that collection is very large, but it’s a lot smaller than it (theoretically) could be. That’s what allows the current protein-structure prediction programs to work as well as they do: proteins end up using this large-but-finite list of motifs over and over again, and they’re associated with particular amino acid sequences.

What software like AlphaFold is doing is very much (in human terms) like looking over a protein sequence and saying “OK, I’ve seen those six or eight amino acids in that order before. . .yeah, that generally makes a turn like this thing here – and when that happens, it can go a couple of different ways. If you get these residues coming up next, it’s generally a short spacer to make room for one of those hinky-looking flattish sheet things coming in at an angle, but if it’s the other ones, the ones with a proline in the middle, then it’s a sign that it’s gonna bend around like so instead, and when it does that it means that there’s usually another sort-of-matching bend later in the sequence that’s going to come around and fit in with it, so I should check for that, too. . .” So just imagine yourself having learned all of those little motifs you could from the existing protein structures and what tends to lead to what and match up with what, and using that knowledge relentlessly and thoroughly at completely inhuman speed and efficiency. And there you are.

But as mentioned above, the number of protein shapes that we have is still nothing compared to the number that could be. So how come we have what we have? Wouldn’t you figure that there could be other folds and loops that could also work, but that for some reason evolution just never got around to? Or perhaps there aren’t? Maybe protein sequence/structure/function relationships have constraints on them that we don’t yet understand? And thus if you ran the whole evolution-of-live thing over and over you’d always wind up with something recognizably like we have now? Obviously, no one knows, and we don’t quite have the power as a species to run those experiments, nor the time (nor the funding, come to think of it). But what we can do is try to explore unusual protein geometries and see if they seem to be useful, and shed some light on the problem from that direction.

That’s where this new paper comes in. The thing is, making those new protein sequences is a matter for some thought as well. If you just wander out there starting at random and looking for function, you can expect to take a rather long time to discover things. I mean, let’s say that you improve on Nature by a thousandfold in speed of experiments: that means that you should have some interesting results in only a few hundred thousand years. So ruling that out, you can start from known functional proteins and start mutating them. But the problem there is that your starting point is already optimized in some direction, and the number of changes you need to make to find new activities might take you through some “activity deserts” where the intermediates are nonfunctional. And that comes down to how you’re assaying function, as well. In a living cell, a protein that mutates and loses its initial function is surrounded by huge numbers of possibilities to fit in somewhere else, and occasionally one manages to. But you’re not going to be picking that up in a few targeted in vitro assays, are you? Another option is to try to compute your way through and predict new functions de novo, but let’s be honest. Press releases aside, we really don’t have the knowledge to do that yet. In either case, you might well find that most of your hits are things that aren’t very far from where you started.

The paper linked above tries to crack this problem using similar technology as used in computational approaches to human language. If you feed vast amount of meaningful text into such a system, it will assemble gargantuan lists of correlations. A sentence starting with “I watered the. . .” is far more likely to end with “lawn” or “houseplants”than it is to end with “leopard” or “harpoon”. This is how the predictive-text features on a smartphone messaging platform are working. With a bit more context, these things become even more powerful. The phrase “Cut the sodium. . .” will have a different ending if the context is dietary advice than it will if it’s a procedure for a Birch reduction. And this is one part of how the systems like ChatGPT work. If the sample size is large enough, it will have seen human-generated text that has branched off in several directions from that start, and will look for more context to decide whether to go down the “. . .for cardiovascular health” pathway as opposed to the less-common but equally valid “. . .under a layer of inert solvent” one.

The idea of using such language models on protein sequences is certainly not a new one, and it’s been applied in several different ways before. But the current paper is trying to see if these algorithms are now robust enough to generate plausibly functional proteins, without knowing what function you have in mind. If you have a deep and varied knowledge of “what amino acid tends to come next” in a huge number of situations, you can presumably generate things that kind of look like they should or could be real proteins, but aren’t.

And thus, ProGen, a 1.2-billion parameter neural network trained from a database of 280 million protein sequences, all tagged with some of that extra-context information (protein family, mechanistic and biological functions, etc.) The team tried this out with lysozyme proteins, which are certainly a class that we know a lot about in both structure and function. They generated one million lyzozymish sequences, and then selected 100 of them based on how well the model seemed to generate them and how different they were to known sequences. Their average length (93 to 179 residues) was certainly comparable to known lyzozymes, but they included specific pairwise interactions between amino acids that have never been seen in any natural lysozymes. These were compared to a positive control group, 100 selected from about fifty-six thousand curated lysozymes from the real world.

72% of those natural lysozymes expressed well in cell-free protein synthesis, and 72% of the artificial ones did, too. They then turned these loose on a standard assay of fluorescein-labeled bacterial cell walls, which are engineered to be flourescence-quenched until the structures are broken up by enzymatic action. 59% of the natural lyzozyme proteins met the cutoff for functionality, while 73% of the artificial ones did. Some of those had rather low sequence homology to the natural enzymes, but worked as efficiently as the “real” ones. The different residues are evenly spread across the sequences as well, so it’s not like they clustered into “differences that make so difference” regions (i.e., some of the mutations are in the active sites and in other regions that are known to influence high-level conformational changes). Even going back and deliberately picking a new set of sequences that were deliberately chosen for 40% or lower sequence identity to any known lysozymes still produced some active enzymes.

Now, sequence identity is one thing, but that takes us back to something earlier in this post. Perhaps you can get similar overall structures from very different sequences – and that turned out to be the case here. Using AlphaFold to predict the structure of the new artificial sequences showed that they roughly matched known lyzozymes in three dimensions, and that was the case for the low-sequence-identity ones as well. In this case, then, we see that there are far more ways than are known in nature to arrive at more or less the same place, structurally (and functionally).

You’re very likely not going to be able to use these techniques, then, to arrive at totally new protein folds doing totally new things. But you can expand what’s known about the pathways that evolution didn’t take. It’ll be interesting to see if some protein classes are more constrained than the lysozymes, for example, and some of them surely are. As an extreme example, consider the photosynthesis protein RuBisCO, which by enzymatic standards just barely seems to work at all and has proven spectacularly difficult to improve by mutation or computational design (but is nonetheless the keystone for most of the life on the surface of the earth). I would not expect to generate a big ol’ list of alternate RuBisCOs, because it seems to be wedged into a pretty tight slot already.

This commentary at Nature calls the technique “hallucinating functional protein sequences”, and that’s pretty accurate. I particularly like the use of a phrase from Frances Arnold in her Nobel lecture, that “today we can for all practical purposes read, write, and edit any sequence of DNA, but we cannot compose it”. At both the DNA and the protein level, we can sequence like crazy, so “read” is indeed pretty well taken care of. And thanks to CRISPR and many other editing techniques, we can write fairly well, too. AlphaFold (and RosettaFold, etc.) are doing a good job at turning those letter into structures. But what to write? Having a keyboard in front of you is not sufficient to produce a poem, a novel, (or, I should add, a blog post). 

Jorge Luis Borges gave us the vision of the Library of Babel, the huge (but not infinite!) set of all the same-sized books that could be produced from a given set of letters. Everything that can or could be written down is there, every secret and every truth about everything, along with every version with every minor typographical error, every pernicious error and mistake that could be written down about all of them, and every possible commentary on them as well. The perfectly phrase right instructions for finding and learning anything, the ill-phrased ones, the garbled ones, the utterly wrong ones. All there. None of us can write anything down that isn’t in that collection. And there is a Library of Babel of protein sequences, too, where all of that applies in exactly the same way.

But to stick with the language metaphor via Borges, we can recognize the letters any of the books in the library we might pick up and read them off in order. We can write down letter combinations and hit “print”. AlphaFold (and RosettaFold, etc.) will take those sequences of letters and recognize the similarities to known laundry lists, airport thrillers, holy texts, tax forms, or sonnets and fit them into the structural categories they know about as well as they can. And what this new ProGen technique will do is to take a bunch of known recipes for (say) pasta sauce or bread rolls and produce a bunch of new ones that might look a bit weird at first inspection, but turn out to produce edible and acceptable pasta sauces or bread when you actually try them out, because there are in the end a lot of ways to get there, far more than chefs have ever actually tried.

But using software to create our own amazing dishes (make me something great with scallops in it that no restaurant has ever served anything close to!), our own rousing anthems (find me a melody line that doesn’t make me think of any other song I’ve ever heard!), or our own proteins (build me an enzyme that does this reaction, although it’s never seen in any living system!) . . .that, we’re still working on. It’s really, really, hard. But it might not be impossible, either.