Tiny Proteins, Getting Sorted Out

I last wrote about the world of tiny proteins here, but there’s a lot more going on now. To recap, the “official” figure for the number of human genes that code for proteins is between 19 and 20 thousand (I think that the current number is 19,370). I can never think about such figures without remembering the days in the late 1990s when the rush to sequence the human genome was on. Younger observers may not realize it, but the estimates at the time for how many such genes there would be invariably ran much, much higher than that number. It’s a monument to our times that the USPTO received far more gene patent applications than the number of genes that actually turned out to exist.

But that number is in the end artificial, because it’s up to us to decide what a gene is. There are “start” and “stop” codons in the genome, and promoters and enhancers and all sorts of markers in the sequences that tell you that a real open reading frame is coming up (or ending), and you can of course look for the mRNA that gets transcribed off and the proteins that it gets translated into. But there’s a lot of variety in those things, starting with exons, introns, and splicing variations, leading to some fuzziness about what gets included on the list. A really big cutoff is sheer length: the line is drawn at proteins of 100 amino acids and higher, because it gets harder and harder (computationally) to work with the shorter ones. Longer sequences let you do tremendously useful comparisons for point mutations and other changes that illustrate evolutionary relationships and suggest classes of protein function, but as you go to shorter lengths the statistics get nasty. Too many things look like too many other things! In fact, 15-mer sequences from all human genes match at least one other human gene, and on average any given gene matches over 300 others as 15-mers. The 100-residue cutoff was agreed on to avoid slipping off into this mass of numerical noise.

But that was done for our own convenience, and if molecular biology teaches us anything, it’s that things were not set up for our own convenience. There are, in fact, plenty of very important and active proteins that are shorter than 100 amino acids. To be sure, many of these are produced by cleavage of larger proteins – as witness the beta-amyloid protein of Alzheimer’s fame, 40 or so contentious amino acids excised from the middle of a 751-amino acid precursor protein. But there are real proteins that are truly coded for down there at those lengths, and now there’s a coordinated international effort to try to get a handle on them. As that article shows, after a winnowing process there are now over seven thousand candidates from a close study of short mRNAs associated with ribosomes, and the next step is to figure out how many of these actually get turned into functional proteins. And then there’s the “figure out that function” part, which will keep everyone occupied.

The headline on that piece is very likely correct, though – this is really unexplored biology, and there’s every expectation that we will find out some interesting and important things by dealing with it systematically. It’s believed that the real action will be in between the current 100-residue cutoff down to about fifty residues, since it’s believed that below that level it’s harder for proteins to keep defined structures. But there will no doubt be exceptions to that, with particularly strong or favorable interactions holding things together even in shorter cases.

While this is going on, it’s also worth considering the usual disconnects between biomolecules of this sort and small molecule drugs. I got to thinking about this while my (then-future) wife and I were dating years ago. We were both working at the same (now-vanished) drug company in New Jersey, she in the molecular biology labs and I over in med-chem territory. Sequencing was a strenuous work process in those days (mid-1990s) involving whopping big gels, high voltages, and finicky software, as was all of molecular biology by today’s standards. I remember her talking about tiny little DNA fragments, pesky little oligos of (say) 100 base pairs, as if they were ant-sized objects at the edge of visibility. I started thinking of a 100-base-pair stretch of DNA, and immediately got a mental picture of a double helix about a yard wide stretching off through the ceiling and up taller than the trees. That’s because I tend to think of my molecules as being roughly 12 to 18 inches wide in front of me, like I’m holding a plate or a cooking dish, so by the time you picture two hydrogen-bonded base pairs and the sugar/phosphate backbones (small molecules in themselves), that’s roughly the scale. Huge, in other words, considering that one base pair’s worth of molecule is plenty large by med-chem standards.