Ask The Oracle (ChatGPT and Chemistry)

Here’s a timely paper in J. Chem. Inf. Modeling, titled “Do Large Language Models Understand Chemistry? A Conversation with ChatGPT”. In case you’re wondering, the answer is “No”, and as far as I’m concerned, the answer is “no” for any subject whatsoever. But that’s because I’m more of a stickler about that verb “to understand”.

And since LLMs are in the end very swift and capable text-rearrangers, there is no understanding involved, despite what our own brains might tell us when we see some of the impressively detailed responses that these models can provide. It’s hard not to react that way, because we associate human language with, well, humans and only humans, and our theory-of-mind settings default to “other humans understand things and act accordingly”. We’re not used to such fluent mimicry, and we can’t help but attribute qualities to things like ChatGPT that they simply don’t possess. It’s been very instructive to see how many tasks (and perhaps whole occupations) can be simulated by rearranging words in ways that people have arranged words in the past, though.

My daughter and I were talking about this the other evening, since she tried asking ChatGPT what it could tell her about “Derek Lowe, chemist”. It responded with a fluent mishmosh of publicly available information about me, all of which was fine, but it also threw in for good measure that I’d won the Paul Gassman award, which I can assure you is not the case. But came across with the same mechanical confidence as all the rest of the reply, which is the common experience with these things. I should ask ChatGPT about the Talking Heads: “Facts all come with points of view / Facts don’t do what I want them to. . .

The authors of article linked in the first paragraph take on all these objections, but note that they might not apply to later versions of these models – at least, they will presumably be trained on better-quality factual data, so they will be less likely to hallucinate in their current style. That’s illustrated when they try asking ChatGPT to convert simple chemical names into their SMILES strings (or vice versa). As it stands, the version tested gets this right about a quarter of the time for simple linear alkanes, but on the other hand it certainly didn’t see big piles of SMILES strings during its training. And if you wanted to to do that, you could (although I’d rely more on the algorithmic approach to generate SMILES than the let’s-memorize-a-bunch-of-stuff-and-run-with-it approach, personally).

The authors also try it out on logP data, the octanol/water partition coefficient that’s used as a measure of “greasiness”. The model was able to pull out some more reasonable values from the literature it was trained on, with an overall error of about 30% compared to experimental values (at least for the ones that didn’t come back as “unknown”). And for Senator Chris Murphy’s benefit, this was not accomplished by ChatGPT “knowing” anything about hydrophobicity, of course – these are just numbers that appeared in the context of the compound name and the text string “logP” in the model’s training set. It did a pretty good job when asked about the geometry of various coordination compounds was, as well as the symmetry point group of various simple molecules. But of course it gets a significant number of these wrong in every case, too, and there is no way of knowing if you’re getting a good answer or a bad one – they all look the same and are delivered without confidence levels or error bars.

As mentioned, you’d have to think that the accuracy of these LLMs about factual material will improve. It already is – the latest version of ChatGPT seems to hallucinate less than before, and even the publicly available free version has been tuned up not to write (for example) an advertisement extolling the benefits of dimethylmercury skin lotion (which I found at first that it would blithely produce, some months ago). I should try it out with some less-famous poison and see what it thinks, but those gaps can all be filled. The larger gaps, those that can only be filled in by what we use words like “understand” to mean, are another question entirely. Perhaps that can be simulated convincingly with enough processing power, and perhaps not. And if it can, we will then continue the long-running argument about what we’re simulating, and when (or if) an imitation might as well be the real thing. . .