Can LLMs lie?

Trying to think through whether Scott Alexander is right to use intentional language to describe what AIs do

Jan 11, 2024

Could a text-based LLM hold a concept of truth which is the same as ours?

Scott Alexander has a post in which he talks about LLMs lying, and I don’t think that’s a good use of language. I don’t think they can lie; I don’t think they can tell the truth; I don’t think they have any sense of whether their output is “true” or not.

But there’s clearly one level on which I’m wrong: LLMs are getting more accurate. They’re not like GPT2, which spat out sentences that were mostly grammatical, but semantically were garbage. Current LLMs produce reasonably accurate and on-topic statements. They’ve got a grip on semantics in a way that past models simply didn’t have. And Alexander cites a paper that has found an internal state inside LLMs that functions essentially as a “lying indicator”: If this particular variable has a certain value, the LLM “tells the truth;” and if the variable has a different value, the LLM “lies.” This is pretty impressive! It suggests that on some internal level, the LLM can differentiate between statements that are *true and *untrue, and can sometimes use these features of statements correctly in discourse.

But whatever idea constitutes the AI’s version of *true and *untrue, it can’t be the same as our understanding of truth. We learn what truth is as young children through, I think, correspondence with the world. A statement is true if it reflects how the world really is. You say to a kid, the bunny is white, and they say, no! That’s not true, the bunny is black. They look at the world and check if the statement represents it accurately. This is only the first turn of the screw; truth becomes much more complex later on. But all versions of truth must start with that grounding in correspondence, mustn’t they?

I’m making an argument from I-can’t-think-of-anything-else, which is a bad kind of argument in general, but I think it holds up here. Psychologically, and maybe logically, our conception of truth has to be grounded in the world.

But LLMs don’t have any access to the world, so whatever their version of truth is, it’s not that. The fact that they are able to use words like “true” and “false” in discourse means that they must have developed models of what’s true and what’s false. But they’re only mathematico-linguistic constructions. They’re not grounded in anything but word correspondences.

I was wondering if there was any way an LLM could observe enough stuff in its linguistic training data to discover truth in there, without looking at the world. For example, there are things you can say about words. In theory, without ever looking at the world, it could discover correspondence truth in situations like the statement, “The word ‘bow’ is spelled the same as the word ‘bow.’” But I don’t think that’s going to hold up. There’s just so little ChatGPT can observe about text. For example ChatGPT can’t count, so it can’t observe whether words in its data have the same number of letters, or how they’re spaced.

Saussure said that language is a system of signs, each sign being made up of a signifier (word) and a signified (concept). You can picture signs as fitting together a bit like a jigsaw; on one side of the jigsaw pieces are the signifiers/words, and on the other side are the signifieds/concepts. The signifieds determine the shape of each piece and where it is placed in the jigsaw. AI reproduces the language by learning it from the signifier side. It correctly mimics the use of each piece, but it never has any access to the signified side; let alone to the complex ways in which signifieds (concepts) relate to the real world; and of course, it has no access to the real world.

Here’s an analogy: imagine trying to learn chess by only watching chess games. I don’t think they even did that with AlphaZero - even there they prime the AI with the rules of the game before it starts. In theory, just by watching lots and lots of chess games, you could reason out the rules. But in practice, that would be very hard. That’s what LLMs are doing with language, I think. They’re obviously doing it very successfully, because they manage to “play lots of good games of chess” = have successful conversations using language correctly. But they don’t know the wiring underneath, and the wiring underneath isn’t reproduced in their internal states, because they are just missing the access to the external world that that would require.

Tang Poetry

Discussion about this post