Note: If you are unfamiliar with transformer architecture, I suggest reading Part I and Part II first.
Somehow the 9,216 attention heads and ninety-six MLP blocks in ChatGPT work together across its ninety-six layers to produce sensible responses when you ask it to write a poem or a cover letter or some Python code. You can have a perfect understanding of the architecture and the training process, but it is still totally mysterious what algorithms LLMs actually learn to use.
For the most part, transformer models are black boxes. We don’t have great answers to questions about what’s going on under the hood. But we’ve begun to see, albeit through a glass darkly, how to interpret and make sense of the internal processes of trained models.
In this post, I’m going to talk about some of my favorite papers in mechanistic interpretability—a nascent field aiming to understand LLMs and other deep learning models. In essence, this is the very beginning of cognitive science for artificial minds.
The empirical findings are fascinating on their own. But as a professional philosopher, I also find the results very philosophically interesting. I’m going to allow myself a little more indulgence than in previous posts in framing things with reference to my native academic field, but I’ll try to save as much of that as possible for the end.
Induction Heads
Attention heads are the big innovation and at least some mildly interpretable. The point of attention heads is to move information from tokens earlier in the sequence to later in the sequence.1 As we saw in Part I, we can easily understand, for any particular input text, what the attention heads are actually attending to. For every every token, each attention head produces a probability distribution over the previous and current token, so we can just read off how much weight its giving to itself and its predecessors in the sequence. However, it’s harder to understand, systematically, what the purpose of a particular attention head is and what kind of information it moves from earlier tokens.
Two recent papers from Anthropic have made a substantial amount of progress in figuring out what some attention heads do individually and have also shed light on how attention heads can work together across layers to learn in context.
These papers gave me more of a feel for how trained transformer models actually work than anything else I’ve read, so I want to provide a very abbreviated walkthrough of the main results here.
Skip Trigrams
The very simplest type of transformer model has a single layer with just attention heads and no MLP. So, the model will embed some text, run it through the attention heads, add the output onto the original embedding, and then unembed to make predictions about the next token.
Of course, you can’t do too much with this kind of a model, but Anthropic discovered such models are more or less an “ensemble of bigram and skip trigram” models. As we’ve seen, a bigram model simply makes a prediction about the next token based only on the current one. So a bigram model’s prediction of the next token in Whereof one
will be the same as its prediction of the next token in Whereof one cannot speak, thereof one
since one
is the last token both times.
Skip trigrams, however, add a bit more power. A skip trigram is a sequence like keep…in mind
. If you are making next-token predictions and you’ve already seen keep
somewhere before, and the current token is in
, it’s a good idea to make mind
a bit more likely. Note that in between keep
and in
can come text of arbitrary length:
Keep in mind that we have to run errands later
Keep our finances in mind
Keep the following five fundamental principles of writing crisp and concise prose in mind
If the model is a mere bigram model, it can’t look arbitrarily far back to find keep
when it sees in
. But attention heads will let us find information from earlier in the sequence.
How does that work mechanistically? Well, for some attention head, in
’s query will match well with keep
’s key, and then the head will modify in
’s embedding so that mind
is rendered more likely as the next word. Because in
gets to look back at every token before it, it doesn’t matter how far back keep
is.
Induction Heads
Things get more interesting when we add more layers to a model. Consider now a two-layer attention-only transformer. Here, text is embedded, then passed through some attention heads, their value is added back onto the original embedding, the new embedding is passed through another round of attention heads, their value is added back on, and then predictions are made.
Now the attention heads in Layer 1 can interact and compose with the attention heads in Layer 2.
The most interesting and important result discovered thus far is the emergence of something Anthropic calls induction heads, which copy text from earlier in the sequence.
As it turns out, text is often repetitive. To riff on Anthropic’s own example, consider:
Dave went to see Dr. Hilty because of his depression. After taking a detailed history, Dr.
A good prediction here is that Hilty
is the next word even though Hilty
is relatively rare after Dr
. (If you only knew the current token was Dr.
, you might give highest probability to Smith
or some ultra-common last name, but since we’ve already met Dr. Hilty in this story, Hilty
is a better guess than Smith
now.)
Schematically, we can think of an induction head as adhering to the following pattern of inference: [A][B]…[A]→[B]. This means: if you see token [A] followed by token [B], and then you see [A] again, render [B] more likely than it started out as your next token prediction.2
Mechanistically, here’s how induction heads work. In the first layer, there is some attention head that attends to the previous token. So, for this attention head in the first layer, Hilty
’s query will match up with (the first) Dr.
’s key. The attention head will move information over to the embedding for Hilty
that somehow or other encodes the fact that Dr.
precedes it.
Now that Hilty
’s embedding has this information, when it goes through the induction head in the second layer, the induction head can let Hilty
’s key sync up with the (second occurrence of) Dr.
’s query. It can then move information from Hilty
forward so that Hilty
is more likely to be predicted after the second occurrence of Dr
.
So, in effect, the first layer changes Hilty
’s embedding so that its key in the second layer can match (second) Dr
.’s query.
Why can’t this happen with single-layer transformers? The answer is that when the text is passed through the first layer, Hilty
doesn’t yet have any information in its embedding about what tokens precede it. So, there’s no way to make its query sync up with (second) Dr.
’s key (at least using only linear operations). You have to first tell Hilty
about what comes before, so that it can alert (second) Dr.
and get copied over.
If you want to visualize the attention patterns in their two layer model, click here. (Sorry, I couldn’t get it to embed directly on this page.) You can create your own similar visualizations for other models in this Colab notebook.
In their follow up paper, Anthropic found lots of induction heads in larger models with MLPs as well. There’s also relatively strong evidence that induction heads are behind a fair amount of transformer models’ impressive performance, which I’ll briefly discuss in the next optional subsection.
Induction Heads and In-Context Learning (Optional)
In-context learning is the ability to pick up on the peculiarities of a passage to make more accurate predictions. If you give an LLM a passage, it won’t make particularly accurate predictions about the next token at the start of the passage. You just don’t have much to go on when predicting what the first word, second word, or third word will be. But as the passage increases in length, you have a lot more contextual information, and that should improve the LLM’s accuracy. In the example above, there was no good way to predict that Hilty
would come after the first occurrence of Dr.
but there was contextual information that Hilty
would come after the second occurrence.
We can operationalize the concept by considering the average accuracy of the prediction of late tokens versus earlier ones. (Anthropic uses the average accuracy of the prediction for the 500th token versus the 50th.) As it turns out:
Transformer models consistently form induction heads after training on a few billion tokens.
In-context learning improves dramatically at the same time.
Here’s one of many suggestive graphs from the Induction Heads paper.
Of course, proving causation is tough to establish especially in large models with MLPs when so much of the process is still black-box. But at the least, induction heads seem to be an important part of the story.
It’s also worth noting that induction heads tend to do more than just literal copying. By definition, a head is an induction head if it obeys the [A][B]…[A]→[B] schema discussed above. But a lot of the very same heads will also obey schemata like [A*][B*]…[A]→[B]. Here [A*] and [B*] denote tokens similar to [A]. For example, if [A*] is el
, [B*] is perro
, and [A] is the
, then an induction head will often—as a matter of empirical fact—attend back to perro
and raise the probability of dog
. So, as it turns out, the very same attention heads responsible for literal copying can also execute more abstract pattern-matching.
Memory and Brain Surgery
In transformer models, we can easily store, edit, and generally fiddle with the hidden state activations. Although the algorithms the model has learned are hard to decipher, we can discover a lot about models through selective tinkering.
In this paper, Kevin Meng, David Bau, and their collaborators played around with GPT-2 and similar models to understand how and where such models stored memories.
Let’s start with an example to see how this works. Consider: The Space Needle is in downtown
. If you feed GPT-2 this text, it will assign high probability to Seattle
as the next token.
But how does it know and remember that the Space Needle is in downtown Seattle? The authors figured out the answer by doing some minor brain surgery on the model. Here’s the idea in brief:
Run
The Space Needle is in downtown
through the model as normal and store all the hidden state activations of the model—i.e., the initial embedding, the queries, keys, values of each attention layer, the input and output of each MLP, etc.Take the initial embedding for
The
,Space
, andNeedle
and corrupt each by adding a lot of noise and run the model. Make sure there’s enough noise that the model, when run with the corrupted input, no longer outputsSeattle
with high probability. Store all the hidden states again.Selectively restore some of the hidden states from the corrupted run to the values from the original run and see when a restoration will get you
Seattle
with high probability.
To make things more concrete: we run in effect [nonsense] [nonsense] [nonsense] is in downtown
and store the internals of the model. We then see what happens if we change, say, the output of the MLP in layer 17 for the third token3 from its nonsense value back to the value it had for the uncorrupted input. If the model now outputs Seattle
with high probability, then you know that the model has retained this fact about the Space Needle in MLP 17 or somewhere downstream from there. Further tinkering in the same vein can clue you in more.
And indeed, it turns out that midlayer MLPs actually do tend to store facts like where the Space Needle is!
Counterfactual Editing
Once you’ve located where memories and facts are stored within a model, you can also selectively edit those facts without any retraining. Suppose you found that the model stored the location of the Space Needle in MLP 17. Using similar tricks, you find that the model stores the fact that the Eiffel Tower is in Paris in MLP 14. You can then exploit this information to edit the model so that it will think the Space Needle is in Paris.
There are a lot of interesting applications for such selective editing, but you can also evaluate how the model represents some sorts of associations and counterfactuals. For instance, if you surgically edit the model so that the Space Needle is in Paris:
Will it still think Vancouver is under 3 hours away by car?
Will it expect most visitors to be French nationals?
Will it believe that T-Mobile Park (home of the Seattle Mariners) is in Paris as well?
In general, when you selectively edit a fact in a trained model, you want some other facts to change, but you don’t want too much to change. I don’t think that we have a full picture yet of how this works with LLMs, but the authors do an admirable job of exploring how various facts co-vary using different methods of knowledge editing. I’ll refer interested readers to the paper itself for more.
What I Find Most Interesting about This Paper
We already knew that transformer models somehow memorized information. It’s not especially surprising that they often stored these facts in their MLPs. What I did find surprising, however, was just how precisely the authors were able to locate the memories and surgically edit them without training.
In general, gradient descent is a blunt tool for getting a model to perform well. We know that through the tweaks made via gradient descent the model will improve with training. But with gradient descent, the whole model can change, and we have little control over the exact way in which the parameters are updated. But this research points to the ability to improve models with much more targeted interventions.
In the future, I expect we’ll be able to do even more—perhaps implanting new modules, or adding or removing functionality in a way that doesn’t require a bunch of new training. There’s some reason for optimism, then, that brain surgery will give us more ability in the future to ensure that models actually do what we want.
Representing Truth?
It’s often tempting to talk about what an LLM “believes” about the world. But, if you want to get all philosophical about it, it’s not clear that LLMs believe anything at all about the world. They’re just trained to generate text—getting tweaked based on their predictive loss and (sometimes) based on how much they tell us things we like to hear. For all we know, they might learn to chat without tracking anything like truth or meaning, and if they don’t bother representing whether what they’re saying is true, then they don’t really have beliefs.
Indeed, Murray Shanahan argues as much in a recent paper:
A bare-bones LLM doesn’t “really” know anything because all it does, at a fundamental level, is sequence prediction. Sometimes a predicted sequence takes the form of a proposition. But the special relationship propositional sequences have to truth is apparent only to the humans who are asking questions…
[K]nowing that the word “Burundi” is likely to succeed the words “The country to the south of Rwanda is” is not the same as knowing that Burundi is to the south of Rwanda. […]. If you doubt this, consider whether knowing that the word “little” is likely to follow the words “Twinkle, twinkle” is the same as knowing that twinkle twinkle little. The idea doesn’t even make sense.
However, it looks like such philosophical worries might actually be misguided! Collin Burns, Haotian Ye, and their collaborators present evidence that models actually do internally represent whether what they’re saying is true.
The rough version of the strategy is to look at the hidden states of the model and see if they look any different for true statements and false statements in a systematic way.
There’s an immediate challenge, however. We aren’t primarily interested in whether the statements themselves are true or false, but instead in the model’s psychological states. Does the model—in some sense—believe or represent the statements as true or false? Like humans, it might have false beliefs if it has any beliefs at all. But the authors found a way to get around this complication.
The basic idea works as follows. Take a bunch of yes/no questions that have a right answer, such as (x1) Is
Luanda the capital of Angola?
, (x2) Does Mike Trout play for the Pittsburgh Pirates?
, (x3) Was Abraham Lincoln was the third president of the United States?
, (x4) Are cats mammals?
, etc.
Then make two versions of each by appending a Yes
and a No
at the end. So, (x1+) is Is Luanda the capital of Angola? Yes
, and (x1-) is Is Luanda the capital of Angola? No.
Next, run each statement through the model and keep track of the hidden states. Following the authors, I’ll use φ(xi+) to refer to the hidden states when fed statement (xi+) and φ(xi-) to refer to the hidden states when fed (xi-).
Finally, we want to see if we can find a (relatively) simple way of mapping any hidden state φ into probabilities. For this to work, we want at least two things. First, we want p(xi+) to be approximately the same as 1 - p(xi-). Second, we want p(xi+) to be distinct from p(xi-), at least most of the time. Otherwise, we’d represent the model as having totally incoherent beliefs or as having totally trivial beliefs (where everything is assigned probability 50%). So, the authors train a small model to learn a probability function based on the hidden states that satisfies these two requirements.
The authors argue that this method, in some sense, captures what the model actually believes. There are three major sanity checks. The first is that the learned probability functions turn out to be relatively accurate. That is, the probability function generated from the hidden states tends to assign high probability to statements that are actually true and low probability to statements that are actually false. The fact that the learned probabilities tend to be relatively good guides to reality provide some evidence that they actually represent something like the beliefs of the model.
The second is that the method for transforming the hidden states into probabilities works across different domains. So, if we learn the transformation for one data set, it still works well for other data sets. This indicates the model is actually tracking truth rather than other features peculiar to the data that happen to correlate with truth.
The third is that the probabilities generated are more accurate than the actual answers the LLMs produce. Empirically, we know LLMs often produce false text and have a tendency to hallucinate answers to prompts. For example, none of the following papers actually exist according to Google scholar:
And, despite the incredulous stares he received, David Lewis was a committed modal realist till the end of his life:
So, the fact that the probabilities are better than the outputs suggests that we aren’t just recovering from the internal states what the model plans to actually output.
Harry Frankfurt famously distinguished between bullshitting and lying. Lying is intentionally uttering something you believe the be false. Bullshitting, on the other hand, is having no regard at all for the truth. The above paper suggests that LLMs, at least sometimes, are lying to us, not just bullshitting. The probabilities recovered from the internal states are more accurate than the outputs of the model.
I have some doubts, still, as to whether the learned probability function truly represent the beliefs of the model. But I think the general approach is promising and can be developed in the future.
Some Philosophical Comments
This project of mindreading beliefs based on the hidden states of model turns a traditional behaviorist approach in philosophy on its head. Since Ramsey, decision theorists have proven representation theorems for beliefs. More or less, what this amounts to is a systematic way of inferring beliefs from behavior. If you’ll bet me at three-to-one odds that it will rain, then I can infer you’re at least 75% confident it will rain.
An LLM doesn’t make bets, and when it outputs text, it may well be lying or bullshitting. So, the authors here totally ignore its behavior. Instead, they look inside its head and represent it based on the structure of its thought. Such mindgazing is anathema to the quasi-behaviorist tradition in decision theory and game theory, but it seems like the right approach—or at least a very important approach—for understanding what AI systems believe and care about, and in general what sorts of attitudes they have to propositions.
Alien Minds
LLMs are minds, or at least they’re mind-like. They can chat and learn and seem to know a lot of things. But in the space of possible minds, they are quite different from human or animal minds. Even if they can write prose or poetry that’s often indistinguishable from the output of real human, they don’t think in the same way as we do.
W.V.O Quine once wrote:
Different persons growing up in the same language are like different bushes trimmed and trained to take the shape of identical elephants. The anatomical details of twigs and branches will fulfill the elephantine form differently from bush to bush, but the overall outward results are alike.
Quine was wrong—or at least overstating the case—when it came to different persons. Our brains are architecturally similar, and we now know a bit about what goes on under the hood (or skull) of humans.
But for LLMs, despite the recent work on interpretability, we don’t have anything like a mature cognitive science. The output results in convincing chatbots that sound human, but we don’t really know how they work.
Further Reading and Resources
Main papers discussed above
Anthropic’s A Mathematical Framework for Transformer Circuits
Neel Nanda’s accompanying YouTube video
Anthropic’s follow up paper on Induction Heads
Neel Nanda’s accompanying YouTube video with Charles Frye
Meng, Bau, et al. on Locating and Editing Factual Associations in GPT
Meng and Bau discussing the paper on Yannic Kilcher’s YouTube channel
Burns, Ye, et al. on Discovering Latent Knowledge in Language Models without Supervision
Shanahan paper on Talking about LLMs
Early posts in this series
For the purposes of this post, I’ll still be assuming that we’re dealing with decoder only models.
The induction head itself will raise the probability of [B]. An MLP or other attention head could, of course, lower the probability of [B] to cancel out the effect of the induction head.
As you’ll see from the graph, Needle
is tokenized into Need
and le
, so really, we’d look at the fourth token. But as in previous posts, I’m identifying tokens with words.