A Conceptual Guide to Transformers: Part II

Clarifications and cleanup

Jan 30, 2023

In the last post, I gave a basic walkthrough of the transformer architecture. There were a bunch of misrepresentations of how embeddings work, but I think the pedagogical cost was worth it for a first pass. I also left out some cool conceptual stuff about attention heads that you also probably shouldn’t be inundated with until you have a basic feel for the architecture.

So, in the post, I’m going to make a number of clarifications and corrections and provide some conceptual color. I’ll again try to make this as accessible to people without a background in machine learning or linear algebra as possible, but I think some of the discussion will be a bit more abstract, especially when I talk about attention heads.

How Embeddings Actually Work

In Part I, I presented embeddings as a hand-crafted list of human interpretable features. In the toy model for Amy saw Will, each word got a score for Future, Tool, Fame, and Male that I picked by hand.

When I told you that GPT-3 Small had 768-dimensional embeddings (instead of 4), you may have wondered what 764 other features programmers chose in addition to Future, Tool, Fame, and Male.

The answer is that they made no choices at all about what the 768 dimensions correspond to.

Really, what happens is the programmer selects a number of dimensions and says “Let there be exactly that many dimensions in the embedding vector for each word.” This gets the model to associate a vector of that size with each word in the vocabulary, but the programmer gives no guidance about what the dimensions encode. The model figures that out on its own. (I’ll talk more about training in Part III.)

Now, for transformers, the way we ultimately get good embeddings is that we just give each word a totally random vector to begin with, and eventually the model improves with training, but I think it’s good to get some intuition and conceptual grasp on what’s going on and how it differs from the hand-crafted thing.

In ancient times (i.e., in 2013 or so), there was a different algorithm called Word2Vec that also embedded words as fairly high dimensional vectors. There were a number of different algorithms for this, but let’s just take a bird’s-eye view at one (that itself will contain some fudges).1

The idea is to take a large body of text (all of Wikipedia, say), and see how well you can make predictions of the next word given the two previous words. Suppose, for instance, the model comes across a sentence like There once was a logician named Quine. It will take its current embedding for there and for once and generate a probability distribution over the vocab for what the next word will be. The higher the probability it gives to was, the better the score. It will then make tiny adjustments to those vectors so that the new probability for was given there and once is a bit higher. It will next make a prediction over the vocab given its current embedding for once and was about what the next word will be and then make tweaks based on how well it predicted a. And so on until it’s done reading Wikipedia a few times. At the start, the embeddings are just random, but as the model gets better and better, more and more useful information will be encoded in the embeddings about the properties of each word, and the model will work out on its own how to do this. There’s no guidance about how the embedding should go. It just ultimately ends up with some way of encoding information to make good predictions.

Transformers’ embeddings ultimately work similarly, except there are a lot of intermediate steps—a bunch of layers with a bunch of attention heads and MLPs— between the initial embedding and the final predictions.

No Human Interpretability

When we embed words as vectors of hundreds or thousands of dimensions, the coordinates need not correspond to anything human-legible like ‘redness’, ‘fruitiness’ or ‘footweariness’. To see why, imagine one embedding with a word’s ‘fruitiness’ score as the x-axis (i.e., first dimension) and the word’s footweariness score as the y-axis. So apple is encoded as [1, 0, …] and shoe is encoded as [0, 1, …]. An alternative but equivalent embedding system just rotates the x- and y-axes 45 degrees counterclockwise. After we rotate the axes, [1, 0,…] corresponds to a word that gets around a .71 ‘fruitiness’ score and also a .71 ‘footweariness’ score. A fruit/footwear mixture is not really a human concept, but the same information can be encoded in this alternative coordinate system as in the first one. In other words, there’s not an a priori guarantee of a privileged basis for the embedding that maps neatly to features we’d think of.

Takeaway for Embedding

The biggest lies I told in the last post were that the coordinates of an embedding were hand-selected and based on concepts the programmers chose. In reality, all that’s specified is the number of dimensions. The model figures out the rest with training, and there’s no guarantee we can understand what the coordinates mean.

Attention Heads

What I said in the last post about attention heads was basically accurate conceptually, but there were a few details I left out and at least one other way of looking at attention heads that I wanted to cover.

Small Dimensionality

I mentioned that models have a lot of different attention heads. And in the toy model, the attention head I presented was two-dimensional (i.e., two query, key, and value coordinates) while the embedding was four dimensional.

In general, attention heads will have much lower dimension than the embedding. In GPT-3 Small, the embedding is 768 dimensional, but heads are only 64 dimensional. In full scale GPT-3, the embedding is 12,228 dimensional, but heads are only 128 dimensional.

This means heads will only be able to pay attention to some small part of the embedding in general since they have much lower dimensionality. This makes sense, since heads will have relatively small tasks like looking one word back, finding a subject for a verb, figuring out tense, and so on. They don’t need to look at all the information that comes with every word.

Moreover, at least in the GPT series, we always have (dimension of the head)×(heads per layer) = dimension of embedding. This generally means that heads can (but need not) stay out of one another’s way, since they each can in principle interact with their own subspace without interference.

Why Queries, Keys, Values, and Outputs are Kind of Fake

Note: This section involves a bit of linear algebra but is totally skippable.

The story I gave about queries, keys, values, and outputs is basically the standard story you’ll get anywhere.

You feed a head an embedding. It computes keys and queries for each token. Based on how well the queries and keys line up, it decides how much attention that token should pay to other tokens.

The head also calculates values for each token. It moves the values around based on the attention pattern. Then it calculates the output to move from the smaller space (2 dimensions in the toy model, 64 in GPT-3 Small) back to the larger space (4 dimensions in the toy model, 768 in GPT-3 Small).

Let’s look first at why the separation of queries and keys is mathematically (though not computationally) a bit artificial. (This is a point made very well in this Anthropic paper.)

Call our embedding x. (If you like, you can use the embedding for Amy saw Will from Part I.) The head has instructions for computing the keys and queries. These instructions are encoded as the matrices Q and K.

So, you compute the queries for any embedding x fed to the head by multiplying it by Q. Recall, in our toy model from Part I, the head calculated queries as follows: the first coordinate of the query was the sum of the first and second coordinate of the embedding. The second coordinate of the query was just the second coordinate of the embedding. If you want to encode those as a matrix, you do it like this:

\( Q= \begin{bmatrix} 1 & 0\\ 1 & 1\\ 0 & 0\\ 0& 0 \end{bmatrix}\)

You compute the keys for any embedding x fed to the head by multiplying it by K. In our toy model from Part I, the head used this matrix:

\(K= \begin{bmatrix} 0 & 1\\ 1 & 1\\ 0 & 0\\ 0 & 0 \end{bmatrix}\)

(In general, a model could use any matrices it liked for Q and K. They are of size dimension of the embedding by dimension of the head.)

Now, you end up taking the dot product of the queries and keys to get the (unnormalized) attention pattern. We have to flip the rows and columns of x and K here to get that to work (i.e., take the transpose). This looks like:

\(xQK^Tx^T = x \begin{bmatrix} 1 & 0\\ 1 & 1\\ 0 & 0\\ 0& 0 \end{bmatrix} \begin{bmatrix} 0 & 1 & 0 & 0\\ 1 & 1 & 0 & 0\\ \end{bmatrix} x^T\)

Here’s the thing to notice. No matter what embedding x gets fed to this attention head, the head always takes x multiplies it by QK^T and then multiplies that x^T. So instead of a matrix for queries and another matrix for keys, we could have had a single matrix Z=QK^T associated with the head.

In other words, the conceptual division here into queries and keys is not mandatory. You can instead think of there being a single matrix for generating the attention pattern for every attention head.

Now, a little math is a dangerous thing, so we should keep in mind some important features of this Z matrix.

Recall that our string x has as many rows as there are tokens and as many columns as there are features in the embedding. Let’s move out of the toy model and think about GPT-3 Small with 768 embedding dimensions and 64 head dimensions. So, Q and K both have 768 rows but 64 columns

When we multiply Q and K together (really QK^T), we end up with a matrix that is 768 rows by 768 columns. However, all the vectors here occupy at most a 64-dimensional space. To make this clearer for those rusty on linear algebra, imagine I had three vectors [1, 0, 0], [0, 1, 0], and [.5, .5, 0]. If I put those vectors on top of one another, I’d get a 3x3 matrix. But all three vectors lie entirely on the xy-plane as their z-axis is 0. In other words, the vectors all lie within a 2 dimensional subspace of the larger 3-dimensional space. Likewise, all the vectors in Z actually lie in a 64-dimensional subspace of the 768-dimensional larger space.

Computationally, this matters. Q is 768 x 64, so it has 49,152 parameters. K has the same. So, there are really at most around 100,000 parameters the computer needs to learn and remember for a given attention head. But if it had to learn Z directly, it would have to deal with 589,824 parameters.

Moreover, we don’t want attention heads to focus on the 768-dimensional space but instead we want them only to focus on some of the information in the embedding. So, by learning Q and K separately, we can ensure that each attention head only worries about a small portion of the information available.

The fact that Q and K always get multiplied together does mean, however, that we shouldn’t have any hope of interpreting the entries in either the Q or the K matrix. That is, we can’t look at a row in Q and interpret it directly without looking at K as well. If you were, say, to multiply each vector in K by 2, you could multiply each vector in Q by 1/2 and end up with the same Z matrix.

On the value side, we can make the same point. The standard story is that we calculate the values by taking our embedding x and multiplying it by the value matrix V. We move information by taking our matrix A encoding the attention patterns and multiplying that by xV. To move from the smaller space (of 64 dimensions) back to the larger space (of 768 dimensions), we multiply AxV by the output matrix O. But again, we see V always gets directly multiplied by O no matter what the embedding x is. So, really, we could instead use a single VO matrix instead of separating V and O.

The End of Reckoning