← All Inside Large Language Models modules

Module 4 — How Self-Attention Works

How models read context · hands-on · about 30 minutes.

Module 3 introduced the conceptual structure of attention: assign weights to preceding tokens according to their relevance. The question that remains is how these weights are computed. The model does not receive them externally — it derives them from the token embeddings themselves, using three quantities introduced from information retrieval terminology: queries, keys, and values. This module presents the complete computation on a small example whose arithmetic can be verified by hand.

Queries, keys, and values

The operation is structurally analogous to retrieval from a key-value store. Each token produces three vectors derived from its embedding by linear projection:

Query (q) — represents the information the attending token seeks.
Key (k) — represents the content each token advertises for matching against a query.
Value (v) — the information the token contributes when it is attended to.

The compatibility between a query and a key is measured by their dot product, which yields the relevance score. The score is large when query and key are aligned in the embedding space and small otherwise. The full formula computed by every transformer is:

\[ \text{Attention}(q, K, V) \;=\; \text{softmax}\!\left(\frac{q \cdot K^{\top}}{\sqrt{d}}\right) V \]

The formula is read as follows: compute the dot product of the query with each key; divide by \( \sqrt{d} \) (the scaling factor that maintains numerical stability of the softmax for large dimensions); apply softmax to obtain a probability distribution over the keys; and compute the weighted sum of the values. The resulting vector is the token's new context-aware representation.

Where the formula comes from

One piece of notation causes most of the confusion. \( q \cdot k \) with a lower-case \( k \) is a single dot product — one number. But the formula uses a capital \( K \), and capital \( K \) is every key stacked into a table, one row per token. So \( q \cdot K^{\top} \) compares your query against all the keys at once and returns a whole list of scores — one per token — not a single number. With that cleared up, the formula is just four plain steps:

Score every token. Dot the query with each key — \( q \cdot K^{\top} \) — to get one relevance score per token. A score is large when the query and that key point in a similar direction.
Steady the scores. Divide by \( \sqrt{d} \), where \( d \) is the length of each vector. Longer vectors make the dot products swing large; this division keeps the next step from over-reacting to them.
Turn scores into weights. Apply softmax to convert the scores into positive weights that add up to 100% — the bars in the activity just below.
Blend the values. Multiply those weights by \( V \), the table of value vectors. The result is a weighted average of the values — mostly the high-weight tokens — and that blended vector is the token's new, context-aware representation.

In one line: score against every key, steady, softmax into weights, then blend the values. The lower-case \( q \) means we are following one attending token; capital \( K \) and \( V \) are the full sets it looks across. Every token runs these same four steps in parallel, and together they form one attention layer.

Softmax on its own

Before running the full computation, isolate the softmax step. Each token below carries a raw relevance score — the kind a dot product \( q \cdot k \) produces. Drag the scores and watch them turn into attention weights. Two things to notice: the weights always sum to 100%, and lifting one score raises its share faster than linearly — yet the lowest score is never pushed all the way to zero. That is what makes the attention "soft" rather than winner-take-all.

This activity needs JavaScript. The lesson below still covers everything.

Execute the computation

Three tokens are displayed, each with its own query, key, and value vectors. Select the attending token, and its query is scored against all three keys. The dot-product scores are converted to softmax weights, and the values are aggregated into the resulting output vector.

This activity needs JavaScript. The lesson below still covers everything.

The "self" in self-attention

The mechanism is termed self-attention because the queries, keys, and values are all derived from the same input sequence — the tokens attend to one another. Every token executes the operation just performed in parallel: it constructs a query, scores it against every key, and computes a weighted sum of the corresponding values. The result is a new sequence of vectors in which each token's representation has been updated to incorporate the most relevant contextual information.

Self-attention expressed in code

# each token's embedding is projected into a query, key, and value
Q = X @ Wq;  K = X @ Wk;  V = X @ Wv

scores  = Q @ K.T / sqrt(d)      # every query scored against every key
weights = softmax(scores, axis=-1)  # the bars you saw, per token
out     = weights @ V            # context-aware vector for each token

Wq, Wk, and Wv are learned projection matrices. Training adjusts these matrices so that semantically related tokens produce aligned query-key pairs — the model thereby discovers which tokens to attend to from the training data.

The whole matrix at once

The single token you just stepped through is one row of Q @ K.T. Every token computes its own row in parallel, producing a full grid of attention weights. Here is that grid as a heatmap — brighter means more attention — for a short sentence whose word types make the pattern legible.

This activity needs JavaScript. The lesson below still covers everything.

AI anchor — the computational core of every transformer layer This computation — project to Q, K, V; compute scores; softmax; weighted sum — is executed within every layer of GPT, Claude, and Llama, billions of times per response. Production models execute multiple attention "heads" in parallel (different heads typically specialize in distinct linguistic features such as grammatical structure or long-range references) and stack dozens of such layers, but each individual head performs precisely the computation derived here. You have now examined the explicit arithmetic at the core of modern AI architectures.

Check your understanding

Answer a short set of questions on self-attention.

This activity needs JavaScript.

Why this matters next Self-attention is the central operation, but a complete transformer comprises additional components: positional encodings to represent token order, a feed-forward sub-layer, and residual connections that maintain training stability. Module 5 assembles these components into the complete transformer block — the unit that is replicated dozens of times in a production LLM.

Summary: self-attention computes its weights by linearly projecting each token into a query, key, and value vector, computing dot-product scores between each query and key, normalizing the scores with softmax, and computing a weighted sum of the values — formally, \( \text{softmax}(qK^{\top}/\sqrt{d})V \) — the core computation of every transformer.

Next: The Transformer Block →