← All Inside Large Language Models modules

Module 4 — How Self-Attention Works

How models read context · hands-on · about 30 minutes.

Module 3 gave you the intuition: weigh the earlier tokens by how much they matter. But where do those weights actually come from? The model does not get them from a human — it computes them, from the token vectors themselves, using three ingredients with funny names: queries, keys, and values. This module opens the box and runs the real calculation on numbers small enough to follow by eye.

Queries, keys, and values

Think of it like a search. Every token produces three little vectors from its embedding:

The match between a query and a key — their dot product — is the relevance score. High when the query and key point the same way, low otherwise. Here is the exact formula every transformer runs:

\[ \text{Attention}(q, K, V) \;=\; \text{softmax}\!\left(\frac{q \cdot K^{\top}}{\sqrt{d}}\right) V \]

Read left to right: score every key against the query, divide by \( \sqrt{d} \) to keep numbers stable, softmax into weights that sum to 1, then blend the values by those weights. That blend is the token's new, context-aware vector.

Run it yourself

Three tokens, each with its own query, key, and value vector. Pick which token is doing the attending — its query gets scored against all three keys. Watch the dot-product scores become softmax weights, and the values blend into one output vector.

This activity needs JavaScript. The lesson below still covers everything.

Why "self"-attention

It is called self-attention because the queries, keys, and values all come from the same sequence — the tokens are attending to each other. Every token, in parallel, does the search you just ran: builds a query, scores it against every key, and pulls in a weighted blend of the values. The result is a new set of vectors where each token has absorbed the context most relevant to it.

Self-attention in code — read only, nothing to install
# each token's embedding is projected into a query, key, and value
Q = X @ Wq;  K = X @ Wk;  V = X @ Wv

scores  = Q @ K.T / sqrt(d)      # every query scored against every key
weights = softmax(scores, axis=-1)  # the bars you saw, per token
out     = weights @ V            # context-aware vector for each token

Wq, Wk, Wv are learned matrices. Training shapes them so the right tokens end up matching — the model discovers what to pay attention to.

AI anchor — the engine inside every transformer layer This exact computation — project to Q, K, V; score; softmax; blend — runs inside every layer of GPT, Claude, and Llama, billions of times per response. Real models run many attention "heads" in parallel (one head might track grammar, another long-range references) and stack dozens of layers, but each head is doing precisely what you just did by hand. You have now seen the literal arithmetic at the heart of modern AI.

Check your understanding

A few questions about self-attention. You will get a score.

This activity needs JavaScript.

Why this matters next Self-attention is the heart, but a transformer wraps more around it: a way to encode word order, a feed-forward layer, and "residual" shortcuts that keep training stable. Module 5 assembles these into the full transformer block — the unit that repeats dozens of times in a real LLM.
One-sentence summary: self-attention computes its weights by projecting each token into a query, key, and value, scoring every query against every key with a dot product, softmaxing into weights, and blending the values — the literal \( \text{softmax}(qK^{\top}/\sqrt{d})V \) at the core of every transformer.

Next: The Transformer Block →