Module 4 — How Self-Attention Works
Module 3 gave you the intuition: weigh the earlier tokens by how much they matter. But where do those weights actually come from? The model does not get them from a human — it computes them, from the token vectors themselves, using three ingredients with funny names: queries, keys, and values. This module opens the box and runs the real calculation on numbers small enough to follow by eye.
Queries, keys, and values
Think of it like a search. Every token produces three little vectors from its embedding:
- Query (q) — "what am I looking for?" The token doing the attending broadcasts this.
- Key (k) — "what do I offer?" Every token advertises this, like a label.
- Value (v) — "what I'll actually contribute if you attend to me." The content that gets passed along.
The match between a query and a key — their dot product — is the relevance score. High when the query and key point the same way, low otherwise. Here is the exact formula every transformer runs:
Read left to right: score every key against the query, divide by \( \sqrt{d} \) to keep numbers stable, softmax into weights that sum to 1, then blend the values by those weights. That blend is the token's new, context-aware vector.
Run it yourself
Three tokens, each with its own query, key, and value vector. Pick which token is doing the attending — its query gets scored against all three keys. Watch the dot-product scores become softmax weights, and the values blend into one output vector.
This activity needs JavaScript. The lesson below still covers everything.
Why "self"-attention
It is called self-attention because the queries, keys, and values all come from the same sequence — the tokens are attending to each other. Every token, in parallel, does the search you just ran: builds a query, scores it against every key, and pulls in a weighted blend of the values. The result is a new set of vectors where each token has absorbed the context most relevant to it.
# each token's embedding is projected into a query, key, and value Q = X @ Wq; K = X @ Wk; V = X @ Wv scores = Q @ K.T / sqrt(d) # every query scored against every key weights = softmax(scores, axis=-1) # the bars you saw, per token out = weights @ V # context-aware vector for each token
Wq, Wk, Wv are learned matrices. Training shapes them so the right tokens end up matching — the model discovers what to pay attention to.
Check your understanding
A few questions about self-attention. You will get a score.
This activity needs JavaScript.