← All Inside Large Language Models modules

Module 3 — Attention, Intuitively

How models read context · hands-on · about 25 minutes.

The bigram model of Module 1 has a fundamental limitation: it conditions on only the immediately preceding token. Natural language exhibits dependencies that span far longer distances. Predicting the final word of "the keys to the cabinet were…" requires conditioning on "keys" — many tokens back — rather than on the nearer noun "cabinet," to determine the correct number agreement ("were" rather than "was"). The mechanism that enables a model to condition on tokens at arbitrary distances and to assign each a learned weight is attention — the central architectural mechanism of every modern LLM.

The formal question addressed by attention

When predicting the next token, attention computes the answer to: for each previously observed token, what weight should it receive in the current prediction? It assigns each previous token a non-negative weight in \( [0, 1] \), with the weights summing to 1. A large weight indicates that the corresponding token is highly relevant to the current prediction; a weight near zero indicates that the token contributes negligibly.

Identifying the relevant context

The model is predicting the highlighted blank below. First identify which earlier word the prediction most directly depends on, then display the attention weights to compare with your assessment.

This activity needs JavaScript. The lesson below still covers everything.

Sharp versus diffuse attention distributions

Attention is not categorical. The same relevance scores can produce a sharp attention distribution concentrated on a single token, or a diffuse distribution spread across many tokens — determined by the temperature parameter of the softmax that converts scores to weights (softmax is the step that turns a set of raw relevance scores into positive weights that sum to 1 — the next module builds it up from scratch). Adjust the temperature parameter to observe the weights concentrate or spread, and the weighted "context" vector the model passes forward change accordingly.

This activity needs JavaScript.

Attention expressed in code

# a relevance score for each earlier token, then softmax to get weights
scores  = query @ keys.T            # how well each token matches what we need
weights = softmax(scores)            # positive, sum to 1 — the bars you saw
context = weights @ values          # a weighted blend of the earlier tokens

This is the complete attention operation. Module 4 develops the construction of query, keys, and values in detail — but the operation is always: compute scores, apply softmax to obtain weights, compute a weighted sum of values.

AI anchor — the foundational result of "Attention Is All You Need" The attention mechanism is the architectural innovation that enabled the current generation of language models. The 2017 paper introducing the transformer was titled "Attention Is All You Need" and demonstrated that older sequential architectures could be replaced by this single operation: each token attends directly to every other token, weighted by learned relevance. The capacity to model long-range dependencies — connecting a pronoun to its antecedent fifty tokens earlier — is what allows models such as GPT and Claude to maintain coherence over passages on which a bigram model would have lost coherence after a few tokens.

Check your understanding

Answer a short set of questions on attention.

This activity needs JavaScript.

Why this matters next The conceptual structure of attention has been introduced: assign weights to preceding tokens. Module 4 develops the underlying machinery — queries, keys, and values — that enables the model to compute these weights from the input data itself, rather than from manually specified scores.

Summary: attention enables a model, when predicting the next token, to assign each previously observed token a weight quantifying its current relevance — at any distance — and this single mechanism is what enables modern LLMs to maintain coherence over long passages of text.

Next: How Self-Attention Works →