← All Inside Large Language Models modules

Module 3 — Attention, Intuitively

How models read context · hands-on · about 25 minutes.

Module 1's bigram model had a fatal flaw: it remembered only the single previous word. Real sentences have dependencies that reach much further back. To predict the last word of "the keys to the cabinet were…", you must look past "cabinet" all the way back to "keys" to know it is "were," not "was." The mechanism that lets a model reach back and weigh earlier words is attention — the single idea behind every modern LLM.

The core question attention answers

When predicting the next token, attention asks: of all the earlier tokens, how much should each one matter right now? It assigns every previous token a weight between 0 and 1, and the weights add up to 1. A high weight means "this word is highly relevant to what I'm about to predict"; a near-zero weight means "ignore this one for now."

See where the model should look

Below, the model is predicting the highlighted blank. First decide for yourself which earlier word the prediction most depends on — then reveal the attention weights and see if you agree.

This activity needs JavaScript. The lesson below still covers everything.

Sharp vs. diffuse attention

Attention is not all-or-nothing. The same relevance scores can produce a sharp focus on one word or a diffuse spread across many — controlled by how decisively the scores are turned into weights (a softmax, the same function from Course 4). Drag the focus dial and watch the weights concentrate or spread, and watch the blended "context" the model carries forward change with them.

This activity needs JavaScript.

The shape of attention — read only, nothing to install
# a relevance score for each earlier token, then softmax to get weights
scores  = query @ keys.T            # how well each token matches what we need
weights = softmax(scores)            # positive, sum to 1 — the bars you saw
context = weights @ values          # a weighted blend of the earlier tokens

That is the whole idea. Module 4 opens up exactly where query, keys, and values come from — but the move is always: score, soften into weights, blend.

AI anchor — "Attention Is All You Need" Attention is the mechanism that broke language AI open. The 2017 paper that introduced the transformer was titled "Attention Is All You Need," and it replaced older step-by-step models with this one move: let every token directly look at every other token and weigh it. That ability to reach across a long passage — connecting a pronoun to the noun it refers to fifty words earlier — is why models like GPT and Claude can stay coherent over paragraphs where the bigram model fell apart after three words.

Check your understanding

A few questions about attention. You will get a score.

This activity needs JavaScript.

Why this matters next You have the intuition: weigh the earlier tokens. Module 4 opens the box and shows the actual machinery — queries, keys, and values — that lets the model compute those weights instead of you setting them by hand.
One-sentence summary: attention lets a model, when predicting the next token, assign every earlier token a weight that says how much it matters right now — reaching back as far as needed — and that single idea is what makes modern LLMs coherent over long text.

Next: How Self-Attention Works →