Module 1 — Predicting the Next Token
Strip away the mystery and a large language model does exactly one thing, over and over: it looks at the text so far and predicts the next token. A token is roughly a word (we will get precise in Module 2). To write a sentence, the model predicts one token, adds it to the text, then predicts the next from the longer text — again and again. That is the entire engine. Everything else in this course exists to make that one prediction good.
This is conditional probability — the Course 1 idea
If you took Course 1, you have already met the heart of this. The model estimates:
Read it as: "given what I have seen, how likely is each possible next word?" The simplest version conditions on just the one previous word — a bigram model. It learns these probabilities by a method you already understand: counting. Go through some text, and for every word, tally which words came right after it. Turn those tallies into fractions and you have \( P(\text{next} \mid \text{previous}) \).
Build the model by counting
Below is a tiny "training corpus" — a dozen short sentences. We counted, for every word, what tends to follow it. Pick a word and see the distribution the model learned. Tall bars are words it expects next; flat-and-many means it is unsure.
This activity needs JavaScript. The lesson below still covers everything.
Generate text by sampling, one token at a time
Now run the loop. Start from a word, let the model pick a likely next word, move to that word, pick again — and a sentence builds itself. Because it only remembers one word back, the result is fluent in short bursts and wanders over a whole sentence. That limitation is exactly what attention (Module 3) fixes.
This activity needs JavaScript.
# count which token follows each token, then normalize to probabilities counts = defaultdict(Counter) for prev, nxt in zip(tokens, tokens[1:]): counts[prev][nxt] += 1 # P(next | "the") — exactly the bars you saw above dist = counts["the"] total = sum(dist.values()) probs = {w: c / total for w, c in dist.items()}
A real LLM replaces "count the previous word" with a deep network that reads thousands of previous tokens — but it is still producing P(next | context). Same target, vastly richer context.
Check your understanding
A few questions about next-token prediction. You will get a score.
This activity needs JavaScript.