← All Inside Large Language Models modules

Module 1 — Predicting the Next Token

The core idea · hands-on · about 25 minutes.

Strip away the mystery and a large language model does exactly one thing, over and over: it looks at the text so far and predicts the next token. A token is roughly a word (we will get precise in Module 2). To write a sentence, the model predicts one token, adds it to the text, then predicts the next from the longer text — again and again. That is the entire engine. Everything else in this course exists to make that one prediction good.

This is conditional probability — the Course 1 idea

If you took Course 1, you have already met the heart of this. The model estimates:

\[ P(\text{next token} \mid \text{the text so far}) \]

Read it as: "given what I have seen, how likely is each possible next word?" The simplest version conditions on just the one previous word — a bigram model. It learns these probabilities by a method you already understand: counting. Go through some text, and for every word, tally which words came right after it. Turn those tallies into fractions and you have \( P(\text{next} \mid \text{previous}) \).

Build the model by counting

Below is a tiny "training corpus" — a dozen short sentences. We counted, for every word, what tends to follow it. Pick a word and see the distribution the model learned. Tall bars are words it expects next; flat-and-many means it is unsure.

This activity needs JavaScript. The lesson below still covers everything.

Generate text by sampling, one token at a time

Now run the loop. Start from a word, let the model pick a likely next word, move to that word, pick again — and a sentence builds itself. Because it only remembers one word back, the result is fluent in short bursts and wanders over a whole sentence. That limitation is exactly what attention (Module 3) fixes.

This activity needs JavaScript.

The same idea in code — read only, nothing to install
# count which token follows each token, then normalize to probabilities
counts = defaultdict(Counter)
for prev, nxt in zip(tokens, tokens[1:]):
    counts[prev][nxt] += 1

# P(next | "the") — exactly the bars you saw above
dist = counts["the"]
total = sum(dist.values())
probs = {w: c / total for w, c in dist.items()}

A real LLM replaces "count the previous word" with a deep network that reads thousands of previous tokens — but it is still producing P(next | context). Same target, vastly richer context.

AI anchor — every chatbot reply is this loop When ChatGPT or Claude answers you, it is running this exact loop: predict the next token from everything so far, append it, repeat — hundreds of times — until it predicts a "stop." The fluency comes from a far better next-token estimate than counting one word back, but the move is identical. There is no plan for the whole sentence; it is built one probable token at a time. Hold onto that — it explains both why LLMs are so fluent and, in Module 8, why they confidently make things up.

Check your understanding

A few questions about next-token prediction. You will get a score.

This activity needs JavaScript.

Why this matters next Our model treated words as bare labels — it has no idea that "cat" and "dog" are similar. Module 2 fixes that by turning each token into a vector of numbers (an embedding), so the model can finally see which words are related.
One-sentence summary: a language model predicts the next token from the text so far — it estimates \( P(\text{next} \mid \text{context}) \), the simplest version just counts which word follows which, and text is generated by sampling that distribution one token at a time.

Next: Tokens & Embeddings →