← All Inside Large Language Models modules

Module 1 — Predicting the Next Token

The core idea · hands-on · about 25 minutes.

Fundamentally, a large language model performs a single operation iteratively: it conditions on the text generated so far and predicts the next token. A token is approximately a word (precise definitions are introduced in Module 2). To generate a sentence, the model predicts one token, appends it to the context, then predicts the next token conditioned on the extended context — and so on. This is the entire generation procedure. Every other component developed in this course serves to improve the quality of this single prediction.

The conditional-probability formulation

For students who completed Course 1, the underlying concept has already been introduced. The model estimates:

\[ P(\text{next token} \mid \text{the text so far}) \]

This is read as: "given the observed context, what is the probability distribution over possible next tokens?" The simplest model conditions only on the immediately preceding token — termed a bigram model. It estimates these probabilities by a method already familiar from Course 1: counting. Process a corpus, tabulate the frequency with which each word follows each other word, and normalize the counts to obtain the conditional probability \( P(\text{next} \mid \text{previous}) \).

Constructing the model by counting

Below is a small training corpus of approximately a dozen sentences. The conditional distribution of next-token frequencies has been computed for each preceding word. Select a word to display the model's learned distribution: tall bars correspond to high-probability next tokens; a flat distribution over many tokens indicates high entropy and consequently high uncertainty.

This activity needs JavaScript. The lesson below still covers everything.

Text generation via iterated sampling

The generation loop proceeds as follows: starting from a seed word, sample a next token from the model's conditional distribution, condition on that token to sample the subsequent token, and iterate. Because the bigram model conditions on only the previous token, the output exhibits local fluency but lacks coherence across longer spans. This limitation is the motivation for attention (Module 3).

This activity needs JavaScript.

The equivalent procedure in code

# count which token follows each token, then normalize to probabilities
counts = defaultdict(Counter)
for prev, nxt in zip(tokens, tokens[1:]):
    counts[prev][nxt] += 1

# P(next | "the") — exactly the bars you saw above
dist = counts["the"]
total = sum(dist.values())
probs = {w: c / total for w, c in dist.items()}

A production LLM replaces the bigram counting procedure with a deep network that conditions on thousands of preceding tokens — but the model's output remains the conditional distribution P(next | context). The estimated quantity is the same; the conditioning context is substantially larger and the estimator substantially more expressive.

AI anchor — every chatbot response is generated by this loop When ChatGPT or Claude generates a response, it executes this loop precisely: predict the next token from the preceding context, append it, repeat — typically hundreds of times — until the model emits a designated stop token. The fluency of the output is attributable to the substantially more accurate next-token estimator, not to a fundamentally different generation procedure. The model does not plan a sentence; it generates the sentence one probable token at a time. This property accounts for both the fluency of LLM output and, as discussed in Module 8, the phenomenon of confident factual confabulation.

Check your understanding

Answer a short set of questions on next-token prediction.

This activity needs JavaScript.

Why this matters next The model developed here treated words as opaque labels: it has no internal representation that "cat" and "dog" are semantically related. Module 2 resolves this limitation by representing each token as a learned vector of real numbers (an embedding), enabling the model to encode semantic relationships between words.

Summary: a language model predicts the next token from the preceding context — formally, it estimates \( P(\text{next} \mid \text{context}) \). The simplest variant estimates this distribution by counting, and text is generated by iteratively sampling from the resulting distribution.

Next: Tokens & Embeddings →

Mini-game: the live tokenizer

Before tokens reach the model, text is split into pieces. Type anything — numbers, punctuation, math, a sentence — and watch exactly how an LLM sees it.

This activity needs JavaScript.