← All Inside Large Language Models modules

Module 7 — Sampling & Generation

From probabilities to text · hands-on · about 30 minutes.

A trained model does not directly output text. At each step it outputs a probability distribution over all possible next tokens. Converting this distribution into a single selected token is a distinct operation termed sampling, and it governs whether the model's output is deterministic or varied. The model and its weights are unchanged; only the sampling procedure differs. This module examines the three principal sampling parameters: temperature, top-k, and top-p.

The decision sampling resolves

Suppose the next-token distribution is the 65%, a 20%, that 7%, one 3% — plus a long tail of rarer options. One option is to always select the highest-probability token (greedy decoding) — deterministic but prone to repetition. An alternative is to sample from the distribution, allowing lower-probability tokens to be selected occasionally — more varied but less predictable. Each sampling parameter is a transformation applied to this distribution prior to sampling.

This activity needs JavaScript. The lesson below still covers everything.

The three parameters

Temperature — divides the logits before the softmax. A low temperature (→0) sharpens the distribution toward greedy decoding; a high temperature (>1) flattens it toward uniform. It rescales the model's confidence.
Top-k — retains only the k highest-probability tokens, sets the remainder to zero, and renormalizes. This imposes a fixed upper bound on how far into the low-probability tail sampling can reach.
Top-p (nucleus sampling) — retains the smallest set of tokens whose cumulative probability reaches p (e.g. 0.9) and discards the rest. The cutoff is adaptive: fewer tokens are retained when the model is confident, more when it is uncertain.

Observe the effect on the distribution

Beginning from the same distribution, adjust the parameters. Observe which tokens are retained and how the probabilities are renormalized, then sample to view the resulting text.

This activity needs JavaScript.

Sampling expressed in code

logits = model(context)[-1]            # scores for the next token
probs  = softmax(logits / temperature)  # temperature reshapes confidence
probs  = top_k(probs, k=40)             # keep the 40 most likely
probs  = top_p(probs, p=0.9)            # …then the nucleus inside that
next   = sample(probs)                  # roll the weighted die

These are exactly the knobs real APIs expose — the Anthropic API offers temperature, top_p, and top_k; the OpenAI API offers temperature and top_p. They do not modify the model; they determine only how its output distribution is reduced to a single token.

Watch it generate, token by token

Everything above acted on a single distribution. Here the same three parameters drive a real (if tiny) language model trained on a small corpus: press Generate and watch it sample an actual sentence one token at a time, re-rolling the weighted die at every step. Turn the temperature down and it loops; turn it up and it unravels.

This activity needs JavaScript. The lesson below still covers everything.

AI anchor — why a deterministic model produces variable output When ChatGPT or Claude produces a different response to an identical prompt, the model's weights are unchanged; the variation arises from sampling. A low temperature yields careful, deterministic output (appropriate for code generation or factual tasks); a higher temperature yields more variable output (appropriate for brainstorming or creative writing). The "regenerate" function simply draws a new sample. Every token produced by an LLM is the result of this operation: a probability distribution, transformed by these parameters, sampled once.

Check your understanding

Answer a short set of questions on sampling.

This activity needs JavaScript.

Why this matters next You now understand how an LLM converts probabilities into text — and that it always samples from a distribution rather than retrieving stored information. Module 8 develops the consequence of this fact: why models produce hallucinations, why fluency does not imply factual accuracy, and how to use these models effectively notwithstanding this limitation.

Summary: a language model outputs a probability distribution over the next token, and sampling — temperature to rescale confidence, top-k and top-p to restrict the tail — is the distinct operation that reduces this distribution to a single token, accounting for why the same model can produce deterministic or variable output.

Next: Why LLMs Hallucinate & How to Use Them Well →