← All Inside Large Language Models modules

Module 7 — Sampling & Generation

From probabilities to text · hands-on · about 30 minutes.

Your trained model does not output words. At every step it outputs a probability for every possible next token — a whole distribution. Turning that distribution into one actual character is a separate decision called sampling, and it is where a model goes from sounding robotic to sounding creative. Same model, same weights; only the sampling rule changed. This module is the knobs: temperature, top-k, and top-p.

The choice the model leaves you

Suppose the next-token distribution is: the 60%, a 25%, that 10%, one 5%. You could always take the top one (greedy) — safe but repetitive. Or you could roll a weighted die and let the long tail occasionally win — varied but riskier. Every generation knob is just a way of reshaping this distribution before the die is rolled.

This activity needs JavaScript. The lesson below still covers everything.

The three knobs

See them reshape the distribution

Same starting distribution, your knobs. Watch which tokens survive and how the bars rescale — then sample to see the text it would produce.

This activity needs JavaScript.

Sampling in code — read only, nothing to install
logits = model(context)[-1]            # scores for the next token
probs  = softmax(logits / temperature)  # temperature reshapes confidence
probs  = top_k(probs, k=40)             # keep the 40 most likely
probs  = top_p(probs, p=0.9)            # …then the nucleus inside that
next   = sample(probs)                  # roll the weighted die

These are the exact arguments you set on the OpenAI or Anthropic API: temperature, top_p, top_k. They do not change the model — only how its distribution is collapsed into one token.

AI anchor — why the same model feels different every time When ChatGPT or Claude gives you a different answer to the identical prompt, the weights did not change — the sampling did. A low temperature makes a model sound careful and deterministic (good for code or facts); a higher one makes it sound inventive (good for brainstorming or fiction). "Regenerate" simply rolls the weighted die again. Every token you have ever read from an LLM came out of exactly this step: a probability distribution, reshaped by these knobs, sampled once.

Check your understanding

A few questions about sampling. You will get a score.

This activity needs JavaScript.

Why this matters next You now know how an LLM turns probabilities into text — and that it is always sampling from a distribution, never looking anything up. Module 8 follows that fact to its consequence: why models hallucinate, why fluent does not mean true, and how to use them well anyway.
One-sentence summary: a language model outputs a probability distribution over the next token, and sampling — temperature to rescale confidence, top-k and top-p to trim the tail — is the separate step that collapses that distribution into one actual character, which is why the same model can sound robotic or creative.

Next: Why LLMs Hallucinate & How to Use Them Well →