Module 8 — Why LLMs Hallucinate & How to Use Them Well
This is the payoff. Across seven modules you built the whole machine: next-token prediction, tokens and embeddings, attention, the transformer block, training, and sampling. Now you put it to work answering the question everyone actually has — why do these models confidently make things up, and how do I use them well anyway? The answer is not a flaw bolted on. It falls directly out of everything you built: an LLM is a next-token sampler, and a next-token sampler always produces something plausible whether or not it knows anything.
The model never says "I don't know" on its own
At every step the model has a probability distribution and it samples from it — full stop. There is no separate "do I actually know this?" check. When the training text strongly supports a continuation, the distribution is sharp and the output is usually right. When it doesn't, the distribution is flat and the model still picks something — fluent, confident, and possibly invented. That gap is a hallucination, and you can watch it happen on the tiny model you trained.
This activity needs JavaScript. The lesson below still covers everything.
Fluent is not the same as true
Everything an LLM writes is grammatical and confident, because fluency is exactly what next-token prediction optimizes. Truth is not the objective — it is a frequent side-effect of fluency when the training data was accurate and dense on a topic. So the dangerous case is the plausible-but-wrong answer: maximally fluent, quietly false. Below, judge a few model claims the way you should judge them in real life.
This activity needs JavaScript.
How to use them well
- Ground it — give the model the source text (paste the doc, the data, the code). Answering from material in its context window is far more reliable than answering from memory.
- Verify the checkable — names, dates, numbers, citations, APIs. These are exactly where a fluent guess hides. Confirm them against a real source.
- Match the temperature to the task — low for facts and code, higher for brainstorming. (Module 7.)
- Mind the context window — the model only "sees" what fits in its context. Beyond that it is back to memory, where hallucination lives.
- Treat it as a brilliant, fast, unreliable intern — wonderful at drafts and breadth, never the final authority on a fact.
docs = retrieve(question, knowledge_base) # pull real source text prompt = f"Answer ONLY from these sources:\n{docs}\n\nQ: {question}" answer = model.generate(prompt, temperature=0.2) # low temp, grounded
This is RAG — the most common production pattern for cutting hallucination. It does not change the model; it moves the answer’s evidence into the context window, where the model is reliable, and turns the temperature down.
Check your understanding
A synthesis quiz spanning the whole course. You will get a score.
This activity needs JavaScript.