← All Inside Large Language Models modules

Module 8 — Why LLMs Hallucinate & How to Use Them Well

Capstone · hands-on · about 30 minutes.

This is the payoff. Across seven modules you built the whole machine: next-token prediction, tokens and embeddings, attention, the transformer block, training, and sampling. Now you put it to work answering the question everyone actually has — why do these models confidently make things up, and how do I use them well anyway? The answer is not a flaw bolted on. It falls directly out of everything you built: an LLM is a next-token sampler, and a next-token sampler always produces something plausible whether or not it knows anything.

The model never says "I don't know" on its own

At every step the model has a probability distribution and it samples from it — full stop. There is no separate "do I actually know this?" check. When the training text strongly supports a continuation, the distribution is sharp and the output is usually right. When it doesn't, the distribution is flat and the model still picks something — fluent, confident, and possibly invented. That gap is a hallucination, and you can watch it happen on the tiny model you trained.

This activity needs JavaScript. The lesson below still covers everything.

Fluent is not the same as true

Everything an LLM writes is grammatical and confident, because fluency is exactly what next-token prediction optimizes. Truth is not the objective — it is a frequent side-effect of fluency when the training data was accurate and dense on a topic. So the dangerous case is the plausible-but-wrong answer: maximally fluent, quietly false. Below, judge a few model claims the way you should judge them in real life.

This activity needs JavaScript.

How to use them well

Grounding in code — retrieval-augmented generation, the standard fix — read only
docs    = retrieve(question, knowledge_base)   # pull real source text
prompt  = f"Answer ONLY from these sources:\n{docs}\n\nQ: {question}"
answer  = model.generate(prompt, temperature=0.2)  # low temp, grounded

This is RAG — the most common production pattern for cutting hallucination. It does not change the model; it moves the answer’s evidence into the context window, where the model is reliable, and turns the temperature down.

AI anchor — you understand the whole machine now You can now explain, end to end, what happens when you type into ChatGPT or Claude: your text becomes tokens, embeddings carry meaning, attention reads context across stacked transformer blocks, training on a huge corpus shaped every weight, and at each step a probability distribution is sampled into the next token. Hallucination, creativity, and brilliance are all the same mechanism seen from different angles. You did not just use a language model — you understand one.

Check your understanding

A synthesis quiz spanning the whole course. You will get a score.

This activity needs JavaScript.

Where you go next You have finished Inside Large Language Models. You built every part — token, embedding, attention, block, training, sampling — and you can now reason about what these models can and cannot be trusted to do. Take that straight into how you prompt, ground, and verify them in your own work.
One-sentence summary: an LLM is a next-token sampler with no built-in sense of truth, so it always produces fluent output whether or not it knows — which is why it hallucinates, and why using it well means grounding it in real sources, verifying the checkable, and treating fluency as a draft rather than a fact.

← Back to all modules