← All Inside Large Language Models modules

Module 8 · Project — Why LLMs Hallucinate and How to Use Them Effectively

Project · synthesis · about 30 minutes.

This module consolidates the course. The preceding seven modules developed the complete architecture: next-token prediction, tokens and embeddings, attention, the transformer block, training, and sampling. This module applies that understanding to a central practical question — why do these models generate confident but false statements, and how can they be used effectively despite this? Hallucination is not an externally introduced defect; it follows directly from the architecture. An LLM is a next-token sampler, and a next-token sampler always produces plausible output regardless of whether the underlying information is present.

The model has no intrinsic mechanism for expressing uncertainty

At every step the model produces a probability distribution and samples from it; there is no separate verification of whether the information is actually known. When the training data strongly supports a particular continuation, the distribution is concentrated and the output is typically correct. When the training data does not support a continuation, the distribution is diffuse and the model nonetheless samples a token — producing fluent, confident, and potentially fabricated output. This failure mode is termed hallucination, and it can be observed directly on the small model trained earlier in the course.

This activity needs JavaScript. The lesson below still covers everything.

Fluency does not imply factual accuracy

All LLM output is grammatical and confident, because fluency is precisely what the next-token prediction objective optimizes. Factual accuracy is not the training objective; it is a frequent consequence of fluency when the training data was accurate and well-represented on a given topic. The problematic case is therefore the plausible-but-incorrect output: maximally fluent yet false. In the activity below, evaluate several model claims using the criteria appropriate to real-world use.

This activity needs JavaScript.

How to use them well

Provide grounding context — supply the model with the relevant source material (the document, data, or code). Answering from material present in the context window is substantially more reliable than answering from the model's parameters.
Verify verifiable claims — names, dates, numerical values, citations, and API references are exactly the elements most susceptible to fluent fabrication. Confirm them against an authoritative source.
Match the temperature to the task — low for factual and code-generation tasks, higher for ideation (Module 7).
Account for the context window — the model conditions only on information within its context window. Beyond that limit, it relies on its parameters, where hallucination is most likely.
Treat the model as a capable but unreliable assistant — highly effective for drafting and breadth of coverage, but not an authoritative source for any specific fact.

Grounding in code — retrieval-augmented generation

docs    = retrieve(question, knowledge_base)   # pull real source text
prompt  = f"Answer ONLY from these sources:\n{docs}\n\nQ: {question}"
answer  = model.generate(prompt, temperature=0.2)  # low temp, grounded

This is retrieval-augmented generation (RAG) — the most widely used production technique for reducing hallucination. It does not modify the model; it places the supporting evidence within the context window, where the model is reliable, and reduces the sampling temperature.

AI anchor — the complete model architecture in retrospect You can now describe, end to end, the computation performed when text is submitted to ChatGPT or Claude: the text is tokenized, embeddings encode semantic meaning, attention incorporates context across stacked transformer blocks, training on a large corpus determined every weight, and at each step a probability distribution is sampled to produce the next token. Hallucination, creativity, and accurate reasoning are the same mechanism observed under different conditions. You have progressed from using a language model to understanding its construction.

Project — synthesis across the course

A synthesis quiz integrating material from across the course.

This activity needs JavaScript.

Conclusion You have completed Inside Large Language Models. You have developed every component — token, embedding, attention, transformer block, training, sampling — and can now reason rigorously about the capabilities and limitations of these models. Apply this understanding to how you prompt, ground, and verify them in practice.

Summary: an LLM is a next-token sampler with no intrinsic representation of factual accuracy, so it always produces fluent output regardless of whether the information is present — which is the source of hallucination, and the reason effective use requires grounding the model in authoritative sources, verifying verifiable claims, and treating fluent output as a draft rather than an established fact.

← Back to all modules