Inside Large Language Models
You have used ChatGPT and Claude. This course opens them up. A large language model is, at its heart, doing one thing over and over: predicting the next token. Everything else — embeddings, attention, the transformer — is machinery built to make that one prediction astonishingly good. Here you will build that machinery from the bottom up, and every piece runs live in your browser.
This is genuinely hands-on. You will train a real bigram model on a tiny corpus and watch it babble, then sharpen it; place tokens in an embedding space and find their nearest neighbours; turn attention weights up and down and see which earlier words a prediction leans on; run the query–key–value math of self-attention by hand-driven sliders; train a tiny neural language model live and watch its loss curve fall; and turn the temperature dial to feel the difference between dull and unhinged text. Each module also shows the matching Hugging Face / PyTorch idea — read-only, so you can recognize it later, with nothing to install now. Each ends with a short mastery check; pass it to mark the module complete.
The core idea
Module 1Predicting the Next Token
The whole game in one move: given the words so far, what comes next? Activity: build a live bigram model from a small corpus, see the probability bars, and sample sentences from it. AI anchor: this is exactly the conditional probability from Course 1, scaled up.
Module 2Tokens & Embeddings
Models do not see words — they see numbers. Activity: turn text into tokens, place each token as a vector on a 2D map, and find a token’s nearest neighbours by meaning. AI anchor: every prompt becomes a sequence of embeddings first.
How models read context
Module 3Attention, Intuitively
To predict the next word, which earlier words matter? Activity: move an attention slider across a sentence and watch the model lean on some words and ignore others. AI anchor: "attention is all you need" — the idea that unlocked modern AI.
Module 4How Self-Attention Works
Open the box: queries, keys, and values. Activity: set a query and watch the dot-product scores become softmax weights that blend the values into one output vector. AI anchor: the actual computation inside every transformer layer.
Module 5The Transformer Block
Stack the parts into the unit that repeats dozens of times in a real LLM. Activity: walk a token through positional encoding, self-attention, a residual add, and a feed-forward layer. AI anchor: GPT and Claude are deep stacks of this one block.
Making it generate
Module 6Training a Tiny Language Model
Where does the "knowledge" come from? Activity: train a real neural language model on a small text, epoch by epoch, and watch the loss curve fall as its samples get more coherent. AI anchor: the same gradient descent from Course 4, applied to language.
Module 7Sampling & Generation
The model gives probabilities — how do they become text? Activity: turn the temperature dial and switch on top-k and top-p sampling, watching the output move from robotic to creative to incoherent. AI anchor: the settings behind every chatbot reply.
Capstone
Module 8 · CapstoneWhy LLMs Hallucinate & How to Use Them Well
Put it together: a model trained to sound fluent is not trained to be true. Activity: see how a confident wrong answer is generated, what the context window can and cannot hold, and turn that into practical habits for trusting and verifying AI. A synthesis check ties every module together.