← All Inside Large Language Models modules

Module 6 — Training a Tiny Language Model

Making it generate · hands-on · about 30 minutes.

You have built the whole architecture — embeddings, attention, the transformer block. But a freshly-built model is random: its weights are noise, so its predictions are noise. Where does the "knowledge" come from? The same place it did in Course 4: training. In this module you train a real language model live, in your browser, and watch its loss fall and its output crawl from gibberish toward English.

What "training a language model" means

It is exactly the Course 4 loop, with one specific job: predict the next token. For every position in the training text, the model predicts a probability for the next character, and we measure how surprised it was by the character that actually came — that surprise is the loss (cross-entropy). Gradient descent then nudges every weight to reduce that surprise. Repeat over the whole text many times (each pass is an epoch), and the predictions sharpen.

The model below is a genuine trainable next-character model — real weights, real cross-entropy, real gradient descent. Nothing here is faked: when you click Train, it actually learns.

This activity needs JavaScript. The lesson below still covers everything.

Reading the loss curve

The falling curve is the model getting less surprised by its training text — better at predicting the next character. Early on it drops fast (easy wins, like learning that a space often follows certain letters), then flattens as it squeezes out the harder patterns. This is the identical shape you saw training neural nets in Course 4, because it is the same process — only the task ("predict the next token") is specific to language.

The training loop in code — read only, nothing to install
for epoch in range(n_epochs):
    logits = model(inputs)                  # predict next-token scores
    loss   = cross_entropy(logits, targets) # how surprised were we?
    loss.backward()                       # gradients (Course 4 backprop)
    optimizer.step()                       # nudge every weight downhill

This is the same four lines that trained the spiral classifier in Course 4. An LLM is trained by this exact loop — just with a transformer, far more text, and far more compute.

AI anchor — this is "pre-training," scaled to the whole internet What you just did in seconds, on a dozen sentences, is what training GPT or Claude does — on a large fraction of the internet, for months, across thousands of GPUs. The objective is the same: predict the next token, minimize the surprise. All the model's apparent knowledge of facts, grammar, and reasoning is a side-effect of getting very, very good at that one prediction over an enormous amount of text. There is no separate "learn facts" step — it all falls out of next-token prediction.

Check your understanding

A few questions about training. You will get a score.

This activity needs JavaScript.

Why this matters next Your trained model outputs a probability for every next character. But how do those probabilities become actual text — and why does the same model sound robotic one moment and wildly creative the next? Module 7 is all about sampling: temperature, top-k, and top-p.
One-sentence summary: training a language model is the Course 4 loop aimed at one task — predict the next token, measure the surprise (cross-entropy loss), and use gradient descent to lower it over many epochs — and all of an LLM's knowledge is a side-effect of getting good at exactly that.

Next: Sampling & Generation →