← All Inside Large Language Models modules

Module 6 — Training a Tiny Language Model

Making it generate · hands-on · about 30 minutes.

The complete architecture — embeddings, attention, the transformer block — has now been assembled. However, a newly initialized model has random weights, and therefore produces random predictions. The model's knowledge is acquired through the same procedure introduced in Course 4: training. In this module you train a real language model interactively in the browser and observe its loss decrease and its generated output progress from random characters toward well-formed text.

The definition of training a language model

Training a language model is the Course 4 optimization loop applied to a specific objective: next-token prediction. At every position in the training text, the model produces a probability distribution over the next character, and the loss (cross-entropy) quantifies the model's surprise at the character that actually occurred. Gradient descent then adjusts every weight to reduce this loss. Iterating over the entire corpus many times — each complete pass termed an epoch — progressively sharpens the predictions.

The model below is a genuine trainable next-character model, with real weights, a real cross-entropy objective, and real gradient descent. The training is not simulated: clicking Train executes actual optimization.

This activity needs JavaScript. The lesson below still covers everything.

Interpreting the loss curve

The decreasing curve indicates that the model is becoming less surprised by its training text — that is, more accurate at predicting the next character. The loss decreases rapidly at first (capturing easily-learned regularities, such as which characters frequently precede a space), then plateaus as the remaining patterns are progressively more difficult to learn. This is the same loss-curve shape observed when training neural networks in Course 4, because the underlying optimization process is identical; only the objective ("predict the next token") is specific to language modeling.

The training loop expressed in code

for epoch in range(n_epochs):
    logits = model(inputs)                  # predict next-token scores
    loss   = cross_entropy(logits, targets) # how surprised were we?
    loss.backward()                       # gradients (Course 4 backprop)
    optimizer.step()                       # nudge every weight downhill

This is the same four-line loop that trained the spiral classifier in Course 4. A production LLM is trained by this identical loop, differing only in the use of a transformer architecture, a substantially larger corpus, and substantially greater compute.

AI anchor — pre-training at internet scale The procedure executed here in seconds on a dozen sentences is the same procedure used to train GPT or Claude — applied to a large fraction of the public internet, over a period of months, across thousands of GPUs. The objective is identical: predict the next token, minimizing the cross-entropy loss. The model's apparent knowledge of facts, grammar, and reasoning is an emergent consequence of optimizing this single objective over an extremely large corpus. There is no separate fact-acquisition stage; these capabilities arise entirely from next-token prediction.

Check your understanding

Answer a short set of questions on training.

This activity needs JavaScript.

Why this matters next The trained model outputs a probability distribution over the next character. How is this distribution converted into actual text, and why can the same model produce deterministic output in one configuration and highly variable output in another? Module 7 addresses sampling: temperature, top-k, and top-p.

Summary: training a language model is the Course 4 optimization loop applied to a single objective — predict the next token, measure the cross-entropy loss, and apply gradient descent to minimize it over many epochs — and all of an LLM's knowledge is an emergent consequence of optimizing this objective.

Next: Sampling & Generation →