← All Inside Large Language Models modules

Module 5 — The Transformer Block

How models read context · hands-on · about 30 minutes.

Self-attention is the heart of a transformer, but a few more parts are wrapped around it to make a complete, trainable unit — the transformer block. A real LLM is mostly just this one block stacked dozens of times. In this module you assemble it: first fix a problem attention has on its own (it ignores word order), then walk a token through the full block end to end.

The problem: attention is blind to order

Self-attention treats its inputs as a set, not a sequence. To it, "the dog bit the man" and "the man bit the dog" contain the exact same tokens with the exact same attention — yet they mean opposite things. Attention alone literally cannot tell them apart. The fix is positional encoding: before attention, add a small position-dependent vector to each token's embedding, stamping each one with where it sits.

This activity needs JavaScript. The lesson below still covers everything.

The full block, step by step

With position baked in, a token flows through the block. Step through it and watch the vector change at each stage — and notice the two residual shortcuts that add a stage's input back to its output.

This activity needs JavaScript.

What each piece is for

A transformer block in code — read only, nothing to install
def block(x):
    x = x + attention(norm(x))   # self-attention + residual shortcut
    x = x + ffn(norm(x))         # feed-forward + residual shortcut
    return x

# a real LLM is just this block, stacked — GPT-3 stacks it 96 times
for blk in blocks:
    x = blk(x)

Notice x = x + ... twice — those are the residuals. They are why a 96-layer stack still trains: each block adjusts the signal rather than replacing it.

AI anchor — the unit that is copied dozens of times Every large language model you have used is, structurally, this block repeated: GPT-3 stacks it 96 times, larger models more. Tokens enter at the bottom as embeddings + positions and rise through the stack, each block letting them attend to context and refine their vectors, until the top layer's vectors are turned into next-token probabilities. The whole giant is built from the one block you just walked through — there is no extra magic, only scale and repetition.

Check your understanding

A few questions about the transformer block. You will get a score.

This activity needs JavaScript.

Why this matters next You have built the architecture. But an untrained transformer is random — it knows nothing. Module 6 trains a real tiny language model live, so you can watch its loss fall and its output go from gibberish to almost-English as it learns.
One-sentence summary: a transformer block adds positional encoding so order is visible, then runs self-attention and a feed-forward network each wrapped in a residual shortcut — and a full LLM is mostly just this one block stacked dozens of times.

Next: Training a Tiny Language Model →