Module 5 — The Transformer Block
Self-attention is the heart of a transformer, but a few more parts are wrapped around it to make a complete, trainable unit — the transformer block. A real LLM is mostly just this one block stacked dozens of times. In this module you assemble it: first fix a problem attention has on its own (it ignores word order), then walk a token through the full block end to end.
The problem: attention is blind to order
Self-attention treats its inputs as a set, not a sequence. To it, "the dog bit the man" and "the man bit the dog" contain the exact same tokens with the exact same attention — yet they mean opposite things. Attention alone literally cannot tell them apart. The fix is positional encoding: before attention, add a small position-dependent vector to each token's embedding, stamping each one with where it sits.
This activity needs JavaScript. The lesson below still covers everything.
The full block, step by step
With position baked in, a token flows through the block. Step through it and watch the vector change at each stage — and notice the two residual shortcuts that add a stage's input back to its output.
This activity needs JavaScript.
What each piece is for
- Positional encoding — stamps order onto the tokens so the block knows sequence.
- Self-attention — lets each token gather context from the others (Module 4).
- Residual + normalize — adds the input back and rescales, so deep stacks train without the signal vanishing. (This is the overfitting/stability discipline from Course 4, built into the architecture.)
- Feed-forward network — a small per-token neural net (Course 4!) that transforms each token's gathered context into something richer.
def block(x): x = x + attention(norm(x)) # self-attention + residual shortcut x = x + ffn(norm(x)) # feed-forward + residual shortcut return x # a real LLM is just this block, stacked — GPT-3 stacks it 96 times for blk in blocks: x = blk(x)
Notice x = x + ... twice — those are the residuals. They are why a 96-layer stack still trains: each block adjusts the signal rather than replacing it.
Check your understanding
A few questions about the transformer block. You will get a score.
This activity needs JavaScript.