← All Inside Large Language Models modules

Module 5 — The Transformer Block

How models read context · hands-on · about 30 minutes.

Self-attention is the central operation of a transformer, but several additional components surround it to form a complete, trainable unit — the transformer block. A production LLM consists primarily of this single block replicated many times. In this module you assemble the block: first by addressing a limitation of attention in isolation (its insensitivity to token order), then by tracing a single token through the complete block.

The limitation: attention is invariant to token order

Self-attention treats its input as an unordered set rather than a sequence. Consequently, "the dog bit the man" and "the man bit the dog" contain identical tokens and produce identical attention outputs, despite having opposite meanings; attention alone cannot distinguish them. The standard remedy is positional encoding: prior to attention, a position-dependent vector is added to each token's embedding, encoding the token's position within the sequence.

This activity needs JavaScript. The lesson below still covers everything.

The complete block, stage by stage

With positional information incorporated, a token is propagated through the block. Advance through each stage and observe how the token's vector is transformed — noting in particular the two residual connections that add each sub-layer's input to its output.

This activity needs JavaScript.

The function of each component

Positional encoding — encodes token order so that the block is sensitive to sequence.
Self-attention — enables each token to incorporate context from the other tokens (Module 4).
Residual connection and normalization — the block first normalizes its input, feeds that through the sub-layer, then adds the result back onto the original (un-normalized) input. Modern GPT-style models normalize before each sub-layer (pre-norm), enabling deep stacks to train without the gradient signal vanishing. This applies the training-stability principles from Course 4 at the architectural level.
Feed-forward network — a small per-token neural network (Course 4) that further transforms each token's contextualized representation.

A transformer block expressed in code

def block(x):
    x = x + attention(norm(x))   # self-attention + residual shortcut
    x = x + ffn(norm(x))         # feed-forward + residual shortcut
    return x

# a real LLM is just this block, stacked — GPT-3 stacks it 96 times
for blk in blocks:
    x = blk(x)

The two occurrences of x = x + ... are the residual connections. They are what permits a 96-layer stack to train successfully: each block adjusts the representation rather than replacing it, preserving the gradient signal through the depth of the network.

AI anchor — the block that is replicated throughout the model Every large language model is, structurally, this block replicated: GPT-3 stacks it 96 times, and larger models use more. Tokens enter at the base of the stack as embeddings combined with positional encodings and propagate upward, each block enabling them to attend to context and refine their representations, until the final layer's vectors are projected into next-token probabilities. The entire model is constructed from the single block examined here, distinguished only by scale and repetition.

Check your understanding

Answer a short set of questions on the transformer block.

This activity needs JavaScript.

Why this matters next The architecture is now complete. However, a transformer with randomly initialized weights encodes no information. Module 6 trains a small language model interactively, allowing you to observe the loss decrease and the generated output progress from random characters to approximately well-formed text as training proceeds.

Summary: a transformer block adds positional encoding to make token order observable, then applies self-attention and a feed-forward network, each wrapped in a residual connection — and a complete LLM consists principally of this single block stacked many times.

Next: Training a Tiny Language Model →