← All Neural Networks & Deep Learning modules

Module 5 — Backpropagation Intuition

Putting neurons together · hands-on · about 30 minutes.

Module 4 treated the gradient-computation step as a black box: "compute, for each weight, the partial derivative of the loss." This computation is backpropagation, and despite the terminology it is a direct application of the chain rule of calculus — building on the derivatives and gradient from Course 2 — applied recursively in the reverse direction through the network. This module derives the computation explicitly so that the gradient flow can be observed weight by weight.

The principle: attributing error to each weight

After a forward pass yields a prediction and a loss, backpropagation answers: what is the contribution of each weight to the loss? The computation begins at the output, where the error is directly observable (prediction minus target), and proceeds backward through the network. A weight near the output contributes to the loss directly; a weight in an earlier layer contributes through every downstream computation in which its output participates — and the chain rule precisely formalizes the multiplication of these intermediate influences.

Output error — begin with \( \hat{y} - y \), the deviation of the prediction from the target.
Backward step — propagate this error to the preceding layer, scaled by the weights through which it flowed and by the derivative of each neuron's activation function (the chain rule).
Gradient — for each weight, the gradient equals (the error signal arriving at its neuron) × (the input the weight carried). This quantity is \( \partial\text{loss}/\partial w \).

This activity needs JavaScript.

Now use the full network trainer below to see many steps at once:

Click Train one step below. The network executes a forward pass, computes the loss, and then backpropagation visualizes each connection's contribution to the gradient — thicker lines correspond to larger gradient magnitudes. Weights with the largest gradients undergo the largest updates, and with a well-chosen learning rate the loss trends downward across iterations — though, as you saw in Module 4, too large a step can overshoot and send it back up.

This activity needs JavaScript. The lesson below still covers everything.

Backpropagation in PyTorch

pred = model(X)                 # forward pass
loss = loss_fn(pred, y)         # one number
loss.backward()                 # BACKPROP — fills every weight's .grad via the chain rule
optimizer.step()                # update: w ← w − η · w.grad
optimizer.zero_grad()           # clear grads for the next step

The single loss.backward() call corresponds to the backward propagation visualized above: it computes \( \partial\text{loss}/\partial w \) for every weight automatically via automatic differentiation. Manual application of the chain rule is unnecessary; the framework implements it.

AI anchor — the algorithm that enabled deep learning Backpropagation, popularized by Rumelhart, Hinton, and Williams in 1986, is the algorithm that makes training deep networks computationally tractable: it computes all the gradients in a single efficient backward pass rather than evaluating each weight independently via finite differences. Every modern framework — PyTorch, TensorFlow, JAX — is built around an "autograd" (automatic differentiation) engine that implements this algorithm at the scale of billions of parameters. The connection-by-connection gradient flow shown above is the same computation that trains every large-scale model in production.

Check your understanding

Answer a short set of questions on backpropagation.

This activity needs JavaScript.

Why this matters next The complete training algorithm has now been derived: forward pass, loss, backpropagation, weight update. Module 6 examines what additional capability is gained by increasing depth — you will train a network of increasing depth on a non-linearly-separable spiral dataset and observe the resulting decision boundary, which no shallow model can represent.

Summary: backpropagation is the chain rule of calculus applied recursively in the reverse direction through the network. It begins with the output error and computes the partial derivative of the loss with respect to every weight, providing the gradients used by gradient descent. Weights whose perturbations have the largest effect on the loss receive the largest gradients and the largest updates.

Next: What Depth Buys You →