Module 4 — Gradient Descent in Practice
Module 3 ran a network whose weights were already good. Now the real question: how does a network find good weights from a pile of random ones? The answer is the engine under all of deep learning — gradient descent — and you met its core idea in Course 2. Here you will run it live, watch a loss curve fall, and discover that one dial, the learning rate, makes the difference between a network that learns and one that blows up.
The training loop
Training repeats four steps, over and over, for many passes through the data (each full pass is an epoch):
- Forward pass — run the inputs through the current weights to get predictions.
- Loss — measure how wrong those predictions are with a single number. For yes/no problems we use log-loss (cross-entropy); lower is better.
- Gradient — compute, for every weight, which way to nudge it to lower the loss (that is backprop, Module 5).
- Update — step every weight a little in that downhill direction. How big a step is the learning rate.
Here \( \eta \) (eta) is the learning rate. Too small and training crawls; too big and it overshoots the valley and the loss explodes. Hit Train and watch a real network learn — then drag the learning rate and break it on purpose.
This activity needs JavaScript. The lesson below still covers everything.
from tensorflow.keras.optimizers import SGD model.compile(optimizer=SGD(learning_rate=0.1), # η — the dial you’re turning loss='binary_crossentropy') # the loss it minimizes model.fit(X, y, epochs=200) # run the loop 200 passes
.fit() is the loop above: forward, loss, gradient, update — repeated for every epoch. Everything you’re watching on the canvas is what happens inside that one call.
Check your understanding
A few questions about training. You will get a score.
This activity needs JavaScript.