← All Neural Networks & Deep Learning modules

Module 4 — Gradient Descent in Practice

Putting neurons together · hands-on · about 30 minutes.

Module 3 operated on a network whose weights were pre-determined. The central question is now: how does a network learn appropriate weights from a random initialization? The procedure is gradient descent — the optimization algorithm underlying all of deep learning, the core principles of which were introduced in Course 2. In this module you will execute gradient descent interactively, observe the loss curve descend, and demonstrate that a single hyperparameter — the learning rate — determines whether the network converges or diverges.

The training loop

Training iterates four operations across many passes through the data (each complete pass is termed an epoch):

Forward pass — propagate the inputs through the current weights to produce predictions.
Loss — quantify the prediction error as a single scalar. For binary classification the standard choice is the log-loss (cross-entropy); lower values indicate better fit.
Gradient — compute, for every weight, the partial derivative of the loss with respect to that weight, indicating the direction of decrease (the topic of Module 5: backpropagation).
Update — adjust every weight by a small step in the direction of decreasing loss. The step magnitude is governed by the learning rate.

\[ w \;\leftarrow\; w \;-\; \eta \,\frac{\partial \,\text{loss}}{\partial w} \]

Here \( \eta \) (eta) denotes the learning rate. If \( \eta \) is too small, training proceeds slowly; if it is too large, the updates overshoot the loss minimum and the loss diverges. Click Train to execute training on a real network — then vary the learning rate to induce instability deliberately.

This activity needs JavaScript. The lesson below still covers everything.

The equivalent training loop in Keras

from tensorflow.keras.optimizers import SGD

model.compile(optimizer=SGD(learning_rate=0.1),   # η — the learning rate hyperparameter
              loss='binary_crossentropy')        # the loss function being minimized
model.fit(X, y, epochs=200)                     # execute the training loop for 200 epochs

The .fit() method implements the training loop described above: forward pass, loss, gradient, update — iterated for each epoch. The activity above visualizes the internal behavior of this single method call.

AI anchor — a single optimization algorithm trains every modern network Gradient descent — and its widely-used variants such as Adam — trains essentially every neural network in production: image models, recommendation systems, and language models with hundreds of billions of parameters. The scale is large, but the loop is identical to the one demonstrated here: forward pass, loss, gradient, update. The selection of the learning rate (and its schedule) is one of the most consequential hyperparameters in deep-learning practice, as the activity above illustrates directly.

Check your understanding

Answer a short set of questions on training.

This activity needs JavaScript.

Why this matters next One step in the training loop was treated as a black box: the computation of the gradient with respect to each weight. Module 5 derives this computation. Backpropagation is an application of the chain rule of calculus, propagated backward through the network; the derivation is presented one gradient at a time.

Summary: a network learns by iterating a training loop — forward pass, loss, gradient, weight update — across many epochs, with each iteration adjusting every weight in the direction of decreasing loss. The learning rate determines the step magnitude, and its selection determines whether training converges smoothly or diverges.

Next: Backpropagation Intuition →