← All Neural Networks & Deep Learning modules

Module 7 — Training Real Networks

What makes it deep learning · hands-on · about 30 minutes.

Module 6 established that a deep network has the capacity to represent essentially any decision boundary. This expressive capacity has a corresponding cost: a network sufficiently flexible to separate the spirals is also sufficiently flexible to memorize the individual training points, including their noise component, and consequently to fail on data it has not encountered. This phenomenon is overfitting — the principal practical challenge in training real networks. This module presents the diagnostic procedure for detecting overfitting and three standard techniques for mitigating it.

Diagnostic: comparing training and validation loss

The standard diagnostic is to reserve a subset of the data — a validation set — that is not used during training, and to monitor its loss alongside the training loss:

Training loss decreases monotonically — the network is fitting the data to which it has access.
Validation loss initially decreases, then increases — the network is now fitting noise specific to the training set that does not generalize.
The gap between the two curves is the empirical measure of overfitting. A small gap indicates a well-regularized model; a large and growing gap indicates memorization.

Three standard techniques constrain this gap:

Weight decay (L2 regularization) — adds a penalty proportional to the squared magnitude of the weights, biasing the network toward simpler, smoother functions.
Dropout — randomly deactivates a subset of neurons at each training step, preventing any single neuron from encoding a memorized training example.
Early stopping — terminates training at the iteration that minimizes validation loss, before subsequent overfitting can occur.

The activity below trains a deliberately over-parameterized network on a small, noisy dataset, with the two loss curves diverging. Enabling weight decay and dropout and re-training reduces the gap between the curves and lowers the validation loss — the quantity relevant to generalization.

This activity needs JavaScript. The lesson below still covers everything.

Regularization in Keras

from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping

model.add(Dense(32, activation='relu',
                kernel_regularizer=l2(1e-3)))   # weight decay
model.add(Dropout(0.3))                          # dropout: drop 30% of neurons

model.fit(X, y, validation_split=0.3,           # hold out 30% to watch
          callbacks=[EarlyStopping(patience=10)])  # stop when val stops improving

Each technique is configured with a single line of code. The validation_split argument produces the second loss curve displayed in the activity below; the remaining configuration constrains the gap between the curves.

AI anchor — why large-scale models generalize despite their capacity Models with billions of parameters have more than sufficient capacity to memorize their training data outright. The reason they generalize is attributable to precisely the principles introduced here — regularization, dropout, large and diverse training datasets, and principled stopping criteria — applied at scale. Every team training a large-scale model monitors validation loss and addresses the same overfitting gap demonstrated in this activity.

Check your understanding

Answer a short set of questions on overfitting and regularization.

This activity needs JavaScript.

Why this matters next Every component has now been introduced: the neuron, activation functions, the network architecture, the training loop, backpropagation, depth, and the regularization techniques that enable generalization. Module 8 is the capstone — you will assemble these components to train a network end-to-end from a random initialization to a working classifier, making the architecture and training-procedure decisions yourself.

Summary: a network with sufficient capacity to represent arbitrary decision boundaries can memorize its training data rather than learn the underlying pattern (overfitting), diagnosed empirically by an increasing gap between decreasing training loss and increasing validation loss. Standard mitigations are weight decay, dropout, and early stopping.

Next: Train a Network End-to-End →