← All Neural Networks & Deep Learning modules

Module 6 — What Depth Buys You

What makes it deep learning · hands-on · about 30 minutes.

By this point, the complete training algorithm has been derived:

Forward pass — the network makes a prediction
Loss — the prediction is compared to the target
Backpropagation — error is traced back through every weight
Weight update — each weight is nudged in the right direction

A natural question follows: why use more than one hidden layer?

A shallow network can only draw simple decision boundaries. The word deep in "deep learning" means many layers composed together — and each additional layer lets the boundary grow more complex.

In this module you will train a network on a classically difficult dataset: two interleaved spirals. A shallow model cannot separate them. A sufficiently deep model produces a boundary that wraps around both arms.

Two interleaved spiral arms. Arm A (blue) and Arm B (red) weave around each other — no single straight or gently curved line can separate them. A deep network learns a boundary that wraps around both arms.

The insufficiency of a single hidden layer

A single neuron defines a linear decision boundary (Module 1). A single hidden layer permits a non-linear boundary composed of a small number of folds (Module 2). However, certain patterns — spirals, checkerboards, or any data distribution in which a single class is partitioned into multiple disconnected regions — require many folds in the decision boundary. Composing layers provides this capacity efficiently: each layer non-linearly transforms the representation passed to the next layer, so folds compose into curves and curves compose into more complex shapes.

Width — the number of neurons in each layer. Greater width increases the number of folds expressible in a single layer.
Depth — the number of layers. Greater depth permits compositions of folds, so the decision boundary can curve recursively.
The trade-off — depth is typically more parameter-efficient than width: a deep-but-narrow network can represent certain functions that a shallow-but-wide network requires substantially more parameters to approximate.

Select a depth and width below, then click Train. Begin with a shallow architecture (1 hidden layer) — the resulting boundary is insufficiently complex to separate the spirals. Increase depth and re-train: the data and the training loop are unchanged, but the resulting boundary now wraps around both spiral arms.

This activity needs JavaScript. The lesson below still covers everything.

Adding depth in Keras

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(16, activation='relu', input_shape=(2,)),  # layer 1
    Dense(16, activation='relu'),                  # layer 2  ← depth
    Dense(16, activation='relu'),                  # layer 3  ← more depth
    Dense(1,  activation='sigmoid')                # output
])

Each additional Dense line constitutes one additional layer and one additional opportunity to apply a non-linear transformation to the representation. This is the entire mechanism of "increasing depth" in code; the training loop introduced in Module 4 is unchanged.

AI anchor — depth is the central architectural choice at scale Production networks for image recognition and language modeling comprise not three layers but dozens to hundreds, typically organized into repeating block structures. A large language model is a deep stack of identical layers, termed transformer blocks. The phenomenon demonstrated in this module — depth enabling decision boundaries that no width alone can achieve — is the same principle by which large-scale models capture structure that shallow architectures cannot. Depth is not an incidental detail of the architecture; it is the defining design choice of the field.

Check your understanding

Answer a short set of questions on depth and width.

This activity needs JavaScript.

Why this matters next Depth provides substantial expressive capacity — potentially more than is desirable. A network with sufficient depth to separate the spirals also has sufficient capacity to memorize individual training points and fail to generalize to new data. Module 7 addresses this failure mode — overfitting — and the principal regularization techniques that constrain it: weight decay, dropout, and early stopping.

Summary: "deep" denotes the composition of many layers, and each additional layer permits the decision boundary to become more complex. Consequently, a deep network can represent decision boundaries (such as those required for interleaved spirals) that a shallow network provably cannot — making depth, rather than width alone, the defining architectural property of the field.

Next: Training Real Networks →