Module 6 — Derivatives & Gradient Descent

Pillar 3 · Optimization · hands-on · about 30 minutes.

You now understand that a loss function is a scalar quantifying a model's error. This module addresses the central question of machine learning: how is that quantity minimized? The answer is gradient descent — iteratively following the slope toward lower values. The first prerequisite is the slope itself, which is measured by the derivative.

The derivative: the slope of a function at a point

The derivative of a function is its slope — the rate at which the output changes with respect to the input. Geometrically, it is the slope of the line tangent to the curve at a given point. A steep increasing curve has a large positive derivative; a steep decreasing curve a large negative derivative; a flat region a derivative of zero.

\[ \text{slope} \;=\; \frac{\text{change in output}}{\text{change in input}} \;=\; \frac{\Delta y}{\Delta x} \]

Manual computation of derivatives is not required here; what matters is the single principle underlying training: the sign of the derivative indicates the direction in which the function increases. Where the derivative is positive, the output increases as the input increases; where it is negative, the output decreases. The derivative \( f'(w) \) therefore points in the direction of steepest increase, and its negative, \( -f'(w) \), points in the direction of steepest decrease — the direction in which gradient descent steps to reduce the loss.

Minima: points at which the derivative is zero

A minimum of a function is a point of locally lowest value, at which the slope is zero. The minimum of a loss function corresponds to the best achievable model. Gradient descent is a procedure for reaching such a minimum without prior knowledge of its location, using only the local slope at the current point.

The activity below illustrates this local-information constraint: only the region of the curve at the current point is visible. The global shape is unknown; only the slope at the current position is available. Stepping in the downhill direction repeatedly converges to the minimum.

Move the point along the curve below and observe the tangent line. The slope of the tangent is the derivative at that point; at the minimum of the curve the tangent is horizontal and the slope is \( 0 \).

This activity needs JavaScript. At a minimum the tangent line is horizontal, so the derivative (slope) is \( 0 \) — which is why gradient descent comes to rest there.

Gradient descent: the iterative update rule

From any point on the curve, evaluate the slope and take a small step in the direction of decreasing value, then repeat. This update rule is the basis of essentially all model training:

\[ w_{\text{new}} \;=\; w_{\text{old}} \;-\; \eta \cdot \text{slope} \]

You will also see this written compactly with an update arrow: \( w \leftarrow w - \eta \cdot \text{slope} \). The arrow \( \leftarrow \) means update — "replace \( w \) with the value on the right." It is an instruction, not an equation (just as \( w = w + 1 \) in code increases \( w \) rather than claiming the two are equal).

Here \( \eta \) (eta) is the learning rate, which determines the step size. If \( \eta \) is too small, convergence is slow; if it is too large, the updates overshoot the minimum and may diverge. The activity below allows you to set the initial point on a loss curve and adjust \( \eta \). Identify the value that reaches the minimum most rapidly without diverging.

This activity needs JavaScript. The lesson below still covers everything.

The gradient: the vector of partial derivatives

Real models have millions of parameters rather than one. The gradient is the collection of partial derivatives — one per parameter — assembled into a vector that points in the direction of steepest increase. Gradient descent steps in the opposite direction. The one-dimensional case demonstrated here generalizes directly: a real model performs gradient descent in a space of millions of dimensions.

AI anchor — this is how models learn Every neural network, every logistic regression, and every large language model is trained by gradient descent. The model produces predictions, a loss function quantifies their error, backpropagation computes the gradient of the loss with respect to every weight, and the update rule \( w \leftarrow w - \eta \cdot \text{gradient} \) adjusts each weight in the direction of decreasing loss. Iterated over many examples, the loss decreases and the model learns. The learning rate is among the most important hyperparameters in deep learning; an inappropriate value causes either slow convergence or divergence, as the activity demonstrates.

Check your understanding

Predict the behavior of gradient descent under various conditions.

This activity needs JavaScript.

Why this matters next Gradient descent is the optimization procedure underlying every model trained in Courses 3 and 4. The loss function it minimizes was introduced in Module 1; the gradient it follows is the vector introduced in Module 4; and in Module 8 you will trace a single gradient step of a real model. Optimization integrates the preceding mathematical pillars.

Summary: a derivative is the slope of a function; gradient descent iteratively updates a parameter in the direction of decreasing loss via the rule \( w \leftarrow w - \eta \cdot \text{slope} \), where the learning rate \( \eta \) controls the step size — and this single procedure is the mechanism by which essentially every machine-learning model is trained.

Next: Statistics for Data & Evaluation →