← All Math Foundations modules

Module 6 — Derivatives & Gradient Descent

Pillar 3 · Optimization · hands-on · about 30 minutes.

You now know what a loss function is — a number measuring how wrong a model is. This module answers the question that is machine learning: how does a model make that number smaller? The answer is gradient descent — follow the slope downhill. First we need the slope, which is what a derivative measures.

The derivative: slope at a point

The derivative of a function is its slope — how fast the output changes as you nudge the input. On a curve, it's the steepness of the line that just touches the curve at a point. Steep and rising → large positive derivative; steep and falling → large negative; flat → zero.

\[ \text{slope} \;=\; \frac{\text{change in output}}{\text{change in input}} \;=\; \frac{\Delta y}{\Delta x} \]

You don't need to compute derivatives by hand here — you need the one fact that drives training: the derivative points uphill, so its negative points downhill. That sign is the compass.

Minima: where the slope is zero

A function's minimum is its lowest point — and there, the slope is flat (zero). A loss function's minimum is the best the model can do. Gradient descent is a procedure for walking to that bottom without ever being told where it is, using only the local slope under your feet.

Gradient descent: roll the ball downhill

Stand somewhere on the curve. Measure the slope. Take a small step in the downhill direction. Repeat. The step rule is the heart of all model training:

\[ w_{\text{new}} \;=\; w_{\text{old}} \;-\; \eta \cdot \text{slope} \]

Here \( \eta \) (eta) is the learning rate — how big a step to take. Too small and you crawl; too big and you overshoot and bounce — maybe forever. The demo below lets you drop a ball on a loss curve and tune \( \eta \). Find the rate that reaches the bottom fastest without blowing up.

This activity needs JavaScript. The lesson below still covers everything.

The gradient: slope in many directions at once

Real models have millions of parameters, not one. The gradient is just the collection of slopes — one per parameter — bundled into a vector that points in the direction of steepest increase. Gradient descent steps in the opposite direction. The 1-D ball you're rolling is the same idea; a real model rolls downhill in a million-dimensional bowl.

AI anchor — this is literally how models learn Every neural network, every logistic regression, every large language model is trained by gradient descent. The model makes predictions, a loss function scores how wrong they are, backpropagation computes the gradient of that loss with respect to every weight, and the step rule \( w \leftarrow w - \eta \cdot \text{gradient} \) nudges each weight slightly downhill. Repeat over millions of examples and the loss falls — the model "learns." The learning-rate slider you're tuning is one of the most important knobs in all of deep learning; set it wrong and training either crawls or explodes, exactly as you'll see below.

Tune the training

Predict what happens to gradient descent under different conditions. You'll get a score.

This activity needs JavaScript.

Why this matters next Gradient descent is the engine under every model you'll train in Courses 3 and 4 — it's why they call it "training." The loss function it minimizes came from Module 1; the gradient it follows is a vector from Module 4; and in Module 8 you'll watch a real model take one gradient step by hand. Optimization is where all the other pillars meet.
One-sentence summary: a derivative is the slope of a function; gradient descent repeatedly steps a parameter downhill with the rule \( w \leftarrow w - \eta \cdot \text{slope} \), where the learning rate \( \eta \) controls step size — and this single loop is how essentially every machine-learning model learns.

Next: Statistics for Data & Evaluation →