← All Machine Learning Foundations modules

Module 2 — Regression: Fitting a Line

Supervised learning · hands-on · about 30 minutes.

The simplest supervised model predicts a continuous numerical value: the sale price of a house, the duration of a delivery, the expected revenue from a campaign. This task is called regression, and its foundational model is the straight line. In this module you will adjust the line's parameters manually, observe how the error responds, and then compute the optimal line in closed form — the same calculation performed by scikit-learn's LinearRegression.

A line is a model with two parameters

A straight-line model has two parameters: a slope \( w \) and an intercept \( b \). Given an input \( x \), its prediction is:

\[ \hat{y} \;=\; w\,x + b \]

The slope \( w \) controls the line's inclination, and the intercept \( b \) shifts it vertically. Fitting the model means selecting the values of \( w \) and \( b \) that minimize the total distance between the line and the observed data points, according to a precisely defined error criterion.

Residuals: quantifying prediction error

For an observation \( (x_i, y_i) \), the residual is the signed vertical distance between the observed value and the model's prediction: \( y_i - \hat{y}_i \). Observations above the line yield positive residuals; those below yield negative residuals. The objective is to make every residual small in magnitude.

The loss function: mean squared error

To collapse all residuals into a single scalar objective for minimization, we square each residual — preventing positive and negative errors from canceling, and penalizing large errors disproportionately more than small ones — and take the mean. This is the mean squared error (MSE):

\[ \text{MSE} \;=\; \frac{1}{n}\sum_{i=1}^{n}\big(y_i - (w x_i + b)\big)^2 \]

Adjust \( w \) and \( b \) with the sliders — or manipulate the line directly: drag its body up or down to shift it, and drag either blue endpoint to change its tilt. The red vertical segments represent the residuals; the MSE displayed is the mean of their squared lengths. Attempt to minimize the MSE manually, then click Auto-fit to compare your result against the closed-form optimum.

This activity needs JavaScript. The lesson below still covers everything.

How the optimal line is computed

You just minimized the MSE by hand — nudging \( w \) and \( b \) until the error stopped dropping. A computer doesn't guess and check like that. It has two standard ways to find the best line, and both arrive at exactly the same answer for a straight line.

1. The normal equation (ordinary least squares, OLS). This is a formula that takes the data and computes the optimal \( w \) and \( b \) directly, in a single step — no trial and error and no repetition. Plug in the points, and the exact best-fit values come straight out. It is fast and precise for a straight line through a modest amount of data, which is why it is the classic textbook method. Its drawback is cost: the formula involves inverting a matrix whose size grows with the number of inputs, so it becomes far too slow once a model has very many parameters.

2. Gradient descent (Course 2, Module 6). Instead of solving a formula, this method starts from any line at all and then repeatedly takes a small step "downhill" on the MSE surface — each step lowering the error a little — until it can no longer improve. It is the slower choice for a simple line, but unlike the normal equation it keeps working when a model has millions of parameters, which is exactly the situation in deep learning. That scalability is why gradient descent, not the normal equation, trains modern neural networks.

For the line above, both methods land on the same minimum. The LinearRegression().fit() call below uses the closed-form least-squares solution under the hood.

The equivalent procedure in scikit-learn — executable in the browser

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)      # learns w (slope) and b (intercept)
pred  = model.predict(X_test)       # ŷ = wx + b on unseen data
print(model.coef_, model.intercept_)

The .fit() method corresponds to the Train stage from Module 1 — the same optimization you performed manually with the sliders, executed automatically. Click Run it yourself to fit the model in your browser, then modify the data or parameters and re-execute.

When you run it, the program prints these values (and draws the scatter-plus-line chart):

slope     w = 2.593
intercept b = 6.355
R^2 score   = 0.915

AI anchor — regression underlies most numerical prediction systems Automated home-value estimation, demand forecasting, click-through-rate prediction, delivery-time estimation, and the expected-revenue models that drive online advertising auctions are all instances of regression. In each case the output is a continuous quantity, and the model is trained by minimizing a squared-error (or analogous) loss. Even the final layer of a neural network is, in many cases, a regression: a weighted sum of the form \( w \cdot x + b \) — structurally identical to the line you fitted above, but defined over a much higher-dimensional input space.

Check your understanding

Answer a short set of questions on linear models, residuals, and the mean squared error.

This activity needs JavaScript.

Why this matters next Replacing the straight line with a sigmoid (S-shaped) curve transforms a regression on a continuous value into a regression on a probability. This construction defines logistic regression and provides the foundation for classification (Module 3). The squared-error framework introduced here also reappears in Module 8 as the principal tool for distinguishing genuine fit from overfitting.

Summary: linear regression predicts a continuous value using the model \( \hat{y} = wx + b \), and fitting consists of selecting the parameters \( w \) and \( b \) that minimize the mean squared error — the mean of the squared residuals between the observed values and the model's predictions.