← All Machine Learning Foundations modules

Module 8 — Honest Evaluation

Capstone · the question that decides if any model is real · about 35 minutes.

Every model in this course had a knob — k in k-NN, depth in a tree, the degree of a curve. Crank it and training accuracy climbs toward perfect. This capstone is about the uncomfortable truth behind that climb: fitting the training data better and better eventually makes the model worse on everything else. Learning to measure that honestly is what separates a real model from a fooled one.

The trap: memorizing isn’t learning

A model’s job is to generalize — to do well on data it has never seen. But it only ever sees the training set, so a flexible enough model can ace that set by memorizing every point, noise included. That’s overfitting: brilliant on the data it studied, lost on anything new. The only honest test is to hold out data the model never touched.

Train vs. test: the U-curve

Below, the same wiggly data is fit by a curve whose flexibility (polynomial degree) you control. Watch two numbers as you turn it up:

That bottom of the test curve is the sweet spot. Too far left is underfitting (too stiff to catch the pattern); too far right is overfitting (so bendy it memorizes wiggles that won’t repeat).

This activity needs JavaScript. The lesson below still covers everything.

Bias and variance: the two ways to be wrong

The U-curve has a name for each side. Bias is error from a model too simple to capture the truth (the stiff line — it’s wrong the same way every time). Variance is error from a model so flexible it changes wildly with every new sample (the wiggly curve — it chases noise). You can’t kill both at once; tuning a model is choosing the balance point. That’s the bias–variance tradeoff, and it is the single idea underneath every knob in this course.

Cross-validation: don’t trust one split

A single held-out test set can be lucky or unlucky. k-fold cross-validation fixes this: split the data into \( k \) parts, train on \( k-1 \) and test on the one left out, rotate through all \( k \), and average the scores. Every point gets to be test data exactly once — a far more honest estimate than a single split.

The same thing in scikit-learn — run it right here, nothing to install
from sklearn.model_selection import train_test_split, cross_val_score

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2)  # hold out 20%
model.fit(X_tr, y_tr)
model.score(X_te, y_te)        # the ONLY honest number

cross_val_score(model, X, y, cv=5)  # 5-fold: average of 5 honest splits

.score() on held-out data — never on the training data — is the number that tells you if the model is real. Cross-validation just makes it sturdier. Hit Run it yourself and compare the optimistic training number against the honest held-out one.

AI anchor — the discipline behind every model that ships Honest evaluation is the unglamorous backbone of all of machine learning, from a tiny tree to a giant language model. Every benchmark, every "state of the art," every claim that a model "works" rests on held-out data and the bias–variance balance. A team that skips it ships models that dazzle in the demo and fail in production. This is the judgment that turns the algorithms from Modules 1–7 into something trustworthy — and it’s the perfect place to stand before Course 4 scales these ideas up to deep learning.

Capstone check — tie it all together

The final questions span the whole course: the workflow, each model, and the evaluation that judges them all. You will get a score.

This activity needs JavaScript.

You finished the trunk — what’s next You’ve walked the full machine-learning workflow and met every classic model, each one trained live in your browser, plus the evaluation that keeps them honest. Course 4 — Deep Learning takes the gradient descent of Course 2 and the generalization discipline you just learned and scales them into neural networks: the engines behind modern vision, language, and the AI systems you started with in Course 1.
One-sentence summary: a model is only as good as its performance on data it never saw — training error always falls with flexibility, but test error forms a U whose bottom is the bias–variance sweet spot, found honestly with a held-out set and confirmed with cross-validation.

← Back to all modules