Module 8 — Honest Evaluation
Every model in this course had a knob — k in k-NN, depth in a tree, the degree of a curve. Crank it and training accuracy climbs toward perfect. This capstone is about the uncomfortable truth behind that climb: fitting the training data better and better eventually makes the model worse on everything else. Learning to measure that honestly is what separates a real model from a fooled one.
The trap: memorizing isn’t learning
A model’s job is to generalize — to do well on data it has never seen. But it only ever sees the training set, so a flexible enough model can ace that set by memorizing every point, noise included. That’s overfitting: brilliant on the data it studied, lost on anything new. The only honest test is to hold out data the model never touched.
Train vs. test: the U-curve
Below, the same wiggly data is fit by a curve whose flexibility (polynomial degree) you control. Watch two numbers as you turn it up:
- Training error — measured on the points the curve fit. It falls and falls; more flexibility always fits the seen data better.
- Test error — measured on held-out points. It falls at first, bottoms out, then rises as the curve starts chasing noise.
That bottom of the test curve is the sweet spot. Too far left is underfitting (too stiff to catch the pattern); too far right is overfitting (so bendy it memorizes wiggles that won’t repeat).
This activity needs JavaScript. The lesson below still covers everything.
Bias and variance: the two ways to be wrong
The U-curve has a name for each side. Bias is error from a model too simple to capture the truth (the stiff line — it’s wrong the same way every time). Variance is error from a model so flexible it changes wildly with every new sample (the wiggly curve — it chases noise). You can’t kill both at once; tuning a model is choosing the balance point. That’s the bias–variance tradeoff, and it is the single idea underneath every knob in this course.
Cross-validation: don’t trust one split
A single held-out test set can be lucky or unlucky. k-fold cross-validation fixes this: split the data into \( k \) parts, train on \( k-1 \) and test on the one left out, rotate through all \( k \), and average the scores. Every point gets to be test data exactly once — a far more honest estimate than a single split.
from sklearn.model_selection import train_test_split, cross_val_score X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2) # hold out 20% model.fit(X_tr, y_tr) model.score(X_te, y_te) # the ONLY honest number cross_val_score(model, X, y, cv=5) # 5-fold: average of 5 honest splitsimport numpy as np from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split, cross_val_score X, y = make_classification(n_samples=400, n_features=8, n_informative=5, random_state=0) # One split: train on 80%, judge on the held-out 20% X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=0) model = LogisticRegression(max_iter=1000).fit(X_tr, y_tr) print("accuracy on TRAINING data (optimistic) =", round(model.score(X_tr, y_tr), 3)) print("accuracy on HELD-OUT data (honest) =", round(model.score(X_te, y_te), 3)) # 5-fold cross-validation: five honest splits instead of one scores = cross_val_score(LogisticRegression(max_iter=1000), X, y, cv=5) print("5-fold scores =", np.round(scores, 3)) print("cross-val mean =", round(scores.mean(), 3), "+/-", round(scores.std(), 3))
.score() on held-out data — never on the training data — is the number that tells you if the model is real. Cross-validation just makes it sturdier. Hit Run it yourself and compare the optimistic training number against the honest held-out one.
Capstone check — tie it all together
The final questions span the whole course: the workflow, each model, and the evaluation that judges them all. You will get a score.
This activity needs JavaScript.