← All Machine Learning Foundations modules

Module 8 — Honest Evaluation

Module 8 · Project · evaluating model performance honestly · about 35 minutes.

Every model in this course is governed by a complexity hyperparameter — k in k-NN, depth in a decision tree, the degree of a polynomial. Increasing model complexity drives training accuracy toward 100%. This closing module addresses the central methodological insight that follows: improvements in training accuracy beyond a certain point degrade generalization performance. Learning to measure generalization rigorously is what distinguishes a genuine model from one whose apparent performance is an artifact of overfitting to its training data.

The fundamental distinction: memorization is not learning

The objective of a model is to generalize — to perform accurately on observations it has not previously seen. However, the model is fit using only the training data, so a sufficiently flexible model can attain near-perfect training accuracy by memorizing each observation, including its noise component. This phenomenon is overfitting: high performance on the training set, poor performance on new data. The only valid empirical test of generalization is performance on data that was withheld from the training process.

Training versus test error: the characteristic U-curve

Below, a noisy nonlinear dataset is fit by a polynomial whose flexibility (degree) you control. Observe the two quantities as the degree is increased:

Training error — computed on the points used to fit the polynomial. This quantity decreases monotonically: greater flexibility always permits a closer fit to the training data.
Test error — computed on held-out points not used in fitting. This quantity initially decreases, reaches a minimum, then increases as the polynomial begins to fit noise rather than signal.

The minimum of the test-error curve identifies the optimal model complexity. Complexity below this value yields underfitting — the model lacks the flexibility to represent the underlying pattern. Complexity above this value yields overfitting — the model captures random variation in the training sample that does not generalize.

This activity needs JavaScript. The lesson below still covers everything.

Bias and variance: the two components of generalization error

Each side of the U-curve corresponds to a distinct source of error. Bias is error arising from a model too restricted to represent the underlying function: the model is systematically wrong in the same direction. Variance is error arising from a model so flexible that its fitted form is highly sensitive to the particular training sample: the model captures random fluctuations rather than signal. Bias and variance cannot, in general, be minimized simultaneously; selecting a model's complexity is therefore an exercise in balancing the two. This is the bias–variance trade-off, the unifying principle underlying every complexity hyperparameter encountered in this course.

Cross-validation: a more reliable estimate of generalization

A single train/test split yields an estimate of generalization that is subject to substantial sampling variability. k-fold cross-validation mitigates this variability: the dataset is partitioned into \( k \) disjoint folds, the model is trained on \( k - 1 \) folds and evaluated on the remaining fold, and this procedure is repeated for each fold. The reported performance is the mean of the \( k \) per-fold scores. Every observation serves as test data exactly once, producing a more reliable estimate of generalization than any single split.

The equivalent procedure in scikit-learn — executable in the browser

from sklearn.model_selection import train_test_split, cross_val_score

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2)  # withhold 20% for evaluation
model.fit(X_tr, y_tr)
model.score(X_te, y_te)        # the valid estimate of generalization

cross_val_score(model, X, y, cv=5)  # 5-fold: mean of 5 held-out scores

The score produced by .score() on held-out data — and never on the training data — is the valid estimate of generalization performance. Cross-validation produces a more stable estimate of the same quantity. Click Run it yourself to compare the optimistic training-set score against the unbiased held-out score.

When you run it, the program prints these numbers and draws the chart below them — the left panel shows the optimism gap between the inflated training score and the honest held-out score; the right panel shows the five cross-validation folds scattering around their mean (the red band is \( \text{mean} \pm 1 \) standard deviation), making clear why averaging over folds gives a steadier estimate than any single split:

accuracy on TRAINING data (optimistic estimate) = 0.869
accuracy on HELD-OUT data (unbiased estimate)   = 0.838
5-fold scores  = [0.862 0.912 0.812 0.925 0.775]
cross-val mean = 0.858 +/- 0.057

AI anchor — the methodological foundation of every deployed model Rigorous evaluation is the methodological foundation of all of machine learning, from a single decision tree to a large language model. Every benchmark result, every claim of state-of-the-art performance, and every assertion that a model is effective in practice rests on held-out evaluation data and an understanding of the bias–variance trade-off. Practitioners who omit this step deploy models that appear strong in demonstration but fail in production. This methodology is what transforms the algorithms of Modules 1–7 into trustworthy systems, and it provides the necessary foundation before Course 4 extends these principles to deep learning.

Project — synthesis across the course

The following questions integrate material from across the course: the modeling workflow, each model family, and the evaluation methodology by which all models are judged.

This activity needs JavaScript.

Course complete — what follows You have now traversed the complete machine-learning workflow and the principal classical model families, each trained interactively in the browser, together with the evaluation methodology that establishes their reliability. Course 4 — Deep Learning extends the gradient-descent framework of Course 2 and the generalization principles introduced here to neural networks — the architectures that drive modern computer vision and natural language processing, including the systems introduced in Course 1.

Summary: a model's quality is measured by its performance on data it has not seen. Training error decreases monotonically with model complexity, but test error forms a U-shaped curve whose minimum corresponds to the optimal bias–variance balance — identified using a held-out evaluation set and confirmed by k-fold cross-validation.

Next: Capstone — Build a Concept Manipulative →