Module 3 — Classification: Drawing Boundaries
Where regression predicts a continuous quantity, classification predicts a categorical outcome drawn from a finite label set — spam or legitimate, benign or malignant, one of several object classes. The learning task is to estimate a decision boundary that partitions the feature space into regions associated with each class; a new observation is then assigned the label of the region into which it falls. In this module you construct the most transparent such classifier by hand.
The simplest classifier: ask the neighbours
k-nearest-neighbours (k-NN) is a non-parametric, instance-based method: it estimates no parameters and performs no optimisation during training, instead storing the labelled examples themselves. To classify a query point, it identifies the \( k \) training instances nearest to it and assigns the plurality (majority) class among them. The hyperparameter \( k \) controls the bias–variance trade-off: \( k = 1 \) reproduces the label of the single closest example (low bias, high variance), whereas a larger \( k \) averages over a wider neighbourhood, suppressing noise at the cost of a smoother, more biased boundary. Proximity is measured by a distance metric — conventionally the Euclidean norm \( \lVert \mathbf{x} - \mathbf{x}_i \rVert_2 \) introduced in Course 2, Module 4.
Click anywhere in the plot to position a query point. The widget marks its \( k \) nearest neighbours, reports their vote, and shades each region by its predicted class. Adjust \( k \) and observe how the prediction at a contested location changes as additional neighbours are enfranchised. (A query point is placed for you at the centre to begin.)
This activity needs JavaScript. The lesson below still covers everything.
The decision boundary
Evaluating the classifier at every location in the feature space partitions it into class regions; the locus along which the predicted label changes is the decision boundary, rendered as the shaded background above. For small \( k \) the boundary is highly irregular, contorting around individual training points — a symptom of overfitting (high variance). As \( k \) increases, the boundary becomes smoother and more stable, exchanging variance for bias.
From votes to probabilities
Most classifiers report not only a predicted label but an estimated posterior probability of class membership. Logistic regression (Course 2, Module 8) forms a linear score \( z = \mathbf{w}\cdot\mathbf{x} + b \) and maps it into \( (0,1) \) through the logistic (sigmoid) function:
A decision threshold (conventionally \( 0.5 \)) converts this probability into a label: the positive class is predicted when \( \hat{p} \) exceeds the threshold. Shifting the threshold trades false positives against false negatives — the operating-point choice formalised in Module 8.
from sklearn.neighbors import KNeighborsClassifier clf = KNeighborsClassifier(n_neighbors=5) clf.fit(X_train, y_train) # just memorizes the labeled points clf.predict([[2.0, 3.1]]) # majority vote of the 5 nearest → 0 clf.predict_proba([[2.0, 3.1]]) # [0.6 0.4] → 3 of the 5 are class 0import numpy as np import matplotlib.pyplot as plt from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import make_blobs # Two labeled clusters — the points you'd be classifying X, y = make_blobs(n_samples=80, centers=2, cluster_std=1.4, random_state=0) clf = KNeighborsClassifier(n_neighbors=5) clf.fit(X, y) # just memorizes the labeled points print("train accuracy =", round(clf.score(X, y), 3)) print("predict [2.0, 3.1] ->", clf.predict([[2.0, 3.1]])[0]) print("predict_proba [2.0,3.1] ->", np.round(clf.predict_proba([[2.0, 3.1]])[0], 2)) # Color every point of the plane by the majority vote of its 5 neighbors xx, yy = np.meshgrid(np.linspace(X[:, 0].min() - 1, X[:, 0].max() + 1, 200), np.linspace(X[:, 1].min() - 1, X[:, 1].max() + 1, 200)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) plt.figure(figsize=(5, 3.4)) plt.contourf(xx, yy, Z, alpha=0.2, cmap="coolwarm") plt.scatter(X[:, 0], X[:, 1], c=y, cmap="coolwarm", s=18, edgecolor="k", linewidth=0.3) plt.title("k-NN decision boundary (k=5)"); plt.tight_layout(); plt.show()
This cell runs live in your browser — the snippet above is a preview; press Run it yourself to execute the full program (no installation). On the two-cluster dataset it prints train accuracy = 0.838, predict([2.0, 3.1]) → 0, and predict_proba → [0.6, 0.4], then plots the \( k = 5 \) decision boundary. Note that the training accuracy is well below 100% because the classes overlap and \( k = 5 \) deliberately smooths over individual points — reduce n_neighbors toward 1 and rerun to see the boundary sharpen and the training accuracy rise (toward overfitting). Swap in LogisticRegression() for a smooth linear boundary with calibrated probabilities.
Classify the claims
A few questions on neighbors, boundaries, and thresholds. You will get a score.
This activity needs JavaScript.