← All Machine Learning Foundations modules

Module 3 — Classification: Drawing Boundaries

Supervised learning · hands-on · about 30 minutes.

Where regression predicts a continuous quantity, classification predicts a categorical outcome drawn from a finite label set — spam or legitimate, benign or malignant, one of several object classes. The learning task is to estimate a decision boundary that partitions the feature space into regions associated with each class; a new observation is then assigned the label of the region into which it falls. In this module you construct the most transparent such classifier by hand.

Reading the plot: features, points, and classes

Before the mechanics, it helps to know what the picture actually shows. Each item being classified is described by a short list of measurements, called features. Here there are exactly two features, and they become the two axes of the plot — horizontal and vertical. Every dot is one example: its position is fixed by its two measurements, and its color records the class it genuinely belongs to (class 0 or class 1). The classifier's task is to learn, from these labeled dots, which areas of the plane belong to which class, so that a new dot dropped anywhere can be labeled by where it lands.

A concrete analogy: suppose you are sorting email into spam and legitimate using two counts — say, the number of links and the number of ALL-CAPS words in each message. Every email becomes a dot positioned by those two numbers; spam tends to gather in one region and legitimate mail in another, forming two clouds; and the line separating the clouds is the rule that flags spam. The two axes in this lesson's demonstration are deliberately generic — they carry no particular units, because what matters is the shape of the data: two groups that partly overlap. The identical procedure applies unchanged when the two measurements are real-world quantities.

The simplest classifier: k-nearest-neighbors

k-nearest-neighbors (k-NN) is a non-parametric, instance-based method: it estimates no parameters and performs no optimization during training, instead storing the labeled examples themselves. To classify a query point, it identifies the \( k \) training instances nearest to it and assigns the plurality (majority) class among them. The hyperparameter \( k \) controls the bias–variance trade-off: \( k = 1 \) reproduces the label of the single closest example (low bias, high variance), whereas a larger \( k \) averages over a wider neighborhood, suppressing noise at the cost of a smoother, more biased boundary. Proximity is measured by a distance metric — conventionally the Euclidean norm \( \lVert \mathbf{x} - \mathbf{x}_i \rVert_2 \) introduced in Course 2, Module 4.

Click anywhere in the plot to position a query point — the new, unlabeled observation you want the classifier to label (every other point already carries a known class; this is the one you are asking about) — or drag it around to watch the prediction update as it moves. The widget marks its \( k \) nearest neighbors, reports their vote, and shades each region by its predicted class. Adjust \( k \) and observe how the prediction at a contested location changes as more neighbors are allowed to vote. (A query point is placed for you at the center to begin.)

This activity needs JavaScript. The lesson below still covers everything.

The decision boundary

Evaluating the classifier at every location in the feature space partitions it into class regions; the locus along which the predicted label changes is the decision boundary, rendered as the shaded background above. For small \( k \) the boundary is highly irregular, contorting around individual training points — a symptom of overfitting (high variance). As \( k \) increases, the boundary becomes smoother and more stable, exchanging variance for bias.

Draw your own dataset

The boundary above came from a fixed dataset. Here you build the data: pick a class, click on the plot to drop training points, and watch the decision boundary redraw itself after every point. Place two clean clusters and the boundary is simple; interleave the classes and watch it contort.

This activity needs JavaScript. The lesson below still covers everything.

From votes to probabilities

Most classifiers report not only a predicted label but an estimated posterior probability of class membership. Logistic regression (Course 2, Module 8) forms a linear score \( z = \mathbf{w}\cdot\mathbf{x} + b \) and maps it into \( (0,1) \) through the logistic (sigmoid) function:

\[ \hat{p} \;=\; \sigma(z) \;=\; \frac{1}{1 + e^{-z}} \]

A decision threshold (conventionally \( 0.5 \)) converts this probability into a label: the positive class is predicted when \( \hat{p} \) exceeds the threshold. The threshold is a dial you can turn, and where you set it decides which kind of mistake the classifier makes more often.

Whenever the truth is yes/no and the model also answers yes/no, two kinds of error are possible:

A false positive — the model says positive but the truth is negative (a false alarm). For example, a spam filter sending a real email to the spam folder.
A false negative — the model says negative but the truth is positive (a missed case). For example, a disease screen clearing a patient who actually has the disease.

A useful summary figure is recall, also called the true positive rate: of all the cases that are truly positive, the fraction the model successfully flags. Catching more true cases — that is, fewer false negatives — means higher recall.

Now turn the dial. Lowering the threshold makes the model call things positive more readily, so it catches more true cases — fewer false negatives and higher recall — but it also raises the number of false positives. Raising the threshold does the reverse: fewer false positives, but more false negatives. Moving the threshold cannot lower both kinds of error at once; it trades one against the other.

Which way to turn it is a design decision driven by the cost of each error. In medical screening, a missed case (a false negative) can be life-threatening, while a false positive merely triggers a follow-up test — so you deliberately lower the threshold to minimize false negatives, accepting more false positives in return. A spam filter often reasons the opposite way: hiding a genuine email (a false positive) is worse than letting some spam through, so it leans toward a higher threshold. The formal methodology for choosing and evaluating this operating point is developed in Module 8.

The equivalent procedure in scikit-learn — executable in the browser

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)        # stores the labeled training points
clf.predict([[2.0, 3.1]])      # majority vote of the 5 nearest → 0
clf.predict_proba([[2.0, 3.1]]) # [0.6 0.4] → 3 of the 5 are class 0

This cell executes in the browser — the snippet above is a preview; click Run it yourself to execute the full program. On the two-cluster dataset it prints train accuracy = 0.838, predict([2.0, 3.1]) → 0, and predict_proba → [0.6, 0.4], then plots the \( k = 5 \) decision boundary. Training accuracy is below 100% because the classes overlap and \( k = 5 \) smooths the prediction over a neighborhood; decreasing n_neighbors toward 1 and re-executing yields a sharper boundary and a higher training accuracy — a manifestation of overfitting. Substituting LogisticRegression() produces a smooth linear boundary with calibrated probabilities.

When you run it, the program prints (and draws the decision-boundary chart):

train accuracy         = 0.838
predict [2.0, 3.1]      -> 0
predict_proba [2.0,3.1] -> [0.6 0.4]

AI anchor — classification underlies most automated decision systems Spam filtering, fraud detection, medical triage, content moderation, and the final layer of an image classifier are all instances of classification. The model partitions the feature space with a decision boundary and assigns each new observation to a class, typically with an associated probability. Selecting the decision threshold is a substantive product decision: a clinical screening tool is typically calibrated to minimize false negatives — even at the cost of additional false positives — a trade-off that can be reasoned about rigorously only using the evaluation methodology developed in Module 8.

Check your understanding

Answer a short set of questions on neighbors, decision boundaries, and probability thresholds.

This activity needs JavaScript.

Why this matters next k-NN classifies by distance in feature space; the next classifier, naive Bayes (Module 4), instead combines class-conditional probabilities using Bayes' rule from Course 2, producing a fully probabilistic spam classifier. Both methods induce decision boundaries; they differ in how proximity in feature space is defined. The trustworthiness of any classifier is established only through the rigorous evaluation methodology of Module 8.

Summary: classification predicts a categorical label by inducing a decision boundary in the feature space. k-nearest-neighbors does so via the plurality label of the \( k \) closest labeled points; logistic regression outputs a posterior probability \( \sigma(w \cdot x + b) \) that is converted to a class by a decision threshold.