← All Machine Learning Foundations modules

Module 6 — Clustering: k-means

Unsupervised learning · hands-on · about 30 minutes.

Every model considered so far has required labels — each training example was annotated as "spam" or "not spam," "class A" or "class B." This module removes that requirement. Unsupervised learning identifies structure in unlabeled data. The canonical example is k-means clustering: given a set of observations and an integer \( k \), the algorithm partitions the observations into \( k \) groups without supervision.

The algorithm: iterative refinement of centroids

k-means searches for \( k \) centroids — the centers of \( k \) clusters — by alternating two steps until the centroids cease to move:

Assignment step: assign each observation to its nearest centroid.
Update step: recompute each centroid as the mean of the observations assigned to it.

These two steps constitute the entirety of the algorithm. Each iteration is guaranteed to weakly decrease the objective function, so convergence is monotonic. The objective minimized is the inertia (also called the within-cluster sum of squares): the sum of squared distances from each observation to its assigned centroid.

\[ \text{inertia} \;=\; \sum_{i} \lVert x_i - \mu_{c(i)} \rVert^2 \]

where \( \mu_{c(i)} \) denotes the centroid of the cluster to which observation \( x_i \) is assigned. Lower inertia corresponds to more compact, well-separated clusters.

Run the algorithm

Set \( k \), then click Step to execute one assignment-update iteration at a time, or Run to iterate to convergence. The centroids (large rings) are initialized at random positions and progressively migrate toward the centers of the underlying clusters. Reset re-initializes the centroids; note that the final partition can depend on the initialization, a property of the algorithm's local-optimum behavior.

This activity needs JavaScript. The lesson below still covers everything.

Selecting k: the elbow method

k-means does not determine the number of clusters; \( k \) must be specified by the practitioner. Selecting \( k \) too small merges distinct clusters, while selecting \( k \) too large fragments coherent clusters into spurious subgroups. The elbow method is a standard heuristic: plot inertia as a function of \( k \). Inertia is monotonically non-increasing in \( k \), but the marginal reduction typically diminishes sharply once \( k \) exceeds the number of clusters present in the data. The value of \( k \) at this inflection point is a reasonable choice.

Elbow finder

The chart below runs k-means to convergence for \( k = 1 \) through \( 6 \) on the same dataset as the playground above and plots the resulting inertia. Note the sharp bend: beyond it, additional clusters reduce inertia only marginally.

This activity needs JavaScript. The lesson below still covers everything.

The equivalent procedure in scikit-learn — executable in the browser

from sklearn.cluster import KMeans

km = KMeans(n_clusters=3)   # the k you set on the slider
km.fit(X)                     # no labels y — that's what "unsupervised" means
km.labels_                       # which cluster each point landed in
km.inertia_                      # the within-cluster spread it minimizes

Note that fit(X) accepts no label argument y. There are no target values for the algorithm to learn; the inferred structure derives entirely from the geometry of the feature distribution. Click Run it yourself, then vary n_clusters and observe the effect on the resulting inertia and centroid positions.

When you run it, the program prints (and draws the clustered scatter chart):

inertia (within-cluster spread) = 536.4
points per cluster            = [ 93 105 102]

AI anchor — discovering structure in unlabeled data Clustering is the primary method for analyzing data that has not been manually annotated. It is widely used for customer segmentation in marketing, grouping similar documents or images, palette reduction in image compression, anomaly detection (an observation far from every centroid is, by construction, atypical), and initial exploratory analysis in fields such as genomics. Clustering is the standard initial step in exploratory data analysis — used to characterize the latent group structure of a dataset before any supervised, labeled model is constructed.

Check your understanding

Answer a short set of questions on the assignment and update steps, inertia, and the selection of \( k \).

This activity needs JavaScript.

Why this matters next Clustering groups observations by spatial proximity; the next method, dimensionality reduction (Module 7), instead identifies the directions in feature space that capture the greatest variance — projecting a high-dimensional feature space onto a low-dimensional subspace while preserving its essential structure. Together, clustering and dimensionality reduction are the two principal forms of unsupervised learning: the former identifies group structure; the latter identifies the latent axes along which the data vary.

Summary: k-means is an unsupervised algorithm that partitions unlabeled data into \( k \) clusters by alternating an assignment step (each observation to its nearest centroid) and an update step (each centroid to the mean of its assigned observations), monotonically reducing the inertia \( \sum \lVert x_i - \mu_{c(i)} \rVert^2 \) until convergence to a local optimum.