Module 6 — Clustering: k-means
Every model so far needed labels — someone had to tag each example "spam" or "not," "class A" or "B." Now we cut the labels away. Unsupervised learning finds structure in data nobody has tagged. The most famous example is k-means clustering: hand it a cloud of points and a number \( k \), and it discovers \( k \) groups on its own.
The idea: pick centers, then settle
k-means looks for \( k \) centroids — the centers of \( k \) groups — by repeating two dead-simple steps until nothing moves:
- Assign: color each point by its nearest centroid.
- Update: move each centroid to the average of the points that chose it.
That's the whole algorithm. Assign, update, assign, update — each round can only lower the total spread, so it always settles. The quantity it drives down is the inertia: the sum of squared distances from every point to its centroid.
where \( \mu_{c(i)} \) is the centroid of the cluster point \( x_i \) was assigned to. Lower inertia means tighter, cleaner groups.
Run it yourself
Set \( k \), then press Step to watch one assign-and-update round at a time, or Run to let it converge. The centroids (the big rings) start in random spots and walk toward the heart of each cluster. Reset re-seeds them — notice the final groups can change depending on where the centroids started.
This activity needs JavaScript. The lesson below still covers everything.
Choosing k: the elbow
k-means can't tell you how many groups exist — you pick \( k \). Too few and you merge distinct groups; too many and you split one group into meaningless shards. A common trick is the elbow: plot inertia as \( k \) grows. It always drops, but the drop slows sharply once you pass the "true" number of clusters — that bend is a good \( k \).
from sklearn.cluster import KMeans km = KMeans(n_clusters=3) # the k you set on the slider km.fit(X) # no labels y — that's what "unsupervised" means km.labels_ # which cluster each point landed in km.inertia_ # the spread it minimizedimport numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs # Three natural groups — but we hand the model NO labels X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=0) km = KMeans(n_clusters=3, n_init=10, random_state=0) km.fit(X) # no y — unsupervised print("inertia (spread it minimized) =", round(km.inertia_, 1)) print("points per cluster =", np.bincount(km.labels_)) plt.figure(figsize=(5, 3.4)) plt.scatter(X[:, 0], X[:, 1], c=km.labels_, cmap="viridis", s=15) plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], c="red", marker="X", s=160, edgecolor="k", label="centroids") plt.legend(); plt.title("k-means, k=3"); plt.tight_layout(); plt.show()
Notice fit(X) takes no y. There are no right answers to learn from — the structure comes entirely from the data’s shape. Hit Run it yourself, then change n_clusters and watch the inertia and the centroids move.
Group the claims
A few questions on assignment, inertia, and choosing k. You will get a score.
This activity needs JavaScript.