Module 7 — Dimensionality Reduction: PCA
Real data often has hundreds of features — far too many to see or reason about. Dimensionality reduction squashes those many features down to a handful while keeping as much of the data’s shape as possible. The classic method is Principal Component Analysis (PCA): it finds the directions the data actually varies in, and lets you keep only the strongest few.
The key idea: variance is information
Picture a stretched, tilted oval of points. Most of the spread runs along its long axis; very little runs across the short one. PCA finds those axes — the principal components — ordered by how much the data varies along each:
- PC1 — the single direction of greatest variance (the oval’s long axis).
- PC2 — the next direction, at a right angle to PC1, with the most remaining variance.
If almost all the spread lives along PC1, you can throw PC2 away and describe each point by a single number — its position along PC1 — losing almost nothing. That’s reduction: two features become one.
Projecting onto a line
"Reducing to one dimension" means projecting every point onto a single line — sliding it straight onto the line at a right angle. The amount of spread you keep is the retained variance. Project onto PC1 and you keep the most possible; project onto any other line and you keep less.
Spin the line, watch the variance
Rotate the projection line below. The bar shows what fraction of the data’s total variance survives the projection. Find the angle that maxes it out — you’ve just found PC1 by hand. Then press Snap to PC1 to see the exact answer PCA computes.
This activity needs JavaScript. The lesson below still covers everything.
from sklearn.decomposition import PCA pca = PCA(n_components=1) # keep just the strongest direction Z = pca.fit_transform(X) # each point → its position on PC1 pca.explained_variance_ratio_ # fraction of variance kept, e.g. [0.92]import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA # A correlated 2-D cloud: most of its variance lies along one diagonal rng = np.random.default_rng(0) t = rng.normal(0, 1, size=200) X = np.c_[t * 2.0, t * 1.0] + rng.normal(0, 0.3, size=(200, 2)) pca = PCA(n_components=1) Z = pca.fit_transform(X) # each point → its position on PC1 recon = pca.inverse_transform(Z) # put it back in 2-D, on the line kept = pca.explained_variance_ratio_[0] print("explained_variance_ratio_ =", np.round(pca.explained_variance_ratio_, 3)) print(f"kept {kept*100:.1f}% of the variance using 1 of 2 dimensions") plt.figure(figsize=(5, 3.4)) plt.scatter(X[:, 0], X[:, 1], s=12, alpha=0.4, label="original 2-D") plt.scatter(recon[:, 0], recon[:, 1], s=12, color="crimson", label="projected to PC1") plt.axis("equal"); plt.legend(); plt.title("PCA: 2-D → 1-D"); plt.tight_layout(); plt.show()
explained_variance_ratio_ is exactly the "retained variance" bar below — how much of the shape you kept after dropping a dimension. Hit Run it yourself and see how little is lost when the data really lives along one direction.
Reduce the claims
A few questions on components, variance, and projection. You will get a score.
This activity needs JavaScript.