← All Machine Learning Foundations modules

Module 7 — Dimensionality Reduction: PCA

Unsupervised learning · hands-on · about 30 minutes.

Real-world datasets frequently contain hundreds of features — far more than can be directly visualized or interpreted. Dimensionality reduction projects high-dimensional data onto a small number of derived features while preserving as much of the data's structure as possible. The classical technique is Principal Component Analysis (PCA), which identifies the orthogonal directions of greatest variance in the data and retains the leading few.

The central principle: variance as information

Consider an elongated, rotated cloud of points. The variance along its principal axis is large; the variance along the orthogonal axis is small. PCA identifies these axes — the principal components — ordered by the variance they account for:

PC1 — the direction of maximum variance (the principal axis of the cloud).
PC2 — the direction orthogonal to PC1 that captures the maximum remaining variance.

The cloud is long in one direction and thin in the other. PC1 points along the spread; PC2 is what little is left. Keep PC1, drop PC2.

When the great majority of the variance is concentrated along PC1, PC2 can be discarded and each observation represented by a single coordinate — its projection onto PC1 — with minimal loss of information. This is the essence of dimensionality reduction: two features are compressed into one with negligible loss of structure.

Orthogonal projection onto a line

Reducing the data to one dimension is accomplished by orthogonally projecting each observation onto a single line — dropping the point perpendicularly onto the line and recording its position along it. The fraction of the original variance preserved by the projection is the retained variance. The projection onto PC1 maximizes the retained variance; projection onto any other direction necessarily preserves less.

Drop each point straight down onto the line (dashed = perpendicular). Its landing spot is its single new number. Two coordinates become one; how tightly the orange dots hug the cloud is the retained variance.

Locate the principal axis interactively

Rotate the projection line using the slider below. The bar reports the fraction of the total variance preserved by the projection. Identify the angle at which this quantity is maximized — that direction is, by definition, PC1. Click Snap to PC1 to display the value computed analytically by PCA from the data's covariance matrix.

This activity needs JavaScript. The lesson below still covers everything.

The equivalent procedure in scikit-learn — executable in the browser

from sklearn.decomposition import PCA

pca = PCA(n_components=1)        # keep just the strongest direction
Z = pca.fit_transform(X)            # each point → its position on PC1
pca.explained_variance_ratio_      # fraction of variance kept, e.g. [0.92]

explained_variance_ratio_ corresponds precisely to the "retained variance" indicator in the activity above — the fraction of the data's variance preserved after the reduction. Click Run it yourself to observe how little variance is lost when the data are intrinsically low-rank, i.e. when most of the variance is concentrated along a single direction.

When you run it, the program prints (and draws a chart):

explained_variance_ratio_ = [0.981]
kept 98.1% of the variance using 1 of 2 dimensions

AI anchor — managing high-dimensional data PCA and related dimensionality-reduction methods are widely used wherever high-dimensional data must be analyzed: visualizing a 50-feature dataset in two dimensions, compressing images, reducing training time by lowering the input dimensionality of downstream models, and removing redundant correlated features prior to training. Modern dense embeddings — the vector representations underlying search and recommendation systems — are reduced and compared using closely related techniques. Any two-dimensional visualization of high-dimensional data is, almost invariably, the product of dimensionality reduction.

Check your understanding

Answer a short set of questions on principal components, variance, and projection.

This activity needs JavaScript.

Why this matters next You have now encountered the principal model families — regression, classification, naive Bayes, decision trees, clustering, and PCA. The final module addresses the question that determines whether any of these models is trustworthy in practice: does the model generalize to data it has not seen? Module 8 develops the methodology of honest evaluation — overfitting, train/test splits, cross-validation, and the bias–variance trade-off that underlies every complexity hyperparameter in this course.

Summary: PCA performs dimensionality reduction by identifying the orthogonal directions of greatest variance — the principal components — and projecting the data onto the leading few. This procedure retains the maximum possible variance for a given target dimensionality.