← All Machine Learning Foundations modules

Module 5 — Decision Trees

Supervised learning · hands-on · about 30 minutes.

A decision tree is among the most interpretable models in machine learning: it classifies by applying a hierarchical sequence of binary tests on the input features — for example, "Is income > $50,000? If yes, is age < 30?" — producing a tree-structured classifier. In this module you will grow a decision tree one level at a time, observe how it partitions the feature space into axis-aligned regions, and identify the depth at which it begins to overfit.

Splits: recursive binary partitioning

At each internal node the algorithm selects a single feature-and-threshold pair — for example, "is $ x_1 < 3.2 $?" — that best separates the classes by some splitting criterion. The data are then partitioned into two subsets and the procedure recurses within each subset, generating successive splits until a stopping condition is reached at a leaf node, which is assigned a class label.

Selecting the optimal split: impurity criteria

A split is good when each resulting subset is as pure as possible — ideally every observation in it belongs to a single class, so the leaf can predict that class with confidence. To choose between candidate splits we first need a number that measures how mixed a subset is. The standard choice is the Gini impurity:

\[ \text{Gini} \;=\; 1 - \sum_{c} p_c^2 \]

where $ p_c $ is the fraction of the subset belonging to class $ c $. One way to read it: the probability you would mislabel a randomly drawn item if you guessed classes in proportion to how often they appear. A few values fix the scale (two classes):

All one class $(p_c = 1, 0)$: $ 1 - (1^2 + 0^2) = 0 $ — perfectly pure.
An 80/20 mix: $ 1 - (0.8^2 + 0.2^2) = 0.32 $ — mostly pure.
A 50/50 mix: $ 1 - (0.5^2 + 0.5^2) = 0.5 $ — maximally impure, a coin flip.

A split sends the node's data into a left and a right child. To score it, compare the parent's impurity against its children's — weighting each child by the fraction of points that land in it, since a child holding most of the data should count for more:

\[ \Delta\text{Gini} \;=\; \text{Gini}_{\text{parent}} \;-\; \left( \frac{n_L}{n}\,\text{Gini}_L \;+\; \frac{n_R}{n}\,\text{Gini}_R \right) \]

This drop $ \Delta\text{Gini} $ is the split's purity gain. The algorithm greedily keeps the feature-and-threshold with the largest gain, then recurses inside each child. Note what is not happening: unlike the linear-regression and neural-network models elsewhere in this course, there is no gradient descent — a discrete feature/threshold choice is not a smooth dial to nudge. The algorithm instead simply enumerates every candidate threshold, scores each by $ \Delta\text{Gini} $, and retains the winner.

Build the tree by hand

The greedy search above is something you can drive yourself. Click any region marked with a +: the algorithm finds that region's best split — the feature and threshold that most reduce its weighted Gini — and partitions it in two, adding a node to the tree diagram. Keep clicking the mixed regions to recurse, and watch the partition refine while the overall impurity falls toward 0. Regions that are already pure stop offering a split.

This activity needs JavaScript. The lesson below still covers everything.

Grow the tree

Increase max depth using the slider. At depth 1 the tree applies a single split, producing one axis-aligned boundary. Each additional level enables subsequent splits within each subset, refining the decision boundary into a finer partition. Observe how training accuracy increases with depth, and assess whether the additional regions correspond to genuine structure in the data or instead enclose individual noisy observations.

This activity needs JavaScript. The lesson below still covers everything.

Depth and the bias–variance trade-off

A shallow tree may underfit — too few splits to represent the underlying pattern (high bias). A sufficiently deep tree can drive training accuracy to 100% by isolating each observation in its own region, but these fine-grained partitions reflect noise in the training data and generalize poorly to new observations — the model has overfit (high variance). The optimal depth represents a balance between bias and variance (Module 8) and is typically selected by evaluating performance on a held-out validation set.

The equivalent procedure in scikit-learn — executable in the browser

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=3)  # the slider you just moved
clf.fit(X_train, y_train)               # greedily picks splits by impurity
clf.score(X_test, y_test)              # accuracy on unseen data

max_depth is the principal hyperparameter governing model complexity, and it directly controls the degree of overfitting — precisely the behavior illustrated by the slider above. Click Run it yourself, then increase max_depth and observe how the gap between training and test accuracy widens.

When you run it, the program prints (and draws a chart):

max_depth=1   train=0.805  test=0.822
max_depth=3   train=0.895  test=0.900
max_depth=8   train=0.986  test=0.856
max_depth=15  train=1.000  test=0.833

AI anchor — the dominant family of models for tabular data A single decision tree is rarely deployed in isolation; however, ensembles of many trees — random forests and gradient-boosted trees (notably XGBoost and LightGBM) — consistently achieve state-of-the-art performance on tabular datasets and predominate in real-world applications such as credit scoring, customer churn prediction, ad ranking, and risk modeling. Tree ensembles require comparatively little hyperparameter tuning, accommodate mixed feature types natively, and — through feature importance scores — provide partial interpretability by identifying which features most influenced the model's predictions, a property of considerable practical value.

Check your understanding

Answer a short set of questions on splits, impurity criteria, and depth.

This activity needs JavaScript.

Why this matters next Decision trees, k-NN, naive Bayes, and regression are all supervised methods that require labeled data. Module 6 introduces the unsupervised paradigm, in which the labels are removed: k-means clustering identifies group structure in unlabeled data, representing the first method of the course for discovering structure intrinsically rather than learning a labeled mapping.

Summary: a decision tree classifies through a hierarchy of binary feature splits, selected greedily to minimize impurity (Gini $ = 1 - \sum p_c^2 $). Deeper trees achieve higher training accuracy but tend to overfit; the optimal depth must therefore be selected by evaluating performance on held-out data.