Module 5 — Decision Trees
A decision tree is the most human-readable model in machine learning: it classifies by asking a sequence of yes/no questions about the features, like a flowchart. "Is income > $50k? If yes, is age < 30?" In this module you will grow a tree one level at a time and watch it carve the data into boxes — and then watch it overfit.
Splits: one question at a time
At each step the tree picks the single feature-and-threshold question that best separates the classes — for example "is \( x_1 < 3.2 \)?" That split divides the data into two groups. The tree then repeats inside each group, asking another question, building a flowchart of splits that ends in leaves where it commits to a class.
How it picks the "best" split: impurity
"Best" means the split that makes each side as pure as possible — ideally all one class. A common purity score is the Gini impurity of a group:
where \( p_c \) is the fraction of the group in class \( c \). It is 0 for a perfectly pure group and 0.5 for a 50/50 mix. The tree greedily chooses the split that drops the (weighted) impurity the most. No calculus, no gradient — just "try the splits, keep the best."
Grow the tree
Slide max depth up. At depth 1 the tree asks one question — one straight cut. Each extra level lets it ask follow-up questions, bending the boundary into more boxes. Watch training accuracy climb… and keep an eye on whether the deep boxes are catching a real pattern or just lassoing single noisy points.
This activity needs JavaScript. The lesson below still covers everything.
Depth is a double-edged sword
A shallow tree may underfit — too few questions to capture the pattern. A very deep tree can drive training accuracy to 100% by drawing a tiny box around every point, but those boxes won’t survive on new data: it has overfit. The right depth is a balance (the bias–variance tradeoff of Module 8), usually found by checking a held-out set.
from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier(max_depth=3) # the slider you just moved clf.fit(X_train, y_train) # greedily picks splits by impurity clf.score(X_test, y_test) # accuracy on unseen dataimport numpy as np import matplotlib.pyplot as plt from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import make_moons from sklearn.model_selection import train_test_split # Two interleaving half-moons — not separable by a straight line X, y = make_moons(n_samples=300, noise=0.25, random_state=0) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) # Watch train accuracy climb while test accuracy peaks then drops: overfitting for depth in [1, 3, 8, 15]: clf = DecisionTreeClassifier(max_depth=depth, random_state=0).fit(X_train, y_train) print(f"max_depth={depth:<2} train={clf.score(X_train, y_train):.3f} test={clf.score(X_test, y_test):.3f}") clf = DecisionTreeClassifier(max_depth=3, random_state=0).fit(X_train, y_train) xx, yy = np.meshgrid(np.linspace(X[:, 0].min() - .5, X[:, 0].max() + .5, 250), np.linspace(X[:, 1].min() - .5, X[:, 1].max() + .5, 250)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) plt.figure(figsize=(5, 3.4)) plt.contourf(xx, yy, Z, alpha=0.2, cmap="coolwarm") plt.scatter(X[:, 0], X[:, 1], c=y, cmap="coolwarm", s=14, edgecolor="k", linewidth=0.3) plt.title("Decision tree, max_depth=3"); plt.tight_layout(); plt.show()
max_depth is the single most important knob — it directly controls overfitting, exactly as the slider shows. Hit Run it yourself, then bump max_depth up and watch the train/test gap widen.
Branch out
A few questions on splits, impurity, and depth. You will get a score.
This activity needs JavaScript.