Module 7 — Statistics for Data & Evaluation
Models are built from data and judged by numbers. Statistics is the toolkit for both: summarizing a pile of data into a few honest numbers, and reading the metrics that tell you whether a model is any good. This module covers the summaries — mean, spread, the bell curve — and the traps that fool people who skip them.
Center: mean and median
The mean (average) adds the values and divides by how many — the balance point of the data:
The median is the middle value when sorted. They usually agree — but when a few huge values pull the mean while the median holds steady, that gap is itself information (think incomes, or response times). Edit the dataset below and watch both move.
Spread: variance and standard deviation
Center isn't enough — you need to know how spread out the data is. Variance averages the squared distance from the mean; standard deviation is its square root, back in the original units:
Small σ means the data huddles near the mean; large σ means it's scattered. Standard deviation is everywhere in ML: it's how we normalize features, measure noise, and report the spread of a model's errors.
This activity needs JavaScript. The lesson below still covers everything.
The normal distribution: the bell curve
Many natural quantities pile up symmetrically around a mean — heights, measurement noise, the errors a good model makes. That shape is the normal distribution, described entirely by its mean (where the peak sits) and its standard deviation (how wide). A handy rule: about 68% of data falls within one σ of the mean, 95% within two. The histogram demo overlays this curve so you can compare your data to the ideal bell.
Correlation — and why it isn't causation
Correlation measures whether two variables move together, summarized by \( r \) from −1 (perfect opposite) through 0 (no linear relationship) to +1 (perfect together). But correlation is not causation: ice-cream sales and drownings rise together (both driven by summer heat), yet neither causes the other. Models exploit correlation to predict — and mislead anyone who confuses it with cause.
This activity needs JavaScript.
Why averages mislead
A single number hides a lot. The same mean can come from tightly-clustered data or wildly scattered data; one outlier can drag an average somewhere no actual data point lives. "The average user…" is often a person who doesn't exist. The demo below lets you drop an outlier into a dataset and watch the mean lurch while the median barely flinches — the reason robust reporting shows both.
Don't get fooled
Spot the statistical trap in each scenario. You'll get a score.
This activity needs JavaScript.