Module 7 — Statistics for Data & Evaluation

Pillar 4 · Statistics · hands-on · about 30 minutes.

Models are constructed from data and evaluated by quantitative metrics. Statistics provides the methods for both: summarizing a dataset into a small number of representative quantities, and interpreting the metrics that determine whether a model performs well. This module covers the principal summary statistics — measures of center, measures of spread, and the normal distribution — together with the common errors of interpretation they prevent.

Center: mean and median

The mean (average) adds the values and divides by how many — the balance point of the data:

\[ \bar{x} \;=\; \frac{1}{n}\sum_{i=1}^{n} x_i \]

The median is the middle value of the sorted data. The two measures often coincide; however, when a small number of extreme values shift the mean while the median remains stable, the discrepancy is itself informative (as with income or response-time distributions). Edit the dataset below to observe how both measures respond.

Spread: variance and standard deviation

A measure of center is insufficient on its own; the dispersion of the data must also be characterized. Variance is the mean squared deviation from the mean; standard deviation is its square root, expressed in the original units:

\[ \sigma^2 \;=\; \frac{1}{n}\sum_{i=1}^{n} (x_i - \bar{x})^2, \qquad \sigma \;=\; \sqrt{\sigma^2} \]

Note: this formula divides by \( n \) — the population variance. Spreadsheets and statistics libraries often default to the sample version, which divides by \( n - 1 \), so slightly different numbers from Excel or pandas are expected.

A small σ indicates that the data is concentrated near the mean; a large σ indicates that it is widely dispersed. Standard deviation is used throughout ML: to normalize features, to quantify noise, and to report the dispersion of a model's errors.

This activity needs JavaScript. The lesson below still covers everything.

The normal distribution

Many quantities are distributed symmetrically about a mean — heights, measurement noise, and the errors of a well-specified model. This distribution is the normal distribution, fully characterized by its mean (the location of the peak) and its standard deviation (its width). The empirical rule states that approximately 68% of the data lies within one σ of the mean, 95% within two σ, and 99.7% within three. The figure below illustrates this.

This activity needs JavaScript. The empirical rule for a normal distribution: about 68% of values lie within \( 1\sigma \) of the mean, 95% within \( 2\sigma \), and 99.7% within \( 3\sigma \).

Correlation and causation

Correlation measures the degree to which two variables vary together, summarized by the coefficient \( r \) ranging from −1 (perfect negative linear relationship) through 0 (no linear relationship) to +1 (perfect positive linear relationship). Correlation does not imply causation: ice-cream sales and drowning incidents are positively correlated (both driven by warm weather), yet neither causes the other. Models exploit correlations to make predictions, but a correlation must not be interpreted as a causal relationship.

This activity needs JavaScript.

The limitations of summary statistics

A single statistic necessarily discards information. The same mean can arise from tightly clustered data or from widely dispersed data; a single outlier can shift the mean to a value far from any observed data point. The notion of an "average user" frequently describes no actual individual. The dataset explorer above allows you to introduce an outlier into a dataset and observe that the mean changes substantially while the median is largely unaffected — which is why both are reported in rigorous analysis.

AI anchor — interpreting evaluation metrics Every claim about a model is a statistic. Accuracy is a mean (the fraction correct). A model that's "95% accurate" on data where 95% of cases are one class has learned nothing — it just guesses the majority; that's why you also report precision and recall (the conditional probabilities from Module 2). Reporting a metric without its standard deviation across runs hides whether the result is reliable or luck. And confusing correlation with causation is how a model that merely predicts gets mistaken for one that explains. Statistics is what keeps model evaluation honest.

Check your understanding

Identify the statistical error of interpretation in each scenario.

This activity needs JavaScript.

Why this matters next Statistics is how you'll evaluate every model in Courses 3 and 4 — accuracy, precision/recall, and the spread of results across runs. It's also how you'll prepare data: normalizing features by mean and standard deviation is a standard preprocessing step. The normal distribution returns whenever you reason about noise and uncertainty, and correlation is the raw material of every predictive feature.

One-sentence summary: the mean \( \bar{x} = \frac{1}{n}\sum x_i \) gives the center and the standard deviation σ gives the spread; the normal curve describes data that piles up around a mean; correlation measures co-movement (never causation); and reading these honestly is exactly what model evaluation requires.

Next: The Math of a Tiny Model →