← All Math Foundations modules

Module 2 — Conditional Probability

Pillar 1 · Probability · hands-on · about 30 minutes.

Probability is the math of uncertainty — and machine learning is uncertainty management at industrial scale. A spam filter never knows an email is spam; it computes how likely it is. This module builds the one idea that powers most of that reasoning: conditional probability — how a probability changes once you learn something new.

The vocabulary: outcomes, events, and probability

A sample space is the set of all possible outcomes — every face of a die, every email that could arrive. An event is a subset of those outcomes we care about: "the die shows an even number," "the email is spam." A probability is a number from 0 to 1 measuring how likely an event is — 0 means impossible, 1 means certain.

Three rules (the axioms) are all you need: every probability sits between 0 and 1; the probability of something happening across the whole sample space is 1; and for events that can't both happen, the probability of either one is the sum of their probabilities.

Joint, marginal, conditional

When two things vary at once — say, whether an email contains the word "free" and whether it is spam — three probabilities describe the situation:

The definition ties them together — and it is just arithmetic on counts:

\[ P(A \mid B) \;=\; \frac{P(A \text{ and } B)}{P(B)} \]

In words: of all the cases where \( B \) is true, what fraction also have \( A \) true? Restrict attention to the \( B \) column, then read off the \( A \) share. The explorer below lets you do exactly that on a population of 1,000 emails.

This activity needs JavaScript. The lesson below still covers everything.

Independence: when one event carries no information about another

Two events \( A \) and \( B \) are independent when conditioning on one does not change the probability of the other: \( P(A \mid B) = P(A) \). Equivalently, their joint probability factors into the product of the individual probabilities, \( P(A \cap B) = P(A)\,P(B) \). In plain terms: observing \( B \) gives you no information about \( A \). A fair coin is the classic example — each flip is independent of the last, so the previous result tells you nothing about the next.

Most useful signals, however, are dependent, and that dependence is exactly what makes them useful. The word "free" helps a spam filter only because \( P(\text{spam} \mid \text{"free"}) \) is substantially greater than \( P(\text{spam}) \) — the two are not independent. In short: independence means no information; dependence is information.

AI anchor — how a spam filter thinks A spam filter watches the words in an email and asks: given these words, how likely is spam? It has counted, over thousands of past emails, how often each word showed up in spam versus not. A word like "invoice" might be neutral; "viagra" might appear in 95% of spam and almost no real mail. The filter combines those conditional probabilities into a single score. You will assemble the exact combining rule in Module 3 (Bayes) — this module builds the single-clue version first.

Reading the table the way a model does

The clue is only as strong as the gap between \( P(\text{spam} \mid \text{clue}) \) and \( P(\text{spam}) \). A clue that barely moves the conditional probability is nearly useless; a clue that sends it from 30% to 95% is gold. Test your reading of a contingency table below — these are the same judgments a model makes automatically.

This activity needs JavaScript.

Why this matters next Conditional probability is the backbone of naive Bayes (Module 3 here, then a real classifier in Course 3) and of every evaluation metric you'll meet in Module 7 — precision is \( P(\text{actually positive} \mid \text{predicted positive}) \), recall is \( P(\text{predicted positive} \mid \text{actually positive}) \). Same formula, different question. Learn to read \( P(A \mid B) \) and half of model evaluation is already familiar.

One more way to see it — as a picture. Circle \( A \) is one event, circle \( B \) another, and their overlap is \( A \text{ and } B \). Conditioning on \( B \) means ignoring everything outside circle \( B \); then \( P(A \mid B) \) is simply the share of \( B \) that lies inside \( A \).

This activity needs JavaScript. The lesson below still covers everything.

One-sentence summary: conditional probability \( P(A \mid B) = P(A \text{ and } B) / P(B) \) asks "within the cases where \( B \) holds, how often does \( A \) also hold?" — and the bigger the gap between \( P(A \mid B) \) and \( P(A) \), the more information the clue \( B \) carries.

Next: Bayes & Random Variables →