Module 2 — Conditional Probability
Probability is the math of uncertainty — and machine learning is uncertainty management at industrial scale. A spam filter never knows an email is spam; it computes how likely it is. This module builds the one idea that powers most of that reasoning: conditional probability — how a probability changes once you learn something new.
The vocabulary: outcomes, events, and probability
A sample space is the set of all possible outcomes — every face of a die, every email that could arrive. An event is a subset of those outcomes we care about: "the die shows an even number," "the email is spam." A probability is a number from 0 to 1 measuring how likely an event is — 0 means impossible, 1 means certain.
Three rules (the axioms) are all you need: every probability sits between 0 and 1; the probability of something happening across the whole sample space is 1; and for events that can't both happen, the probability of either one is the sum of their probabilities.
Joint, marginal, conditional
When two things vary at once — say, whether an email contains the word "free" and whether it is spam — three probabilities describe the situation:
- Joint \( P(A \text{ and } B) \): the chance both happen — a spam email that also says "free."
- Marginal \( P(A) \): the chance of one event ignoring the other — any email being spam, "free" or not.
- Conditional \( P(A \mid B) \): the chance of \( A \) once you know \( B \) happened — the chance an email is spam given it says "free." This is the one that matters.
The definition ties them together — and it is just arithmetic on counts:
In words: of all the cases where \( B \) is true, what fraction also have \( A \) true? Restrict attention to the \( B \) column, then read off the \( A \) share. The explorer below lets you do exactly that on a population of 1,000 emails.
This activity needs JavaScript. The lesson below still covers everything.
Independence: when one event carries no information about another
Two events \( A \) and \( B \) are independent when conditioning on one does not change the probability of the other: \( P(A \mid B) = P(A) \). Equivalently, their joint probability factors into the product of the individual probabilities, \( P(A \cap B) = P(A)\,P(B) \). In plain terms: observing \( B \) gives you no information about \( A \). A fair coin is the classic example — each flip is independent of the last, so the previous result tells you nothing about the next.
Most useful signals, however, are dependent, and that dependence is exactly what makes them useful. The word "free" helps a spam filter only because \( P(\text{spam} \mid \text{"free"}) \) is substantially greater than \( P(\text{spam}) \) — the two are not independent. In short: independence means no information; dependence is information.
Reading the table the way a model does
The clue is only as strong as the gap between \( P(\text{spam} \mid \text{clue}) \) and \( P(\text{spam}) \). A clue that barely moves the conditional probability is nearly useless; a clue that sends it from 30% to 95% is gold. Test your reading of a contingency table below — these are the same judgments a model makes automatically.
This activity needs JavaScript.
One more way to see it — as a picture. Circle \( A \) is one event, circle \( B \) another, and their overlap is \( A \text{ and } B \). Conditioning on \( B \) means ignoring everything outside circle \( B \); then \( P(A \mid B) \) is simply the share of \( B \) that lies inside \( A \).
This activity needs JavaScript. The lesson below still covers everything.