Module 2 — Conditional Probability

Pillar 1 · Probability · hands-on · about 30 minutes.

Probability is the mathematics of uncertainty, and machine learning is fundamentally the systematic management of uncertainty. A spam filter does not determine with certainty that an email is spam; it computes the probability that it is. This module develops the concept underlying most such reasoning: conditional probability — the manner in which a probability is updated upon the acquisition of new information.

The vocabulary: outcomes, events, and probability

A sample space is the set of all possible outcomes — every face of a die, every email that could arrive. An event is a subset of those outcomes we care about: "the die shows an even number," "the email is spam." A probability is a number from 0 to 1 measuring how likely an event is — 0 means impossible, 1 means certain.

Three rules (the axioms) are all you need: every probability sits between 0 and 1; the probability of something happening across the whole sample space is 1; and for events that can't both happen, the probability of either one is the sum of their probabilities.

Adjust the mix below and watch all three hold at once: each slice stays within 0 and 1, the slices always fill the bar exactly (they sum to 1), and the probability of "either of two outcomes" is just their two slices added together.

This activity needs JavaScript. The three rules above are all you need.

Joint, marginal, conditional

When two things vary at once — say, whether an email contains the word "free" and whether it is spam — three probabilities describe the situation:

Joint \( P(A \text{ and } B) \): the chance both happen — a spam email that also says "free."
Marginal \( P(A) \): the chance of one event ignoring the other — any email being spam, "free" or not.
Conditional \( P(A \mid B) \): the probability of \( A \) given that \( B \) has occurred — the probability that an email is spam given that it contains "free." This is the quantity of primary interest.

The definition relates these three quantities, and is expressed as a ratio of probabilities:

\[ P(A \mid B) \;=\; \frac{P(A \text{ and } B)}{P(B)} \]

In words: among all cases in which \( B \) is true, what fraction also have \( A \) true? One restricts attention to the subset where \( B \) holds, then computes the proportion of that subset in which \( A \) also holds. The explorer below performs this computation on a population of 1,000 emails.

This activity needs JavaScript. The lesson below still covers everything.

The same population, shown as 100 individual emails rather than a table of totals. Counting squares is often more intuitive than reading a ratio: conditioning on a clue simply means ignoring the squares that lack it, and \( P(A \mid B) \) becomes the colored fraction of what remains.

This activity needs JavaScript. The lesson below still covers everything.

Independence: when one event carries no information about another

Two events \( A \) and \( B \) are independent when conditioning on one does not change the probability of the other: \( P(A \mid B) = P(A) \). Equivalently, their joint probability factors into the product of the individual probabilities, \( P(A \cap B) = P(A)\,P(B) \). In plain terms: observing \( B \) gives you no information about \( A \). A fair coin is the classic example — each flip is independent of the last, so the previous result tells you nothing about the next.

Most useful signals, however, are dependent, and that dependence is exactly what makes them useful. The word "free" helps a spam filter only because \( P(\text{spam} \mid \text{"free"}) \) is substantially greater than \( P(\text{spam}) \) — the two are not independent. In short: independence means no information; dependence is information.

An equivalent geometric interpretation: let circle \( A \) represent one event and circle \( B \) another, with their intersection representing \( A \text{ and } B \). Conditioning on \( B \) corresponds to restricting attention to circle \( B \); \( P(A \mid B) \) is then the proportion of \( B \) that lies within \( A \).

This activity needs JavaScript. The lesson below still covers everything.

AI anchor — the probabilistic reasoning of a spam filter A spam filter watches the words in an email and asks: given these words, how likely is spam? It has counted, over thousands of past emails, how often each word showed up in spam versus not. A word like "invoice" might be neutral; "viagra" might appear in 95% of spam and almost no real mail. The filter combines those conditional probabilities into a single score. You will assemble the exact combining rule in Module 3 (Bayes) — this module builds the single-clue version first.

Interpreting a contingency table

A feature is informative in proportion to the gap between \( P(\text{spam} \mid \text{clue}) \) and \( P(\text{spam}) \). A feature that scarcely changes the conditional probability is nearly uninformative; a feature that raises it from 30% to 95% is highly informative. Evaluate your interpretation of a contingency table below — these are the same computations a model performs automatically.

This activity needs JavaScript.

Why this matters next Conditional probability is the foundation of naive Bayes (Module 3 here, and a full classifier in Course 3) and of every evaluation metric introduced in Module 7 — precision is \( P(\text{actually positive} \mid \text{predicted positive}) \) and recall is \( P(\text{predicted positive} \mid \text{actually positive}) \): the same formula applied to different conditioning events. A working command of \( P(A \mid B) \) therefore provides the basis for much of model evaluation.

One-sentence summary: conditional probability \( P(A \mid B) = P(A \text{ and } B) / P(B) \) asks "within the cases where \( B \) holds, how often does \( A \) also hold?" — and the bigger the gap between \( P(A \mid B) \) and \( P(A) \), the more information the clue \( B \) carries.

Next: Bayes & Random Variables →