← All Machine Learning Foundations modules

Module 4 — Naive Bayes

Supervised learning · hands-on · about 30 minutes.

Naive Bayes applies Bayes' rule from Course 2, Module 3 to construct a probabilistic classifier — the canonical textbook example of a spam filter. It trains rapidly, requires comparatively little data, and remains a remarkably strong baseline for text classification. In this module you will operate a live spam filter, toggling individual words and observing how each contributes to the posterior probability of spam.

Bayes' rule applied to classification

The quantity of interest is the posterior probability \( P(\text{spam} \mid \text{words}) \) — the probability that an email is spam, conditional on the words it contains. Bayes' rule re-expresses this in terms of quantities that can be estimated by counting frequencies in historical, labeled data:

\[ P(\text{spam} \mid \text{words}) \;\propto\; P(\text{spam}) \times P(\text{words} \mid \text{spam}) \]

The symbol \( \propto \) means "is proportional to," not "equals." The exact form of Bayes' rule divides the right-hand side by \( P(\text{words}) \), but that denominator is the same regardless of the class being tested, so it cannot change which class scores highest. Dropping it leaves the two sides equal only up to that shared constant — which is precisely what \( \propto \) records, and all a classifier needs in order to pick the more probable class.

\( P(\text{spam}) \) is the prior — the marginal probability of spam in the overall mail distribution. \( P(\text{words} \mid \text{spam}) \) is the likelihood — the probability of observing those words conditional on the email being spam.

The "naive" conditional-independence assumption

Estimating \( P(\text{whole sentence} \mid \text{spam}) \) directly is intractable, because the space of possible sentences is effectively infinite and most have never been observed in training data. Naive Bayes resolves this by introducing a strong simplifying assumption: the words are treated as conditionally independent given the class label, so the joint likelihood factorizes into a product of per-word likelihoods:

\[ P(\text{words} \mid \text{spam}) \;\approx\; \prod_i P(\text{word}_i \mid \text{spam}) \]

This independence assumption is empirically false — for example, "york" is far more likely to follow "new" than to occur in arbitrary context — yet the classifier performs strongly in practice, because correct classification requires only that the posterior assign the highest probability to the correct class, not that the absolute probability be calibrated.

A worked example, with numbers

Suppose a short email contains just two words, "free money", and that counting a labeled training inbox produced these estimates (ham means legitimate, non-spam mail):

Prior: \( P(\text{spam}) = 0.4 \) and \( P(\text{ham}) = 0.6 \) — spam is the minority of this inbox.
The word free: \( P(\text{free}\mid\text{spam}) = 0.30 \), but \( P(\text{free}\mid\text{ham}) = 0.02 \).
The word money: \( P(\text{money}\mid\text{spam}) = 0.20 \), but \( P(\text{money}\mid\text{ham}) = 0.03 \).

The independence assumption is what lets us simply multiply the per-word likelihoods together; we then multiply by the prior to get a score for each class:

\[ \text{spam score} = \underbrace{0.4}_{P(\text{spam})} \times \underbrace{0.30}_{\text{free}} \times \underbrace{0.20}_{\text{money}} = 0.024 \]

\[ \text{ham score} = \underbrace{0.6}_{P(\text{ham})} \times \underbrace{0.02}_{\text{free}} \times \underbrace{0.03}_{\text{money}} = 0.00036 \]

The spam score is roughly 67 times larger, so the email is classified as spam. Recall that this all comes from Bayes' rule:

\[ P(\text{class} \mid \text{words}) = \frac{P(\text{words} \mid \text{class}) \, P(\text{class})}{P(\text{words})} \propto P(\text{words} \mid \text{class}) \, P(\text{class}) \]

The denominator \( P(\text{words}) \) is the same for every class, so it can be dropped while we compare them — that is the \( \propto \) ("proportional to") step, and each score above is exactly the numerator \( P(\text{words} \mid \text{class}) \, P(\text{class}) \). To turn the two scores back into an actual probability, divide each by their sum (the normalization the \( \propto \) sign hides):

\[ P(\text{spam} \mid \text{words}) = \frac{0.024}{0.024 + 0.00036} \approx 0.985 \quad (\text{about } 98.5\%) \]

Notice we never had to know the true probability of the exact sentence "free money" — only which class scored higher. That is precisely why the unrealistic independence assumption still classifies correctly.

One hazard of estimating likelihoods by counting deserves mention: if a word never appears in one class of the training data, its estimated likelihood for that class is exactly zero — and a single zero factor drives the class's entire product to zero, so one unseen word would veto the class outright. The standard remedy is add-one (Laplace) smoothing: add 1 to every word count before computing the likelihoods, so no estimate is ever exactly zero. scikit-learn's MultinomialNB applies this correction by default (its alpha=1.0 parameter).

With a real email the product runs over dozens of words, each a small probability, and multiplying them can underflow to zero in floating-point arithmetic. This is why the computation is performed in log space (Course 2, Module 1): taking logarithms turns every product into a sum, so the spam score becomes \( \log(0.4) + \log(0.30) + \log(0.20) \), and the class with the larger total wins — the same decision, computed stably.

Construct the classifier

Each word below modifies the spam odds by its likelihood ratio: words characteristic of spam multiply the odds upward, words characteristic of legitimate mail multiply them downward. Toggle the words contained in the email and observe the resulting posterior. Adjust the prior slider to vary the baseline rate of spam.

This activity needs JavaScript. The lesson below still covers everything.

The equivalent procedure in scikit-learn — executable in the browser

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

X = CountVectorizer().fit_transform(emails)  # word counts
clf = MultinomialNB().fit(X, labels)        # learns P(word | class)
clf.predict_proba(new_email)                  # P(spam | words)

The word-level probabilities shown in the widget are the same quantities that .fit() estimates by counting word occurrences in labeled spam and ham. Click Run it yourself, then add additional strings to the tests list and re-execute.

When you run it, the program prints a spam/ham probability for each test message:

'free money now'                  P(spam)=0.973  P(ham)=0.027
'see you at the meeting today'    P(spam)=0.014  P(ham)=0.986

The first message — full of free and money — is flagged spam at about 97%; the second is almost certainly legitimate (99% ham). Your numbers will match exactly, because the data and the model are fixed.

emails = [...] labels = [...]The training data. A tiny labeled inbox — eight messages, four marked spam and four ham (legitimate mail). Everything the model knows comes from these eight examples.
CountVectorizer().fit_transform(emails)Turn words into numbers. Each email becomes a row of word counts — how many times each vocabulary word appears (a "bag of words"). Word order is discarded; only the counts matter.
MultinomialNB().fit(X, labels)Train. The model counts how often each word occurs in spam versus ham and stores the resulting per-word likelihoods \( P(\text{word}\mid\text{class}) \) — the very quantities you toggle in the widget above.
vec.transform(tests)Encode the new mail. Two unseen messages are converted to the same word-count form. Note transform, not fit — it reuses the vocabulary already learned during training.
clf.predict_proba(...)Predict. For each message the model combines the prior with the per-word likelihoods — exactly the "free money" calculation from earlier — and returns \( P(\text{spam}) \) and \( P(\text{ham}) \), which sum to 1.
for text, p in zip(...): print(...)Print. The loop simply formats each message alongside its two probabilities.

See the evidence add up

The same classification, viewed in log space. Naive Bayes does not multiply probabilities directly — it adds their logarithms, so the posterior log-odds is simply the prior log-odds plus one step per word. Toggle the words and watch each contribution slide the running total toward spam or ham; the marker's final position, mapped through the logistic function, is the posterior probability.

This activity needs JavaScript. The lesson below still covers everything.

AI anchor — a compact, efficient, and enduring baseline Naive Bayes underpinned the first practical spam filters and remains in production use for document tagging, sentiment analysis, language identification, and as the standard baseline against which more complex models are evaluated. Its enduring methodological lesson is that a model may rest on an assumption that is provably false (conditional independence of features) and nevertheless yield strong empirical performance, because classification accuracy depends only on the relative ordering of class posteriors, not their absolute calibration. Identifying contexts in which a simplifying assumption is acceptable is a central skill in applied machine learning.

Check your understanding

Answer a short set of questions on priors, conditional independence, and the rationale for the naive Bayes approximation.

This activity needs JavaScript.

Why this matters next Naive Bayes combines probabilities multiplicatively; the next model, the decision tree (Module 5), instead applies a hierarchical sequence of binary tests on individual features — producing a fundamentally different class of decision boundary. Decision trees are the building block of random forests and gradient-boosted ensembles, which remain state of the art on many tabular datasets. Comparing how different model classes partition the same feature space is central to principled model selection.

Summary: naive Bayes classifies by combining a class prior \( P(\text{spam}) \) with per-word likelihoods \( P(\text{word} \mid \text{spam}) \) under a conditional-independence assumption that factorizes the joint likelihood as a product. The assumption is provably violated by natural language, yet the classifier remains a fast and effective baseline for text classification.