← All Machine Learning Foundations modules

Module 4 — Naive Bayes

Supervised learning · hands-on · about 30 minutes.

Naive Bayes turns Bayes’ rule from Course 2, Module 3 into a working classifier — and it is still the textbook spam filter. It is fast, needs little data, and is shockingly hard to beat on text. In this module you will build a live spam filter: toggle words on and off and watch the spam probability swing.

Bayes’ rule, as a classifier

We want \( P(\text{spam} \mid \text{words}) \) — the chance an email is spam given the words it contains. Bayes flips it into things we can count from past email:

\[ P(\text{spam} \mid \text{words}) \;\propto\; P(\text{spam}) \times P(\text{words} \mid \text{spam}) \]

\( P(\text{spam}) \) is the prior — how common spam is overall. \( P(\text{words} \mid \text{spam}) \) is the likelihood — how typical those words are of spam.

The "naive" trick

Computing \( P(\text{whole sentence} \mid \text{spam}) \) is impossible — most sentences have never been seen before. So naive Bayes makes one bold simplifying assumption: treat each word as independent given the class, and just multiply the per-word probabilities:

\[ P(\text{words} \mid \text{spam}) \;\approx\; \prod_i P(\text{word}_i \mid \text{spam}) \]

That assumption is technically false — "york" is far more likely after "new" — yet the classifier works remarkably well anyway, because it only needs to get the winner right, not the exact probability. Working in log space (Course 2, Module 1) turns that long product into a stable sum.

Build the filter

Each word below tilts the odds: a spam-ish word multiplies the spam odds up, a work-ish word multiplies them down. Toggle the words in your "email" and watch the verdict. Slide the prior to make spam rarer or more common.

This activity needs JavaScript. The lesson below still covers everything.

The same thing in scikit-learn — run it right here, nothing to install
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

X = CountVectorizer().fit_transform(emails)  # word counts
clf = MultinomialNB().fit(X, labels)        # learns P(word | class)
clf.predict_proba(new_email)                  # P(spam | words)

The word probabilities you see below are exactly what .fit() estimates by counting words in known spam vs. known ham. Hit Run it yourself, then add your own lines to tests and rerun.

AI anchor — small, fast, and still everywhere Naive Bayes powered the first practical spam filters and still runs in document tagging, sentiment analysis, language ID, and as a fast baseline every ML team checks before reaching for anything heavier. Its lesson outlives it: a model can rest on an assumption that is obviously wrong (word independence) and still be useful, because classification only needs the right ranking of classes. Knowing when "wrong but useful" is good enough is real ML judgment.

Check your filter sense

A few questions on priors, independence, and why the trick works. You will get a score.

This activity needs JavaScript.

Why this matters next Naive Bayes multiplies probabilities; the next model, decision trees (Module 5), instead asks a sequence of yes/no questions — a totally different shape of boundary, and the building block of the random forests and gradient-boosted trees that win on tabular data. Comparing how each model carves up the same data is the heart of model selection.
One-sentence summary: naive Bayes classifies by combining a prior \( P(\text{spam}) \) with per-word likelihoods \( P(\text{word} \mid \text{spam}) \), naively assuming the words are independent and multiplying them — a wrong assumption that still makes an excellent, fast text classifier.

Next: Decision Trees →