Module 1 — Why Percent-Correct Isn’t Enough

Start here · hands-on · about 25 minutes.

You took a test and got 70%. Is that good? It is impossible to say — because the number depends entirely on which questions you happened to be asked. The same person scores 90% on an easy form and 50% on a hard one. That single weakness is why the entire field of Item Response Theory (IRT) exists, and it is the idea this whole course is built on. This module shows you the problem, then the fix.

Classical test theory: the score is the truth

Definition: Classical Test Theory (CTT) is the traditional way of scoring tests. Your score is simply the number of questions you got right, written as a percentage. That’s it — no model, no formula beyond counting.

Under the hood, CTT assumes your observed score is made of two parts: \( \text{observed} = \text{true ability} + \text{error} \). The “true ability” is what you actually know. The “error” is everything else — lucky guesses, a bad night’s sleep, or a test that was easier or harder than it should have been. CTT doesn’t try to separate those pieces; it just reports the total.

This works fine when everyone takes the exact same test. But two things break it:

The score is test-dependent. A 70% on an easy test and a 70% on a hard test are not the same achievement — but CTT reports the same number for both.
Item difficulty depends on who took the test. In CTT, a question’s “difficulty” is just “what fraction of this group got it right.” Give the same question to a stronger class and suddenly it looks easier. The question didn’t change — the yardstick did.

The bottom line: a CTT score mixes together how able the student is and how hard the test was, and you can never fully separate them. For a fixed paper exam given once, that’s tolerable. For an adaptive test — where different students see different questions on purpose — it is fatal.

What CTT gets right

Before we bury it, credit where it’s due. CTT has been the backbone of educational testing for nearly a century, and for good reason:

It’s simple. Score = right answers / total questions. No model, no parameters, no estimation loop. You can compute it on a napkin.
It works for fixed-form tests. When every student takes the exact same exam, a 70% means the same thing for everyone — because the test difficulty is held constant.
The statistics are easy. Item difficulty is just “what fraction got it right.” Discrimination is just a point-biserial correlation. Reliability is Cronbach’s \( \alpha \). All of these are one-line formulas you can run in a spreadsheet.

Most classroom quizzes, district benchmarks, and even the SAT (in its fixed-form version) ran on CTT for decades. If your test never changes form, CTT is perfectly serviceable.

A concrete example: two forms, one student

Imagine Maria, who knows about 60% of the curriculum. She takes Form A — ten straightforward questions, each well within her reach. She scores \( 8/10 = 80\% \). Then she takes Form B — ten harder questions, several above her level. She scores \( 4/10 = 40\% \).

Maria didn’t change. The curriculum didn’t change. The only thing that changed was which questions she happened to see. Under CTT, Form A says “B student” and Form B says “failing.” Both are wrong — both are mixing Maria’s ability with the test’s difficulty and reporting the blend as if it were a pure measure of Maria.

This is the fundamental problem. CTT’s observed score is a confounded quantity: \( X = T + E \), where \( T \) is the true score (ability) and \( E \) is random error. But the “error” term silently absorbs test difficulty, and CTT gives you no way to pull it back out. IRT’s contribution is exactly that — it separates the person from the instrument.

See the flaw yourself

Below is one student with a fixed, unchanging skill. Give them an easy form, then a hard form, and watch their raw percentage swing wildly — even though the student never changed.

This activity needs JavaScript. The point: the same student’s percent-correct rises on an easy form and falls on a hard form.

The IRT idea: put people and questions on one scale

IRT fixes this with one elegant move. It places every student and every question on a single shared scale of difficulty/ability, written with the Greek letter \( \theta \) (theta):

A student has an ability \( \theta \) — a position on that scale. Higher means more skilled. By convention it usually runs from about \( -3 \) to \( +3 \), centered near 0.
A question has a difficulty on the same scale. A question at \( b = 1.5 \) sits to the right of a student at \( \theta = 0 \) — it is above them.

Because they share an axis, you can ask the one question that matters: given this student’s ability and this item’s difficulty, what is the probability they answer correctly? That probability is the engine of everything that follows. When ability is far above difficulty, the chance is high; far below, it is low; right at the difficulty, it’s a coin flip. Module 2 draws that relationship as a curve.

The payoff: because items live on a fixed scale that does not depend on who took them, a student’s \( \theta \) estimate means the same thing no matter which questions they saw. That is what lets an adaptive test give two people completely different questions and still compare them fairly.

Where you’ve met this — without knowing Your credit “score,” a chess Elo rating, and the matchmaking rank in a video game are all the same trick: people and challenges on one shared scale, with a probability of “winning” that depends on the gap between them. Elo is, essentially, a one-parameter IRT model where the “items” are opponents. If you understand this course, you understand all of them.

Sort the statements

For each statement, decide whether it describes Classical Test Theory or Item Response Theory.

This activity needs JavaScript.

Why this matters next Everything in this course hangs off one object: the curve relating ability \( \theta \) to the probability of a correct answer. Module 2 introduces it — the item characteristic curve — and you’ll drag a student along it. Modules 3–5 then add one knob at a time (difficulty, discrimination, guessing) until you’ve built the exact 3-parameter model QuantegyAI uses.

One-sentence summary: a raw percent-correct score is test-dependent and can’t separate student skill from test difficulty, so IRT instead places people (ability \( \theta \)) and questions (difficulty) on one shared scale and models the probability of a correct answer.

Next: The Item Characteristic Curve →