← All Item Response Theory modules

Module 1 — Why Percent-Correct Isn’t Enough

Start here · hands-on · about 25 minutes.

You took a test and got 70%. Is that good? It is impossible to say — because the number depends entirely on which questions you happened to be asked. The same person scores 90% on an easy form and 50% on a hard one. That single weakness is why the entire field of Item Response Theory (IRT) exists, and it is the idea this whole course is built on. This module shows you the problem, then the fix.

Classical test theory: the score is the truth

The old, intuitive approach is classical test theory (CTT). Your score is just the count of right answers: \( \text{observed} = \text{true ability} + \text{error} \). Simple, and it works fine when everyone takes the exact same test. But two things break it:

So a CTT score mixes together how able the student is and how hard the test was, and you can never fully separate them. For a fixed paper exam given once, that’s tolerable. For an adaptive test — where different students see different questions on purpose — it is fatal.

See the flaw yourself

Below is one student with a fixed, unchanging skill. Give them an easy form, then a hard form, and watch their raw percentage swing wildly — even though the student never changed.

This activity needs JavaScript. The point: the same student’s percent-correct rises on an easy form and falls on a hard form.

The IRT idea: put people and questions on one scale

IRT fixes this with one elegant move. It places every student and every question on a single shared scale of difficulty/ability, written with the Greek letter \( \theta \) (theta):

Because they share an axis, you can ask the one question that matters: given this student’s ability and this item’s difficulty, what is the probability they answer correctly? That probability is the engine of everything that follows. When ability is far above difficulty, the chance is high; far below, it is low; right at the difficulty, it’s a coin flip. Module 2 draws that relationship as a curve.

The payoff: because items live on a fixed scale that does not depend on who took them, a student’s \( \theta \) estimate means the same thing no matter which questions they saw. That is what lets an adaptive test give two people completely different questions and still compare them fairly.
Where you’ve met this — without knowing Your credit “score,” a chess Elo rating, and the matchmaking rank in a video game are all the same trick: people and challenges on one shared scale, with a probability of “winning” that depends on the gap between them. Elo is, essentially, a one-parameter IRT model where the “items” are opponents. If you understand this course, you understand all of them.

Sort the statements

For each statement, decide whether it describes Classical Test Theory or Item Response Theory.

This activity needs JavaScript.

Why this matters next Everything in this course hangs off one object: the curve relating ability \( \theta \) to the probability of a correct answer. Module 2 introduces it — the item characteristic curve — and you’ll drag a student along it. Modules 3–5 then add one knob at a time (difficulty, discrimination, guessing) until you’ve built the exact 3-parameter model QuantegyAI uses.
One-sentence summary: a raw percent-correct score is test-dependent and can’t separate student skill from test difficulty, so IRT instead places people (ability \( \theta \)) and questions (difficulty) on one shared scale and models the probability of a correct answer.

Next: The Item Characteristic Curve →