← All Item Response Theory modules

Module 6 — Estimating Ability

Scoring engine · hands-on · about 25 minutes.

So far you've learned how to build a model that predicts responses from known parameters. Now it's time to run the machinery backwards. In practice, the items' parameters \( a, b, c \) have already been calibrated — and you have a real student who just answered some questions. Your job: figure out where on the \( \theta \) scale that student actually sits. This is ability estimation, and it is the heart of what every adaptive test does after each response.

The likelihood of a response pattern

Suppose a student answers a set of \( n \) items and you record the vector \( \mathbf{u} = (u_1, u_2, \ldots, u_n) \), where \( u_i = 1 \) if the student answered item \( i \) correctly and \( u_i = 0 \) if not. Given a particular ability level \( \theta \), what is the probability of that exact pattern?

Assuming responses to different items are locally independent (given \( \theta \), knowing you got item 3 right tells you nothing extra about item 7), the probability simply multiplies across items. Each item contributes its ICC probability if the response was correct, or one minus that probability if the response was wrong:

\[ L(\theta) = \prod_{i=1}^{n} P_i(\theta)^{u_i}\,(1 - P_i(\theta))^{1 - u_i} \]

Read this carefully. When \( u_i = 1 \) (correct), the factor for item \( i \) is \( P_i(\theta) \) — the model's probability of a correct answer. When \( u_i = 0 \) (wrong), the factor is \( 1 - P_i(\theta) \) — the model's probability of a wrong answer. Multiply these across all items and you get the likelihood: how probable is this particular response string, as a function of the unknown ability \( \theta \)?

The function \( L(\theta) \) is not the probability that the student has ability \( \theta \) — it is the probability of the observed response pattern given that the student has ability \( \theta \). That distinction matters. \( L(\theta) \) is a function of the data under different hypothetical values of \( \theta \), not a distribution over \( \theta \) itself.

Maximum Likelihood Estimation (MLE)

The Maximum Likelihood Estimate (MLE) of ability is the value \( \hat{\theta} \) that makes the observed response pattern as probable as possible:

\[ \hat{\theta} = \operatorname*{arg\,max}_{\theta} \; L(\theta) \]

In practice, because \( L(\theta) \) is a product of many small numbers, it can underflow to zero on a computer. The standard fix is to maximize the log-likelihood instead — since the logarithm is monotonically increasing, the \( \theta \) that maximizes \( \log L(\theta) \) is the same as the one that maximizes \( L(\theta) \):

\[ \log L(\theta) = \sum_{i=1}^{n} \bigl[ u_i \log P_i(\theta) + (1 - u_i) \log (1 - P_i(\theta)) \bigr] \]

For IRT with the 3PL model, this objective is smooth and unimodal (for most realistic response patterns), so a grid search or Newton–Raphson iteration finds it reliably. In the activity below you will do a grid search: compute the likelihood on a dense grid of \( \theta \) values, then find the peak.

What the likelihood curve tells you

The shape of the likelihood function carries important information beyond just the peak location. A sharp, narrow peak means the data pin down the ability estimate precisely — the student's responses are consistent with one narrow range of abilities. A flat, wide curve means the responses are broadly compatible with many different ability levels, and the estimate is uncertain.

When does the likelihood go flat? When the response pattern is surprising: a student who misses the easiest item but aces the hardest one has behaved inconsistently with any single ability value, so the likelihood spreads across a wide range of \( \theta \). Conversely, a student who gets all easy items right and all hard items wrong — exactly what the model predicts for a person at moderate ability — produces a narrow, confident peak.

Where you've met MLE — without knowing Fitting a linear regression, training a logistic classifier, or estimating the mean of a Gaussian: all of these are MLE in disguise. The log-likelihood for a Gaussian turns out to be exactly the negative sum of squared errors, so "minimize squared error" and "maximize likelihood" are the same operation. IRT ability estimation is logistic regression run in reverse: instead of fitting item parameters from many students, you fix the item parameters and fit the one student's \( \theta \).

Interact: flip responses and watch the MLE move

Below are five items with known parameters. Each chip shows an item's difficulty \( b \). Click a chip to toggle it between correct (lit up) and incorrect. The likelihood curve redraws instantly, and the amber marker lands at the Maximum Likelihood Estimate. Try flipping an inconsistent pattern — easy item wrong, hard item right — and notice how the curve flattens.

This activity needs JavaScript. Toggle item responses and watch the MLE on the likelihood curve shift.

Bayesian alternatives: EAP and MAP

Pure MLE has a weakness: it ignores everything you knew about the student before they started. If a student answers only two items, the likelihood curve may be almost flat, and the MLE can land at an extreme like \( \hat{\theta} = +4 \) or \( -4 \). A Bayesian approach multiplies the likelihood by a prior distribution \( \pi(\theta) \) — typically a standard normal, reflecting the fact that most test-takers are near average ability — and then summarizes the resulting posterior:

As the number of items grows, the prior becomes negligible and both MAP and EAP converge to the MLE. For a full-length exam of 30–40 items, the three methods give nearly identical answers. For the 5–10 items typical in a short CAT session, the Bayesian correction matters.

Sort: interpreting likelihood patterns

For each description, decide whether the result is a sharp (narrow, confident) or flat (wide, uncertain) likelihood curve.

This activity needs JavaScript.

Why this matters next You can now estimate a student's ability from a static response record. Module 7 takes this further: instead of asking "what does this response pattern tell us about \( \theta \)?", it asks "which item would give us the most information about \( \theta \) if we asked it next?" That question is the engine of computer-adaptive testing — and it is answered by the information function, the subject of Module 7.
One-sentence summary: IRT estimates ability by finding the \( \hat{\theta} \) that maximizes the likelihood \( L(\theta) = \prod P_i(\theta)^{u_i}(1-P_i(\theta))^{1-u_i} \) of the observed responses; a consistent pattern produces a sharp likelihood peak and a confident estimate, while a surprising pattern flattens the curve and increases uncertainty.

Next: Information & Adaptive Testing →