← All Science of Learning modules

Module 8 — Does It Actually Work? Evaluating Ed-Tech

Capstone · evidence, effect sizes, and honest claims · about 35 minutes.

Every learning product claims to work. Testimonials glow; press releases cite user counts in the millions; a celebrity teacher swears by it; a cheerful dashboard shows steep upward curves. None of that is evidence. After seven modules building the scientific foundations of learning, this final module turns the same critical lens on the tools themselves — including this one. How do you tell whether an ed-tech product actually improves learning, rather than just feeling like it does?

The answer is the same framework scientists use to evaluate medical treatments, economic policies, and any other claim about cause and effect: levels of evidence, effect sizes, and replication. These are not arcane statistics concepts. They are practical reading skills every educator and ed-tech buyer needs.

Levels of evidence: the ESSA tiers

In 2015 the US Every Student Succeeds Act (ESSA) created a four-tier evidence framework for educational interventions. It is the clearest accessible standard for thinking about research quality:

TierLabelWhat it requires
Tier 1 — StrongStrongAt least one well-designed, well-implemented randomized controlled trial (RCT) showing a positive effect on a relevant student outcome, with a statistically significant result.
Tier 2 — ModerateModerateAt least one well-designed, well-implemented quasi-experimental study (e.g., matched comparison groups, regression discontinuity) showing a positive effect.
Tier 3 — PromisingPromisingAt least one well-designed and well-implemented correlational study with statistical controls showing a positive relationship between the intervention and student outcomes.
Tier 4 — RationaleDemonstrates a rationaleA logic model or theory of action based on high-quality research on related topics — but no direct study of the intervention itself.

Most ed-tech products on the market, when pressed for evidence, offer Tier 4 at best — a plausible story, a pilot with no control group, a white paper authored by the vendor. A Tier 4 product is not evidence-free — a sound theory of action is valuable — but it is a very different claim from one backed by an independent RCT. The tiers are not arbitrary gatekeeping; they map directly onto the degree to which alternative explanations (the intervention wasn't what caused improvement; the students were self-selected; the teachers were unusually motivated) can be ruled out.

Why RCTs are the gold standard In a randomized controlled trial, participants are assigned at random to an intervention group (uses the new product) or a control group (doesn't). Random assignment is the key move: on average, it balances all other variables — student ability, teacher quality, school resources, motivation — between the two groups. Any remaining difference in outcomes can therefore be attributed to the intervention itself, not to pre-existing differences. No other study design fully achieves this, which is why Tier 1 requires it. Quasi-experimental designs (Tier 2) approximate it through clever comparison strategies but can never entirely rule out confounding.

What RCTs cannot tell you

Randomized trials establish that an intervention worked in a specific context with specific participants — they do not automatically tell you it will work in your classroom, with your students, under your resource constraints. External validity — the degree to which a finding generalizes — requires replication across diverse contexts. A single RCT in a well-funded suburban school district in one state is weak evidence that an intervention will work in a rural district with different demographics and infrastructure. Replication is not a luxury; it is the mechanism by which science becomes reliable knowledge.

Effect size: how big is "it works"?

Statistical significance tells you whether an observed difference is likely to be real (not just sampling noise). It does not tell you whether the difference is educationally meaningful. For that you need an effect size — a standardized measure of the magnitude of the difference.

The most common effect-size metric in education research is Cohen's \( d \):

\[ d = \frac{\mu_{\text{treatment}} - \mu_{\text{control}}}{\sigma_{\text{pooled}}} \]

where \( \mu \) is the group mean and \( \sigma_{\text{pooled}} \) is the pooled standard deviation of outcomes. Rough conventional benchmarks (Cohen, 1988):

The famous reference point is Bloom's two-sigma effect: in a series of studies in the 1980s, Benjamin Bloom found that one-on-one human tutoring produced roughly \( d \approx 2.0 \) compared to conventional classroom instruction — meaning the average tutored student outperformed about 98% of the classroom comparison group. That is the ceiling toward which adaptive learning systems aspire. Most real-world ed-tech interventions achieve \( d \) values in the range of 0.2 to 0.4 — meaningful, but not miraculous.

A critically important point: a result can be statistically significant with a tiny effect size. With a large enough sample, even a \( d = 0.05 \) difference — roughly invisible in practice — will produce \( p < 0.05 \). Always look for effect sizes alongside \( p \)-values, and ask whether the effect is large enough to be worth the cost of the intervention.

Try it: visualize effect sizes

The activity below shows two overlapping distributions — a control group (gray) and a treatment group (blue) — on a test score scale. Adjust Cohen's \( d \) to see what "small," "medium," and "large" effects actually look like. Notice how even a \( d = 0.8 \) large effect leaves substantial overlap between the groups.

This activity needs JavaScript. The idea: two normal distributions, one shifted by d standard deviations, showing the practical meaning of effect sizes from small (d=0.2) to Bloom's two-sigma (d=2.0).

The replication problem

Even well-designed studies are not always right. The last decade of social and educational science has been marked by a replication crisis — a systematic finding that a substantial fraction of published results (some estimates: 40–60% of psychology studies, and a similar proportion in education) fail to replicate when an independent research team runs the same experiment again. The causes are multiple: small sample sizes that produce unstable estimates; analytical flexibility (trying many statistical tests and reporting only the one that works — "p-hacking"); publication bias (journals publishing positive results but not null results); and sometimes, outright fraud.

For ed-tech buyers, the practical implications are:

Vendor-funded research When a product's developer funds the study evaluating that product, the results are systematically more positive than independently funded research — a pattern documented across medicine, nutrition science, and education technology. This is not necessarily deliberate fraud; it can arise from subtle choices in study design, outcome selection, comparison conditions, and reporting. The solution is not to ignore vendor-funded research entirely, but to weight it appropriately and look for corroboration from independent sources. The What Works Clearinghouse (whatworks.ed.gov) reviews ed-tech evidence using ESSA tiers and excludes studies with conflicts of interest that cannot be managed.

Classify the evidence: strength of efficacy claims

Given each claim about an ed-tech product, classify the strength of the evidence provided.

This activity needs JavaScript.

QuantegyAI and the evidence standard

It would be convenient to end this course by claiming QuantegyAI has Tier 1 evidence and a \( d \) of 0.8. We won't make that claim, because we haven't earned it yet with a fully independent, pre-registered RCT. What we can honestly say is this:

Course complete — the full arc You have traveled the full arc of learning science, from the cellular to the evaluative:

Module 1 — Memory is encoding, storage, and retrieval; forgetting is exponential by default.
Module 2 — Retrieval practice strengthens memory traces; testing is a cause of learning, not just a measure of it.
Module 3 — Spacing practice over time exploits the forgetting curve to build durable memory efficiently.
Module 4 — Mastery learning ensures each foundation is solid before building on it.
Module 5 — Feedback that corrects errors immediately and specifically is one of the highest-leverage instructional interventions known.
Module 6 — Adaptive learning systems estimate per-skill mastery using Bayesian Knowledge Tracing and select content accordingly.
Module 7 — Learning dashboards should surface leading indicators, not vanity metrics, and must handle learner data ethically.
Module 8 — Evidence quality runs from RCT (strong) to rationale (weak); effect size tells you how big; replication tells you how real.

The evidence standard you now hold is not just for evaluating other products. It is the standard QuantegyAI must meet — and that you should demand from any tool you put in front of learners.
One-sentence summary: the strongest evidence that an ed-tech product actually improves learning is a randomized controlled trial — or at minimum a well-designed quasi-experiment — showing a real and practically meaningful effect size, replicated independently, rather than testimonials, user counts, or vendor-funded studies.

← Back to all modules