Module 8 — Does It Actually Work? Evaluating Ed-Tech
Every learning product claims to work. Testimonials glow; press releases cite user counts in the millions; a celebrity teacher swears by it; a cheerful dashboard shows steep upward curves. None of that is evidence. After seven modules building the scientific foundations of learning, this final module turns the same critical lens on the tools themselves — including this one. How do you tell whether an ed-tech product actually improves learning, rather than just feeling like it does?
The answer is the same framework scientists use to evaluate medical treatments, economic policies, and any other claim about cause and effect: levels of evidence, effect sizes, and replication. These are not arcane statistics concepts. They are practical reading skills every educator and ed-tech buyer needs.
Levels of evidence: the ESSA tiers
In 2015 the US Every Student Succeeds Act (ESSA) created a four-tier evidence framework for educational interventions. It is the clearest accessible standard for thinking about research quality:
| Tier | Label | What it requires |
|---|---|---|
| Tier 1 — Strong | Strong | At least one well-designed, well-implemented randomized controlled trial (RCT) showing a positive effect on a relevant student outcome, with a statistically significant result. |
| Tier 2 — Moderate | Moderate | At least one well-designed, well-implemented quasi-experimental study (e.g., matched comparison groups, regression discontinuity) showing a positive effect. |
| Tier 3 — Promising | Promising | At least one well-designed and well-implemented correlational study with statistical controls showing a positive relationship between the intervention and student outcomes. |
| Tier 4 — Rationale | Demonstrates a rationale | A logic model or theory of action based on high-quality research on related topics — but no direct study of the intervention itself. |
Most ed-tech products on the market, when pressed for evidence, offer Tier 4 at best — a plausible story, a pilot with no control group, a white paper authored by the vendor. A Tier 4 product is not evidence-free — a sound theory of action is valuable — but it is a very different claim from one backed by an independent RCT. The tiers are not arbitrary gatekeeping; they map directly onto the degree to which alternative explanations (the intervention wasn't what caused improvement; the students were self-selected; the teachers were unusually motivated) can be ruled out.
What RCTs cannot tell you
Randomized trials establish that an intervention worked in a specific context with specific participants — they do not automatically tell you it will work in your classroom, with your students, under your resource constraints. External validity — the degree to which a finding generalizes — requires replication across diverse contexts. A single RCT in a well-funded suburban school district in one state is weak evidence that an intervention will work in a rural district with different demographics and infrastructure. Replication is not a luxury; it is the mechanism by which science becomes reliable knowledge.
Effect size: how big is "it works"?
Statistical significance tells you whether an observed difference is likely to be real (not just sampling noise). It does not tell you whether the difference is educationally meaningful. For that you need an effect size — a standardized measure of the magnitude of the difference.
The most common effect-size metric in education research is Cohen's \( d \):
where \( \mu \) is the group mean and \( \sigma_{\text{pooled}} \) is the pooled standard deviation of outcomes. Rough conventional benchmarks (Cohen, 1988):
- \( d \approx 0.2 \) — small: about 58% of the treatment group above the control median.
- \( d \approx 0.5 \) — medium: about 69% of the treatment group above the control median.
- \( d \approx 0.8 \) — large: about 79% of the treatment group above the control median.
The famous reference point is Bloom's two-sigma effect: in a series of studies in the 1980s, Benjamin Bloom found that one-on-one human tutoring produced roughly \( d \approx 2.0 \) compared to conventional classroom instruction — meaning the average tutored student outperformed about 98% of the classroom comparison group. That is the ceiling toward which adaptive learning systems aspire. Most real-world ed-tech interventions achieve \( d \) values in the range of 0.2 to 0.4 — meaningful, but not miraculous.
A critically important point: a result can be statistically significant with a tiny effect size. With a large enough sample, even a \( d = 0.05 \) difference — roughly invisible in practice — will produce \( p < 0.05 \). Always look for effect sizes alongside \( p \)-values, and ask whether the effect is large enough to be worth the cost of the intervention.
Try it: visualize effect sizes
The activity below shows two overlapping distributions — a control group (gray) and a treatment group (blue) — on a test score scale. Adjust Cohen's \( d \) to see what "small," "medium," and "large" effects actually look like. Notice how even a \( d = 0.8 \) large effect leaves substantial overlap between the groups.
This activity needs JavaScript. The idea: two normal distributions, one shifted by d standard deviations, showing the practical meaning of effect sizes from small (d=0.2) to Bloom's two-sigma (d=2.0).
The replication problem
Even well-designed studies are not always right. The last decade of social and educational science has been marked by a replication crisis — a systematic finding that a substantial fraction of published results (some estimates: 40–60% of psychology studies, and a similar proportion in education) fail to replicate when an independent research team runs the same experiment again. The causes are multiple: small sample sizes that produce unstable estimates; analytical flexibility (trying many statistical tests and reporting only the one that works — "p-hacking"); publication bias (journals publishing positive results but not null results); and sometimes, outright fraud.
For ed-tech buyers, the practical implications are:
- Look for independent replication: studies conducted by researchers with no financial relationship to the vendor carry far more weight than vendor-funded research.
- Look for preregistered studies: a trial registered in a public database (e.g., AEA RCT Registry, OSF) before data collection, with a pre-specified analysis plan, is much harder to p-hack.
- Distrust single studies: a single positive RCT is encouraging, not conclusive. A meta-analysis — a quantitative synthesis of multiple independent studies — is much stronger evidence.
- Weight sample size: a well-designed RCT with 2,000 students is stronger evidence than one with 40 students, even if both reach statistical significance.
Classify the evidence: strength of efficacy claims
Given each claim about an ed-tech product, classify the strength of the evidence provided.
This activity needs JavaScript.
QuantegyAI and the evidence standard
It would be convenient to end this course by claiming QuantegyAI has Tier 1 evidence and a \( d \) of 0.8. We won't make that claim, because we haven't earned it yet with a fully independent, pre-registered RCT. What we can honestly say is this:
- Every mechanism QuantegyAI implements — mastery gating (Module 4), spaced retrieval scheduling (Modules 2 and 3), adaptive selection by BKT estimate (Module 6), immediate corrective feedback (Module 5) — is backed by decades of independent replication in cognitive science and education research.
- The combination of these principles in a single product is itself a testable claim, and QuantegyAI is committed to ongoing evaluation — including pre-registered studies with independent researchers.
- The goal is Bloom's two-sigma, achieved through technology rather than a 1:1 human tutor. Whether we get there is a question of evidence, not marketing. You now have the tools to evaluate that evidence yourself.
Module 1 — Memory is encoding, storage, and retrieval; forgetting is exponential by default.
Module 2 — Retrieval practice strengthens memory traces; testing is a cause of learning, not just a measure of it.
Module 3 — Spacing practice over time exploits the forgetting curve to build durable memory efficiently.
Module 4 — Mastery learning ensures each foundation is solid before building on it.
Module 5 — Feedback that corrects errors immediately and specifically is one of the highest-leverage instructional interventions known.
Module 6 — Adaptive learning systems estimate per-skill mastery using Bayesian Knowledge Tracing and select content accordingly.
Module 7 — Learning dashboards should surface leading indicators, not vanity metrics, and must handle learner data ethically.
Module 8 — Evidence quality runs from RCT (strong) to rationale (weak); effect size tells you how big; replication tells you how real.
The evidence standard you now hold is not just for evaluating other products. It is the standard QuantegyAI must meet — and that you should demand from any tool you put in front of learners.