Module 8 — Does It Actually Work? Evaluating Ed-Tech

Capstone · evidence, effect sizes, and honest claims · about 35 minutes.

Every learning product claims to work. Testimonials glow; press releases cite user counts in the millions; a celebrity teacher swears by it; a cheerful dashboard shows steep upward curves. None of that is evidence. After seven modules building the scientific foundations of learning, this final module turns the same critical lens on the tools themselves — including this one. How do you tell whether an ed-tech product actually improves learning, rather than just feeling like it does?

The answer is the same framework scientists use to evaluate medical treatments, economic policies, and any other claim about cause and effect: levels of evidence, effect sizes, and replication. These are not arcane statistics concepts. They are practical reading skills every educator and ed-tech buyer needs.

Levels of evidence: the ESSA tiers

In 2015 the US Every Student Succeeds Act (ESSA) created a four-tier evidence framework for educational interventions. It is the clearest accessible standard for thinking about research quality:

Tier	Label	What it requires
Tier 1 — Strong	Strong	At least one well-designed, well-implemented randomized controlled trial (RCT) showing a positive effect on a relevant student outcome, with a statistically significant result.
Tier 2 — Moderate	Moderate	At least one well-designed, well-implemented quasi-experimental study (e.g., matched comparison groups, regression discontinuity) showing a positive effect.
Tier 3 — Promising	Promising	At least one well-designed and well-implemented correlational study with statistical controls showing a positive relationship between the intervention and student outcomes.
Tier 4 — Rationale	Demonstrates a rationale	A logic model or theory of action based on high-quality research on related topics — but no direct study of the intervention itself.

Most ed-tech products on the market, when pressed for evidence, offer Tier 4 at best — a plausible story, a pilot with no control group, a white paper authored by the vendor. A Tier 4 product is not evidence-free — a sound theory of action is valuable — but it is a very different claim from one backed by an independent RCT. The tiers are not arbitrary gatekeeping; they map directly onto the degree to which alternative explanations (the intervention wasn't what caused improvement; the students were self-selected; the teachers were unusually motivated) can be ruled out.

Why RCTs are the gold standard In a randomized controlled trial, participants are assigned at random to an intervention group (uses the new product) or a control group (doesn't). Random assignment is the key move: on average, it balances all other variables — student ability, teacher quality, school resources, motivation — between the two groups. Any remaining difference in outcomes can therefore be attributed to the intervention itself, not to pre-existing differences. No other study design fully achieves this, which is why Tier 1 requires it. Quasi-experimental designs (Tier 2) approximate it through clever comparison strategies but can never entirely rule out confounding.

What RCTs cannot tell you

Randomized trials establish that an intervention worked in a specific context with specific participants — they do not automatically tell you it will work in your classroom, with your students, under your resource constraints. External validity — the degree to which a finding generalizes — requires replication across diverse contexts. A single RCT in a well-funded suburban school district in one state is weak evidence that an intervention will work in a rural district with different demographics and infrastructure. Replication is not a luxury; it is the mechanism by which science becomes reliable knowledge.

Effect size: how big is "it works"?

Statistical significance tells you whether an observed difference is likely to be real (not just sampling noise). It does not tell you whether the difference is educationally meaningful. For that you need an effect size — a standardized measure of the magnitude of the difference.

The most common effect-size metric in education research is Cohen's \( d \):

\[ d = \frac{\mu_{\text{treatment}} - \mu_{\text{control}}}{\sigma_{\text{pooled}}} \]

where \( \mu \) is the group mean and \( \sigma_{\text{pooled}} \) is the pooled standard deviation of outcomes. Rough conventional benchmarks (Cohen, 1988):

\( d \approx 0.2 \) — small: about 58% of the treatment group above the control median.
\( d \approx 0.5 \) — medium: about 69% of the treatment group above the control median.
\( d \approx 0.8 \) — large: about 79% of the treatment group above the control median.

The famous reference point is Bloom's two-sigma effect: in a series of studies in the 1980s, Benjamin Bloom found that one-on-one human tutoring produced roughly \( d \approx 2.0 \) compared to conventional classroom instruction — meaning the average tutored student outperformed about 98% of the classroom comparison group. That is the ceiling toward which adaptive learning systems aspire. Most real-world ed-tech interventions achieve \( d \) values in the range of 0.2 to 0.4 — meaningful, but not miraculous.

A critically important point: a result can be statistically significant with a tiny effect size. With a large enough sample, even a \( d = 0.05 \) difference — roughly invisible in practice — will produce \( p < 0.05 \). Always look for effect sizes alongside \( p \)-values, and ask whether the effect is large enough to be worth the cost of the intervention.

Try it: visualize effect sizes

The activity below shows two overlapping distributions — a control group (gray) and a treatment group (blue) — on a test score scale. Adjust Cohen's \( d \) to see what "small," "medium," and "large" effects actually look like. Notice how even a \( d = 0.8 \) large effect leaves substantial overlap between the groups.

This activity needs JavaScript. The idea: two normal distributions, one shifted by d standard deviations, showing the practical meaning of effect sizes from small (d=0.2) to Bloom's two-sigma (d=2.0).

The replication problem

Even well-designed studies are not always right. The last decade of social and educational science has been marked by a replication crisis — a systematic finding that a substantial fraction of published results (some estimates: 40–60% of psychology studies, and a similar proportion in education) fail to replicate when an independent research team runs the same experiment again. The causes are multiple: small sample sizes that produce unstable estimates; analytical flexibility (trying many statistical tests and reporting only the one that works — "p-hacking"); publication bias (journals publishing positive results but not null results); and sometimes, outright fraud.

For ed-tech buyers, the practical implications are:

Look for independent replication: studies conducted by researchers with no financial relationship to the vendor carry far more weight than vendor-funded research.
Look for preregistered studies: a trial registered in a public database (e.g., AEA RCT Registry, OSF) before data collection, with a pre-specified analysis plan, is much harder to p-hack.
Distrust single studies: a single positive RCT is encouraging, not conclusive. A meta-analysis — a quantitative synthesis of multiple independent studies — is much stronger evidence.
Weight sample size: a well-designed RCT with 2,000 students is stronger evidence than one with 40 students, even if both reach statistical significance.

Vendor-funded research When a product's developer funds the study evaluating that product, the results are systematically more positive than independently funded research — a pattern documented across medicine, nutrition science, and education technology. This is not necessarily deliberate fraud; it can arise from subtle choices in study design, outcome selection, comparison conditions, and reporting. The solution is not to ignore vendor-funded research entirely, but to weight it appropriately and look for corroboration from independent sources. The What Works Clearinghouse (whatworks.ed.gov) reviews ed-tech evidence using ESSA tiers and excludes studies with conflicts of interest that cannot be managed.

Classify the evidence: strength of efficacy claims

Given each claim about an ed-tech product, classify the strength of the evidence provided.

This activity needs JavaScript.

QuantegyAI and the evidence standard

It would be convenient to end this course by claiming QuantegyAI has Tier 1 evidence and a \( d \) of 0.8. We won't make that claim, because we haven't earned it yet with a fully independent, pre-registered RCT. What we can honestly say is this:

Every mechanism QuantegyAI implements — mastery gating (Module 4), spaced retrieval scheduling (Modules 2 and 3), adaptive selection by BKT estimate (Module 6), immediate corrective feedback (Module 5) — is backed by decades of independent replication in cognitive science and education research.
The combination of these principles in a single product is itself a testable claim, and QuantegyAI is committed to ongoing evaluation — including pre-registered studies with independent researchers.
The goal is Bloom's two-sigma, achieved through technology rather than a 1:1 human tutor. Whether we get there is a question of evidence, not marketing. You now have the tools to evaluate that evidence yourself.

Course complete — the full arc You have traveled the full arc of learning science, from the cellular to the evaluative:

Module 1 — Memory is encoding, storage, and retrieval; forgetting is exponential by default.
Module 2 — Retrieval practice strengthens memory traces; testing is a cause of learning, not just a measure of it.
Module 3 — Spacing practice over time exploits the forgetting curve to build durable memory efficiently.
Module 4 — Mastery learning ensures each foundation is solid before building on it.
Module 5 — Feedback that corrects errors immediately and specifically is one of the highest-leverage instructional interventions known.
Module 6 — Adaptive learning systems estimate per-skill mastery using Bayesian Knowledge Tracing and select content accordingly.
Module 7 — Learning dashboards should surface leading indicators, not vanity metrics, and must handle learner data ethically.
Module 8 — Evidence quality runs from RCT (strong) to rationale (weak); effect size tells you how big; replication tells you how real.

The evidence standard you now hold is not just for evaluating other products. It is the standard QuantegyAI must meet — and that you should demand from any tool you put in front of learners.

One-sentence summary: the strongest evidence that an ed-tech product actually improves learning is a randomized controlled trial — or at minimum a well-designed quasi-experiment — showing a real and practically meaningful effect size, replicated independently, rather than testimonials, user counts, or vendor-funded studies.

Capstone reflection — evaluate your own product

You have spent this course studying the science of how people learn. Now apply that lens to something you actually built or designed. Think about any learning tool, lesson, activity, or product you have created — in this course or another context.

Answer both questions below in your own words. Be honest. The goal is not to sell your product — it is to think like an evaluator.

Question 1 — What is one thing your product does well, grounded in learning science?
Identify a specific design choice (e.g. how you give feedback, how you sequence content, how you check for understanding) and explain why it is likely to support learning based on at least one principle from this course (Modules 1–7).

Question 2 — What is one thing you would improve, and how would you evaluate whether the improvement worked?
Identify a specific weakness or gap. Then describe what change you would make and how you would measure its effect — what outcome would you look at, and what kind of study (ESSA tier) would you aim for?

Reflection submission needs JavaScript enabled.

← Back to all modules