GPT-4 Made Students 127% Better at Homework and 17% Worse at Math

Q: Why Raw ChatGPT Fails as a Tutor

Standard ChatGPT optimizes for helpfulness. Ask it to solve a quadratic equation, and it solves the quadratic equation. A good human tutor does the opposite: asks you what you've tried, where you got stuck, and nudges you toward the next step without revealing it.

A field experiment gave ~1,000 Turkish high schoolers GPT-4 for math practice. Performance soared 127%. Then researchers took the AI away, and exam scores dropped 17% below students who never used it at all. Purpose-built tutoring AI avoided the decline entirely.

One hundred and twenty-seven percent. That was the performance improvement when Turkish high schoolers used a GPT-4 tutor on math practice problems. Researchers at the University of Pennsylvania and Wharton ran the experiment across four 90-minute sessions covering roughly 15% of the math curriculum. Two versions of GPT-4 were deployed: GPT Base, which mimicked standard ChatGPT, and GPT Tutor, which included pedagogical guardrails designed to guide students through problems rather than solve them outright.

GPT Tutor produced the 127% gain. GPT Base managed 48%. Both groups were doing dramatically better on their practice sets.

Then the researchers took the AI away and gave everyone the same exam.

GPT Tutor students held steady. GPT Base students scored 17% lower than the control group that never had AI access at all. Not 17% lower than their AI-assisted performance. Seventeen percent lower than students who practiced with nothing but a textbook.

False Mastery Is Measurable Now

Education researchers have a name for this: the performance-learning distinction. It's the gap between what a student can do with a tool and what they actually internalized. Calculators created a mild version of this decades ago. Generative AI has made the effect large enough to measure in a controlled experiment.

What the Bastani et al. study (published in PNAS, 2025) showed is that GPT Base functioned as a cognitive crutch. Students didn't use it to understand the math. They used it to get through the practice sets. When the mechanism for analyzing student behavior logs was applied, researchers found GPT Base users were copying answers at significantly higher rates than GPT Tutor users, who were guided through step-by-step reasoning instead.

Frame the results as a "guardrail premium":

AI Condition	Practice Performance	Post-Removal Exam	Net Learning Effect
No AI (control)	Baseline	Baseline	0%
GPT Base (raw ChatGPT)	+48%	-17%	Negative
GPT Tutor (guardrails)	+127%	No decline	Positive

Raw ChatGPT access produced a net negative learning outcome. Not neutral. Negative. Students were worse off than if they'd never used AI at all. Adding pedagogical guardrails didn't just prevent the decline; it produced the highest practice performance too. The tutor that refuses to give you the answer makes you better at finding it yourself.

Khanmigo's Data at Scale

If the Bastani experiment was the controlled proof, Khan Academy's Khanmigo is the at-scale deployment. By late 2025, Khanmigo had reached 5.1 million students across 110 countries, logging over 300 million interactions. A randomized controlled trial published in Educational Technology Research and Development found that students using Khanmigo three times per week for math support showed a 0.34 standard deviation improvement in algebra scores over one semester.

To put 0.34 SD in context: Benjamin Bloom's famous 1984 finding showed that one-on-one human tutoring produced a 2-sigma (2.0 SD) effect. Every attempt to replicate that result at scale has landed between 0.4 and 0.8 SD. Khanmigo's 0.34 SD sits just below that range, at a cost of roughly $44 per student per year (Khanmigo's current district pricing), compared to the $2,000-$4,000 per student annual cost of hiring a human tutor for comparable hours.

Khanmigo's most-used feature: step-by-step math problem deconstruction. Students asking the AI to explain the problem, not solve it. That usage pattern is the exact behavior GPT Tutor's guardrails forced. Khanmigo bakes it into the product design.

OECD Confirms the Pattern Across 14 Countries

In January 2026, the OECD published its Digital Education Outlook, drawing on classroom data from 14 member countries. Their finding echoed Bastani's result at continental scale: students with access to general-purpose AI chatbots produced higher-quality work, but the advantage "disappeared and sometimes reversed" when the AI was removed for assessments.

Education-specific tools designed with pedagogical intent showed sustained improvements. The OECD report highlighted several findings:

Finding	Data Point	Source
Teacher time savings	31% reduction in lesson planning (science teachers, England)	OECD Digital Education Outlook 2026
Student self-reported benefit	80% say AI improved academic performance	Coursera Higher Ed Report, Feb 2026
Actual exam improvement (pedagogical AI)	0.34 SD in algebra	Khanmigo RCT
Actual exam decline (raw AI)	-17% vs. control	Bastani et al., PNAS 2025
Global AI-in-education adoption	60% of GenAI use in high-income countries; <1% in low-income	OECD 2026

That last row matters. Eighty percent of students say AI helps them. The experimental data says it depends entirely on which AI. The distinction between perception and measured outcome is the story.

Why Raw ChatGPT Fails as a Tutor

Standard ChatGPT optimizes for helpfulness. Ask it to solve a quadratic equation, and it solves the quadratic equation. A good human tutor does the opposite: asks you what you've tried, where you got stuck, and nudges you toward the next step without revealing it.

GPT Tutor, in the Bastani experiment, was prompted to never give direct answers. Instead, it asked follow-up questions: "What approach would you try first?" and "Can you identify which formula applies here?" This Socratic constraint is exactly what Khanmigo implements at the product level, and what Khan Academy founder Sal Khan has called the difference between "AI that does the work for you and AI that helps you do the work yourself."

Building these guardrails is not technically difficult. The prompting difference between GPT Base and GPT Tutor was modest. But the learning outcome difference was enormous: GPT Tutor's +127% practice performance versus GPT Base's +48% is a 79 percentage-point gap on practice. On post-removal exams, GPT Tutor held at baseline while GPT Base fell 17% below it. The total swing between the two conditions on real learning: 17 percentage points on the metric that matters.

Strongest Counterargument

The Bastani study tested ~1,000 students in one Turkish school, covering one subject (math), over four 90-minute sessions representing 15% of the curriculum. That is a narrow base for sweeping conclusions about AI and learning. Math has unusually clear right-and-wrong answers that make crutch behavior easy to detect; subjects like writing or history might show different patterns. Short exposure (6 hours total) could also mean students hadn't yet developed effective AI usage strategies, and that the 17% decline reflects a learning curve rather than a fundamental limitation.

Khanmigo's 0.34 SD could reflect selection effects: districts that adopt AI tutoring tools may have more engaged administrators and higher-performing students to begin with. The RCT design should control for this, but the full methodology details remain limited in publicly available reporting.

Both counterarguments are real. Neither invalidates the core finding that tool design determines whether AI helps or harms learning. The question is whether the effect sizes generalize, not whether the direction of the effect is wrong.

Limitations

This analysis relies on a small number of rigorous studies. The Bastani experiment, while well-designed (randomized, controlled, pre-registered), tested only one subject in one school in one country. Khanmigo's RCT is promising but details of randomization and control conditions are limited in public documentation. The OECD data is observational and cross-country, meaning confounders abound. Long-term effects of AI tutoring beyond one semester remain unmeasured. And 16% of U.S. households with school-age children still lack reliable internet access (Pew Research, 2025), making those students invisible in every study cited here.

The Bottom Line

Giving students raw ChatGPT for schoolwork is like giving them a calculator that also hides the multiplication tables. Performance goes up. Learning goes down. The fix isn't complicated: AI that asks questions instead of answering them preserves learning while boosting performance even more than the unrestricted version. Khanmigo charges $44/student/year. A human tutor costs 50 to 100 times more. The economics favor AI tutoring at any school budget. But only the right kind of AI. Every school district rushing to adopt generative AI needs to understand that "which AI" matters more than "whether AI." Bastani's experiment proved it. Now it's a question of whether anyone reads the data before buying the product.