Balancing the Equation: Why AI in Math Teaching Cuts Both Ways
The evidence supporting AI-assisted math learning is some of the strongest out there, but also provides the clearest proof that when AI solves the problem, the students don’t
Thanks to our Presenting Sponsors Hire Education, Tuck Advisors, and Cooley, for making Edtech Insiders possible.
Balancing the Equation: Why AI in Math Teaching Cuts Both Ways
Sponsored by Overdeck Family Foundation
By: Lin Ler
Lin Ler is a Stanford MBA and MA in Education candidate working at the intersection of business strategy, learning science, and frontier AI. With a background spanning management consulting, EdTech startups, and the social sector, Lin translates complex AI capabilities into practical insight for the future of learning.
Special thanks to Overdeck Family Foundation for sponsoring this article in our AI & Efficacy Editorial Research Series diving into key research findings from Stanford’s AI Hub for Education Research Repository (a project by Stanford’s SCALE Initiative). Stay tuned for more!
Picture this: A student in the West African country of Ghana—where the average annual income is $235 USD—sits down during study hall, opens WhatsApp and starts working through math problems with a chatbot named Rori. When she answers a question incorrectly, Rori does not correct her: it offers a hint, surfaces a misconception, gently guides her in the right direction… and only if she is still stuck does it walk her through the worked solution.
In a randomized trial across eleven schools, roughly 1,000 students in grades 3-9 used Rori for two short sessions a week… and pulled ahead of their peers. In fact, Rori demonstrated an effect size of 0.36, which in education research is close to an extra year of learning at a cost of about five dollars per student. Those results are real, affordable, built on existing technology, and, as with most tech-delivered solutions, scalable.
Research suggests that the core detail that explains the results is the order in which Rori, or other math tutors designed with pedagogical intent, delivers its inferences: hints and guidance first, answer last. The strongest evidence for math and science learning consistently shows that the AI that helps is the AI that refuses to simply hand over the answer and instead engages students in metacognition and productive struggle. Here’s the catch; the thing that general- purpose AI does best when asked a math question- that is, producing an instant correct solution- turns out to be the thing that most reliably reduces the learning for students.
Why Math Is Built for This
Ask an AI to grade an essay and you are asking it to make a judgment call. Ask whether a student correctly solved for x and the answer is simply right or wrong. That difference helps explain why math, not writing or history, has been the easiest place for edtech products to show a measurable win, which is why so many products focus on the domain. Math is also the domain in which the evidence for AI in education is strongest. A math problem has objective answers, decomposable steps, and diagnosable errors, so software, and now AI models, can tell a student exactly where her reasoning broke and what to try next, instantly, on every step.
According to a 2025 systematic review of 258 K-12 STEM studies, mathematics was the single largest subject of study, and research on intelligent tutoring systems (ITSs), tools built to give this kind of step-by-step guidance, accounted for nearly half of all the work. Many researchers note one key mechanism in ITSs: feedback density: ITS’s, and AI tutors, have the ability to create a learning loop that closes after every step, in which students get immediate, consistent and iterative feedback instead of waiting for a teacher who has the responsibility to reach, and grade, the work of an entire class .
That said, the same review found that only 34% of its studies bothered to report a standardized effect size. The evidence base is wide and shallow, not deep, but offers tantalizing insights into some of the core design principles that could make AI-driven education technology a game-changer for math education .
The Active Ingredient Is What You Withhold
Picture two AI tutors working with the same student on the same hard problem. One, when she gets stuck, explains the answer. The other refuses, asks her a question, and points at the next step she has to take herself. A Carnegie Mellon study of a university discrete-math course ran almost exactly that contrast. Its system had two parts: a chatbot that would give the answer if a student pushed for it, and a proof-reviewer engineered never to release the solution, only to flag where the reasoning went wrong.
The result is unsurprising; it is the same order that made Rori work: hints, Socratic questions, carefully designed guidance, all before solution. Students who used the answer-giving chatbot more often scored worse on exams, while students who used the answer-withholding reviewer more often scored better. This finding is correlational, not a clean randomized comparison, so it is best read as evidence consistent with a pattern rather than proof of one. But the pattern keeps recurring.
It is built into the physics tutor whose designers prompted it to “never reveal the full solution,” only the next step. And it shows up when an interactive, step-by-step math method was tested against a standard worked-solution explanation: on the harder problems, students judged the method that made them answer at each step more useful by a margin of 47.5% to 26.7%, and 78% preferred working that way.
This is one of the core paradoxes of education technology; the instructional design that best helps students learn is often the design that frustrates them in the moment. The implications for education technology companies, which are often incentivized to optimize for tool usage, user satisfaction, or reduced friction, continue to be difficult to navigate.
Just as carefully designed tools (like ITSs) are often less popular than the ones that simply give answers, this tension is exacerbated in an era when students can opt out of such designs and into commercially available “homework help” tools—prevalent in the Apple and Google app stores—that reveal the answer outright. That puts the evidence on a collision course with how most of these products are built and sold.
Confident, and Wrong
A student finishes a physics problem and gets feedback that is specific, fluent, and authoritative. It is also wrong, and she has no way to tell. The same physics feedback system from the previous section—the one its designers engineered never to hand over the full solution—was grounded carefully in an expert’s map of what a correct solution should contain, and yet its feedback was still wrong in about one case in five. What’s worse—the users may often not notice that the feedback is incorrect—in this case, the students testing it were Physics Olympiad participants, among the most capable in the country, and they caught those errors in only two of thirty-eight cases.
This is the catch hidden inside the feedback-density story. A loop that closes instantly is only worth having if AI responses are actually correct, and the confident, expert-sounding register is exactly what stops a student from questioning it. Part of this gap is temporary and will narrow as models get more accurate- we’ve already seen enormous jumps in accuracy in the last few years, especially in math. But part of the issue is structural: the polished, authoritative tone that makes AI feedback feel trustworthy is intrinsic to how these models write, and it will keep masking errors even as the errors grow rarer. Pair this with the previous section and the risk compounds. An AI that hands over an answer is bad for learning, and an AI that hands over a wrong answer with full confidence is worse. The tighter the loop, the more convincing the mistake.
Performance Is Not Learning
One of the most consistent, and important, findings in AI research is the gap between performance—how students perform on assessments or homework when assisted—and genuine learning, including retention of material and transfer to other environments. Then comes the exam, with the AI switched off. In the Carnegie Mellon study, while only the treatment group had the tutor, their homework scores rose significantly; on exams, the two groups were indistinguishable. The students who leaned hardest on the answer-giving chatbot were the ones who started with the least confidence, the students who could least afford to offload the thinking in the first place.
The sharpest version of this comes from a field experiment with nearly a thousand high-school math students, which randomized them between two GPT-4 tutor experiences: the first was a plain, ChatGPT-style interface that would hand over answers, and the other was prompted to safeguard learning by withholding answers. While students had access, both lifted performance sharply—a 48% gain for the answer-giving version, 127% for the safeguarded one. But when the AI was switched off, the students who had used the answer-giving tutor scored 17% worse than classmates who had never touched it. The researchers’ explanation for what happened is a concept running through this whole series: when students had used the model as a “crutch,” their performance did not predict their deeper and sustained learning- but intelligently designed safeguards can keep the crutch from forming.
A caveat belongs here. Even the Ghana result is not a clean test of durable learning, because students were still using Rori when they were assessed, and there is no high-quality causal study of durable, unassisted learning in US K-12 classrooms anywhere in this evidence. What the research can say is narrower and still important: AI reliably improves student performance while students have access to it, but whether it improves students’ understanding, retention and transfer after the AI is removed is an area of active study.
AI reliably improves student performance while students have access to it, but whether it improves students’ understanding, retention and transfer after the AI is removed is an area of active study.
That said, there are some early encouraging signals on transfer; one UK-based RCT conducted by Google, LearnLM and Eedi in math found that “students who received support from LearnLM were 5.5 percentage points more likely to solve novel problems on subsequent topics (with a success rate of 66.2%) than those who received tutoring from human tutors alone (rate of 60.7%).” Even more noteworthy from the same study was the fact that the human tutors overseeing the AI tutor approved more than ¾ of the AI-generated messages either without changes or with only minimal edits; the implication is that AI tutors, when designed carefully, may already be generating pedagogically sound content the majority of the time.
Evening the Playing Field: The Expertise-Reversal Effect
In a study of 661 high-school physics students, students who started with weaker skills gained the most from richer, multi-format feedback from AI, while the advantage shrank for students who were already strong. Education researchers call this the expertise-reversal effect: the same scaffold that helps a novice can slow down an expert who no longer needs it. The interactive-math study found the same shape from the other direction, with step-by-step prompting helping on hard problems but acting as “a distraction” on easy ones.
The same pattern, in which support is concentrated where it is needed most, appears when the AI coaches the tutor instead of the student. In a randomized trial with more than 700 tutors and 1,000 students from underserved communities, a Stanford designed tool called Tutor CoPilot fed expert moves to tutors in real time. Students whose tutors used it were four percentage points more likely to master a topic… but for those assigned the lowest-rated tutors, the gain jumped to 9%. In other words, lower-performing tutors got more of a boost from AI than others- the expertise-reversal effect in action. An analysis of 350,000 messages showed the mechanism: the tool pushed tutors toward more probing questions and away from generic praise—nudging humans, at about twenty dollars per tutor a year, toward exactly the answer-withholding pedagogy the rest of this evidence rewards.
If the students using AI support end up being the ones it helps most, not least, this complicates the negative findings about AI for learning. But for lower-performing students, the harder problem may be getting them to use it at all: in the physics study, students left the elaborated feedback unopened on two-thirds of their attempts. So the real question is not whether AI helps “students” as a group. It is which students, on which problems, and whether those being helped are building understanding or quietly leaning on a crutch they will not have on test day.
The Question Isn’t Whether AI Can Do the Math
The question of whether AI can handle K-12 math is effectively closed; it clearly can. The real challenges are determining whether any given AI math tool actually helps the student: does it solve problems for them and deprive them of productive struggle? Is its feedback both accurate and actionable? Are students genuinely learning or merely leaning on the tool to perform? While current evidence consistently favors the role of guidance over direct answers, there is a research gap: the cleanest test we have—the GPT Base versus GPT Tutor experiment described earlier—pitted a bundle of learning safeguards against none, rather than isolating the single variable of withholding answers. That clean trial still doesn’t exist, so our conclusions remain well-supported hypotheses rather than settled facts
For developers, this creates a difficult paradox: the most marketable products (and those dominating the App Stores) provide immediate answers, yet the most effective educational design withholds them, admits uncertainty, and truly challenges the student at the right level. These are fundamentally different products, and the latter is both harder to build and harder to sell.
Ultimately, the challenge of AI in education transcends the math itself; it rests on whether the EdTech community can internalize this emerging research and design products that prioritize productive struggle over convenience, finally shattering the seductive illusion of learning to cultivate true, resilient student understanding.
Studies Cited:
Adewumi, T., Liwicki, F. S., Liwicki, M., Gardelli, V., Alkhaled, L., & Mokayed, H. (2025). Findings of mega: Maths explanation with llms using the socratic method for active learning. arXiv preprint arXiv:2507.12079. Link
Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakcı, Ö., & Mariman, R. (2025). Generative AI without guardrails can harm learning: Evidence from high school mathematics. Proceedings of the National Academy of Sciences, 122(26). Link
Chen, E., Judicke, S., Beigh, K., Tang, X., Xiao, Z., Li, C., ... & Koedinger, K. (2025). Generative AI alone may not be enough: Evaluating AI Support for Learning Mathematical Proof. arXiv preprint arXiv:2509.16778. Link
Henkel, O., Horne-Robinson, H., Kozhakhmetova, N., & Lee, A. (2024). Effective and scalable math support: Evidence on the impact of an AI-tutor on math achievement in Ghana. arXiv preprint arXiv:2402.09809. Link
Maus, H., Tschisgale, P., Kieser, F., Petersen, S., & Wulff, P. (2025). Developing and Evaluating a Large Language Model-Based Automated Feedback System Grounded in Evidence-Centered Design for Supporting Physics Problem Solving. arXiv preprint arXiv:2512.10785. Link
Memari, M., & Ruggles, K. (2025). Artificial Intelligence in Elementary STEM Education: A Systematic Review of Current Applications and Future Challenges. arXiv preprint arXiv:2511.00105. Link
Revenga-Lozano, N., Avila, K. E., Steinert, S., Schweinberger, M., Gómez-Pérez, C. E., Kuhn, J., & Küchemann, S. (2026). Personalized Multimodal Feedback Using Multiple External Representations: Strategy Profiles and Learning in High School Physics. arXiv preprint arXiv:2601.09470. Link
Wang, R. E., Ribeiro, A. T., Robinson, C. D., Loeb, S., & Demszky, D. (2024). Tutor CoPilot: A Human-AI approach for scaling real-time expertise. arXiv preprint arXiv:2410.03017. Link
Special thanks to Overdeck Family Foundation for sponsoring this article in our AI & Efficacy Editorial Research Series diving into key research findings from Stanford’s AI Hub for Education Research Repository (a project by Stanford’s SCALE Initiative). Stay tuned for more!
Top Edtech Headlines
1. Anthropic Launches 1,000-Fellowship Program to Support AI for Public Good
Anthropic has committed $150 million to launch the Claude Corps Fellowship Program, a new initiative that will support 1,000 fellows working on AI applications in education, government, research, and nonprofit sectors.
2. Lawsuit Against Social Media Companies Advances as Debate Over Youth Mental Health Intensifies
A major lawsuit brought on behalf of families and school districts is moving forward against several social media companies, alleging that platform design choices contributed to harms experienced by young people. The case comes amid increasing scrutiny of technology’s impact on youth well-being and could have significant implications for how digital platforms are regulated, designed, and used by children and adolescents.
3. New AI Literacy Framework Aims to Define Essential Skills for the AI Era
A coalition of education, research, and technology organizations has launched the AI Literacy Framework, a new resource designed to help educators, students, and institutions build a shared understanding of the knowledge and skills needed to engage with AI responsibly. The framework outlines key competencies ranging from technical understanding to critical thinking and ethical decision-making, providing a roadmap for integrating AI literacy into teaching and learning.
4. SuperCharger Ventures Opens Applications for Malta-Based EdTech Accelerator
SuperCharger Ventures is accepting applications for Cohort 7.0 of its Malta Accelerator, a 12-week, in-person program designed to help EdTech and Future of Work startups expand into Europe. Participants gain access to mentorship, investor connections, corporate partners, and up to €1.5 million in non-dilutive funding opportunities, with select companies also eligible for direct investment from the SuperCharger Ventures Fund.
Preparing Students for 2076: Ben Kornell on AI, Assessment, and the Future of Learning
In this special crossover episode, Edtech Insiders host Ben Kornell joins Allison Salisbury, author of The Humanist, for a wide-ranging conversation about AI, education, workforce development, and the future of learning. Together, they explore how AI could reshape assessment, personalized learning, school design, and the skills students will need to thrive in a rapidly changing world.
5 Things You’ll Learn in This Episode
Why AI should be used to redesign education and not just make existing systems more efficient.
How assessment could shift from annual testing to continuous, actionable feedback.
The durable skills students need to succeed in a world of constant technological change.
What the personalized learning movement got right and wrong.
How AI, school choice, and new funding models could reshape the future of K-12 education.












