AI scores a ‘C–’ on its hardest math test yet
AI scores a ‘C–’ on its hardest math test yet The second batch of “First Proof” problems is meant to evaluate AI’s usefulness for research-level math. The best model got six or seven of the 10 questions basically right The best-yet test of artificial intelligence’s mathematical
The second batch of “First Proof” problems is meant to evaluate AI’s usefulness for research-level math. The best model got six or seven of the 10 questions basically right
The best-yet test of artificial intelligence’s mathematical mettle has released its first official round of results. The verdict is that large language models (LLMs) are emerging as useful—albeit deeply flawed—assistants for math research.
Organized by a team of top mathematicians, the “First Proof” project is a response to AI companies’ growing fixation on using advanced math as a benchmark for their products—regardless of whether those metrics reflect the problems professional mathematicians actually care about. Results of a pilot round in February were mixed , with companies’ opaque, internal efforts vastly outperforming their public models .
This latest batch of tests involves a broader range of math problems and more rigorous protocols for its participants—to which only OpenAI and a trio of academic groups agreed. The results were again mixed, with six to seven of the 10 problems answered essentially correctly by at least one AI. Although peak performance continues to improve, the models also churn out copious amounts of garbage as a by-product, requiring heroic interventions to sift sense from slop.
If you're enjoying this article, consider supporting our award-winning journalism by subscribing . By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
“We felt very strongly that if we’re going to be doing a public service for the greater community, we need to test publicly available models,” says Lauren Williams, a mathematician at Harvard University and member of the First Proof team. That limited the entrants to OpenAI’s ChatGPT-5.5 Pro and three models built by groups at the Swiss Federal Institute of Technology Zurich (ETH Zurich) and Aarhus University in Denmark, the University of California, Los Angeles, and Princeton University.
The team solicited problems from mathematicians across a great breadth of subject areas. It also employed expert graders who were paid to evaluate the AIs’ responses. “Grading an AI-generated solution is kind of a painful, thankless task,” Williams says. The graders assembled last week at Harvard’s Center of Mathematical Sciences and Applications for two days of intensive “peer” review—accelerating a process that, for a typical math proof, takes half a year or more.
The team considered a proof basically correct if its flaws were minor and likely to be easily patched—a standard commonly applied by math journals under the phrase “accept with minor revisions.” Some answers, though, fell on the edge of this somewhat murky threshold—thus the slight toss-up in final scores.
