Radio
Now Playing
Quickyla Radio — Click to play
Open →
3 min left
Back to News

AI scores a ‘C–’ on its hardest math test yet

AI scores a ‘C–’ on its hardest math test yet The second batch of “First Proof” problems is meant to evaluate AI’s usefulness for research-level math. The best model got six or seven of the 10 questions basically right The best-yet test of artificial intelligence’s mathematical

AI scores a ‘C–’ on its hardest math test yet
Scientific American — 10 June 2026
Text:
1 0 0

The second batch of “First Proof” problems is meant to evaluate AI’s usefulness for research-level math. The best model got six or seven of the 10 questions basically right

The best-yet test of artificial intelligence’s mathematical mettle has released its first official round of results. The verdict is that large language models (LLMs) are emerging as useful—albeit deeply flawed—assistants for math research.

Organized by a team of top mathematicians, the “First Proof” project is a response to AI companies’ growing fixation on using advanced math as a benchmark for their products—regardless of whether those metrics reflect the problems professional mathematicians actually care about. Results of a pilot round in February were mixed , with companies’ opaque, internal efforts vastly outperforming their public models .

This latest batch of tests involves a broader range of math problems and more rigorous protocols for its participants—to which only OpenAI and a trio of academic groups agreed. The results were again mixed, with six to seven of the 10 problems answered essentially correctly by at least one AI. Although peak performance continues to improve, the models also churn out copious amounts of garbage as a by-product, requiring heroic interventions to sift sense from slop.

If you're enjoying this article, consider supporting our award-winning journalism by subscribing . By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

“We felt very strongly that if we’re going to be doing a public service for the greater community, we need to test publicly available models,” says Lauren Williams, a mathematician at Harvard University and member of the First Proof team. That limited the entrants to OpenAI’s ChatGPT-5.5 Pro and three models built by groups at the Swiss Federal Institute of Technology Zurich (ETH Zurich) and Aarhus University in Denmark, the University of California, Los Angeles, and Princeton University.

The team solicited problems from mathematicians across a great breadth of subject areas. It also employed expert graders who were paid to evaluate the AIs’ responses. “Grading an AI-generated solution is kind of a painful, thankless task,” Williams says. The graders assembled last week at Harvard’s Center of Mathematical Sciences and Applications for two days of intensive “peer” review—accelerating a process that, for a typical math proof, takes half a year or more.

The team considered a proof basically correct if its flaws were minor and likely to be easily patched—a standard commonly applied by math journals under the phrase “accept with minor revisions.” Some answers, though, fell on the edge of this somewhat murky threshold—thus the slight toss-up in final scores.

Advertisement
React:
Sponsored

More to Read

'Astonishing': James Webb telescope spots the most chemical…
🔬 Science
'Astonishing': James Webb telescope spots the most chemically primitive galaxy in the anc…
Live Science · 13 days ago
NASA Awards Contract for Johnson Space Center Infrastructure
🔬 Science
NASA Awards Contract for Johnson Space Center Infrastructure
NASA · 14 days ago
Bacteria uncover distinct strategy to import rare sugar pol…
🔬 Science
Bacteria uncover distinct strategy to import rare sugar polymers, crystal structures show
Phys.org · 12 days ago
CBS News insiders worry how 60 Minutes will endure after fi…
💰 Business
CBS News insiders worry how 60 Minutes will endure after firings: ‘What are they going to…
Guardian Business · 9 days ago
Sam Altman says OpenAI's top token spender uses 100 billion…
📈 Markets & Finance
Sam Altman says OpenAI's top token spender uses 100 billion tokens a month — and they're …
Business Insider Mkt · 10 days ago
Intel, AMD, Micron shares sink as Broadcom results spark se…
📈 Markets & Finance
Intel, AMD, Micron shares sink as Broadcom results spark semiconductor sector sell-off
Yahoo Finance · 9 days ago
Full view