When Lauren Williams, a MacArthur fellow at Harvard, asks AI systems questions about her own research in algebraic combinatorics, something shifts. The answers that seemed helpful and confident moments before start showing cracks. Mistakes appear. Papers she never wrote get cited. The further the question moves into genuine expertise, the less reliable the system becomes.
This gap—between AI's ability to solve existing problems and its capacity to do what mathematicians actually do—is what a team of 11 elite mathematicians decided to measure.
Setting up the test
Williams and her collaborators, including Fields Medal winner Martin Hairer, launched the "First Proof" project to create the first rigorous benchmark for evaluating AI on research-level math. Not competition math. Not textbook problems. The kind of work that defines a career.
We're a new kind of news feed.
Regular news is designed to drain you. We're a non-profit built to restore you. Every story we publish is scored for impact, progress, and hope.
Start Your News DetoxThey assembled 10 unsolved problems they'd recently cracked themselves but hadn't yet published. The problems span number theory, algebraic combinatorics, spectral graph theory, symplectic topology, and numerical linear algebra—the messy frontiers where mathematicians actually work. The solutions were encrypted and locked away. On February 5th, the problems went public. On February 13th, the answers would be revealed, and the real comparison would begin: how do the proofs mathematicians created stack up against what AI produces.
When they ran preliminary tests with the best publicly available AI systems, the results were clear. The models solved 2 out of 10 problems.
What the gap actually means
Hairer put it bluntly: "To push back a bit on the narrative that 'math has been solved' just because some LLM managed to solve a bunch of Math Olympiad problems." Those competition problems are well-defined puzzles with known solution methods. Research math is different. It requires something that current AI systems haven't demonstrated: the ability to ask good questions and develop frameworks for attacking them. Those first two steps—the conceptual work—remain distinctly human territory.
"As of now, this idea of mathematicians being replaced by AI is complete nonsense in my opinion," Hairer said. "Maybe this will change in the future, but I find it hard to believe that the type of models we're seeing at the moment will suddenly start producing genuinely new insights."
What makes this project valuable isn't the score. It's the clarity. For years, AI progress has been measured in headlines—systems that beat humans at Go, that pass exams, that write code. But those benchmarks often measure what's easiest to quantify, not what matters most. The First Proof project does something harder: it takes actual human expertise, the kind that takes years to develop, and asks whether machines can match it on its own terms.
The team's work suggests that for now, they can't. But more importantly, it gives us a way to watch what happens next.










