Top mathematicians test AI on unsolved problems

Top mathematicians test AI on unsolved problems—and humans still lead

Defying AI's mathematical prowess, Harvard's Lauren Williams, a MacArthur 'genius,' leads an international effort to challenge notions of AI supremacy.

Lina Chen

Feb 7·2 min read·Cambridge, United States·69 views

Originally reported by Harvard Gazette ↗ · Rewritten for clarity and brevity by Brightcast

Why it matters: This research helps define the boundaries of AI capabilities, reassuring the public that human expertise and problem-solving skills remain essential for advancing scientific knowledge.

When Lauren Williams, a MacArthur fellow at Harvard, asks AI systems questions about her own research in algebraic combinatorics, something shifts. The answers that seemed helpful and confident moments before start showing cracks. Mistakes appear. Papers she never wrote get cited. The further the question moves into genuine expertise, the less reliable the system becomes.

This gap—between AI's ability to solve existing problems and its capacity to do what mathematicians actually do—is what a team of 11 elite mathematicians decided to measure.

Setting up the test

Williams and her collaborators, including Fields Medal winner Martin Hairer, launched the "First Proof" project to create the first rigorous benchmark for evaluating AI on research-level math. Not competition math. Not textbook problems. The kind of work that defines a career.

Wait—What is Brightcast?

We're a new kind of news feed.

Regular news is designed to drain you. We're a non-profit built to restore you. Every story we publish is scored for impact, progress, and hope.

Start Your News Detox

They assembled 10 unsolved problems they'd recently cracked themselves but hadn't yet published. The problems span number theory, algebraic combinatorics, spectral graph theory, symplectic topology, and numerical linear algebra—the messy frontiers where mathematicians actually work. The solutions were encrypted and locked away. On February 5th, the problems went public. On February 13th, the answers would be revealed, and the real comparison would begin: how do the proofs mathematicians created stack up against what AI produces.

When they ran preliminary tests with the best publicly available AI systems, the results were clear. The models solved 2 out of 10 problems.

What the gap actually means

Hairer put it bluntly: "To push back a bit on the narrative that 'math has been solved' just because some LLM managed to solve a bunch of Math Olympiad problems." Those competition problems are well-defined puzzles with known solution methods. Research math is different. It requires something that current AI systems haven't demonstrated: the ability to ask good questions and develop frameworks for attacking them. Those first two steps—the conceptual work—remain distinctly human territory.

"As of now, this idea of mathematicians being replaced by AI is complete nonsense in my opinion," Hairer said. "Maybe this will change in the future, but I find it hard to believe that the type of models we're seeing at the moment will suddenly start producing genuinely new insights."

What makes this project valuable isn't the score. It's the clarity. For years, AI progress has been measured in headlines—systems that beat humans at Go, that pass exams, that write code. But those benchmarks often measure what's easiest to quantify, not what matters most. The First Proof project does something harder: it takes actual human expertise, the kind that takes years to develop, and asks whether machines can match it on its own terms.

The team's work suggests that for now, they can't. But more importantly, it gives us a way to watch what happens next.

Brightcast Impact Score (BIS)

This article showcases a collaborative effort by renowned mathematicians to challenge the notion of AI supremacy in advanced mathematics. It highlights the limitations of current AI systems in making creative leaps and solving problems beyond their training data, while also acknowledging the impressive progress of AI in mathematics. The project aims to establish a more objective methodology for evaluating AI capabilities, which could have significant implications for the field. The article presents a balanced and nuanced perspective on the ongoing competition between human and artificial intelligence.

Hope28/40

Emotional uplift and inspirational potential

Reach23/30

Audience impact and shareability

Verification24/30

Source credibility and content accuracy

Significant

75/100

Major proven impact