When GPT-4o scored 2.7% on a new exam, it wasn't a failure of the test—it was proof the test was working. For years, AI systems have been acing benchmarks designed by humans for humans. Then the benchmarks stopped meaning anything.
Older tests like the Massive Multitask Language Understanding exam became too easy. Top AI models were hitting near-perfect scores, which created a problem: if the test can't distinguish between genuine progress and clever pattern-matching, how do you actually know what these systems can do.
So nearly 1,000 researchers from around the world—computer scientists, historians, physicists, linguists, medical researchers—built something different. "Humanity's Last Exam" is 2,500 questions covering everything from translating ancient Palmyrene inscriptions to identifying anatomical structures in birds to parsing Biblical Hebrew phonetics. The goal wasn't to be impossibly hard for its own sake. It was to measure, precisely and systematically, where AI stops being able to fake it.
We're a new kind of news feed.
Regular news is designed to drain you. We're a non-profit built to restore you. Every story we publish is scored for impact, progress, and hope.
Start Your News DetoxWhat the gaps reveal
The results are telling. Claude 3.5 Sonnet scored 4.1%. OpenAI's o1 reached 8%. Even the most recent systems—Gemini 3.1 Pro and Claude Opus 4.6—have climbed to around 40-50% accuracy. Still nowhere near human mastery.
Dr. Tung Nguyen, an instructional associate professor at Texas A&M who contributed 73 of the questions, explains why this matters beyond the numbers. "When AI systems start performing extremely well on human benchmarks, it's tempting to think they're approaching human-level understanding," he said. "But HLE reminds us that intelligence isn't just about pattern recognition—it's about depth, context, and specialized expertise."
This distinction is crucial for how we actually use these tools. A system that scores 40% on a specialized exam isn't the same as one that understands what it's doing. High scores on human-designed tests can mask something important: the test was designed for humans, not machines. A system might be matching patterns it's seen in training data rather than reasoning through a problem from first principles.
Here's what makes this project different from previous benchmarks: every single question was tested against leading AI systems before being included. If an AI got it right, the question was removed. This created a moving target—a test that stays just beyond what current systems can do, even as they improve.
The diversity of contributors is what makes HLE actually measure intelligence rather than just specialized knowledge. When physicists, historians, and linguists all contribute questions from their fields, you're not testing whether an AI learned to pattern-match on one type of problem. You're testing whether it can handle genuine depth across human knowledge.
Why this matters for what comes next
Despite the slightly ominous name, Humanity's Last Exam isn't a doomsday prediction. It's the opposite. Nguyen emphasizes: "This isn't a race against AI. It's a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies. And, importantly, it reminds us why human expertise still matters."
Policymakers, developers, and users need accurate information about what AI can actually do. Misinterpreting capabilities leads to misuse. Overestimating what a system understands can lead to deploying it in situations where it will fail dangerously. Underestimating it wastes potential.
The exam will remain a long-term, transparent benchmark—with most questions kept private so systems can't simply memorize answers. As AI improves, researchers will be able to see exactly where progress is happening and where the fundamental gaps remain.
What's striking about this project is that it took nearly 1,000 humans working together across disciplines to reveal what machines still cannot do. That's not a weakness of the research. It's the whole point.









