Why AI Benchmarks Keep Failing Us
Remember the headlines when ChatGPT “passed” the medical licensing exam? I remember thinking: impressive. Then I read more, and the feeling got complicated.
Here’s the core issue in one sentence: many AI tests live on the internet, and AIs are trained on the internet. That makes benchmarks unreliable in a way that’s easy to miss.
What actually happens
– Models are trained on massive web data. If a benchmark or its answers exist online, a model can learn them directly.
– Labs then tune models to perform well on those benchmarks. It’s called “benchmarketing” — optimizing for a score rather than robust ability.
– A once-hard test can be solved months after release, not because intelligence improved radically, but because the test leaked into training data or people tailored models to it.
A good example is the so-called “Humanity’s Last Exam.” It was created to be brutally hard. At first, models scored around 10%. Then ChatGPT-5 came out and the score jumped to 25%. That sounds like progress, but it raises a different question: did the model really understand better, or did it just get better at this particular exam?
Why that matters
If benchmarks measure memorization more than reasoning, they mislead everyone. Policymakers, funders, and the public will think systems are more capable than they really are. Researchers chase score improvements that don’t generalize to real-world tasks. Startups tweak models to win contests instead of fixing hard problems.
What better testing looks like
I don’t have a silver bullet, but here are practical directions that feel more honest:
– Hidden and rotating datasets: keep evaluation sets out of public reach and rotate them often so models can’t memorize them.
– Adversarial and real-world tasks: test on messy, practical problems, not sanitized multiple-choice questions.
– Human-in-the-loop checks: combine automatic metrics with human judgment, prioritizing reliability over neat numbers.
– Continuous evaluation: treat benchmarks as ongoing challenges, not fixed milestones. That shows whether improvements stick.
– Transparency about training data: knowing whether a dataset leaked into training helps interpret scores.
What you can do as a reader
If you read a headline about an AI “passing” some exam, ask a few quick questions: was the test public? Could the model have trained on it? Is the improvement consistent across other tasks? Those few questions cut through a lot of hype.
I still find AI progress exciting. But I also think we need to be tougher about how we measure it. Numbers are tempting because they’re clear. They comfort us. But if the metric is broken, that comfort is false.
If we want trustworthy AI, we need trustworthy tests. That means more careful design, more honesty about limitations, and a slower breath before we declare victory. I’m curious to see how the community adapts — and whether benchmarks become tools for real assessment again, rather than just targets to optimize for.