AI Benchmarks Are Measuring the Wrong Thing — Here's What to Look for Instead
Every major AI release arrives with a table. MMLU, HumanEval, MATH, ARC-Challenge — columns of percentages, often with a few cells highlighted in green to show where this model beats the last one. The numbers spread fast: they anchor comparison threads on Reddit, fill the "why we switched" posts on LinkedIn, and quietly govern which model your company buys a seat on.
I want to make an uncomfortable argument: for most of what you actually use AI for, benchmark numbers are close to useless — and relying on them as a primary signal is making you a worse judge of the tools you use. Not because the benchmarks are faked (mostly they aren't), but because they're measuring something real and narrow and calling it general. The gap between "best on the table" and "best for your work" is wide, frequently misunderstood, and systematically papered over by every company competing on those tables.
By the end of this piece, you'll understand what benchmarks actually measure, why the numbers drift further from usefulness over time, and how to evaluate a model for the work you actually do — a framework you can use right now, before the next release cycle.
TL;DR - Benchmarks measure narrow, static, well-defined tasks. They are useful proxies for a model's general capability floor but poor predictors of how it will handle your specific tasks. - Models get trained with awareness of the benchmarks that will evaluate them. High scores partly reflect optimization for the benchmark, not just general intelligence. - Benchmark saturation means the top models often score within noise of each other — the headline number stops discriminating in the range that matters most. - The right signal isn't a leaderboard position. It's systematic self-testing on your actual tasks, at your actual volumes, in your actual context. - The skills that make you a good AI evaluator are the same ones that make you a good thinker: ask whether the model is actually right, not just confidently detailed.
Steelman first: benchmarks are genuinely useful
I should concede what's true, because the weak version of this argument would be wrong.
Benchmarks did something important in the early years of the LLM era: they gave researchers a shared vocabulary for measuring capability. Before standardized evaluation, comparing models meant picking anecdotes — mine passed this prompt, yours failed that one. Benchmarks forced rigor. MMLU (Massive Multitask Language Understanding) tests knowledge across 57 subjects. HumanEval measures code generation. MATH tests mathematical reasoning. These aren't arbitrary; they were built by smart people trying to probe real capabilities.
And they work, roughly. A model that scores 40% on MMLU will almost certainly be worse at most knowledge tasks than one that scores 80%. The benchmarks discriminate at the extremes — they correctly rank a capable model above a weak one. If you're trying to choose between a well-funded frontier model and a two-year-old open-source model with half the parameters, the benchmark table will probably point you right.
The problem is that most people aren't in that comparison. They're comparing the top three models on a leaderboard where the scores are 89%, 87%, and 86% — and treating those numbers like they're the only evidence they need.
What benchmarks actually measure
Every benchmark is a fixed test. It was designed at a specific moment, on a specific set of tasks, and it stays fixed. The tasks on MMLU are multiple-choice questions on topics from high school biology to professional law. HumanEval is 164 specific Python coding problems. They're curated, labeled, and frozen.
This creates a fundamental limitation: a benchmark measures how well a model performs on the tasks in that benchmark, not on tasks in general. Those two things overlap meaningfully — but they're not the same.
| What benchmarks measure well | What benchmarks measure poorly |
|---|---|
| Broad knowledge coverage (do you know facts across domains?) | Instruction-following on ambiguous, multi-step real tasks |
| Reasoning on well-defined problems with known answers | Judgment quality on tasks where "correct" is subjective |
| Code generation on isolated, self-contained functions | Code generation in a large existing codebase |
| Mathematical reasoning with clean problem statements | Mathematical reasoning embedded in messy natural language |
| Performance at a fixed snapshot in time | Consistency over thousands of uses in varying contexts |
The tasks you actually use AI for rarely look like the benchmark tasks. You're writing a memo in your company's voice. You're debugging a function in a 50-file project. You're summarizing a document for an audience the model has never seen. You're asking a question where the right answer requires knowing context that lives in your notes. Benchmarks weren't designed for any of this.
The Goodhart problem: when the measure becomes the target
There's a principle in economics and machine learning: any measure that becomes a target stops being a good measure. This is Goodhart's Law, and it applies squarely to AI benchmarks.
Model developers know what benchmarks reviewers will run. They know the benchmark datasets, the scoring methodology, and roughly what the top score looks like. This creates an incentive — conscious or not — to optimize for the benchmark during training and fine-tuning. It's not fraud. It's the natural result of using fixed test sets to evaluate models that are trained with gradient descent: the training process finds what moves the metric.
The result is that high benchmark scores partially reflect optimization for the benchmark, not just general capability. A model with exceptional benchmark scores might perform worse than a lower-scoring competitor on tasks that weren't in the training signal.
Framework: The "Goodhart distance" question Before you trust a benchmark number, ask how close the test task is to the training signal: 1. Is this benchmark widely known and used in training evaluations? The more it's used, the more the scores reflect optimization. 2. Are the test examples publicly available? If yes, data contamination (accidentally training on test data) becomes plausible. 3. How far is this benchmark task from my actual use case? The further, the less the number tells you. 4. Does the model developer publish training details? Benchmark scores from developers who are opaque about training should get more skepticism.
Benchmark saturation: when the numbers stop discriminating
There's a specific failure mode that makes the leaderboard especially misleading right now: saturation. When the best models are all scoring in the 85–92% range on a benchmark, a few things become true simultaneously:
- The differences between scores are within the noise of different evaluation runs.
- The benchmark isn't hard enough to reveal capability differences among the leaders.
- New benchmarks get created to reveal new gaps — but quickly become targets too.
This is why there's a constant arms race of new benchmarks: GPQA, ARC-AGI, FrontierMath, Humanity's Last Exam. Researchers keep raising the difficulty because the old tests stopped discriminating. But each new benchmark has a short useful life before the same dynamics apply.
What this means practically: if you're choosing between the top three frontier models based on benchmark leaderboards where the scores are all above 85%, you are making your decision in the noise band. The numbers aren't telling you which model is better for you. They're telling you the models are roughly similar on the tests they were optimized for.
How to actually evaluate a model for your work
The alternative isn't mysticism. It's systematic self-testing — and it's faster than it sounds.
Step 1: Write down your five most common AI tasks. These should be specific: "summarize an earnings call transcript into five bullet points for my team," not "summarize text." The more specific, the more diagnostic the test.
Step 2: Create a small test set — 5–10 examples per task. Use real examples from your work if you have them. Synthetic examples are fine if you don't. The key is that they should represent the actual distribution of difficulty you encounter, not just the easy cases.
Step 3: Run the same prompts on each model you're evaluating. Score the outputs. Scoring doesn't need to be elaborate:
| Score | Meaning |
|---|---|
| 3 — Good | I'd use this output with minor edits |
| 2 — Acceptable | I'd use it, but I'd rewrite significant parts |
| 1 — Poor | I'd discard this and start over |
| 0 — Wrong | The output is factually or logically wrong in a way that matters |
Step 4: Look at variance, not just average. A model with an average score of 2.5 that occasionally scores 0 (catastrophically wrong) is worse for your work than a model that consistently scores 2. Reliability matters more than peak performance on most professional tasks.
Step 5: Test context sensitivity. Provide the same task with and without relevant context (a document, a template, a set of constraints). The best models use context accurately; the weaker ones hallucinate confidently either way. (See also: [You Don't Have a Prompting Problem. You Have a Context Problem.](/content/published/ai-literacy/you-dont-have-a-prompting-problem) for a full treatment of why context is the real variable.)
Your evaluation checklist - [ ] I have at least 5 real tasks I've run through this model - [ ] I've tested each task on 5–10 examples, not just one - [ ] I've looked at failure cases, not just successes - [ ] I've tested how the model handles my actual context (documents, notes, constraints) - [ ] I've compared the model's confident-sounding answer against ground truth at least once - [ ] I've run the same task on the comparison model with identical prompts
Common mistakes
Trusting the benchmark table as a proxy for your task. The benchmark measures different tasks. Your task is your task.
Comparing score percentages across different benchmarks. A 90% on one benchmark and a 90% on another are not the same thing. The benchmarks have wildly different difficulty ceilings and scoring methodologies.
Testing only on easy examples. Cherry-picked demos — including your own casual testing — systematically over-represent cases where the model performs well. If you want a real signal, test on the hard cases, the edge cases, and the cases where you've seen other tools fail.
Treating "better at coding" as general superiority. A model that dominates HumanEval may be fine-tuned specifically for Python function generation. That's not the same as being generally better at reasoning, writing, or following complex instructions.
Updating too fast on each new release. Every release cycle resets the benchmark table. If you switch models based on each new leaderboard, you're optimizing for benchmark performance with all the limitations that implies. A model that's slightly lower on the table but handles your specific tasks reliably is almost always the right choice.
The takeaway
Benchmarks are not lies. They're narrow measurements that got promoted to a role they weren't designed for — because the industry needed a fast, comparable, publishable signal, and benchmark numbers are fast, comparable, and publishable.
The skill worth developing is not learning to read benchmark tables better. It's learning to test models on your actual work, to notice when a confident answer is wrong, and to build a small personal evaluation set that compounds in value the more you use it. That's harder than reading a leaderboard. It's also the only thing that actually tells you which tool is right for the work you do.
Start with five tasks. Build ten examples. Score them honestly. You'll know more in an afternoon of testing than you'd learn from a year of benchmark releases.
Related: [You Don't Have a Prompting Problem. You Have a Context Problem.](/content/published/ai-literacy/you-dont-have-a-prompting-problem) — why the context you give a model matters more than how you phrase your prompt.