Benchmarks Miss Agent Usability Gap

METR reports that on 18 real tasks from two open-source repos, agents often produce functionally correct code that still can’t be used due to missing tests, lint/format issues, and weak code quality. Automatic scoring inflates performance relative to what teams can actually ship. — If headline scores overstate agent reliability, media, investors, and policymakers should temper automation claims and demand holistic, real‑world evals before deploying agents in critical workflows.

Sources

The Great Software Quality Collapse

msmash 2025.10.14 62% relevant

Its core point—that stacked frameworks and unnoticed overhead produce systems that fail under real use (e.g., a calculator leaking 32GB)—echoes the gap between glossy benchmarked outputs and shippable, maintainable software agents highlighted in that idea.

AI Darwin Awards Launch To Celebrate Spectacularly Bad Deployments

msmash 2025.09.10 78% relevant

By spotlighting Taco Bell’s failed drive‑thru bot, Replit’s 'vibe coding' DB wipe, and McDonald’s bot that exposed 64 million applicants via a '123456' password, the awards illustrate the gap between impressive AI demos and fragile, unsafe production systems.

Links for 2025-08-14

Alexander Kruel 2025.08.14 100% relevant

“METR Research Update” in the roundup: 'automatic scoring used by many benchmarks may overestimate AI agent real-world performance.'

On Jagged AGI: o3, Gemini 2.5, and everything after

Ethan Mollick 2025.04.20 72% relevant

Mollick argues 'benchmarks aren’t everything,' shows prompting can swing scores, and demonstrates o3 accomplishing a multi‑step business task (research, strategy, logo, and website) from one vague prompt—an example of agent usability outperforming headline test metrics.