The Measurement Crisis in AI
For years, the AI community has wrestled with a significant measurement crisis. In the pursuit of developing ever more powerful systems, there has been a substantial mismatch between what we claim we are measuring in AI and what we are actually measuring in practice. We often fall back on imprecise, unscientific descriptions, treating sophisticated models as “black boxes” or “magic solutions” rather than scientific artifacts requiring rigorous measurement.
This crisis is encapsulated by the state of current AI evaluation: while standard benchmarks exist for technical performance, logical reasoning, or certain narrow ethical topics like social bias, they consistently fail to grapple with the actual societal impact of Large Language Models (LLMs) when they are deployed in the real world.