What Unquantified Measurement Noise Actually Enables
The Messing paper's most consequential finding is not that evaluation scores are noisy — that has been suspected — but that the noise is exploitable in a specific direction . When variance goes unreported, the entity best positioned to probe the measurement surface is the model developer, not the evaluator. Labs running repeated evals against the same benchmark can identify which prompt phrasings, judge configurations, and temperature settings favor their model's outputs, then ship the version that scores highest on those conditions. The benchmark community has no mechanism to detect this because it has not built one: standard confidence intervals are designed for sampling variance, not for the multi-source variance that prompt and judge sensitivity introduce.
The open-weights community faces a version of this problem it cannot solve with transparency alone. Publishing model weights does not publish the eval conditions under which a score was produced. A Hugging Face leaderboard entry showing a Qwen or Mistral model at a given MMLU score carries no attached distribution of results across prompt variants and judge choices. Numbers on the Open LLM Leaderboard that sparked the original Falcon benchmark debate showed precisely this problem years before the Messing analysis formalized it — the community intuited fragility before it had a statistical account of why. What the community now has is proof that the fragility is structural, not incidental, and that more data makes it worse rather than better.