What Connoisseurship Demands That Generation Benchmarks Cannot
The institutional consequence of CArtBench is that it makes visible a gap the field has been papering over: generation benchmarks measure output, but they cannot measure whether a model understands what it is producing. CArtBench's CATALOGCAPTION task demands structured four-section expert-style appreciations ; REINTERPRET asks for defensible reinterpretations rated by specialists . Neither task has a correct answer a model can retrieve — both require the kind of reasoning art historians use when arguing about attribution. The benchmark's museum-grounded authentication approach, aligning Palace Museum objects with authoritative catalog pages, means the evidentiary standard is archival rather than crowd-sourced. That makes it resistant to the benchmark saturation that has made MMLU-Pro and similar tests increasingly unreliable signals of frontier capability. The creative AI field has been building tools that can produce; CArtBench is the first evaluation infrastructure that asks whether those tools can judge — and labs that optimize for generation scores will not reach it.