Automated Validation of LLM-based Evaluators for Software Engineering ArtifactsOra Nova Fandina, Eitan Farchi, Shmulik Froimovich, Rami Katan, Alice Podolsky, Orna Raz, Avi Zivhttps://arxiv.org/abs/2508.02827
Automated Validation of LLM-based Evaluators for Software Engineering ArtifactsAutomation in software engineering increasingly relies on large language models (LLMs) to generate, review, and assess code artifacts. However, establishing LLMs as reliable evaluators remains an open challenge: human evaluations are costly, subjective and non scalable, while existing automated methods fail to discern fine grained variations in artifact quality. We introduce REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation), an automated framework for benchmarking LLM based eval…