Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation
Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng, Jing Huang, Jiri Gesi, Ying Xu, Bingsheng Yao, Dakuo Wang
https://arxiv.org/abs/2507.21028
NFL evaluators have seen decline in Steelers' Minkah Fitzpatrick, reportedly leading to him being available
https://www.cbssports.com/nfl/ne…
Can You Share Your Story? Modeling Clients' Metacognition and Openness for LLM Therapist Evaluation
Minju Kim, Dongje Yoo, Yeonjun Hwang, Minseok Kang, Namyoung Kim, Minju Gwak, Beong-woo Kwak, Hyungjoo Chae, Harim Kim, Yunjoong Lee, Min Hee Kim, Dayi Jung, Kyong-Mee Chung, Jinyoung Yeo
https://arxiv.org/abs/2507.19643
LLM4VV: Evaluating Cutting-Edge LLMs for Generation and Evaluation of Directive-Based Parallel Programming Model Compiler Tests
Zachariah Sollenberger, Rahul Patel, Saieda Ali Zada, Sunita Chandrasekaran
https://arxiv.org/abs/2507.21447
Music Arena: Live Evaluation for Text-to-Music
Yonghyun Kim, Wayne Chi, Anastasios N. Angelopoulos, Wei-Lin Chiang, Koichi Saito, Shinji Watanabe, Yuki Mitsufuji, Chris Donahue
https://arxiv.org/abs/2507.20900
How to Evaluate the Accuracy of Online and AI-Based Symptom Checkers: A Standardized Methodological Framework
Marvin Kopka, Markus A. Feufel
https://arxiv.org/abs/2506.22379
Evaluating Differentially Private Generation of Domain-Specific Text
Yidan Sun, Viktor Schlegel, Srinivasan Nandakumar, Iqra Zahid, Yuping Wu, Warren Del-Pinto, Goran Nenadic, Siew-Kei Lam, Jie Zhang, Anil A Bharath
https://arxiv.org/abs/2508.20452
Multilingual Self-Taught Faithfulness Evaluators
Carlo Alfano, Aymen Al Marjani, Zeno Jonke, Amin Mantrach, Saab Mansour, Marcello Federico
https://arxiv.org/abs/2507.20752 http…
ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents
Tianjian Liu, Fanqi Wan, Jiajian Guo, Xiaojun Quan
https://arxiv.org/abs/2508.20973 https://
Evaluating Scoring Bias in LLM-as-a-Judge
Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu
https://arxiv.org/abs/2506.22316 https://