AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents
Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam
https://arxiv.org/abs/2506.00641
FACE: A Fine-grained Reference Free Evaluator for Conversational Recommender Systems
Hideaki Joko, Faegheh Hasibi
https://arxiv.org/abs/2506.00314 https://…
From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation
Serry Sibaee, Omer Nacar, Adel Ammar, Yasser Al-Habashi, Abdulrahman Al-Batati, Wadii Boulila
https://arxiv.org/abs/2506.01920
Evaluating Robot Policies in a World Model
Julian Quevedo, Percy Liang, Sherry Yang
https://arxiv.org/abs/2506.00613 https://arxiv.or…
Regionalized Metric Framework: A Novel Approach for Evaluating Multimodal Multi-Objective Optimization Algorithms
Jintai Chen, Fangqing Liu, Xueming Yan, Han Huang
https://arxiv.org/abs/2506.00468
CiteEval: Principle-Driven Citation Evaluation for Source Attribution
Yumo Xu, Peng Qi, Jifan Chen, Kunlun Liu, Rujun Han, Lan Liu, Bonan Min, Vittorio Castelli, Arshit Gupta, Zhiguo Wang
https://arxiv.org/abs/2506.01829
MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation
Yile Liu, Ziwei Ma, Xiu Jiang, Jinglu Hu, Jing Chang, Liang Li
https://arxiv.org/abs/2506.01776
CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions
Tamer Alkhouli, Katerina Margatina, James Gung, Raphael Shu, Claudia Zaghi, Monica Sunkara, Yi Zhang
https://arxiv.org/abs/2506.01859
Human-Centric Evaluation for Foundation Models
Yijin Guo, Kaiyuan Ji, Xiaorong Zhu, Junying Wang, Farong Wen, Chunyi Li, Zicheng Zhang, Guangtao Zhai
https://arxiv.org/abs/2506.01793
RewardBench 2: Advancing Reward Model Evaluation
Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Nathan Lambert
https://arxiv.org/abs/2506.01937