A methodology for clinically driven interactive segmentation evaluation
Parhom Esmaeili, Virginia Fernandez, Pedro Borges, Eli Gibson, Sebastien Ourselin, M. Jorge Cardoso
https://arxiv.org/abs/2510.09499
MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics
Jiapeng Wang, Changxin Tian, Kunlong Chen, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou
https://arxiv.org/abs/2510.09295
What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment
Allison Sihan Jia, Daniel Huang, Nikhil Vytla, Nirvika Choudhury, John C Mitchell, Anupam Datta
https://arxiv.org/abs/2510.08847
The Sonora Substellar Atmosphere Models VI. Red Diamondback: Extending Diamondback with SPHINX for Brown Dwarf Early Evolution
C. Evan Davis, Jonathan J. Fortney, Aishwarya Iyer, Sagnick Mukherjee, Caroline V. Morley, Mark S. Marley, Michael Line, Philip S. Muirhead
https://arxiv.org/abs/2510.08694…
Beyond the Binary: The System of All-round Evaluation of Research and Its Practices in China
Yu Zhu, Jiyuan Ye
https://arxiv.org/abs/2509.08546 https://arx…
TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation
Yincen Qu, Huan Xiao, Feng Li, Hui Zhou, Xiangying Dai
https://arxiv.org/abs/2510.09011
ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering
Francesco Maria Molfese, Luca Moroni, Ciro Porcaro, Simone Conia, Roberto Navigli
https://arxiv.org/abs/2510.09351
Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation
Xiangxu Zhang, Lei Li, Yanyun Zhou, Xiao Zhou, Yingying Zhang, Xian Wu
https://arxiv.org/abs/2510.09275
Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation
Khondoker Ittehadul Islam, Gabriele Sarti
https://arxiv.org/abs/2508.08933 https://