A methodology for clinically driven interactive segmentation evaluation
Parhom Esmaeili, Virginia Fernandez, Pedro Borges, Eli Gibson, Sebastien Ourselin, M. Jorge Cardoso
https://arxiv.org/abs/2510.09499
MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics
Jiapeng Wang, Changxin Tian, Kunlong Chen, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou
https://arxiv.org/abs/2510.09295
What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment
Allison Sihan Jia, Daniel Huang, Nikhil Vytla, Nirvika Choudhury, John C Mitchell, Anupam Datta
https://arxiv.org/abs/2510.08847
The Sonora Substellar Atmosphere Models VI. Red Diamondback: Extending Diamondback with SPHINX for Brown Dwarf Early Evolution
C. Evan Davis, Jonathan J. Fortney, Aishwarya Iyer, Sagnick Mukherjee, Caroline V. Morley, Mark S. Marley, Michael Line, Philip S. Muirhead
https://arxiv.org/abs/2510.08694…
The Secular Evolution of #PlanetaryNebula IC 418 and Its Implications for Carbon Star Formation: https://iopscience.iop.org/article/10.3847/2041-8213/adf62b -> HKU Astrophysics Research Captures 130 Years of Evolution of a Dying Star: https://www.hku.hk/press/news_detail_28550.html
Park Service orders changes to staff ratings, a move experts call illegal
A top National Park Service official has instructed park superintendents to limit the number of staff who get top marks in performance reviews
-- a move that experts say violates federal code and could make it easier to lay off staff.
Parks leadership generally evaluate individual employees annually on a five-point scale,
with a three rating given to those who are successful in achieving their go…
Automated Evolutionary Optimization for Resource-Efficient Neural Network Training
Ilia Revin, Leon Strelkov, Vadim A. Potemkin, Ivan Kireev, Andrey Savchenko
https://arxiv.org/abs/2510.09566
TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation
Yincen Qu, Huan Xiao, Feng Li, Hui Zhou, Xiangying Dai
https://arxiv.org/abs/2510.09011
ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering
Francesco Maria Molfese, Luca Moroni, Ciro Porcaro, Simone Conia, Roberto Navigli
https://arxiv.org/abs/2510.09351
Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation
Xiangxu Zhang, Lei Li, Yanyun Zhou, Xiao Zhou, Yingying Zhang, Xian Wu
https://arxiv.org/abs/2510.09275