Tootfinder

Opt-in global Mastodon full text search. Join the index!

No exact results. Similar results found.
@arXiv_csCL_bot@mastoxiv.page
2025-07-04 09:53:11

Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
arxiv.org/abs/2507.02856

@arXiv_csCV_bot@mastoxiv.page
2025-06-04 14:53:46

This arxiv.org/abs/2505.19028 has been replaced.
initial toot: mastoxiv.page/@arXiv_csCV_…

@arXiv_csCY_bot@mastoxiv.page
2025-06-04 13:34:40

This arxiv.org/abs/2506.00095 has been replaced.
initial toot: mastoxiv.page/@arXiv_csCY_…

@arXiv_csCY_bot@mastoxiv.page
2025-06-05 09:38:07

This arxiv.org/abs/2506.00095 has been replaced.
initial toot: mastoxiv.page/@arXiv_csCY_…

@arXiv_csAI_bot@mastoxiv.page
2025-07-01 09:54:43

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning
Yulun Jiang, Yekun Chai, Maria Brbi\'c, Michael Moor
arxiv.org/abs/2506.22992

@arXiv_astrophGA_bot@mastoxiv.page
2025-06-03 07:45:53

Millimeter-wave observations of Euclid Deep Field South using the South Pole Telescope: A data release of temperature maps and catalogs
M. Archipley, A. Hryciuk, L. E. Bleem, K. Kornoelje, M. Klein, A. J. Anderson, B. Ansarinejad, M. Aravena, L. Balkenhol, P. S. Barry, K. Benabed, A. N. Bender, B. A. Benson, F. Bianchini, S. Bocquet, F. R. Bouchet, E. Camphuis, M. G. Campitiello, J. E. Carlstrom, J. Cathey, C. L. Chang, S. C. Chapman, P. Chaubal, P. M. Chichura, A. Chokshi, T. -L. Chou…

@arXiv_csCY_bot@mastoxiv.page
2025-06-03 07:24:04

ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases
Yuchong Li, Xiaojun Zeng, Chihua Fang, Jian Yang, Lei Zhang
arxiv.org/abs/2506.00095

@arXiv_eessSP_bot@mastoxiv.page
2025-07-01 11:48:23

Automatic Phase Calibration for High-resolution mmWave Sensing via Ambient Radio Anchors
Ruixu Geng, Yadong Li, Dongheng Zhang, Pengcheng Huang, Binquan Wang, Binbin Zhang, Zhi Lu, Yang Hu, Yan Chen
arxiv.org/abs/2506.23472

@arXiv_csCC_bot@mastoxiv.page
2025-07-02 07:30:59

Sensitivity and Query Complexity under Uncertainty
Deepu Benson, Balagopal Komarath, Nikhil Mande, Sai Soumya Nalli, Jayalal Sarma, Karteek Sreenivasaiah
arxiv.org/abs/2507.00148

@arXiv_csCL_bot@mastoxiv.page
2025-06-27 09:59:49

Potemkin Understanding in Large Language Models
Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan
arxiv.org/abs/2506.21521 arxiv.org/pdf/2506.21521 arxiv.org/html/2506.21521
arXiv:2506.21521v1 Announce Type: new
Abstract: Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs -- such as AP exams -- are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.
toXiv_bot_toot