Tootfinder

No exact results. Similar results found.

@arXiv_csCL_bot@mastoxiv.page
2025-07-04 09:53:11

Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
https://arxiv.org/abs/2507.02856

Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to…

@arXiv_csCV_bot@mastoxiv.page
2025-06-04 14:53:46

This https://arxiv.org/abs/2505.19028 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…

InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts
Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understand…

@arXiv_csCY_bot@mastoxiv.page
2025-06-04 13:34:40

This https://arxiv.org/abs/2506.00095 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCY_…

ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases
Hepato-pancreato-biliary (HPB) disorders represent a global public health challenge due to their high morbidity and mortality. Although large language models (LLMs) have shown promising performance in general medical question-answering tasks, the current evaluation benchmarks are mostly derived from standardized examinations or manually designed questions, lacking HPB coverage and clinical cases. To address these issues, we systematically eatablish an HPB disease evaluation benchmark comprising…

@arXiv_csCY_bot@mastoxiv.page
2025-06-05 09:38:07

This https://arxiv.org/abs/2506.00095 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCY_…

ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases
Hepato-pancreato-biliary (HPB) disorders represent a global public health challenge due to their high morbidity and mortality. Although large language models (LLMs) have shown promising performance in general medical question-answering tasks, the current evaluation benchmarks are mostly derived from standardized examinations or manually designed questions, lacking HPB coverage and clinical cases. To address these issues, we systematically eatablish an HPB disease evaluation benchmark comprising…

@arXiv_csAI_bot@mastoxiv.page
2025-07-01 09:54:43

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning
Yulun Jiang, Yekun Chai, Maria Brbi\'c, Michael Moor
https://arxiv.org/abs/2506.22992

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning
The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in multimodal domains. Here, we present MARBLE, a challenging multimodal reasoning benchmark that is d…

@arXiv_astrophGA_bot@mastoxiv.page
2025-06-03 07:45:53

Millimeter-wave observations of Euclid Deep Field South using the South Pole Telescope: A data release of temperature maps and catalogs
M. Archipley, A. Hryciuk, L. E. Bleem, K. Kornoelje, M. Klein, A. J. Anderson, B. Ansarinejad, M. Aravena, L. Balkenhol, P. S. Barry, K. Benabed, A. N. Bender, B. A. Benson, F. Bianchini, S. Bocquet, F. R. Bouchet, E. Camphuis, M. G. Campitiello, J. E. Carlstrom, J. Cathey, C. L. Chang, S. C. Chapman, P. Chaubal, P. M. Chichura, A. Chokshi, T. -L. Chou…

Millimeter-wave observations of Euclid Deep Field South using the South Pole Telescope: A data release of temperature maps and catalogs
Context. The South Pole Telescope third-generation camera (SPT-3G) has observed over 10,000 square degrees of sky at 95, 150, and 220 GHz (3.3, 2.0, 1.4 mm, respectively) overlapping the ongoing 14,000 square-degree Euclid Wide Survey. The Euclid collaboration recently released Euclid Deep Field observations in the first quick data release (Q1). Aims. With the goal of releasing complementary millimeter-wave data and encouraging legacy science, we performed dedicated observations of a 57-square-…

@arXiv_csCY_bot@mastoxiv.page
2025-06-03 07:24:04

ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases
Yuchong Li, Xiaojun Zeng, Chihua Fang, Jian Yang, Lei Zhang
https://arxiv.org/abs/2506.00095

ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases
Hepato-pancreato-biliary (HPB) disorders represent a global public health challenge due to their high morbidity and mortality. Although large language models (LLMs) have shown promising performance in general medical question-answering tasks, the current evaluation benchmarks are mostly derived from standardized examinations or manually designed questions, lacking HPB coverage and clinical cases. To address these issues, we systematically eatablish an HPB disease evaluation benchmark comprising…

@arXiv_eessSP_bot@mastoxiv.page
2025-07-01 11:48:23

Automatic Phase Calibration for High-resolution mmWave Sensing via Ambient Radio Anchors
Ruixu Geng, Yadong Li, Dongheng Zhang, Pengcheng Huang, Binquan Wang, Binbin Zhang, Zhi Lu, Yang Hu, Yan Chen
https://arxiv.org/abs/2506.23472

Automatic Phase Calibration for High-resolution mmWave Sensing via Ambient Radio Anchors
Millimeter-wave (mmWave) radar systems with large array have pushed radar sensing into a new era, thanks to their high angular resolution. However, our long-term experiments indicate that array elements exhibit phase drift over time and require periodic phase calibration to maintain high-resolution, creating an obstacle for practical high-resolution mmWave sensing. Unfortunately, existing calibration methods are inadequate for periodic recalibration, either because they rely on artificial refer…

@arXiv_csCC_bot@mastoxiv.page
2025-07-02 07:30:59

Sensitivity and Query Complexity under Uncertainty
Deepu Benson, Balagopal Komarath, Nikhil Mande, Sai Soumya Nalli, Jayalal Sarma, Karteek Sreenivasaiah
https://arxiv.org/abs/2507.00148

Sensitivity and Query Complexity under Uncertainty
In this paper, we study the query complexity of Boolean functions in the presence of uncertainty, motivated by parallel computation with an unlimited number of processors where inputs are allowed to be unknown. We allow each query to produce three results: zero, one, or unknown. The output could also be: zero, one, or unknown, with the constraint that we should output ''unknown'' only when we cannot determine the answer from the revealed input bits. Such an extension of a Boolean function is ca…

@arXiv_csCL_bot@mastoxiv.page
2025-06-27 09:59:49

Potemkin Understanding in Large Language Models
Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan
https://arxiv.org/abs/2506.21521 https://arxiv.org/pdf/2506.21521 https://arxiv.org/html/2506.21521
arXiv:2506.21521v1 Announce Type: new
Abstract: Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs -- such as AP exams -- are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.
toXiv_bot_toot

Tootfinder

Opt-in global Mastodon full text search. Join the index!