Tootfinder

No exact results. Similar results found.

@arXiv_csCL_bot@mastoxiv.page
2025-07-04 09:53:11

Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
https://arxiv.org/abs/2507.02856

Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 07:22:49

OntoRAG: Enhancing Question-Answering through Automated Ontology Derivation from Unstructured Knowledge Bases
Yash Tiwari, Owais Ahmad Lone, Mayukha Pal
https://arxiv.org/abs/2506.00664

OntoRAG: Enhancing Question-Answering through Automated Ontology Derivation from Unstructured Knowledge Bases
Ontologies are pivotal for structuring knowledge bases to enhance question answering (QA) systems powered by Large Language Models (LLMs). However, traditional ontology creation relies on manual efforts by domain experts, a process that is time intensive, error prone, and impractical for large, dynamic knowledge domains. This paper introduces OntoRAG, an automated pipeline designed to derive ontologies from unstructured knowledge bases, with a focus on electrical relay documents. OntoRAG integr…

@arXiv_csIR_bot@mastoxiv.page
2025-06-03 07:30:23

Optimizing Question Semantic Space for Dynamic Retrieval-Augmented Multi-hop Question Answering
Linhao Ye, Lang Yu, Zhikai Lei, Qin Chen, Jie Zhou, Liang He
https://arxiv.org/abs/2506.00491

Optimizing Question Semantic Space for Dynamic Retrieval-Augmented Multi-hop Question Answering
Retrieval-augmented generation (RAG) is usually integrated into large language models (LLMs) to mitigate hallucinations and knowledge obsolescence. Whereas,conventional one-step retrieve-and-read methods are insufficient for multi-hop question answering, facing challenges of retrieval semantic mismatching and the high cost in handling interdependent subquestions. In this paper, we propose Optimizing Question Semantic Space for Dynamic Retrieval-Augmented Multi-hop Question Answering (Q-DREAM). …

@arXiv_csCV_bot@mastoxiv.page
2025-06-04 14:53:46

This https://arxiv.org/abs/2505.19028 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…

InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts
Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understand…

@arXiv_csCL_bot@mastoxiv.page
2025-06-03 08:21:05

Self-ensemble: Mitigating Confidence Distortion for Large Language Models
Zicheng Xu, Guanchu Wang, Guangyao Zheng, Yu-Neng Chuang, Alexander Szalay, Xia Hu, Vladimir Braverman
https://arxiv.org/abs/2506.01951

Self-ensemble: Mitigating Confidence Distortion for Large Language Models
Although Large Language Models (LLMs) perform well in general fields, they exhibit a confidence distortion problem on multi-choice question-answering (MCQA), particularly as the number of answer choices increases. Specifically, on MCQA with many choices, LLMs suffer from under-confidence in correct predictions and over-confidence in incorrect ones, leading to a substantially degraded performance. To solve this problem, we propose Self-ensemble in this work. Our method splits the choices into se…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 07:26:17

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho
https://arxiv.org/abs/2506.00781 ht…

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Recent advances in Large Language Models (LLMs) have spurred transformative applications in various domains, ranging from open-source to proprietary LLMs. However, jailbreak attacks, which aim to break safety alignment and user compliance by tricking the target LLMs into answering harmful and risky responses, are becoming an urgent concern. The practice of red-teaming for LLMs is to proactively explore potential risks and error-prone instances before the release of frontier AI technology. This …

@arXiv_csCL_bot@mastoxiv.page
2025-06-03 08:20:10

iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering
Shuai Wang, Yinan Yu
https://arxiv.org/abs/2506.01784 https://

iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering
While Large Language Models (LLMs) excel at many natural language processing tasks, they often suffer from factual inaccuracies in knowledge-intensive scenarios. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning pose…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 18:09:48

This https://arxiv.org/abs/2505.18492 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions
Mathematical reasoning lies at the heart of artificial intelligence, underpinning applications in education, program verification, and research-level mathematical discovery. Mathematical competitions, in particular, present two challenging problem types: theorem-proving, requiring rigorous proofs of stated conclusions, and answer-construction, involving hypothesizing and formally verifying mathematical objects. Large Language Models (LLMs) effectively generate creative candidate answers but str…

@arXiv_csCL_bot@mastoxiv.page
2025-07-04 09:44:51

SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model
Wencheng Zhang, Shiqin Qiao, Lingjie Luo, Yinfeng Li, Chuanyang Zheng, Qian Xu, Meng Li, Yong Gui, Yijun He, Jianing Qiu, Jindong Hong, Jiankai Sun
https://arxiv.org/abs/2507.02822

SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model
With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between "thinking" (high reasoning) and "non-thinking" (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost …

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 07:21:03

Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs
Chenjun Xu, Bingbing Wen, Bin Han, Robert Wolfe, Lucy Lu Wang, Bill Howe
https://arxiv.org/abs/2506.00582

Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs
Psychology research has shown that humans are poor at estimating their performance on tasks, tending towards underconfidence on easy tasks and overconfidence on difficult tasks. We examine three LLMs, Llama-3-70B-instruct, Claude-3-Sonnet, and GPT-4o, on a range of QA tasks of varying difficulty, and show that models exhibit subtle differences from human patterns of overconfidence: less sensitive to task difficulty, and when prompted to answer based on different personas -- e.g., expert vs laym…

Tootfinder

Opt-in global Mastodon full text search. Join the index!