Tootfinder

No exact results. Similar results found.

@arXiv_csCL_bot@mastoxiv.page
2025-08-22 10:09:21

RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores
Yingshu Li, Yunyi Liu, Lingqiao Liu, Lei Wang, Luping Zhou
https://arxiv.org/abs/2508.15464 https://

RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores
Evaluating automatically generated radiology reports remains a fundamental challenge due to the lack of clinically grounded, interpretable, and fine-grained metrics. Existing methods either produce coarse overall scores or rely on opaque black-box models, limiting their usefulness in real-world clinical workflows. We introduce RadReason, a novel evaluation framework for radiology reports that not only outputs fine-grained sub-scores across six clinically defined error types, but also produces h…

@arXiv_csCL_bot@mastoxiv.page
2025-09-22 10:11:01

Can LLMs Judge Debates? Evaluating Non-Linear Reasoning via Argumentation Theory Semantics
Reza Sanayei, Srdjan Vesic, Eduardo Blanco, Mihai Surdeanu
https://arxiv.org/abs/2509.15739

Can LLMs Judge Debates? Evaluating Non-Linear Reasoning via Argumentation Theory Semantics
Large Language Models (LLMs) excel at linear reasoning tasks but remain underexplored on non-linear structures such as those found in natural debates, which are best expressed as argument graphs. We evaluate whether LLMs can approximate structured reasoning from Computational Argumentation Theory (CAT). Specifically, we use Quantitative Argumentation Debate (QuAD) semantics, which assigns acceptability scores to arguments based on their attack and support relations. Given only dialogue-formatte…

@Techmeme@techhub.social
2025-11-18 16:30:55

Google says Gemini 3 Pro scores 1,501 on LMArena, above 2.5 Pro, and demonstrates PhD-level reasoning with top scores on Humanity's Last Exam and GPQA Diamond (Abner Li/9to5Google)
https://9to5google.com/2025/11/18/gemini-3-launch/

Google launches Gemini 3 with state-of-the-art reasoning, ‘generative UI’ for responses, more
Google today announced Gemini 3 with the goal of bringing “any idea to life.” The first model available in this family is Gemini 3 Pro...

@cosmos4u@scicomm.xyz
2025-11-17 07:46:18

Is #AI really just dumb statistics? "Olympiad-level physics problem-solving presents a significant challenge for both humans and artificial intelligence (AI), as it requires a sophisticated integration of precise calculation, abstract reasoning, and a fundamental grasp of physical principles," says the (abstract of the) paper https://arxiv.org/abs/2511.10515: "The Chinese Physics Olympiad (CPhO), renowned for its complexity and depth, serves as an ideal and rigorous testbed for these advanced capabilities. In this paper, we introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning, and apply it to the CPhO 2025 theory examination. LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor and significantly outperforming all baseline methods." Oops ...?

@arXiv_csCL_bot@mastoxiv.page
2025-09-22 10:19:51

Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions
Frederic Kirstein, Sonu Kumar, Terry Ruas, Bela Gipp
https://arxiv.org/abs/2509.15901

Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions
Meeting summarization with large language models (LLMs) remains error-prone, often producing outputs with hallucinations, omissions, and irrelevancies. We present FRAME, a modular pipeline that reframes summarization as a semantic enrichment task. FRAME extracts and scores salient facts, organizes them thematically, and uses these to enrich an outline into an abstractive summary. To personalize summaries, we introduce SCOPE, a reason-out-loud protocol that has the model build a reasoning trace …

@arXiv_csAI_bot@mastoxiv.page
2025-10-09 09:56:51

Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces
Minju Gwak, Guijin Son, Jaehyung Kim
https://arxiv.org/abs/2510.06953 https://

Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces
The Uniform Information Density (UID) hypothesis suggests that effective communication maintains a stable flow of information. In this work, we revisit this principle in the context of large language model (LLM) reasoning traces, asking whether step-level uniformity reflects reasoning quality. To this end, we propose an entropy-based stepwise information density metric and introduce two complementary measures of uniformity, local and global uniformity scores. Across the experiments on six diffe…

@arXiv_csIR_bot@mastoxiv.page
2025-10-14 10:33:38

Comparative Explanations via Counterfactual Reasoning in Recommendations
Yi Yu, Zhenxing Hu
https://arxiv.org/abs/2510.10920 https://arxiv.org/pdf/2510.109…

Comparative Explanations via Counterfactual Reasoning in Recommendations
Explainable recommendation through counterfactual reasoning seeks to identify the influential aspects of items in recommendations, which can then be used as explanations. However, state-of-the-art approaches, which aim to minimize changes in product aspects while reversing their recommended decisions according to an aggregated decision boundary score, often lead to factual inaccuracies in explanations. To solve this problem, in this work we propose a novel method of Comparative Counterfactual E…

@arXiv_statML_bot@mastoxiv.page
2025-10-07 10:51:32

Embracing Discrete Search: A Reasonable Approach to Causal Structure Learning
Marcel Wien\"obst, Leonard Henckel, Sebastian Weichwald
https://arxiv.org/abs/2510.04970 https…

Embracing Discrete Search: A Reasonable Approach to Causal Structure Learning
We present FLOP (Fast Learning of Order and Parents), a score-based causal discovery algorithm for linear models. It pairs fast parent selection with iterative Cholesky-based score updates, cutting run-times over prior algorithms. This makes it feasible to fully embrace discrete search, enabling iterated local search with principled order initialization to find graphs with scores at or close to the global optimum. The resulting structures are highly accurate across benchmarks, with near-perfect…

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 10:09:41

Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency
Colin Hong, Xu Guo, Anand Chaanan Singh, Esha Choukse, Dmitrii Ustiugov
https://arxiv.org/abs/2509.13990 htt…

Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency
Recently, Test-Time Scaling (TTS) has gained increasing attention for improving LLM reasoning performance at test time without retraining the model. A notable TTS technique is Self-Consistency (SC), which generates multiple reasoning chains in parallel and selects the final answer via majority voting. While effective, the order-of-magnitude computational overhead limits its broad deployment. Prior attempts to accelerate SC mainly rely on model-based confidence scores or heuristics with limited …

@arXiv_csCL_bot@mastoxiv.page
2025-10-03 10:45:51

What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?
Jiwan Chung, Neel Joshi, Pratyusha Sharma, Youngjae Yu, Vibhav Vineet
https://arxiv.org/abs/2510.01719

What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?
Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information …

Tootfinder

Opt-in global Mastodon full text search. Join the index!