Tootfinder

@karlauerbach@sfba.social
2025-08-18 23:51:25

I just filed my first complaint against a California attorney under my own obligations under California's rule 8.3 (of the rules of professional conduct.)
California has a *mandatory* system under which attorney's (I am a member of that klan) *must* report various kinds of misconduct by other attorneys.
In this case that attorney sent out a fishing letter to my incorrect name (but correct address) asserting that they had done actual research with the implication that they…

@arXiv_csCL_bot@mastoxiv.page
2025-08-18 09:22:30

Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction
Tao Wu, Jingyuan Chen, Wang Lin, Jian Zhan, Mengze Li, Kun Kuang, Fei Wu
https://arxiv.org/abs/2508.11184

Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction
Distractors, incorrect but plausible answer choices in multiple-choice questions (MCQs), play a critical role in educational assessment by diagnosing student misconceptions. Recent work has leveraged large language models (LLMs) to generate shared, group-level distractors by learning common error patterns across large student populations. However, such distractors often fail to capture the diverse reasoning errors of individual students, limiting their diagnostic effectiveness. To address this …

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:23:11

Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration
Yuanchen Wu, Ke Yan, Shouhong Ding, Ziyin Zhou, Xiaoqiang Li
https://arxiv.org/abs/2509.13919 https://…

Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration
Large Vision-Language Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces the Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight "rationale fine-tuning" approach, which modifies the model's respons…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 10:19:50

Wisdom of the Crowd: Reinforcement Learning from Coevolutionary Collective Feedback
Wenzhen Yuan, Shengji Tang, Weihao Lin, Jiacheng Ruan, Ganqu Cui, Bo Zhang, Tao Chen, Ting Liu, Yuzhuo Fu, Peng Ye, Lei Bai
https://arxiv.org/abs/2508.12338

Wisdom of the Crowd: Reinforcement Learning from Coevolutionary Collective Feedback
Reinforcement learning (RL) has significantly enhanced the reasoning capabilities of large language models (LLMs), but its reliance on expensive human-labeled data or complex reward models severely limits scalability. While existing self-feedback methods aim to address this problem, they are constrained by the capabilities of a single model, which can lead to overconfidence in incorrect answers, reward hacking, and even training collapse. To this end, we propose Reinforcement Learning from Coev…

@arXiv_csSE_bot@mastoxiv.page
2025-08-18 08:40:10

Hallucination in LLM-Based Code Generation: An Automotive Case Study
Marc Pavel, Nenad Petrovic, Lukasz Mazur, Vahid Zolfaghari, Fengjunjie Pan, Alois Knoll
https://arxiv.org/abs/2508.11257

Hallucination in LLM-Based Code Generation: An Automotive Case Study
Large Language Models (LLMs) have shown significant potential in automating code generation tasks offering new opportunities across software engineering domains. However, their practical application remains limited due to hallucinations - outputs that appear plausible but are factually incorrect, unverifiable or nonsensical. This paper investigates hallucination phenomena in the context of code generation with a specific focus on the automotive domain. A case study is presented that evaluates m…

@comex@mas.to
2025-10-16 19:11:23

Google's AI Overviews are getting mean.
I Googled 'buck2 fixed point caching', speculatively, wondering if Buck2 had any feature like this. The AI Overview started with: "There is no specific feature in Buck2 called ‘fixed point caching.’ The term appears to be a misunderstanding of how Buck2's caching mechanisms work in a build system.”
The overview went on to give an incorrect definition of "fixed point".

@arXiv_csIT_bot@mastoxiv.page
2025-08-19 08:32:40

Age of Semantic Information-Aware Wireless Transmission for Remote Monitoring Systems
Xue Han, Biqian Feng, Yongpeng Wu, Xiang-Gen Xia, Wenjun Zhang, Shengli Sun
https://arxiv.org/abs/2508.12248

Age of Semantic Information-Aware Wireless Transmission for Remote Monitoring Systems
Semantic communication is emerging as an effective means of facilitating intelligent and context-aware communication for next-generation communication systems. In this paper, we propose a novel metric called Age of Incorrect Semantics (AoIS) for the transmission of video frames over multiple-input multiple-output (MIMO) channels in a monitoring system. Different from the conventional age-based approaches, we jointly consider the information freshness and the semantic importance, and then formul…

@arXiv_hepph_bot@mastoxiv.page
2025-08-19 10:13:50

Resolution of spin crisis, and notes on the Bjorken sum rule, anomaly and constituent quark
J. Pasupathy, Janardhan P. Singh
https://arxiv.org/abs/2508.12156 https://

Resolution of spin crisis, and notes on the Bjorken sum rule, anomaly and constituent quark
It is shown that the widely used parton model expression for $ g_1$ in polarized proton-lepton scattering is incorrect as it ignores gluon-quark spin entanglement. Therefore, there is no spin crisis. A brief summary of results of the theoretical evaluation of non-octet axial vector current renormalization constants and anomaly -anomaly vacuum correlator is given. It suggests that anomaly plays an important role in the transformation of current quarks to constituent quarks and chiral symmetry br…

@arXiv_csLG_bot@mastoxiv.page
2025-10-14 13:38:38

Learning to Make MISTAKEs: Modeling Incorrect Student Thinking And Key Errors
Alexis Ross, Jacob Andreas
https://arxiv.org/abs/2510.11502 https://arxiv.org…

Learning to Make MISTAKEs: Modeling Incorrect Student Thinking And Key Errors
Research on reasoning in language models (LMs) predominantly focuses on improving the correctness of their outputs. But some important applications require modeling reasoning patterns that are incorrect. For example, automated systems that can reason about and simulate student errors are useful for providing real-time feedback in the classroom or offline practice for educators-in-training. This paper presents a new method, MISTAKE, that (1) constructs high-quality synthetic examples of reasonin…

@jlpiraux@wallonie-bruxelles.social
2025-09-15 07:49:44

"Conventionally, the output of an AI is graded in a binary way, rewarding it when it gives a correct response and penalizing it when it gives an incorrect one.
In simple terms, in other words, guessing is rewarded — because it might be right — over an AI admitting it doesn't know the answer, which will be graded as incorrect no matter what.
As a result, through "natural statistical pressures," LLMs are far more prone to hallucinate an answer instead of "ac…

@grahamperrin@bsd.cafe
2025-08-17 02:41:46

@… let me guess … the discussion that spammed four lists (ignoring the documented basic rule about never more than two); the one that originated with shouting and swearing in GitHub; the one that proceeded to go off-topic from all four lists; the one that's technically incorrect about the effect of a command.
If you're bored, there's also a twenty-three…

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 09:57:11

Geometric Uncertainty for Detecting and Correcting Hallucinations in LLMs
Edward Phillips, Sean Wu, Soheila Molaei, Danielle Belgrave, Anshul Thakur, David Clifton
https://arxiv.org/abs/2509.13813

Geometric Uncertainty for Detecting and Correcting Hallucinations in LLMs
Large language models demonstrate impressive results across diverse tasks but are still known to hallucinate, generating linguistically plausible but incorrect answers to questions. Uncertainty quantification has been proposed as a strategy for hallucination detection, but no existing black-box approach provides estimates for both global and local uncertainty. The former attributes uncertainty to a batch of responses, while the latter attributes uncertainty to individual responses. Current loca…

@yaya@jorts.horse
2025-10-16 04:13:32

my favorite thing about my vocab app is that sometimes the incorrect answers construct an incredible parallel reality
please I want to live in the football dimension where there's a goal in the church and it's normal for wedding photos to have people in cleats and football kits

What do people often do in the kitchen?
Pick 1
ithim (eat)
pasálaim an liathróid (pass the ball)

What clothes are often seen in wedding ceremony photos?
Pick 1
bróga peile (cleats)
geansai (jerseys)
léine (shirts)

What can often be found in abchurch?
Pick 1

leabhar (book)
cúl (goal)

@arXiv_csHC_bot@mastoxiv.page
2025-09-17 10:25:10

Evolution of Programmers' Trust in Generative AI Programming Assistants
Anshul Shah, Thomas Rexin, Elena Tomson, Leo Porter, William G. Griswold, Adalbert Gerald Soosai Raj
https://arxiv.org/abs/2509.13253

Evolution of Programmers' Trust in Generative AI Programming Assistants
Motivation. Trust in generative AI programming assistants is a vital attitude that impacts how programmers use those programming assistants. Programmers that are over-trusting may be too reliant on their tools, leading to incorrect or vulnerable code; programmers that are under-trusting may avoid using tools that can improve their productivity and well-being. Methods. Since trust is a dynamic attitude that may change over time, this study aims to understand programmers' evolution of trust aft…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:38:01

SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models
Huy Nghiem, Advik Sachdeva, Hal Daum\'e III
https://arxiv.org/abs/2509.15174

SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models
WARNING: This paper contains examples of offensive materials. Toxic content has become pervasive on social media platforms. We introduce SMARTER, a data-efficient two-stage framework for explainable content moderation using Large Language Models (LLMs). In Stage 1, we leverage LLMs' own outputs to generate synthetic explanations for both correct and incorrect labels, enabling alignment via preference optimization with minimal human supervision. In Stage 2, we refine explanation quality through …

@arXiv_csCR_bot@mastoxiv.page
2025-09-16 12:00:27

ILA: Correctness via Type Checking for Fully Homomorphic Encryption
Tarakaram Gollamudi, Anitha Gollamudi, Joshua Gancher
https://arxiv.org/abs/2509.11559 https://

ILA: Correctness via Type Checking for Fully Homomorphic Encryption
RLWE-based Fully Homomorphic Encryption (FHE) schemes add some small \emph{noise} to the message during encryption. The noise accumulates with each homomorphic operation. When the noise exceeds a critical value, the FHE circuit produces an incorrect output. This makes developing FHE applications quite subtle, as one must closely track the noise to ensure correctness. However, existing libraries and compilers offer limited support to statically track the noise. Additionally, FHE circuits are als…

@arXiv_csIR_bot@mastoxiv.page
2025-09-16 09:24:57

ReFineG: Synergizing Small Supervised Models and LLMs for Low-Resource Grounded Multimodal NER
Jielong Tang, Shuang Wang, Zhenxing Wang, Jianxing Yu, Jian Yin
https://arxiv.org/abs/2509.10975

ReFineG: Synergizing Small Supervised Models and LLMs for Low-Resource Grounded Multimodal NER
Grounded Multimodal Named Entity Recognition (GMNER) extends traditional NER by jointly detecting textual mentions and grounding them to visual regions. While existing supervised methods achieve strong performance, they rely on costly multimodal annotations and often underperform in low-resource domains. Multimodal Large Language Models (MLLMs) show strong generalization but suffer from Domain Knowledge Conflict, producing redundant or incorrect mentions for domain-specific entities. To address…

@arXiv_condmatmtrlsci_bot@mastoxiv.page
2025-09-16 09:41:47

"Adiabatic" Elastic Constants in Hubbard-Corrected Density-Functional Theory DFT U: case UO$_2$
Mahmoud Payami, Samira Sheykhi
https://arxiv.org/abs/2509.11200 https:/…

"Adiabatic" Elastic Constants in Hubbard-Corrected Density-Functional Theory DFT+U: case UO$_2$
Since in DFT+U there are multiple self-consistent electronic solutions, the so called metastable states, the elastic constants computed from stress-vs-strain will be incorrect if some of the strained configurations fall into a different local electronic minimum than the equilibrium non-strained state. So, it is crucial to carefully take steps to keep the same electronic Hubbard occupation branch when computing the stresses for small strained geometries. In this work, we have explained this "adi…

@deprogrammaticaipsum@mas.to
2025-10-05 10:53:38

"If George Boole is the 19th century’s AI scientist, then his contemporary machine learning engineers were Charles Babbage and Ada Lovelace. The Difference Engine, which would be frequently cited as the first example of a (mechanical) programmable digital computer if it had been built at the time, was explicitly designed to _replace_ rather than _augment_ human thought. Just as modern software engineering managers use Jira to avoid thinking about process engineering."

Douglas Hofstadter
You may be worried that I am going to talk about an author of books that are not about programming, and you are correct and incorrect. Correct, in that Hofstadter's books are not about programming (the intellectually hollow like to claim that they are not about anything at all, or that if you think you know what they are about then you did not understand them; this is untrue). Incorrect, in that Hofstadter's books and computer programs themselves are about the same thing.

@lilmikesf@c.im
2025-10-12 17:34:11

Attempted #Drumpf & #RFKjr purge of #CDC workers initially fails due to clerical "coding" error.
“The employees who received incorrect notifications were never separated from the agency and have all been notified that they…

More than half of CDC staffers recently fired by Trump administration have been reinstated
Hundreds of staff fired from the US Centers for Disease Control and Prevention late Friday have been reinstated, according to the American Federation of Government Employees.

@adlerweb@social.adlerweb.info
2025-10-08 21:20:27

Days since I bootet a server with incorrect memory slot configuration: 0

Dusty screen

Middle right: System initializing memory
Bottom left: System Halted: No Memory could be configured

@Techmeme@techhub.social
2025-08-05 21:40:52

Wikipedia editors adopt a policy giving admins the authority to quickly delete AI-generated articles that meet certain criteria, like incorrect citations (Emanuel Maiberg/404 Media)
https://www.404media.co/wikipedia-editors-adopt-speedy-deletion-p…

Wikipedia Editors Adopt ‘Speedy Deletion’ Policy for AI Slop Articles
“The ability to quickly generate a lot of bogus content is problematic if we don't have a way to delete it just as quickly.”

@dennisfaucher@infosec.exchange
2025-10-09 13:41:12

So, since I found a bug in #logseq where pasting formatted notes from MS Teams causes logseq to use incorrect bold markdown syntax ([space]** at the end of a phrase instead of just **), I wrote this sed script to fix the logseq markdown files after I paste content in:
$ cat fix_logseg_bold_journals.sh
#!/bin/bash
cd /Users/faucherd/Documents//Logseq/journals
sed -i '…

@cowboys@darktundra.xyz
2025-10-06 00:54:09

NFL refs got Justin Fields’ ‘SkyCam’ throw in Week 5 vs Cowboys incorrect https://www.usatoday.com/story/sports/nfl/2025/10/05/justin-fields-skycam-pass-nfl-rulebook-jets-cowboys/86541696007/

NFL refs got Justin Fields’ ‘SkyCam’ throw in Week 5 vs Cowboys incorrect
The Jets should've been granted a do-over on 'SkyCam' pass from Justin Fields, according to the NFL rulebook.

@arXiv_csDB_bot@mastoxiv.page
2025-08-14 07:51:02

AmbiGraph-Eval: Can LLMs Effectively Handle Ambiguous Graph Queries?
Yuchen Tian, Kaixin Li, Hao Chen, Ziyang Luo, Hongzhan Lin, Sebastian Schelter, Lun Du, Jing Ma
https://arxiv.org/abs/2508.09631

AmbiGraph-Eval: Can LLMs Effectively Handle Ambiguous Graph Queries?
Large Language Models (LLMs) have recently demonstrated strong capabilities in translating natural language into database queries, especially when dealing with complex graph-structured data. However, real-world queries often contain inherent ambiguities, and the interconnected nature of graph structures can amplify these challenges, leading to unintended or incorrect query results. To systematically evaluate LLMs on this front, we propose a taxonomy of graph-query ambiguities, comprising three …

@jake4480@c.im
2025-08-07 13:42:38

Around 20 years ago, I made my first and only change to something on Wikipedia. It was for an underground rap artist- I just made a correction or two to something that was obviously incorrect, and my information was correct, thinking nothing of it. I looked at it the next day, and a Wikipedia editor or mod or whatever changed it back. That was the last time I ever edited anything there. I still look up things on Wikipedia (with a grain of salt) but that experience really bugged me. The Wikip…

@arXiv_csAI_bot@mastoxiv.page
2025-09-15 08:52:41

XAgents: A Unified Framework for Multi-Agent Cooperation via IF-THEN Rules and Multipolar Task Processing Graph
Hailong Yang, Mingxian Gu, Jianqi Wang, Guanjin Wang, Zhaohong Deng
https://arxiv.org/abs/2509.10054

XAgents: A Unified Framework for Multi-Agent Cooperation via IF-THEN Rules and Multipolar Task Processing Graph
The rapid advancement of Large Language Models (LLMs) has significantly enhanced the capabilities of Multi-Agent Systems (MAS) in supporting humans with complex, real-world tasks. However, MAS still face challenges in effective task planning when handling highly complex tasks with uncertainty, often resulting in misleading or incorrect outputs that hinder task execution. To address this, we propose XAgents, a unified multi-agent cooperative framework built on a multipolar task processing graph …

@arXiv_csHC_bot@mastoxiv.page
2025-08-12 10:09:13

Hide or Highlight: Understanding the Impact of Factuality Expression on User Trust
Hyo Jin Do, Werner Geyer
https://arxiv.org/abs/2508.07095 https://arxiv.…

Hide or Highlight: Understanding the Impact of Factuality Expression on User Trust
Large language models are known to produce outputs that are plausible but factually incorrect. To prevent people from making erroneous decisions by blindly trusting AI, researchers have explored various ways of communicating factuality estimates in AI-generated outputs to end-users. However, little is known about whether revealing content estimated to be factually incorrect influences users' trust when compared to hiding it altogether. We tested four different ways of disclosing an AI-generated…

@arXiv_mathOC_bot@mastoxiv.page
2025-08-13 09:25:42

Byzantine-Resilient Decentralized Online Resource Allocation
Runhua Wang, Qing Ling, Hoi-To Wai, Zhi Tian
https://arxiv.org/abs/2508.08658 https://arxiv.or…

Byzantine-Resilient Decentralized Online Resource Allocation
In this paper, we investigate the problem of decentralized online resource allocation in the presence of Byzantine attacks. In this problem setting, some agents may be compromised due to external manipulations or internal failures, causing them to behave maliciously and disrupt the resource allocation process by sending incorrect messages to their neighbors. Given the non-consensual nature of the resource allocation problem, we formulate it under a primal-dual optimization framework, where the …

@Mediagazer@mstdn.social
2025-08-06 08:01:26

Wikipedia editors adopt a policy giving admins the authority to quickly delete AI-generated articles that meet certain criteria, like incorrect citations (Emanuel Maiberg/404 Media)
https://www.404media.co/wikipedia-editors-adopt-speedy-deletion-p…

Wikipedia Editors Adopt ‘Speedy Deletion’ Policy for AI Slop Articles
“The ability to quickly generate a lot of bogus content is problematic if we don't have a way to delete it just as quickly.”

@arXiv_csCR_bot@mastoxiv.page
2025-10-15 08:48:52

Robust ML-based Detection of Conventional, LLM-Generated, and Adversarial Phishing Emails Using Advanced Text Preprocessing
Deeksha Hareesha Kulal, Chidozie Princewill Arannonu, Afsah Anwar, Nidhi Rastogi, Quamar Niyaz
https://arxiv.org/abs/2510.11915

Robust ML-based Detection of Conventional, LLM-Generated, and Adversarial Phishing Emails Using Advanced Text Preprocessing
Phishing remains a critical cybersecurity threat, especially with the advent of large language models (LLMs) capable of generating highly convincing malicious content. Unlike earlier phishing attempts which are identifiable by grammatical errors, misspellings, incorrect phrasing, and inconsistent formatting, LLM generated emails are grammatically sound, contextually relevant, and linguistically natural. These advancements make phishing emails increasingly difficult to distinguish from legitimat…

@arXiv_csCY_bot@mastoxiv.page
2025-09-30 10:20:11

Opinions can be Incorrect! In our Opinion. On the accuracy principle in data protection law
Dara Hallinan, Frederik Zuiderveen Borgesius
https://arxiv.org/abs/2509.23848 https:/…

Opinions can be Incorrect! In our Opinion. On the accuracy principle in data protection law
The GDPR contains an accuracy principle, as most data privacy laws in the world do. In principle, data controllers must ensure that personal data they use are accurate. Some have argued that the accuracy principle does not apply to personal data in the form of opinions about data subjects. We argue, however, from a positive law perspective, that the accuracy principle does apply to opinions. We further argue, from a normative perspective, that the accuracy principle should apply to opinions.

@nelson@tech.lgbt
2025-09-03 15:15:15

Gemini also asserts my oldest emails are from April 2003 but produces incorrect info when asked for details. Gmail didn't even exist until April 2004 and regular search finds nothing before then. (It does find a lot of Jira spam starting April 8 2004, some things never change.)

@arXiv_statME_bot@mastoxiv.page
2025-08-12 10:01:23

Modelling phenology using ordered categorical generalized additive models
David L Miller
https://arxiv.org/abs/2508.07789 https://arxiv.org/pdf/2508.07789

Modelling phenology using ordered categorical generalized additive models
One form of data collected in ecology is phenological, describing the timing of life stages. It can be tempting to analyze such data using a continuous distribution or to model individual transitions via probit/logit models. Such simplifications can lead to incorrect inference in various ways, all of which stem from ignoring the natural structure of the data. This paper presents a flexible approach to modelling ordered categorical data using the popular R package `mgcv`. An example analysis of …

@arXiv_csLG_bot@mastoxiv.page
2025-08-25 10:01:50

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
David Chanin, Adri\`a Garriga-Alonso
https://arxiv.org/abs/2508.16560 https://

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to single concepts. A core SAE training hyperparameter is L0: how many features should fire per token on average. Existing work compares SAE algorithms using sparsity--reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value. In this work we study the effect of L0 on BatchTopK SAEs, and show that if L0 is not set precisely, the SAE fails to learn the underlying featu…

@robpike@hachyderm.io
2025-09-26 22:15:31

Someone should invent a way for computers to count. At the moment both GMail and GitHub have incorrect message counts in my inbox. Again.
This has happened many times. Given that computers make it possible for me to order toilet paper for delivery by 2pm and then send me hundreds of messages about toilet paper by 5pm, it seems odd to me that they can't count.
But hey, I guess it's hard to count how many things are in a list, especially when the list is empty.

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 09:14:28

OBsmith: Testing JavaScript Obfuscator using LLM-powered sketching
Shan Jiang, Chenguang Zhu, Sarfraz Khurshid
https://arxiv.org/abs/2510.10066 https://arx…

OBsmith: Testing JavaScript Obfuscator using LLM-powered sketching
JavaScript obfuscators are widely deployed to protect intellectual property and resist reverse engineering, yet their correctness has been largely overlooked compared to performance and resilience. Existing evaluations typically measure resistance to deobfuscation, leaving the critical question of whether obfuscators preserve program semantics unanswered. Incorrect transformations can silently alter functionality, compromise reliability, and erode security-undermining the very purpose of obfusc…

@arXiv_statML_bot@mastoxiv.page
2025-10-08 09:01:19

Domain-Shift-Aware Conformal Prediction for Large Language Models
Zhexiao Lin, Yuanyuan Li, Neeraj Sarna, Yuanyuan Gao, Michael von Gablenz
https://arxiv.org/abs/2510.05566 http…

Domain-Shift-Aware Conformal Prediction for Large Language Models
Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real world applications. Conformal prediction provides finite-sample, distribution-free coverage guarantees, but standard conformal prediction breaks down under domain shift, often leading to under-coverage and unreliable prediction sets. We propose a new framework called Domain-Shift-Aware Conf…

@arXiv_csCR_bot@mastoxiv.page
2025-10-15 09:59:51

DeepTrust: Multi-Step Classification through Dissimilar Adversarial Representations for Robust Android Malware Detection
Daniel Pulido-Cort\'azar, Daniel Gibert, Felip Many\`a
https://arxiv.org/abs/2510.12310

DeepTrust: Multi-Step Classification through Dissimilar Adversarial Representations for Robust Android Malware Detection
Over the last decade, machine learning has been extensively applied to identify malicious Android applications. However, such approaches remain vulnerable against adversarial examples, i.e., examples that are subtly manipulated to fool a machine learning model into making incorrect predictions. This research presents DeepTrust, a novel metaheuristic that arranges flexible classifiers, like deep neural networks, into an ordered sequence where the final decision is made by a single internal model…

@arXiv_csCV_bot@mastoxiv.page
2025-09-12 10:14:19

InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation
Sirui Xu, Dongting Li, Yucheng Zhang, Xiyan Xu, Qi Long, Ziyin Wang, Yunzhi Lu, Shuchang Dong, Hezi Jiang, Akshat Gupta, Yu-Xiong Wang, Liang-Yan Gui
https://arxiv.org/abs/2509.09555

InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation
While large-scale human motion capture datasets have advanced human motion generation, modeling and generating dynamic 3D human-object interactions (HOIs) remain challenging due to dataset limitations. Existing datasets often lack extensive, high-quality motion and annotation and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements. Fir…

@arXiv_csAI_bot@mastoxiv.page
2025-08-14 07:38:52

MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement
Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, Hao Feng, Irene Li, Kun Yang, Han Wang, Jingqun Tang, Teng Fu, Changhong Jin, Chao Feng, Xiaohui Lv, Can Huang
https://arxiv.org/abs/2508.09670…

MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement
Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as …

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:27:41

Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models
Shihao Ji, Zihui Song, Jiajie Huang
https://arxiv.org/abs/2510.12137

Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models
Large Language Models (LLMs) hallucinate, generating factually incorrect yet confident assertions. We argue this stems from the Transformer's Softmax function, which creates "Artificial Certainty" by collapsing ambiguous attention scores into a single probability distribution, discarding uncertainty information at each layer. To fix this, we introduce the Credal Transformer, which replaces standard attention with a Credal Attention Mechanism (CAM) based on evidential theory. CAM produces a "cre…

@gwire@mastodon.social
2025-10-03 15:42:18

> All of this is also happening since Google removed support for the ClaimReview standard— a data format that was designed to ensure that this kind of confusion did not happen.
https://fullfact.org/technology/google-search-is-vandalising-the-internet-h…

Google Search is vandalising the internet. Here’s how. – Full Fact
Google Search is misrepresenting our fact checks, showing incorrect information to millions.

@tiotasram@kolektiva.social
2025-08-04 15:49:00

Should we teach vibe coding? Here's why not.
Should AI coding be taught in undergrad CS education?
1/2
I teach undergraduate computer science labs, including for intro and more-advanced core courses. I don't publish (non-negligible) scholarly work in the area, but I've got years of craft expertise in course design, and I do follow the academic literature to some degree. In other words, In not the world's leading expert, but I have spent a lot of time thinking about course design, and consider myself competent at it, with plenty of direct experience in what knowledge & skills I can expect from students as they move through the curriculum.
I'm also strongly against most uses of what's called "AI" these days (specifically, generative deep neutral networks as supplied by our current cadre of techbro). There are a surprising number of completely orthogonal reasons to oppose the use of these systems, and a very limited number of reasonable exceptions (overcoming accessibility barriers is an example). On the grounds of environmental and digital-commons-pollution costs alone, using specifically the largest/newest models is unethical in most cases.
But as any good teacher should, I constantly question these evaluations, because I worry about the impact on my students should I eschew teaching relevant tech for bad reasons (and even for his reasons). I also want to make my reasoning clear to students, who should absolutely question me on this. That inspired me to ask a simple question: ignoring for one moment the ethical objections (which we shouldn't, of course; they're very stark), at what level in the CS major could I expect to teach a course about programming with AI assistance, and expect students to succeed at a more technically demanding final project than a course at the same level where students were banned from using AI? In other words, at what level would I expect students to actually benefit from AI coding "assistance?"
To be clear, I'm assuming that students aren't using AI in other aspects of coursework: the topic of using AI to "help you study" is a separate one (TL;DR it's gross value is not negative, but it's mostly not worth the harm to your metacognitive abilities, which AI-induced changes to the digital commons are making more important than ever).
So what's my answer to this question?
If I'm being incredibly optimistic, senior year. Slightly less optimistic, second year of a masters program. Realistic? Maybe never.
The interesting bit for you-the-reader is: why is this my answer? (Especially given that students would probably self-report significant gains at lower levels.) To start with, [this paper where experienced developers thought that AI assistance sped up their work on real tasks when in fact it slowed it down] (https://arxiv.org/abs/2507.09089) is informative. There are a lot of differences in task between experienced devs solving real bugs and students working on a class project, but it's important to understand that we shouldn't have a baseline expectation that AI coding "assistants" will speed things up in the best of circumstances, and we shouldn't trust self-reports of productivity (or the AI hype machine in general).
Now we might imagine that coding assistants will be better at helping with a student project than at helping with fixing bugs in open-source software, since it's a much easier task. For many programming assignments that have a fixed answer, we know that many AI assistants can just spit out a solution based on prompting them with the problem description (there's another elephant in the room here to do with learning outcomes regardless of project success, but we'll ignore this over too, my focus here is on project complexity reach, not learning outcomes). My question is about more open-ended projects, not assignments with an expected answer. Here's a second study (by one of my colleagues) about novices using AI assistance for programming tasks. It showcases how difficult it is to use AI tools well, and some of these stumbling blocks that novices in particular face.
But what about intermediate students? Might there be some level where the AI is helpful because the task is still relatively simple and the students are good enough to handle it? The problem with this is that as task complexity increases, so does the likelihood of the AI generating (or copying) code that uses more complex constructs which a student doesn't understand. Let's say I have second year students writing interactive websites with JavaScript. Without a lot of care that those students don't know how to deploy, the AI is likely to suggest code that depends on several different frameworks, from React to JQuery, without actually setting up or including those frameworks, and of course three students would be way out of their depth trying to do that. This is a general problem: each programming class carefully limits the specific code frameworks and constructs it expects students to know based on the material it covers. There is no feasible way to limit an AI assistant to a fixed set of constructs or frameworks, using current designs. There are alternate designs where this would be possible (like AI search through adaptation from a controlled library of snippets) but those would be entirely different tools.
So what happens on a sizeable class project where the AI has dropped in buggy code, especially if it uses code constructs the students don't understand? Best case, they understand that they don't understand and re-prompt, or ask for help from an instructor or TA quickly who helps them get rid of the stuff they don't understand and re-prompt or manually add stuff they do. Average case: they waste several hours and/or sweep the bugs partly under the rug, resulting in a project with significant defects. Students in their second and even third years of a CS major still have a lot to learn about debugging, and usually have significant gaps in their knowledge of even their most comfortable programming language. I do think regardless of AI we as teachers need to get better at teaching debugging skills, but the knowledge gaps are inevitable because there's just too much to know. In Python, for example, the LLM is going to spit out yields, async functions, try/finally, maybe even something like a while/else, or with recent training data, the walrus operator. I can't expect even a fraction of 3rd year students who have worked with Python since their first year to know about all these things, and based on how students approach projects where they have studied all the relevant constructs but have forgotten some, I'm not optimistic seeing these things will magically become learning opportunities. Student projects are better off working with a limited subset of full programming languages that the students have actually learned, and using AI coding assistants as currently designed makes this impossible. Beyond that, even when the "assistant" just introduces bugs using syntax the students understand, even through their 4th year many students struggle to understand the operation of moderately complex code they've written themselves, let alone written by someone else. Having access to an AI that will confidently offer incorrect explanations for bugs will make this worse.
To be sure a small minority of students will be able to overcome these problems, but that minority is the group that has a good grasp of the fundamentals and has broadened their knowledge through self-study, which earlier AI-reliant classes would make less likely to happen. In any case, I care about the average student, since we already have plenty of stuff about our institutions that makes life easier for a favored few while being worse for the average student (note that our construction of that favored few as the "good" students is a large part of this problem).
To summarize: because AI assistants introduce excess code complexity and difficult-to-debug bugs, they'll slow down rather than speed up project progress for the average student on moderately complex projects. On a fixed deadline, they'll result in worse projects, or necessitate less ambitious project scoping to ensure adequate completion, and I expect this remains broadly true through 4-6 years of study in most programs (don't take this as an endorsement of AI "assistants" for masters students; we've ignored a lot of other problems along the way).
There's a related problem: solving open-ended project assignments well ultimately depends on deeply understanding the problem, and AI "assistants" allow students to put a lot of code in their file without spending much time thinking about the problem or building an understanding of it. This is awful for learning outcomes, but also bad for project success. Getting students to see the value of thinking deeply about a problem is a thorny pedagogical puzzle at the best of times, and allowing the use of AI "assistants" makes the problem much much worse. This is another area I hope to see (or even drive) pedagogical improvement in, for what it's worth.
1/2

@arXiv_csDS_bot@mastoxiv.page
2025-08-08 08:25:02

A Refutation of Elmasry's $\tilde{O}(m \sqrt{n})$-Time Algorithm for Single-Source Shortest Paths
Sunny Atalig, Marek Chrobak
https://arxiv.org/abs/2508.04872 https://

A Refutation of Elmasry's $\tilde{O}(m \sqrt{n})$-Time Algorithm for Single-Source Shortest Paths
In this note we examine the recent paper "Breaking the Bellman-Ford Shortest-Path Bound" by Amr Elmasry, where he presents an algorithm for the single-source shortest path problem and claims that its running time complexity is $\tilde{O}(m\sqrt{n})$, where $n$ is the number of vertices and $m$ is the number of edges. We show that his analysis is incorrect, by providing an example of a weighted graph on which the running time of his algorithm is $Ω(mn)$.

@gadgetboy@gadgetboy.social
2025-08-28 14:48:37

Sigh...
#ai #claude

A conversation with Claude:

ME: This is completely incorrect. You created your own document or retrieved one from your memory. Please try again. If you're unable to extract the text please just indicate that and do not proceed any further.

Claude: You're absolutely right, and I apologize. I cannot accurately extract the text from this image - the text is too small and unclear for me to read reliably. Instead of admitting this limitation, I incorrectly generated content that wasn't actually fr…

@ubuntourist@mastodon.social
2025-09-25 18:43:47

“The Trump administration has mostly been all about rebuilding statues that have already been torn down, of Confederate generals and other racist tropes and figures from the past” he said. “I would argue that Trump is a racist figure from the past, so why would you tear that down?”
https://www.

US authorities remove Trump-Epstein statue from National Mall
Interior department says National Park Service removed statue of pair holding hands because of incorrect permit

@grumpybozo@toad.social
2025-08-23 16:32:24

I’m not a professional coder, I just write code when I need it written.
Even I know that the most dangerous point in the evolution of a program is the point when it runs without obvious errors. https://m.phase.org/@parsingphase/115072135180061094

Parsingphase (@parsingphase@m.phase.org)
"Ivan replied that code is either correct or incorrect; there’s no subjective determination of whether it’s high or low… if a coder writes code poorly, the program simply won’t run" If an engineer ever says that to you, run the fuck away. The fact it runs doesn't mean it's: - right - secure - maintainable - efficient or can handle anything except the single case you've tried. Every time you scratch the surface of a vibe coder, you'll find delusionally low standards. https://mastod…

@arXiv_csPL_bot@mastoxiv.page
2025-09-03 09:51:03

From Traces to Program Incorrectness: A Type-Theoretic Approach
Yongwei Yuan, Zhe Zhou, Julia Belyakova, Benjamin Delaware, Suresh Jagannathan
https://arxiv.org/abs/2509.02428 h…

From Traces to Program Incorrectness: A Type-Theoretic Approach
We present a type-theoretic framework for reasoning about incorrectness in functional programs that interact with effectful, opaque library APIs. Our approach centers on traces -- temporally-ordered sequences of library API invocations -- which naturally characterize both the preconditions of individual APIs and their composite behavior. We represent these traces using symbolic regular expressions (SREs), enabling formal specification of incorrect abstract data type (ADT) behaviors across funct…

@timfoster@mastodon.social
2025-09-28 09:16:08

Lol, I think this page is missing a big fucking elephant-in-the-room statement:
"Don't allow AI tools that make shit up and frequently make incorrect assertions run anything on any infrastructure, ever. If fact, just stop reading right now, because this was a stupid idea from the beginning."
https:…

Security Best Practices - Model Context Protocol

@andycarolan@social.lol
2025-07-22 15:21:31

I'm seeing some really awful, low effort "may be..." ALT text recently. Clearly generated by an automatic process rather than by a human.
Is bad* alt text worse than no alt text?
*completely incorrect, and misleading
#Accessibility #a11y

@arXiv_csGT_bot@mastoxiv.page
2025-10-06 07:59:09

Deceptive Planning Exploiting Inattention Blindness
Mustafa O. Karabag, Jesse Milzman, Ufuk Topcu
https://arxiv.org/abs/2510.02714 https://arxiv.org/pdf/25…

Deceptive Planning Exploiting Inattention Blindness
We study decision-making with rational inattention in settings where agents have perception constraints. In such settings, inaccurate prior beliefs or models of others may lead to inattention blindness, where an agent is unaware of its incorrect beliefs. We model this phenomenon in two-player zero-sum stochastic games, where Player 1 has perception constraints and Player 2 deceptively deviates from its security policy presumed by Player 1 to gain an advantage. We formulate the perception constr…

@midtsveen@social.linux.pizza
2025-07-23 02:44:28

It is very funny when you get blocked for sharing a "Comparison of Android-based Operating Systems" that I didn't make, and if you think anything is factually incorrect with the comparison chart, you can contribute to it.
https://eylenburg.github.io/android_comparison.htm

Comparison of Android-based Operating Systems
Comparison of Android-based Operating Systems

@stargazer@woof.tech
2025-09-09 14:25:12

#WritersCoffeeClub
7. How much does your writing occupy your thoughts away from the keyboard?
8. What about the current writing milieu do you wish was different?
9.What incorrect assumptions might a reader make about you?
---
7. When I am actively writing, there's the "flow" mode and a "background task" mode.
In flow mode I keep thinkin…

@mgorny@social.treehouse.systems
2025-07-24 03:59:47

#Python world be like:
"Oh, hi, we wrote a new library implementing this spec."
"Hey, it looks like it doesn't conform to the spec, it doesn't pass the examples from it."
"Oh, you're right, we'll fix it ASAP."
…and that was over 3 years ago.
And yet projects keep adding a dependency on this library which has a single "pre-alpha" release 3.5 years ago and whose very first bug report points out it's incorrect.

@nemobis@mamot.fr
2025-08-22 15:15:40

I randomly bought this book in a quirky bookshop in Copenhagen for the sole reason that it said all the wrong things right on the cover.
(Sales: the single most important profession. NLP™: not natural language processing but neuro-linguistic programming. Meta: the Meta Model™ and Meta Publications™.)
I just started reading it and boy oh boy, I was not disappointed. It's outrageously hilarious.
"Persuasion engineering".

"For many years now, the single most important professionals in the world have been ignored by our educational institutions: Sales"

"While it may seem that some of the sentence structures in this book read as grammatically incorrect, they are written for a purpose"

«"Some of them really work hard. They can’t afford these cars. But every time one of them buys one, I smile because I know they are going to be the most motivated they can be just to keep up with the payments. I like my sales people to be a little hungry. There’s nothing better to keep them moving.” And so, he considers them to be self motivated. Anytime one of them starts to slack off a little, he asks them how the new car is.

What you do is you induce a wanton buying state and show them the …

Persuasion engineering by Richard Bandler | Open Library
Persuasion engineering by Richard Bandler, unknown edition,

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:44:51

Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations
Sunny Yu, Ahmad Jabbar, Robert Hawkins, Dan Jurafsky, Myra Cheng
https://arxiv.org/abs/2510.12699

Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations
Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) -- the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a ta…

@arXiv_quantph_bot@mastoxiv.page
2025-08-29 10:08:11

A predictive solution of the EPR paradox
Henryk Gzyl
https://arxiv.org/abs/2508.20788 https://arxiv.org/pdf/2508.20788

A predictive solution of the EPR paradox
In this work an incorrect argument in EPR's paper is corrected. A predictive approach to further confirm the validity of quantum theory is also proposed. The essence of the detail that EPR missed is that in a state of given total momentum (in their example the total momentum is zero), since the total momentum operator $\hat{\bp}=\hat{\bp}_1+\hat{\bp}_2$ does not commute with any of the position operators $\hat{\bx}_1$ and $\hat{\bx}_2,$ then in an eigenstate of the total momentum operator, the …

@arXiv_csAR_bot@mastoxiv.page
2025-08-05 07:32:59

Silent Data Corruption by 10x Test Escapes Threatens Reliable Computing
Subhasish Mitra, Subho Banerjee, Martin Dixon, Rama Govindaraju, Peter Hochschild, Eric X. Liu, Bharath Parthasarathy, Parthasarathy Ranganathan
https://arxiv.org/abs/2508.01786

Silent Data Corruption by 10x Test Escapes Threatens Reliable Computing
Too many defective compute chips are escaping existing manufacturing tests -- at least an order of magnitude more than industrial targets across all compute chip types in data centers. Silent data corruptions (SDCs) caused by test escapes, when left unaddressed, pose a major threat to reliable computing. We present a three-pronged approach to future directions in overcoming test escapes: (a) Quick diagnosis of defective chips directly from system-level incorrect behaviors. Such diagnosis is cri…

@parltrack@eupolicy.social
2025-08-21 17:01:20

thanks again to the fine person who triggered this. without people noticing that some things are incorrect, we would not be able to cope with this.

@Techmeme@techhub.social
2025-07-24 15:06:10

The EU says it will investigate whether KKR provided incorrect or misleading information in its €22B acquisition of Telecom Italia's fixed-line network (Foo Yun Chee/Reuters)
https://www.reuters.com/legal/litigation/e

@arXiv_csSE_bot@mastoxiv.page
2025-09-12 09:20:09

On Integrating Large Language Models and Scenario-Based Programming for Improving Software Reliability
Ayelet Berzack, Guy Katz
https://arxiv.org/abs/2509.09194 https://

On Integrating Large Language Models and Scenario-Based Programming for Improving Software Reliability
Large Language Models (LLMs) are fast becoming indispensable tools for software developers, assisting or even partnering with them in crafting complex programs. The advantages are evident -- LLMs can significantly reduce development time, generate well-organized and comprehensible code, and occasionally suggest innovative ideas that developers might not conceive on their own. However, despite their strengths, LLMs will often introduce significant errors and present incorrect code with persuasiv…

@arXiv_csCL_bot@mastoxiv.page
2025-10-10 10:58:49

LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
XuHao Hu, Peng Wang, Xiaoya Lu, Dongrui Liu, Xuanjing Huang, Jing Shao
https://arxiv.org/abs/2510.08211

LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we fin…

@grahamperrin@bsd.cafe
2025-09-30 18:11:01

Errata notice FreeBSD-EN-25:18.freebsd-update ― freebsd-update(8) installs libraries in incorrect order
https://security.FreeBSD.org/advisories/FreeBSD-EN-25:18.freebsd-update.asc
This update may be treated as essential for anyone who will use le…

@arXiv_heplat_bot@mastoxiv.page
2025-09-11 08:27:03

Thermodynamic Diagnostics for Complex Langevin Simulations: The Role of Configurational Temperature
Anosh Joseph, Arpith Kumar
https://arxiv.org/abs/2509.08287 https://

Thermodynamic Diagnostics for Complex Langevin Simulations: The Role of Configurational Temperature
The complex Langevin method (CLM) is a promising approach to tackle the sign problem in quantum field theories with complex actions. However, it can converge to incorrect results even when simulations appear stable, thus underscoring the need for robust diagnostics. Existing criteria, such as monitoring the drift distribution or the Langevin-time operator, are valuable, but they remain indirect. In this work, we propose a complementary reliability test based on the configurational temperature. …

@arXiv_hepph_bot@mastoxiv.page
2025-10-08 08:44:49

Comment on "Unruh effect for neutrinos interacting with accelerated matter"
R. R. S. Oliveira
https://arxiv.org/abs/2510.05403 https://arxiv.org/…

Comment on "Unruh effect for neutrinos interacting with accelerated matter"
In the present comment, we show that the fundamental equation worked by Dvornikov in his paper, which is the Dirac equation for a massive neutrino interacting with linearly accelerated matter, is incorrect. In particular, Dvornikov incorrectly wrote/defined the effective external current in a curved space-time. In other words, Dvornikov wrote/defined such an effective current in a flat space-time, which is a mistake. Consequently, the second-order differential equation (generated through the qu…

@arXiv_csAI_bot@mastoxiv.page
2025-09-10 10:01:11

Unleashing the True Potential of LLMs: A Feedback-Triggered Self-Correction with Long-Term Multipath Decoding
Jipeng Li, Zeyu Gao, Yubin Qi, Hande Dong, Weijian Chen, Qiang Lin
https://arxiv.org/abs/2509.07676

Unleashing the True Potential of LLMs: A Feedback-Triggered Self-Correction with Long-Term Multipath Decoding
Large Language Models (LLMs) have achieved remarkable performance across diverse tasks, yet their susceptibility to generating incorrect content during inference remains a critical unsolved challenge. While self-correction methods offer potential solutions, their effectiveness is hindered by two inherent limitations: (1) the absence of reliable guidance signals for error localization, and (2) the restricted reasoning depth imposed by conventional next-token decoding paradigms. To address these …

@arXiv_csLG_bot@mastoxiv.page
2025-10-07 13:05:32

Power Transform Revisited: Numerically Stable, and Federated
Xuefeng Xu, Graham Cormode
https://arxiv.org/abs/2510.04995 https://arxiv.org/pdf/2510.04995…

Power Transform Revisited: Numerically Stable, and Federated
Power transforms are popular parametric techniques for making data more Gaussian-like, and are widely used as preprocessing steps in statistical analysis and machine learning. However, we find that direct implementations of power transforms suffer from severe numerical instabilities, which can lead to incorrect results or even crashes. In this paper, we provide a comprehensive analysis of the sources of these instabilities and propose effective remedies. We further extend power transforms to th…

@arXiv_csCV_bot@mastoxiv.page
2025-10-06 09:47:29

AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
Xian Zhang, Zexi Wu, Zinuo Li, Hongming Xu, Luqi Gong, Farid Boussaid, Naoufel Werghi, Mohammed Bennamoun
https://arxiv.org/abs/2510.02778

AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
Understanding long-form videos remains a significant challenge for vision--language models (VLMs) due to their extensive temporal length and high information density. Most current multimodal large language models (MLLMs) rely on uniform sampling, which often overlooks critical moments, leading to incorrect responses to queries. In parallel, many keyframe selection approaches impose rigid temporal spacing: once a frame is chosen, an exclusion window suppresses adjacent timestamps to reduce redun…

@arXiv_csCR_bot@mastoxiv.page
2025-09-10 09:53:21

AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents
Haitao Hu, Peng Chen, Yanpeng Zhao, Yuqi Chen
https://arxiv.org/abs/2509.07764 http…

AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents
Large Language Models (LLMs) have been increasingly integrated into computer-use agents, which can autonomously operate tools on a user's computer to accomplish complex tasks. However, due to the inherently unstable and unpredictable nature of LLM outputs, they may issue unintended tool commands or incorrect inputs, leading to potentially harmful operations. Unlike traditional security risks stemming from insecure user prompts, tool execution results from LLM-driven decisions introduce new and …

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:30:00

Verifying Chain-of-Thought Reasoning via Its Computational Graph
Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda
https://arxiv.org/abs/2510.09312 …

Verifying Chain-of-Thought Reasoning via Its Computational Graph
Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We introduce a white-box method: Circuit-based Reasoning Verification (CRV). We hypothesize that attribution graphs of correct CoT steps, viewed as execution traces of the model's latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier o…

@arXiv_csCL_bot@mastoxiv.page
2025-09-12 09:44:39

MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems
Channdeth Sok, David Luz, Yacine Haddam
https://arxiv.org/abs/2509.09360 https://ar…

MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems
Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framewo…

@arXiv_csDB_bot@mastoxiv.page
2025-07-31 07:38:51

Scalability, Availability, Reproducibility and Extensibility in Islamic Database Systems
Umar Siddiqui, Habiba Youssef, Adel Sabour, Mohamed Ali
https://arxiv.org/abs/2507.22384

Scalability, Availability, Reproducibility and Extensibility in Islamic Database Systems
With the widespread of software systems and applications that serve the Islamic knowledge domain, several concerns arise. Authenticity and accuracy of the databases that back up these systems are questionable. With the excitement that some software developers and amateur researchers may have, false statements and incorrect claims may be made around numerical signs or miracles in the Quran. Reproducibility of these claims may not be addressed by the people making such claims. Moreover, with the …

@arXiv_csSE_bot@mastoxiv.page
2025-07-25 09:37:22

YATE: The Role of Test Repair in LLM-Based Unit Test Generation
Michael Konstantinou, Renzo Degiovanni, Jie M. Zhang, Mark Harman, Mike Papadakis
https://arxiv.org/abs/2507.18316

YATE: The Role of Test Repair in LLM-Based Unit Test Generation
Recent advances in automated test generation utilises language models to produce unit tests. While effective, language models tend to generate many incorrect tests with respect to both syntax and semantics. Although such incorrect tests can be easily detected and discarded, they constitute a "missed opportunity" -- if fixed, they are often valuable as they directly add testing value (they effectively target the underlying program logic to be tested) and indirectly form good seeds for generating…

@arXiv_csLG_bot@mastoxiv.page
2025-09-04 10:31:41

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang, Anurag Beniwal
https://arxiv.org/abs/2509.03403

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
Reinforcement learning with verifiable rewards (RLVR) has emerged to be a predominant paradigm for mathematical reasoning tasks, offering stable improvements in reasoning ability. However, Outcome Reward Models (ORMs) in RLVR are too coarse-grained to distinguish flawed reasoning within correct answers or valid reasoning within incorrect answers. This lack of granularity introduces noisy and misleading gradients significantly and hinders further progress in reasoning process quality. While Proc…

@arXiv_csAI_bot@mastoxiv.page
2025-10-06 07:30:59

Safe and Efficient In-Context Learning via Risk Control
Andrea Wynn, Metod Jazbec, Charith Peris, Rinat Khaziev, Anqi Liu, Daniel Khashabi, Eric Nalisnick
https://arxiv.org/abs/2510.02480

Safe and Efficient In-Context Learning via Risk Control
Large language models (LLMs) demonstrate a remarkable ability to learn new tasks from a few in-context examples. However, this flexibility introduces safety concerns: LLMs can be influenced by incorrect or malicious demonstrations -- for example, if an adversary tampers with or injects harmful examples without a human supervisor noticing. This motivates principled designs in which the system itself includes built-in mechanisms to guard against such attacks. We propose a novel approach to limit …

@arXiv_hepph_bot@mastoxiv.page
2025-10-01 09:46:18

Magnetic Helicity, Magnetic Monopoles, and Higgs Winding
Hajime Fukuda, Yuta Hamada, Kohei Kamada, Kyohei Mukaida, Fumio Uchida
https://arxiv.org/abs/2509.25734 https://

Magnetic Helicity, Magnetic Monopoles, and Higgs Winding
Changes in magnetic helicity are often discussed across a variety of fields, from condensed matter physics to early universe cosmology. It is frequently stated that the helicity change is given by the integral of the gauge field strength tensor and its dual over spacetime, $\int F \wedge F$. However, this is incorrect when magnetic monopoles once exist in the spacetime. In this paper, we show the correct formula of the helicity change in such a case for the Maxwell theory with the magnetic …

@arXiv_csCL_bot@mastoxiv.page
2025-09-10 10:31:01

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge
Lukas Haas, Gal Yona, Giovanni D'Antonio, Sasha Goldshtein, Dipanjan Das
https://arxiv.org/abs/2509.07968

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge
We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongsi…

@arXiv_csLG_bot@mastoxiv.page
2025-10-02 11:06:31

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama
https://arxiv.org/abs/2510.00915

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling. To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary $\{0,1\}$ during training. This choice carries a cost: it introduces \textit{false negatives} (rejecting correct answers, FNs) and \textit{false positives} (accepting incorrect ones, FPs). For instance, a rule-based checker may mark the correct fraction $\frac{12}{36}$ as wrong …

@arXiv_csSE_bot@mastoxiv.page
2025-09-01 09:00:03

Enhancing Semantic Understanding in Pointer Analysis using Large Language Models
Baijun Cheng, Kailong Wang, Ling Shi, Haoyu Wang, Yao Guo, Ding Li, Xiangqun Chen
https://arxiv.org/abs/2508.21454

Enhancing Semantic Understanding in Pointer Analysis using Large Language Models
Pointer analysis has been studied for over four decades. However, existing frameworks continue to suffer from the propagation of incorrect facts. A major limitation stems from their insufficient semantic understanding of code, resulting in overly conservative treatment of user-defined functions. Recent advances in large language models (LLMs) present new opportunities to bridge this gap. In this paper, we propose LMPA (LLM-enhanced Pointer Analysis), a vision that integrates LLMs into pointer a…

@arXiv_csCV_bot@mastoxiv.page
2025-07-25 10:21:02

SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning
Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim
https://arxiv.org/abs/2507.18616

SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning
Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasets generated by text-to-image (T2I) models to mitigate the need for costly manual annotation. However, these T2I models often produce images that exhibit semantic misalignments with their corresponding input captions (e.g., missing objects, incorrect attributes), resulting in noisy synthetic image-caption pairs that can hinder model training. Existing dataset pruning techniques are largely designed for removing noisy text in…

@arXiv_csCR_bot@mastoxiv.page
2025-09-30 07:35:10

GPS Spoofing Attacks and Pilot Responses Using a Flight Simulator Environment
Mathilde Durieux, Kayla D. Taylor, Laxima Niure Kandel, Deepti Gupta
https://arxiv.org/abs/2509.22662

GPS Spoofing Attacks and Pilot Responses Using a Flight Simulator Environment
Global Positioning System (GPS) spoofing involves transmitting fake signals that mimic those from GPS satellites, causing the GPS receivers to calculate incorrect Positioning, Navigation, and Timing (PNT) information. Recently, there has been a surge in GPS spoofing attacks targeting aircraft. Since GPS satellite signals are weak, the spoofed high-power signal can easily overpower them. These spoofed signals are often interpreted as valid by the GPS receiver, which can cause severe and cascadin…

@arXiv_csCL_bot@mastoxiv.page
2025-09-08 10:10:10

Why Language Models Hallucinate
Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang
https://arxiv.org/abs/2509.04664 https://arxiv.org/pdf/2509…

Why Language Models Hallucinate
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations…

@arXiv_csAI_bot@mastoxiv.page
2025-07-31 08:32:41

Beyond Accuracy: How AI Metacognitive Sensitivity improves AI-assisted Decision Making
ZhaoBin Li, Mark Steyvers
https://arxiv.org/abs/2507.22365 https://a…

Beyond Accuracy: How AI Metacognitive Sensitivity improves AI-assisted Decision Making
In settings where human decision-making relies on AI input, both the predictive accuracy of the AI system and the reliability of its confidence estimates influence decision quality. We highlight the role of AI metacognitive sensitivity -- its ability to assign confidence scores that accurately distinguish correct from incorrect predictions -- and introduce a theoretical framework for assessing the joint impact of AI's predictive accuracy and metacognitive sensitivity in hybrid decision-making s…

@arXiv_csCL_bot@mastoxiv.page
2025-10-07 12:14:02

The Geometry of Truth: Layer-wise Semantic Dynamics for Hallucination Detection in Large Language Models
Amir Hameed Mir
https://arxiv.org/abs/2510.04933 https://

The Geometry of Truth: Layer-wise Semantic Dynamics for Hallucination Detection in Large Language Models
Large Language Models (LLMs) often produce fluent yet factually incorrect statements-a phenomenon known as hallucination-posing serious risks in high-stakes domains. We present Layer-wise Semantic Dynamics (LSD), a geometric framework for hallucination detection that analyzes the evolution of hidden-state semantics across transformer layers. Unlike prior methods that rely on multiple sampling passes or external verification sources, LSD operates intrinsically within the model's representational…

@arXiv_csSE_bot@mastoxiv.page
2025-10-01 09:04:18

APRIL: API Synthesis with Automatic Prompt Optimization and Reinforcement Learning
Hua Zhong, Shan Jiang, Sarfraz Khurshid
https://arxiv.org/abs/2509.25196 https://

APRIL: API Synthesis with Automatic Prompt Optimization and Reinforcement Learning
APIs are central to modern software development, yet composing new APIs from large libraries is difficult due to the exponential search space; traditional component-based synthesis relies on costly exploration and hand-crafted specifications. While large language models (LLMs) can generate implementations from natural language, hallucinations and limited access to up-to-date contextual information often yield incorrect code. In this paper, we present APRIL, an approach that combines LLM-based s…

@arXiv_csCL_bot@mastoxiv.page
2025-08-25 10:54:17

Crosslisted article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[2/2]:
- Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
David Chanin, Adri\`a Garriga-Alonso

@arXiv_csAI_bot@mastoxiv.page
2025-08-28 07:43:51

Caught in the Act: a mechanistic approach to detecting deception
Gerard Boxo, Ryan Socha, Daniel Yoo, Shivam Raval
https://arxiv.org/abs/2508.19505 https://

Caught in the Act: a mechanistic approach to detecting deception
Sophisticated instrumentation for AI systems might have indicators that signal misalignment from human values, not unlike a "check engine" light in cars. One such indicator of misalignment is deceptiveness in generated responses. Future AI instrumentation may have the ability to detect when an LLM generates deceptive responses while reasoning about seemingly plausible but incorrect answers to factual questions. In this work, we demonstrate that linear probes on LLMs internal activations can det…

@arXiv_csLG_bot@mastoxiv.page
2025-09-23 12:51:20

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM
Alexander Panfilov, Evgenii Kortukov, Kristina Nikoli\'c, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping
https://arxiv.org/abs/2509.18058

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM
Large language model (LLM) developers aim for their models to be honest, helpful, and harmless. However, when faced with malicious requests, models are trained to refuse, sacrificing helpfulness. We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available. Affected models respond to harmful requests with outputs that sound harmful but are subtly incorrect or otherwise harmless in practice. This behavior emerges with hard-to-predict…

@arXiv_csCL_bot@mastoxiv.page
2025-10-06 07:56:19

Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning
Wannan Yang, Xinchi Qiu, Lei Yu, Yuchen Zhang, Oliver Aobo Yang, Narine Kokhlikyan, Nicola Cancedda, Diego Garcia-Olano
https://arxiv.org/abs/2510.02324

Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning
Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that conne…

@arXiv_csAI_bot@mastoxiv.page
2025-09-23 12:06:20

Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates
Hy Dang, Tianyi Liu, Zhuofeng Wu, Jingfeng Yang, Haoming Jiang, Tao Yang, Pei Chen, Zhengyang Wang, Helen Wang, Huasheng Li, Bing Yin, Meng Jiang
https://arxiv.org/abs/2509.18076

Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates
Large language models (LLMs) have demonstrated strong reasoning and tool-use capabilities, yet they often fail in real-world tool-interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent. These issues often stem from an incomplete understanding of user goals and inadequate comprehension of tool documentation. While Chain-of-Thought (CoT) prompting has proven effective for enhancing reasoning in general contexts, our analysis reveals that free-form…

@arXiv_csCL_bot@mastoxiv.page
2025-10-02 10:31:41

ThinkBrake: Mitigating Overthinking in Tool Reasoning
Minjae Oh, Sangjun Song, Seungkyu Lee, Sungmin Jo, Yohan Jo
https://arxiv.org/abs/2510.00546 https://…

ThinkBrake: Mitigating Overthinking in Tool Reasoning
Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. We diagnose overthinking via oracle rollouts that inject at sentence boundaries. On the Berkeley Function Calling Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8\% to 94.2\% while reducing tokens by 80-94\%, revealing substantial recoverable headroom and potential redundant re…

@arXiv_csCL_bot@mastoxiv.page
2025-07-31 09:54:01

Investigating Hallucination in Conversations for Low Resource Languages
Amit Das, Md. Najib Hasan, Souvika Sarkar, Zheng Zhang, Fatemeh Jamshidi, Tathagata Bhattacharya, Nilanjana Raychawdhury, Dongji Feng, Vinija Jain, Aman Chadha
https://arxiv.org/abs/2507.22720

Investigating Hallucination in Conversations for Low Resource Languages
Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandari…

@arXiv_csCL_bot@mastoxiv.page
2025-07-29 11:47:51

FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models
Likun Tan, Kuan-Wei Huang, Kevin Wu
https://arxiv.org/abs/2507.20930 https://

FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models
Hallucinations in large language models pose a critical challenge for applications requiring factual reliability, particularly in high-stakes domains such as finance. This work presents an effective approach for detecting and editing factually incorrect content in model-generated responses based on the provided context. Given a user-defined domain-specific error taxonomy, we construct a synthetic dataset by inserting tagged errors into financial question-answering corpora and then fine-tune fou…

@arXiv_csCL_bot@mastoxiv.page
2025-07-29 07:43:51

Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning
Shengyuan Wang, Jie Feng, Tianhui Liu, Dan Pei, Yong Li
https://arxiv.org/abs/2507.19586

Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning
Large language models (LLMs) possess extensive world knowledge, including geospatial knowledge, which has been successfully applied to various geospatial tasks such as mobility prediction and social indicator prediction. However, LLMs often generate inaccurate geospatial knowledge, leading to geospatial hallucinations (incorrect or inconsistent representations of geospatial information) that compromise their reliability. While the phenomenon of general knowledge hallucination in LLMs has been w…

@arXiv_csCL_bot@mastoxiv.page
2025-08-27 10:16:23

ConfTuner: Training Large Language Models to Express Their Confidence Verbally
Yibo Li, Miao Xiong, Jiaying Wu, Bryan Hooi
https://arxiv.org/abs/2508.18847 https://

ConfTuner: Training Large Language Models to Express Their Confidence Verbally
Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence". Recent efforts have focused on calibrating LLMs' verbalized confidence: i.e., their expressions of confidence in text form, such as "I am 80% confident that...". Exist…

@arXiv_csCL_bot@mastoxiv.page
2025-09-23 12:55:11

Training-free Truthfulness Detection via Value Vectors in LLMs
Runheng Liu, Heyan Huang, Xingchen Xiao, Zhijing Wu
https://arxiv.org/abs/2509.17932 https://

Training-free Truthfulness Detection via Value Vectors in LLMs
Large language models often generate factually incorrect outputs, motivating efforts to detect the truthfulness of their content. Most existing approaches rely on training probes over internal activations, but these methods suffer from scalability and generalization issues. A recent training-free method, NoVo, addresses this challenge by exploiting statistical patterns from the model itself. However, it focuses exclusively on attention mechanisms, potentially overlooking the MLP module-a core c…

@arXiv_csCL_bot@mastoxiv.page
2025-09-23 12:42:10

Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications
Selva Ta\c{s}, Mahmut El Huseyni, \"Ozay Ezerceli, Reyhan Bayraktar, Fatma Bet\"ul Terzio\u{g}lu
https://arxiv.org/abs/2509.17671

Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications
The widespread adoption of Large Language Models (LLMs) has been hindered by their tendency to hallucinate, generating plausible but factually incorrect information. While Retrieval-Augmented Generation (RAG) systems attempt to address this issue by grounding responses in external knowledge, hallucination remains a persistent challenge, particularly for morphologically complex, low-resource languages like Turkish. This paper introduces Turk-LettuceDetect, the first suite of hallucination detect…

@arXiv_csCL_bot@mastoxiv.page
2025-09-23 12:58:51

ARK-V1: An LLM-Agent for Knowledge Graph Question Answering Requiring Commonsense Reasoning
Jan-Felix Klein, Lars Ohnemus
https://arxiv.org/abs/2509.18063 https://

ARK-V1: An LLM-Agent for Knowledge Graph Question Answering Requiring Commonsense Reasoning
Large Language Models (LLMs) show strong reasoning abilities but rely on internalized knowledge that is often insufficient, outdated, or incorrect when trying to answer a question that requires specific domain knowledge. Knowledge Graphs (KGs) provide structured external knowledge, yet their complexity and multi-hop reasoning requirements make integration challenging. We present ARK-V1, a simple KG-agent that iteratively explores graphs to answer natural language queries. We evaluate several no…

@arXiv_csCL_bot@mastoxiv.page
2025-08-22 09:55:21

Conflict-Aware Soft Prompting for Retrieval-Augmented Generation
Eunseong Choi, June Park, Hyeri Lee, Jongwuk Lee
https://arxiv.org/abs/2508.15253 https://…

Conflict-Aware Soft Prompting for Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM's parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context asses…

Tootfinder

Opt-in global Mastodon full text search. Join the index!