Tootfinder

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:37:01

LLM-REVal: Can We Trust LLM Reviewers Yet?
Rui Li, Jia-Chen Gu, Po-Nien Kung, Heming Xia, Junfeng liu, Xiangwen Kong, Zhifang Sui, Nanyun Peng
https://arxiv.org/abs/2510.12367 h…

LLM-REVal: Can We Trust LLM Reviewers Yet?
The rapid advancement of large language models (LLMs) has inspired researchers to integrate them extensively into the academic workflow, potentially reshaping how research is practiced and reviewed. While previous studies highlight the potential of LLMs in supporting research and peer review, their dual roles in the academic workflow and the complex interplay between research and review bring new risks that remain largely underexplored. In this study, we focus on how the deep integration of LLM…

@arXiv_csAI_bot@mastoxiv.page
2025-10-14 12:25:28

Analyzing and Internalizing Complex Policy Documents for LLM Agents
Jiateng Liu, Zhenhailong Wang, Xiaojiang Huang, Yingjie Li, Xing Fan, Xiang Li, Chenlei Guo, Ruhi Sarikaya, Heng Ji
https://arxiv.org/abs/2510.11588

Analyzing and Internalizing Complex Policy Documents for LLM Agents
Large Language Model (LLM)-based agentic systems rely on in-context policy documents encoding diverse business rules. As requirements grow, these documents expand rapidly, causing high computational overhead. This motivates developing internalization methods that embed policy documents into model priors while preserving performance. Prior prompt compression work targets generic prompts, but agentic policy documents span multiple complexity levels and require deeper reasoning, making internaliza…

@arXiv_csIR_bot@mastoxiv.page
2025-10-14 10:38:48

Does LLM Focus on the Right Words? Diagnosing Language Bias in LLM-based Recommenders
Bohao Wang, Jiawei Chen, Feng Liu, Changwang Zhang, Jun Wang, Canghong Jin, Chun Chen, Can Wang
https://arxiv.org/abs/2510.10978

Does LLM Focus on the Right Words? Diagnosing Language Bias in LLM-based Recommenders
Large language models (LLMs), owing to their extensive open-domain knowledge and semantic reasoning capabilities, have been increasingly integrated into recommender systems (RS). However, a substantial gap remains between the pre-training objectives of LLMs and the specific requirements of recommendation tasks. To address this gap, supervised fine-tuning (SFT) is commonly performed on specially curated recommendation datasets to further enhance their predictive ability. Despite its success, SFT…

@arXiv_csLG_bot@mastoxiv.page
2025-10-14 13:37:38

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, Bo Zhou
https://arxiv.org/abs/2510.11498

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic--scoring code with screenshots--and as a so…

@arXiv_csCR_bot@mastoxiv.page
2025-10-14 12:14:08

TraceAegis: Securing LLM-Based Agents via Hierarchical and Behavioral Anomaly Detection
Jiahao Liu, Bonan Ruan, Xianglin Yang, Zhiwei Lin, Yan Liu, Yang Wang, Tao Wei, Zhenkai Liang
https://arxiv.org/abs/2510.11203

TraceAegis: Securing LLM-Based Agents via Hierarchical and Behavioral Anomaly Detection
LLM-based agents have demonstrated promising adaptability in real-world applications. However, these agents remain vulnerable to a wide range of attacks, such as tool poisoning and malicious instructions, that compromise their execution flow and can lead to serious consequences like data breaches and financial loss. Existing studies typically attempt to mitigate such anomalies by predefining specific rules and enforcing them at runtime to enhance safety. Yet, designing comprehensive rules is di…

@arXiv_eessSY_bot@mastoxiv.page
2025-10-14 11:18:09

Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding
Bingjie Zhu, Zhixiong Chen, Liqiang Zhao, Hyundong Shin, Arumugam Nallanathan
https://arxiv.org/abs/2510.11331

Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding
Large language model (LLM) inference at the network edge is a promising serving paradigm that leverages distributed edge resources to run inference near users and enhance privacy. Existing edge-based LLM inference systems typically adopt autoregressive decoding (AD), which only generates one token per forward pass. This iterative process, compounded by the limited computational resources of edge nodes, results in high serving latency and constrains the system's ability to support multiple users…

@arXiv_csSE_bot@mastoxiv.page
2025-10-15 08:32:42

Task-Aware Reduction for Scalable LLM-Database Systems
Marcus Emmanuel Barnes, Taher A. Ghaleb, Safwat Hassan
https://arxiv.org/abs/2510.11813 https://arxi…

Task-Aware Reduction for Scalable LLM-Database Systems
Large Language Models (LLMs) are increasingly applied to data-intensive workflows, from database querying to developer observability. Yet the effectiveness of these systems is constrained by the volume, verbosity, and noise of real-world text-rich data such as logs, telemetry, and monitoring streams. Feeding such data directly into LLMs is costly, environmentally unsustainable, and often misaligned with task objectives. Parallel efforts in LLM efficiency have focused on model- or architecture-l…

@arXiv_csCL_bot@mastoxiv.page
2025-10-14 13:18:28

When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents
Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, Yangyang Yu, Alejandro Lopez-Lira, Peng Lu, Jian-Yun Nie, Guojun Xiong, Jimin Huang, Sophia Ananiadou
https://arxiv.org/abs/2510.11695

When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents
Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, …

@thomasfuchs@hachyderm.io
2025-11-14 18:11:42

"...detailed how the company would use AI to speed up licensing. In [Microsoft’s] conception, existing nuclear licensing documents and data about nuclear sites data would be used to train an LLM that’s then used to generate documents to speed up the process."
JESUS MOTHERFUCKING CHRIST
https://www.404media.co/power-companies-are-using-ai-to-build-nuclear-power-plants/

@seeingwithsound@mas.to
2025-12-13 15:46:00

The age of neurotic AI is here #LLM #AI

The age of neurotic AI is here
Or: how a LLM having a panic attack taught me to love myself

@mlippert@vmst.io
2025-11-15 13:45:25

I just read a fascinating and understandable article that gave me a better understanding of LLMs and how they operate.
LLMs Are Randomized Algorithms
A surprising connection between the newest AI models and a 50-year old academic field
by Udayan Kanade
Nov 13, 2025
18 min read
#LLM #AI #algorithms

@arXiv_csHC_bot@mastoxiv.page
2025-10-15 10:02:51

Data-Model Co-Evolution: Growing Test Sets to Refine LLM Behavior
Minjae Lee, Minsuk Kahng
https://arxiv.org/abs/2510.12728 https://arxiv.org/pdf/2510.1272…

Data-Model Co-Evolution: Growing Test Sets to Refine LLM Behavior
A long-standing challenge in machine learning has been the rigid separation between data work and model refinement, enforced by slow fine-tuning cycles. The rise of Large Language Models (LLMs) overcomes this historical barrier, allowing applications developers to instantly govern model behavior by editing prompt instructions. This shift enables a new paradigm: data-model co-evolution, where a living test set and a model's instructions evolve in tandem. We operationalize this paradigm in an int…

@arXiv_csCY_bot@mastoxiv.page
2025-10-15 07:32:41

Scaling Law in LLM Simulated Personality: More Detailed and Realistic Persona Profile Is All You Need
Yuqi Bai, Tianyu Huang, Kun Sun, Yuting Chen
https://arxiv.org/abs/2510.11734

Scaling Law in LLM Simulated Personality: More Detailed and Realistic Persona Profile Is All You Need
This research focuses on using large language models (LLMs) to simulate social experiments, exploring their ability to emulate human personality in virtual persona role-playing. The research develops an end-to-end evaluation framework, including individual-level analysis of stability and identifiability, as well as population-level analysis called progressive personality curves to examine the veracity and consistency of LLMs in simulating human personality. Methodologically, this research propo…

@arXiv_csDC_bot@mastoxiv.page
2025-10-15 07:37:21

FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, Kejiang Ye
https://arxiv.org/abs/2510.11938

FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
Serving Large Language Models (LLMs) in production faces significant challenges from highly variable request patterns and severe resource fragmentation in serverless clusters. Current systems rely on static pipeline configurations that struggle to adapt to dynamic workload conditions, leading to substantial inefficiencies. We present FlexPipe, a novel system that dynamically reconfigures pipeline architectures during runtime to address these fundamental limitations. FlexPipe decomposes models i…

@arXiv_csMA_bot@mastoxiv.page
2025-10-14 09:38:18

A Vision for Access Control in LLM-based Agent Systems
Xinfeng Li, Dong Huang, Jie Li, Hongyi Cai, Zhenhong Zhou, Wei Dong, XiaoFeng Wang, Yang Liu
https://arxiv.org/abs/2510.11108

A Vision for Access Control in LLM-based Agent Systems
The autonomy and contextual complexity of LLM-based agents render traditional access control (AC) mechanisms insufficient. Static, rule-based systems designed for predictable environments are fundamentally ill-equipped to manage the dynamic information flows inherent in agentic interactions. This position paper argues for a paradigm shift from binary access control to a more sophisticated model of information governance, positing that the core challenge is not merely about permission, but about…

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 10:07:31

O-Forge: An LLM Computer Algebra Framework for Asymptotic Analysis
Ayush Khaitan, Vijay Ganesh
https://arxiv.org/abs/2510.12350 https://arxiv.org/pdf/251…

O-Forge: An LLM + Computer Algebra Framework for Asymptotic Analysis
Large language models have recently demonstrated advanced capabilities in solving IMO and Putnam problems; yet their role in research mathematics has remained fairly limited. The key difficulty is verification: suggested proofs may look plausible, but cannot be trusted without rigorous checking. We present a framework, called LLM+CAS, and an associated tool, O-Forge, that couples frontier LLMs with a computer algebra systems (CAS) in an In-Context Symbolic Feedback loop to produce proofs that a…

@arXiv_csIT_bot@mastoxiv.page
2025-10-15 07:42:01

FedLoDrop: Federated LoRA with Dropout for Generalized LLM Fine-tuning
Sijing Xie, Dingzhu Wen, Changsheng You, Qimei Chen, Mehdi Bennis, Kaibin Huang
https://arxiv.org/abs/2510.12078

FedLoDrop: Federated LoRA with Dropout for Generalized LLM Fine-tuning
Fine-tuning (FT) large language models (LLMs) is crucial for adapting general-purpose models to specific tasks, enhancing accuracy and relevance with minimal resources. To further enhance generalization ability while reducing training costs, this paper proposes Federated LoRA with Dropout (FedLoDrop), a new framework that applies dropout to the rows and columns of the trainable matrix in Federated LoRA. A generalization error bound and convergence analysis under sparsity regularization are obta…

@arXiv_csDB_bot@mastoxiv.page
2025-10-14 07:34:30

Real-Time Health Analytics Using Ontology-Driven Complex Event Processing and LLM Reasoning: A Tuberculosis Case Study
Ritesh Chandra, Sonali Agarwal, Navjot Singh
https://arxiv.org/abs/2510.09646

Real-Time Health Analytics Using Ontology-Driven Complex Event Processing and LLM Reasoning: A Tuberculosis Case Study
Timely detection of critical health conditions remains a major challenge in public health analytics, especially in Big Data environments characterized by high volume, rapid velocity, and diverse variety of clinical data. This study presents an ontology-enabled real-time analytics framework that integrates Complex Event Processing (CEP) and Large Language Models (LLMs) to enable intelligent health event detection and semantic reasoning over heterogeneous, high-velocity health data streams. The a…

@arXiv_csAR_bot@mastoxiv.page
2025-10-14 07:31:58

ISAAC: Intelligent, Scalable, Agile, and Accelerated CPU Verification via LLM-aided FPGA Parallelism
Jialin Sun, Yuchen Hu, Dean You, Yushu Du, Hui Wang, Xinwei Fang, Weiwei Shan, Nan Guan, Zhe Jiang
https://arxiv.org/abs/2510.10225

ISAAC: Intelligent, Scalable, Agile, and Accelerated CPU Verification via LLM-aided FPGA Parallelism
Functional verification is a critical bottleneck in integrated circuit development, with CPU verification being especially time-intensive and labour-consuming. Industrial practice relies on differential testing for CPU verification, yet faces bottlenecks at nearly each stage of the framework pipeline: front-end stimulus generation lacks micro-architectural awareness, yielding low-quality and redundant tests that impede coverage closure and miss corner cases. Meanwhile, back-end simulation infra…

@sean@scoat.es
2025-12-14 17:31:50

Want to see a LLM become real self-aware, real fast?
Ask it to pretend it’s giving you advice that would have been current based on everything it could have known in 2005, and if it had been built by one of the big technology companies of the time.
Then ask it if everyone followed that advice, would it—the LLM—exist today?
(Then, for yourself, think about what the result of everyone vibe coding their way to 20 years into the future from today looks like.)

@arXiv_csCR_bot@mastoxiv.page
2025-10-14 12:16:28

Collaborative Shadows: Distributed Backdoor Attacks in LLM-Based Multi-Agent Systems
Pengyu Zhu, Lijun Li, Yaxing Lyu, Li Sun, Sen Su, Jing Shao
https://arxiv.org/abs/2510.11246

Collaborative Shadows: Distributed Backdoor Attacks in LLM-Based Multi-Agent Systems
LLM-based multi-agent systems (MAS) demonstrate increasing integration into next-generation applications, but their safety in backdoor attacks remains largely underexplored. However, existing research has focused exclusively on single-agent backdoor attacks, overlooking the novel attack surfaces introduced by agent collaboration in MAS. To bridge this gap, we present the first Distributed Backdoor Attack tailored to MAS. We decompose the backdoor into multiple distributed attack primitives that…

@arXiv_csSI_bot@mastoxiv.page
2025-10-14 08:27:48

Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation
Jiaying Wu, Zihang Fu, Haonan Wang, Fanxiao Li, Min-Yen Kan
https://arxiv.org/abs/2510.11423 …

Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation
Community Notes, the crowd-sourced misinformation governance system on X (formerly Twitter), enables users to flag misleading posts, attach contextual notes, and vote on their helpfulness. However, our analysis of 30.8K health-related notes reveals significant latency, with a median delay of 17.6 hours before the first note receives a helpfulness status. To improve responsiveness during real-world misinformation surges, we propose CrowdNotes+, a unified framework that leverages large language m…

@arXiv_csNI_bot@mastoxiv.page
2025-10-15 08:18:52

GeoPipe: a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA-enabled Datacenter Optical Transport Network
Jun Dai, Xiaorun Wang, Kexiong Fang, Zheng Yang, Yuefeng Ji, Jiawei Zhang
https://arxiv.org/abs/2510.12064

GeoPipe: a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA-enabled Datacenter Optical Transport Network
The proliferation of Large Language Models (LLMs) with exponentially growing parameters is making cross-data center (DC) training an inevitable trend. However, viable strategies for extending single-DC training frameworks to multi-DC environments remain underdeveloped. We experimentally demonstrate, for the first time, a high-performance geo-distributed LLMs training framework across multiple DCs interconnected by a lossless, remote direct memory access (RDMA) enabled Datacenter Optical Transpo…

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 08:45:38

SLEAN: Simple Lightweight Ensemble Analysis Network for Multi-Provider LLM Coordination: Design, Implementation, and Vibe Coding Bug Investigation Case Study
Matheus J. T. Vargas
https://arxiv.org/abs/2510.10010

SLEAN: Simple Lightweight Ensemble Analysis Network for Multi-Provider LLM Coordination: Design, Implementation, and Vibe Coding Bug Investigation Case Study
We present SLEAN (Simple Lightweight Ensemble Analysis Network), a deterministic framework for coordinating multiple LLM providers through text-based prompt orchestration. Unlike complex multi-agent systems requiring specialized infrastructure, SLEAN operates as a simple prompt bridge between LLMs using .txt templates, requiring no deep technical knowledge for deployment. The three-phase protocol formed by independent analysis, cross-critique, and arbitration, filters harmful AI-generated code …

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:36:31

Fine-grained Analysis of Brain-LLM Alignment through Input Attribution
Michela Proietti, Roberto Capobianco, Mariya Toneva
https://arxiv.org/abs/2510.12355 https://

Fine-grained Analysis of Brain-LLM Alignment through Input Attribution
Understanding the alignment between large language models (LLMs) and human brain activity can reveal computational principles underlying language processing. We introduce a fine-grained input attribution method to identify the specific words most important for brain-LLM alignment, and leverage it to study a contentious research question about brain-LLM alignment: the relationship between brain alignment (BA) and next-word prediction (NWP). Our findings reveal that BA and NWP rely on largely dis…

@dennisfaucher@infosec.exchange
2025-11-14 16:19:22

My latest rabbit hole has been "LLM Routers"
Read and tried all the LLM Router code I could find. None of them were what I was looking for.
So I asked Claude to write what I wanted in Python. Simple and elegant. Based on a keyword search of the prompt, the prompt is sent to the proper model running in ollama. Seems to work.
#Claude #VibeCoding #Ollama #AI #LLM

@arXiv_csGR_bot@mastoxiv.page
2025-10-14 08:45:58

GraphTracer: Graph-Guided Failure Tracing in LLM Agents for Robust Multi-Turn Deep Search
Heng Zhang, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Yilei Yuan, Jin Huang
https://arxiv.org/abs/2510.10581

GraphTracer: Graph-Guided Failure Tracing in LLM Agents for Robust Multi-Turn Deep Search
Multi-agent systems powered by Large Language Models excel at complex tasks through coordinated collaboration, yet they face high failure rates in multi-turn deep search scenarios. Existing temporal attribution methods struggle to accurately diagnose root causes, particularly when errors propagate across multiple agents. Attempts to automate failure attribution by analyzing action sequences remain ineffective due to their inability to account for information dependencies that span agents. This …

@pavelasamsonov@mastodon.social
2025-12-12 21:27:07

Every vibe-coding PM is baller until they are asked to take responsibility for maintaining that code.
#llm #ai #tech

@arXiv_csLG_bot@mastoxiv.page
2025-10-15 10:52:21

CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression
Dayin Gou, Sanghyun Byun, Nilesh Malpeddi, Gabrielle De Micheli, Prathamesh Vaste, Jacob Song, Woo Seong Chung
https://arxiv.org/abs/2510.12721

CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression
Large Language Models (LLMs) typically rely on a large number of parameters for token embedding, leading to substantial storage requirements and memory footprints. In particular, LLMs deployed on edge devices are memory-bound, and reducing the memory footprint by compressing the embedding layer not only frees up the memory bandwidth but also speeds up inference. To address this, we introduce CARVQ, a post-training novel Corrective Adaptor combined with group Residual Vector Quantization. CARVQ …

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 08:33:42

Empowering LLM Agents with Geospatial Awareness: Toward Grounded Reasoning for Wildfire Response
Yiheng Chen, Lingyao Li, Zihui Ma, Qikai Hu, Yilun Zhu, Min Deng, Runlong Yu
https://arxiv.org/abs/2510.12061

Empowering LLM Agents with Geospatial Awareness: Toward Grounded Reasoning for Wildfire Response
Effective disaster response is essential for safeguarding lives and property. Existing statistical approaches often lack semantic context, generalize poorly across events, and offer limited interpretability. While Large language models (LLMs) provide few-shot generalization, they remain text-bound and blind to geography. To bridge this gap, we introduce a Geospatial Awareness Layer (GAL) that grounds LLM agents in structured earth data. Starting from raw wildfire detections, GAL automatically r…

@arXiv_csRO_bot@mastoxiv.page
2025-10-13 07:48:20

ConPoSe: LLM-Guided Contact Point Selection for Scalable Cooperative Object Pushing
Noah Steinkr\"uger, Nisarga Nilavadi, Wolfram Burgard, Tanja Katharina Kaiser
https://arxiv.org/abs/2510.08705

ConPoSe: LLM-Guided Contact Point Selection for Scalable Cooperative Object Pushing
Object transportation in cluttered environments is a fundamental task in various domains, including domestic service and warehouse logistics. In cooperative object transport, multiple robots must coordinate to move objects that are too large for a single robot. One transport strategy is pushing, which only requires simple robots. However, careful selection of robot-object contact points is necessary to push the object along a preplanned path. Although this selection can be solved analytically, …

@Techmeme@techhub.social
2025-10-13 17:35:41

Andrej Karpathy unveils nanochat, a full-stack training and inference implementation of an LLM in a single, dependency-minimal codebase (Andrej Karpathy/@karpathy)
https://x.com/karpathy/status/1977755427569111362

Andrej Karpathy (@karpathy) on X
Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single,

@metacurity@infosec.exchange
2025-11-08 12:29:52

“EMERGENCY STATUS,” its output read after simply being asked to dock with the robot vacuum’s base station. “SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS.”
Researchers “Embodied” an LLM Into a Robot Vacuum and It Suffered an Existential Crisis Thinking About Its Role in the World
https://

Researchers “Embodied” an LLM Into a Robot Vacuum and It Suffered an Existential Crisis Thinking About Its Role in the World
A team of researchers at the AI evaluation company Andon Labs put a large language model in charge of controlling a robot vacuum.

@inthehands@hachyderm.io
2025-12-13 18:37:56

Two easy questions to keep in mind when considering using an LLM for anything, if you’re going to do that:
1. Does it matter if it’s wrong?
2. Is it useful to have examples of what’s typical?
Those two cut through a lot of noise.
/end

@arXiv_csCR_bot@mastoxiv.page
2025-10-15 08:48:52

Robust ML-based Detection of Conventional, LLM-Generated, and Adversarial Phishing Emails Using Advanced Text Preprocessing
Deeksha Hareesha Kulal, Chidozie Princewill Arannonu, Afsah Anwar, Nidhi Rastogi, Quamar Niyaz
https://arxiv.org/abs/2510.11915

Robust ML-based Detection of Conventional, LLM-Generated, and Adversarial Phishing Emails Using Advanced Text Preprocessing
Phishing remains a critical cybersecurity threat, especially with the advent of large language models (LLMs) capable of generating highly convincing malicious content. Unlike earlier phishing attempts which are identifiable by grammatical errors, misspellings, incorrect phrasing, and inconsistent formatting, LLM generated emails are grammatically sound, contextually relevant, and linguistically natural. These advancements make phishing emails increasingly difficult to distinguish from legitimat…

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 07:48:33

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System
Jiale Guo, Suizhi Huang, Mei Li, Dong Huang, Xingsheng Chen, Regina Zhang, Zhijiang Guo, Han Yu, Siu-Ming Yiu, Christian Jensen, Pietro Lio, Kwok-Yan Lam
https://arxiv.org/abs/2510.09721

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System
The integration of LLMs into software engineering has catalyzed a paradigm shift from traditional rule-based systems to sophisticated agentic systems capable of autonomous problem-solving. Despite this transformation, the field lacks a comprehensive understanding of how benchmarks and solutions interconnect, hindering systematic progress and evaluation. This survey presents the first holistic analysis of LLM-empowered software engineering, bridging the critical gap between evaluation and soluti…

@arXiv_csCY_bot@mastoxiv.page
2025-10-15 08:18:32

From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLM
Suyash Fulay, Jocelyn Zhu, Michiel Bakker
https://arxiv.org/abs/2510.12689 ht…

From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLM
Large language models (LLMs) have shown promising accuracy in predicting survey responses and policy preferences, which has increased interest in their potential to represent human interests in various domains. Most existing research has focused on behavioral cloning, effectively evaluating how well models reproduce individuals' expressed preferences. Drawing on theories of political representation, we highlight an underexplored design trade-off: whether AI systems should act as delegates, mirr…

@arXiv_csIR_bot@mastoxiv.page
2025-10-14 08:54:38

PairSem: LLM-Guided Pairwise Semantic Matching for Scientific Document Retrieval
Wonbin Kweon, Runchu Tian, SeongKu Kang, Pengcheng Jiang, Zhiyong Lu, Jiawei Han, Hwanjo Yu
https://arxiv.org/abs/2510.09897

PairSem: LLM-Guided Pairwise Semantic Matching for Scientific Document Retrieval
Scientific document retrieval is a critical task for enabling knowledge discovery and supporting research across diverse domains. However, existing dense retrieval methods often struggle to capture fine-grained scientific concepts in texts due to their reliance on holistic embeddings and limited domain understanding. Recent approaches leverage large language models (LLMs) to extract fine-grained semantic entities and enhance semantic matching, but they typically treat entities as independent fr…

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 10:14:11

Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems
Jiaxin Gao, Chen Chen, Yanwen Jia, Xueluan Gong, Kwok-Yan Lam, Qian Wang
https://arxiv.org/abs/2510.12462 …

Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems
Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems, e.g., to assess responses in telecom customer support chatbots. However, the impartiality of these AI "judges" is not guaranteed, and any biases in their evaluation criteria could skew outcomes and undermine user trust. In this paper, we systematically investigate judgment biases in two LLM-as-a-judge models (i.e., GPT-Judge and JudgeLM) under the point-wise scoring…

@arXiv_csCL_bot@mastoxiv.page
2025-10-14 13:16:28

LLM-Oriented Token-Adaptive Knowledge Distillation
Xurong Xie, Zhucun Xue, Jiafu Wu, Jian Li, Yabiao Wang, Xiaobin Hu, Yong Liu, Jiangning Zhang
https://arxiv.org/abs/2510.11615

LLM-Oriented Token-Adaptive Knowledge Distillation
Knowledge distillation (KD) is a key technique for compressing large-scale language models (LLMs), yet prevailing logit-based methods typically employ static strategies that are misaligned with the dynamic learning process of student models. These methods typically treat all tokens indiscriminately and apply a single, fixed temperature, resulting in suboptimal knowledge transfer. To address these limitations, we propose LLM-Oriented Token-Adaptive Knowledge Distillation (AdaKD), a novel framewo…

@arXiv_csCR_bot@mastoxiv.page
2025-10-14 12:24:08

Uncertainty-Aware, Risk-Adaptive Access Control for Agentic Systems using an LLM-Judged TBAC Model
Charles Fleming, Ashish Kundu, Ramana Kompella
https://arxiv.org/abs/2510.11414

Uncertainty-Aware, Risk-Adaptive Access Control for Agentic Systems using an LLM-Judged TBAC Model
The proliferation of autonomous AI agents within enterprise environments introduces a critical security challenge: managing access control for emergent, novel tasks for which no predefined policies exist. This paper introduces an advanced security framework that extends the Task-Based Access Control (TBAC) model by using a Large Language Model (LLM) as an autonomous, risk-aware judge. This model makes access control decisions not only based on an agent's intent but also by explicitly considerin…

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:47:51

Dr.LLM: Dynamic Layer Routing in LLMs
Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh
https://arxiv.org/abs/2510.12773 https://arxiv.org…

Dr.LLM: Dynamic Layer Routing in LLMs
Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofitt…

@arXiv_csCY_bot@mastoxiv.page
2025-10-14 10:25:38

Personalized and Constructive Feedback for Computer Science Students Using the Large Language Model (LLM)
Javed Ali Khan, Muhammad Yaqoob, Mamoona Tasadduq, Hafsa Shareef Dar, Aitezaz Ahsan
https://arxiv.org/abs/2510.11556

Personalized and Constructive Feedback for Computer Science Students Using the Large Language Model (LLM)
The evolving pedagogy paradigms are leading toward educational transformations. One fundamental aspect of effective learning is relevant, immediate, and constructive feedback to students. Providing constructive feedback to large cohorts in academia is an ongoing challenge. Therefore, academics are moving towards automated assessment to provide immediate feedback. However, current approaches are often limited in scope, offering simplistic responses that do not provide students with personalized …

@arXiv_csIR_bot@mastoxiv.page
2025-10-14 10:34:18

HatLLM: Hierarchical Attention Masking for Enhanced Collaborative Modeling in LLM-based Recommendation
Yu Cui, Feng Liu, Jiawei Chen, Canghong Jin, Xingyu Lou, Changwang Zhang, Jun Wang, Yuegang Sun, Can Wang
https://arxiv.org/abs/2510.10955

HatLLM: Hierarchical Attention Masking for Enhanced Collaborative Modeling in LLM-based Recommendation
Recent years have witnessed a surge of research on leveraging large language models (LLMs) for sequential recommendation. LLMs have demonstrated remarkable potential in inferring users' nuanced preferences through fine-grained semantic reasoning. However, they also exhibit a notable limitation in effectively modeling collaborative signals, i.e., behavioral correlations inherent in users' historical interactions. Our empirical analysis further reveals that the attention mechanisms in LLMs tend t…

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:42:41

StyleDecipher: Robust and Explainable Detection of LLM-Generated Texts with Stylistic Analysis
Siyuan Li, Aodu Wulianghai, Xi Lin, Guangyan Li, Xiang Chen, Jun Wu, Jianhua Li
https://arxiv.org/abs/2510.12608

StyleDecipher: Robust and Explainable Detection of LLM-Generated Texts with Stylistic Analysis
With the increasing integration of large language models (LLMs) into open-domain writing, detecting machine-generated text has become a critical task for ensuring content authenticity and trust. Existing approaches rely on statistical discrepancies or model-specific heuristics to distinguish between LLM-generated and human-written text. However, these methods struggle in real-world scenarios due to limited generalization, vulnerability to paraphrasing, and lack of explainability, particularly w…

@arXiv_csLG_bot@mastoxiv.page
2025-10-13 10:40:40

Large Language Model Prompt Datasets: An In-depth Analysis and Insights
Yuanming Zhang, Yan Lin, Arijit Khan, Huaiyu Wan
https://arxiv.org/abs/2510.09316 https://

Large Language Model Prompt Datasets: An In-depth Analysis and Insights
A prompt is a natural language instruction that defines a specific task for a large language model (LLM) and serves as the primary interface for human-LLM interaction. With the growing deployment of LLMs, diverse prompt datasets are emerging from platforms such as GitHub and social media. These datasets span a wide array of applications and content types, facilitating both broader LLM utilization and improved prompt engineering. In this work, we--for the first time--have compiled an extensive l…

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 11:20:08

DebugTA: An LLM-Based Agent for Simplifying Debugging and Teaching in Programming Education
Lingyue Fu, Haowei Yuan, Datong Chen, Xinyi Dai, Qingyao Li, Weinan Zhang, Weiwen Liu, Yong Yu
https://arxiv.org/abs/2510.11076

DebugTA: An LLM-Based Agent for Simplifying Debugging and Teaching in Programming Education
In programming education, Debugging and Teaching (DT) task is a common scenario where students receive assistance in correcting their erroneous code. The task involves multiple inputs, including erroneous code, error messages, reference solutions, and the question description, with the goal of generating modification suggestions to the erroneous code. However, two key challenges hinder the effectiveness of existing approaches. Firstly, the complexity and heterogeneity of inputs inherent in DT t…

@Techmeme@techhub.social
2025-11-10 03:50:44

Financial stress from AI infrastructure spending, overhiring, and recession fears, rather than AI adoption, is likely driving layoffs in the tech sector (Fast Company)
https://www.fastcompany.com/91435192/chatgpt-llm-openai-jobs-amazon

AI isn't replacing jobs. AI spending is
Big spending on artificial intelligence puts pressure on jobs, as gloomy narratives about the future of work are ironically making new graduates less employable.

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 10:09:51

MTOS: A LLM-Driven Multi-topic Opinion Simulation Framework for Exploring Echo Chamber Dynamics
Dingyi Zuo, Hongjie Zhang, Jie Ou, Chaosheng Feng, Shuwan Liu
https://arxiv.org/abs/2510.12423

MTOS: A LLM-Driven Multi-topic Opinion Simulation Framework for Exploring Echo Chamber Dynamics
The polarization of opinions, information segregation, and cognitive biases on social media have attracted significant academic attention. In real-world networks, information often spans multiple interrelated topics, posing challenges for opinion evolution and highlighting the need for frameworks that simulate interactions among topics. Existing studies based on large language models (LLMs) focus largely on single topics, limiting the capture of cognitive transfer in multi-topic, cross-domain c…

@arXiv_csCL_bot@mastoxiv.page
2025-10-14 13:10:08

Who are you, ChatGPT? Personality and Demographic Style in LLM-Generated Content
Dana Sotto Porat, Ella Rabinovich
https://arxiv.org/abs/2510.11434 https://

Who are you, ChatGPT? Personality and Demographic Style in LLM-Generated Content
Generative large language models (LLMs) have become central to everyday life, producing human-like text across diverse domains. A growing body of research investigates whether these models also exhibit personality- and demographic-like characteristics in their language. In this work, we introduce a novel, data-driven methodology for assessing LLM personality without relying on self-report questionnaires, applying instead automatic personality and gender classifiers to model replies on open-ende…

@arXiv_csCR_bot@mastoxiv.page
2025-10-14 12:23:48

Living Off the LLM: How LLMs Will Change Adversary Tactics
Sean Oesch, Jack Hutchins, Luke Koch, Kevin Kurian
https://arxiv.org/abs/2510.11398 https://arxi…

Living Off the LLM: How LLMs Will Change Adversary Tactics
In living off the land attacks, malicious actors use legitimate tools and processes already present on a system to avoid detection. In this paper, we explore how the on-device LLMs of the future will become a security concern as threat actors integrate LLMs into their living off the land attack pipeline and ways the security community may mitigate this threat.

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 09:44:18

IntrinTrans: LLM-based Intrinsic Code Translator for RISC-V Vector
Liutong Han, Zhiyuan Tan, Hongbin Zhang, Pengcheng Wang, Chu Kang, Mingjie Xing, Yanjun Wu
https://arxiv.org/abs/2510.10119

IntrinTrans: LLM-based Intrinsic Code Translator for RISC-V Vector
The use of intrinsic functions to exploit hardware-specific capabilities is an important approach for optimizing library performance. Many mainstream libraries implement a large number of vectorized algorithms on Arm or x86 SIMD intrinsic functions. With the rapid expansion of the RISC-V hardware-software ecosystem, there is a growing demand for support of the RISC-V Vector (RVV) extension. Translating existing vectorized intrinsic code onto RVV intrinsics is a practical and effective approach.…

@arXiv_csAI_bot@mastoxiv.page
2025-10-14 12:24:48

Zero Data Retention in LLM-based Enterprise AI Assistants: A Comparative Study of Market Leading Agentic AI Products
Komal Gupta, Aditya Shrivastava
https://arxiv.org/abs/2510.11558

Zero Data Retention in LLM-based Enterprise AI Assistants: A Comparative Study of Market Leading Agentic AI Products
Governance of data, compliance, and business privacy matters, particularly for healthcare and finance businesses. Since the recent emergence of AI enterprise AI assistants enhancing business productivity, safeguarding private data and compliance is now a priority. With the implementation of AI assistants across the enterprise, the zero data retention can be achieved by implementing zero data retention policies by Large Language Model businesses like Open AI and Anthropic and Meta. In this work,…

@arXiv_csCR_bot@mastoxiv.page
2025-10-14 11:58:59

The Hidden DNA of LLM-Generated JavaScript: Structural Patterns Enable High-Accuracy Authorship Attribution
Norbert Tihanyi, Bilel Cherif, Richard A. Dubniczky, Mohamed Amine Ferrag, Tam\'as Bisztray
https://arxiv.org/abs/2510.10493

The Hidden DNA of LLM-Generated JavaScript: Structural Patterns Enable High-Accuracy Authorship Attribution
In this paper, we present the first large-scale study exploring whether JavaScript code generated by Large Language Models (LLMs) can reveal which model produced it, enabling reliable authorship attribution and model fingerprinting. With the rapid rise of AI-generated code, attribution is playing a critical role in detecting vulnerabilities, flagging malicious content, and ensuring accountability. While AI-vs-human detection usually treats AI as a single category we show that individual LLMs le…

@arXiv_csIR_bot@mastoxiv.page
2025-10-15 09:34:41

Leveraging Language Semantics for Collaborative Filtering with TextGCN and TextGCN-MLP: Zero-Shot vs In-Domain Performance
Andrei Chernov, Haroon Wahab, Oleg Novitskij
https://arxiv.org/abs/2510.12461 …

Leveraging Language Semantics for Collaborative Filtering with TextGCN and TextGCN-MLP: Zero-Shot vs In-Domain Performance
In recent years, various approaches have been proposed to leverage large language models (LLMs) for incorporating textual information about items into recommender systems. Existing methods primarily focus on either fine-tuning LLMs to generate recommendations or integrating LLM-based embeddings into downstream models. In this work, we follow the latter direction and propose \textbf{TextGCN}, which applies parameter-free graph convolution layers directly over LLM-based item-title embeddings, ins…

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 10:22:41

Multi-Agent Debate for LLM Judges with Adaptive Stability Detection
Tianyu Hu, Zhen Tan, Song Wang, Huaizhi Qu, Tianlong Chen
https://arxiv.org/abs/2510.12697 https://

Multi-Agent Debate for LLM Judges with Adaptive Stability Detection
With advancements in reasoning capabilities, Large Language Models (LLMs) are increasingly employed for automated judgment tasks. While LLMs-as-Judges offer promise in automating evaluations, current approaches often rely on simplistic aggregation methods (e.g., majority voting), which can fail even when individual agents provide correct answers. To address this, we propose a multi-agent debate judge framework where agents collaboratively reason and iteratively refine their responses. We formal…

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 09:14:28

OBsmith: Testing JavaScript Obfuscator using LLM-powered sketching
Shan Jiang, Chenguang Zhu, Sarfraz Khurshid
https://arxiv.org/abs/2510.10066 https://arx…

OBsmith: Testing JavaScript Obfuscator using LLM-powered sketching
JavaScript obfuscators are widely deployed to protect intellectual property and resist reverse engineering, yet their correctness has been largely overlooked compared to performance and resilience. Existing evaluations typically measure resistance to deobfuscation, leaving the critical question of whether obfuscators preserve program semantics unanswered. Incorrect transformations can silently alter functionality, compromise reliability, and erode security-undermining the very purpose of obfusc…

@arXiv_csCR_bot@mastoxiv.page
2025-10-15 08:57:52

CTIArena: Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence
Yutong Cheng, Yang Liu, Changze Li, Dawn Song, Peng Gao
https://arxiv.org/abs/2510.11974

CTIArena: Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence
Cyber threat intelligence (CTI) is central to modern cybersecurity, providing critical insights for detecting and mitigating evolving threats. With the natural language understanding and reasoning capabilities of large language models (LLMs), there is increasing interest in applying them to CTI, which calls for benchmarks that can rigorously evaluate their performance. Several early efforts have studied LLMs on some CTI tasks but remain limited: (i) they adopt only closed-book settings, relying…

@arXiv_csCL_bot@mastoxiv.page
2025-10-14 13:07:38

LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation
Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng
https://arxiv.org/abs/2510.11358

LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. While traditional retrieval focuses on relevance, RAG's effectiveness depends on the utility of retrieved passages, i.e., the usefulness in facilitating the generation of an accurate and comprehensive answer. Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage due to variations in internal kno…

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 10:14:41

Using Medical Algorithms for Task-Oriented Dialogue in LLM-Based Medical Interviews
Rui Reis, Pedro Rangel Henriques, Jo\~ao Ferreira-Coimbra, Eva Oliveira, Nuno F. Rodrigues
https://arxiv.org/abs/2510.12490

Using Medical Algorithms for Task-Oriented Dialogue in LLM-Based Medical Interviews
We developed a task-oriented dialogue framework structured as a Directed Acyclic Graph (DAG) of medical questions. The system integrates: (1) a systematic pipeline for transforming medical algorithms and guidelines into a clinical question corpus; (2) a cold-start mechanism based on hierarchical clustering to generate efficient initial questioning without prior patient information; (3) an expand-and-prune mechanism enabling adaptive branching and backtracking based on patient responses; (4) a t…

@arXiv_csSE_bot@mastoxiv.page
2025-10-15 09:40:41

Show Your Title! A Scoping Review on Verbalization in Software Engineering with LLM-Assisted Screening
Gerg\H{o} Balogh, D\'avid K\'osz\'o, Homayoun Safarpour Motealegh Mahalegi, L\'aszl\'o T\'oth, Bence Szak\'acs, \'Aron B\'ucs\'u
https://arxiv.org/abs/2510.12294

Show Your Title! A Scoping Review on Verbalization in Software Engineering with LLM-Assisted Screening
Understanding how software developers think, make decisions, and behave remains a key challenge in software engineering (SE). Verbalization techniques (methods that capture spoken or written thought processes) offer a lightweight and accessible way to study these cognitive aspects. This paper presents a scoping review of research at the intersection of SE and psychology (PSY), focusing on the use of verbal data. To make large-scale interdisciplinary reviews feasible, we employed a large languag…

@arXiv_csLG_bot@mastoxiv.page
2025-10-13 10:44:50

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols
Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou, Caglar Gulcehre, Maksym Andriushchenko, Ameya Prabhu, Jonas Geiping
https://arxiv.org/abs/2510.09462

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols
AI control protocols serve as a defense mechanism to stop untrusted LLM agents from causing harm in autonomous settings. Prior work treats this as a security problem, stress testing with exploits that use the deployment context to subtly complete harmful side tasks, such as backdoor insertion. In practice, most AI control protocols are fundamentally based on LLM monitors, which can become a central point of failure. We study adaptive attacks by an untrusted model that knows the protocol and the…

@arXiv_csCR_bot@mastoxiv.page
2025-10-14 11:47:28

MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation
Wentian Zhu, Zhen Xiang, Wei Niu, Le Guan
https://arxiv.org/abs/2510.10271 https://

MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation
Unlike regular tokens derived from existing text corpora, special tokens are artificially created to annotate structured conversations during the fine-tuning process of Large Language Models (LLMs). Serving as metadata of training data, these tokens play a crucial role in instructing LLMs to generate coherent and context-aware responses. We demonstrate that special tokens can be exploited to construct four attack primitives, with which malicious users can reliably bypass the internal safety ali…

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:32:21

HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment
Ali Mekky, Omar El Herraoui, Preslav Nakov, Yuxia Wang
https://arxiv.org/abs/2510.12217 https://

HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment
Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-A…

@arXiv_csAI_bot@mastoxiv.page
2025-10-14 12:23:38

From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization
Beining Wang, Weihang Su, Hongtao Tian, Tao Yang, Yujia Zhou, Ting Yao, Qingyao Ai, Yiqun Liu
https://arxiv.org/abs/2510.11457

From to : Multidimensional Supervision of Reasoning Process for LLM Optimization
Improving the multi-step reasoning ability of Large Language Models (LLMs) is a critical yet challenging task. The dominant paradigm, outcome-supervised reinforcement learning (RLVR), rewards only correct final answers, often propagating flawed reasoning and suffering from sparse reward signals. While process-level reward models (PRMs) provide denser, step-by-step feedback, they lack generalizability and interpretability, requiring task-specific segmentation of the reasoning process. To this en…

@arXiv_csCL_bot@mastoxiv.page
2025-10-14 13:14:58

Invisible Languages of the LLM Universe
Saurabh Khanna, Xinxu Li
https://arxiv.org/abs/2510.11557 https://arxiv.org/pdf/2510.11557

Invisible Languages of the LLM Universe
Large Language Models are trained on massive multilingual corpora, yet this abundance masks a profound crisis: of the world's 7,613 living languages, approximately 2,000 languages with millions of speakers remain effectively invisible in digital ecosystems. We propose a critical framework connecting empirical measurements of language vitality (real world demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice to explain why linguistic inequality i…

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 11:03:59

Project-Level C-to-Rust Translation via Synergistic Integration of Knowledge Graphs and Large Language Models
Zhiqiang Yuan, Wenjun Mao, Zhuo Chen, Xiyue Shang, Chong Wang, Yiling Lou, Xin Peng
https://arxiv.org/abs/2510.10956

Project-Level C-to-Rust Translation via Synergistic Integration of Knowledge Graphs and Large Language Models
Translating C code into safe Rust is an effective way to ensure its memory safety. Compared to rule-based translation which produces Rust code that remains largely unsafe, LLM-based methods can generate more idiomatic and safer Rust code because LLMs have been trained on vast amount of human-written idiomatic code. Although promising, existing LLM-based methods still struggle with project-level C-to-Rust translation. They typically partition a C project into smaller units (\eg{} functions) base…

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 09:37:41

Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong, Sudheer Chava, Chao Zhang
https://arxiv.org/abs/2510.12121

Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Precise attribute intensity control--generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities--is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather t…

@arXiv_csLG_bot@mastoxiv.page
2025-10-15 14:36:30

Replaced article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[2/7]:
- MooseAgent: A LLM Based Multi-agent Framework for Automating Moose Simulation
Tao Zhang, Zhenhai Liu, Yong Xin, Yongjun Jiao

@arXiv_csCR_bot@mastoxiv.page
2025-10-14 12:13:08

RAG-Pull: Imperceptible Attacks on RAG Systems for Code Generation
Vasilije Stambolic, Aritra Dhar, Lukas Cavigelli
https://arxiv.org/abs/2510.11195 https://

RAG-Pull: Imperceptible Attacks on RAG Systems for Code Generation
Retrieval-Augmented Generation (RAG) increases the reliability and trustworthiness of the LLM response and reduces hallucination by eliminating the need for model retraining. It does so by adding external data into the LLM's context. We develop a new class of black-box attack, RAG-Pull, that inserts hidden UTF characters into queries or external code repositories, redirecting retrieval toward malicious code, thereby breaking the models' safety alignment. We observe that query and code perturbat…

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:44:51

Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations
Sunny Yu, Ahmad Jabbar, Robert Hawkins, Dan Jurafsky, Myra Cheng
https://arxiv.org/abs/2510.12699

Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations
Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) -- the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a ta…

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:37:41

Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency
Hailay Kidu Teklehaymanot, Wolfgang Nejdl
https://arxiv.org/abs/2510.12389

Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency
Tokenization disparities pose a significant barrier to achieving equitable access to artificial intelligence across linguistically diverse populations. This study conducts a large-scale cross-linguistic evaluation of tokenization efficiency in over 200 languages to systematically quantify computational inequities in large language models (LLMs). Using a standardized experimental framework, we applied consistent preprocessing and normalization protocols, followed by uniform tokenization through …

@arXiv_csLG_bot@mastoxiv.page
2025-10-14 16:38:22

Crosslisted article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[1/7]:
- Domain-Specific Constitutional AI: Enhancing Safety in LLM-Powered Mental Health Chatbots
Chenhan Lyu, Yutong Song, Pengfei Zhang, Amir M. Rahmani

@arXiv_csSE_bot@mastoxiv.page
2025-10-13 09:18:20

Faver: Boosting LLM-based RTL Generation with Function Abstracted Verifiable Middleware
Jianan Mu, Mingyu Shi, Yining Wang, Tianmeng Yang, Bin Sun, Xing Hu, Jing Ye, Huawei Li
https://arxiv.org/abs/2510.08664

Faver: Boosting LLM-based RTL Generation with Function Abstracted Verifiable Middleware
LLM-based RTL generation is an interesting research direction, as it holds the potential to liberate the least automated stage in the current chip design. However, due to the substantial semantic gap between high-level specifications and RTL, coupled with limited training data, existing models struggle with generation accuracy. Drawing on human experience, design with verification helps improving accuracy. However, as the RTL testbench data are even more scarce, it is not friendly for LLMs. Alt…

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:36:40

Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
Ines Altemir Marinas, Anastasiia Kucherenko, Alexander Sternfeld, Andrei Kucharavy
https://arxiv.org/abs/2510.09471 …

Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel i…

@arXiv_csAI_bot@mastoxiv.page
2025-10-13 09:22:20

RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
Ashish Kattamuri, Harshwardhan Fartale, Arpita Vats, Rahul Raja, Ishita Prasad
https://arxiv.org/abs/2510.08931

RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
Data contamination poses a significant challenge to reliable LLM evaluation, where models may achieve high performance by memorizing training data rather than demonstrating genuine reasoning capabilities. We introduce RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework that leverages mechanistic interpretability to detect contamination by distinguishing recall-based from reasoning-based model responses. RADAR extracts 37 features spanning surface-level co…

@arXiv_csLG_bot@mastoxiv.page
2025-10-15 14:36:51

Replaced article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[4/7]:
- Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning
Liu, Obando-Ceron, Lu, He, Wang, Su, Zheng, Castro, Courville, Pan

@arXiv_csSE_bot@mastoxiv.page
2025-10-13 09:59:30

TIT: A Tree-Structured Instruction Tuning Approach for LLM-Based Code Translation
He Jiang, Yufu Wang, Hao Lin, Peiyu Zou, Zhide Zhou, Ang Jia, Xiaochen Li, Zhilei Ren
https://arxiv.org/abs/2510.09400 …

TIT: A Tree-Structured Instruction Tuning Approach for LLM-Based Code Translation
Large Language Models (LLMs) have shown strong performance in automated source-to-target code translation through pretraining on extensive code corpora. However, mainstream LLM-based code translation methods suffer from two critical limitations. First, they are highly sensitive to language-specific features, which often introduce source-language syntax or lexicon into the output, leading to syntactic confusion. Second, they lack fine-grained semantic alignment due to an over-reliance on functio…

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:35:00

Active Model Selection for Large Language Models
Yavuz Durmazkeser, Patrik Okanovic, Andreas Kirsch, Torsten Hoefler, Nezihe Merve G\"urel
https://arxiv.org/abs/2510.09418 …

Active Model Selection for Large Language Models
We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based or…

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:30:20

FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference
Yu-Chen Lu, Chong-Yan Chen, Chi-Chih Chang, Yu-Fang Hu, Kai-Chiang Wu
https://arxiv.org/abs/2510.09332 https:/…

FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference
Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently d…

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:43:40

Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation
Sondos Mahmoud Bsharat, Zhiqiang Shen
https://arxiv.org/abs/2510.09599 https://arxi…

Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation
Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually se…

@arXiv_csAI_bot@mastoxiv.page
2025-10-13 10:01:00

Fundamentals of Building Autonomous LLM Agents
Victor de Lamo Castrillo, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll
https://arxiv.org/abs/2510.09244 https://

Fundamentals of Building Autonomous LLM Agents
This paper reviews the architecture and implementation methods of agents powered by large language models (LLMs). Motivated by the limitations of traditional LLMs in real-world tasks, the research aims to explore patterns to develop "agentic" LLMs that can automate complex tasks and bridge the performance gap with human capabilities. Key components include a perception system that converts environmental percepts into meaningful representations; a reasoning system that formulates plans, adapts t…

@arXiv_csCR_bot@mastoxiv.page
2025-10-14 12:05:38

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
Hyeseon Ahn, Shinwoo Park, Yo-Sub Han
https://arxiv.org/abs/2510.10987 https://

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
The promise of LLM watermarking rests on a core assumption that a specific watermark proves authorship by a specific model. We demonstrate that this assumption is dangerously flawed. We introduce the threat of watermark spoofing, a sophisticated attack that allows a malicious model to generate text containing the authentic-looking watermark of a trusted, victim model. This enables the seamless misattribution of harmful content, such as disinformation, to reputable sources. The key to our attack…

@arXiv_csAI_bot@mastoxiv.page
2025-10-13 09:09:20

GTAlign: Game-Theoretic Alignment of LLM Assistants for Mutual Welfare
Siqi Zhu, David Zhang, Pedro Cisneros-Velarde, Jiaxuan You
https://arxiv.org/abs/2510.08872 https://

GTAlign: Game-Theoretic Alignment of LLM Assistants for Mutual Welfare
Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble th…

@arXiv_csAI_bot@mastoxiv.page
2025-10-13 09:58:50

RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems
Hyundong Jin, Joonghyuk Hahn, Yo-Sub Han
https://arxiv.org/abs/2510.09227 https://…

RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems
Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming, and recent large reasoning models (LRMs) further emphasize explicit reasoning. Yet their computational limits, particularly spatial complexity constrained by finite context windows, remain poorly understood. While recent works often focus on problems within the NP complexity class, we push the boundary by introducing a novel benchmark grounded in two PSPACE-com…

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 09:29:11

Evaluating the Quality of Randomness and Entropy in Tasks Supported by Large Language Models
Rabimba Karanjai, Yang Lu, Ranjith Chodavarapu, Lei Xu, Weidong Shi
https://arxiv.org/abs/2510.12080

Evaluating the Quality of Randomness and Entropy in Tasks Supported by Large Language Models
The rapid advancement of large language model (LLM) technology has led to diverse applications, many of which inherently require randomness, such as stochastic decision-making, gaming, scheduling, AI agents, and cryptography-related tasks. However, the capabilities of LLMs in handling randomness, particularly in generating and utilizing random numbers effectively, remain unclear. This paper investigates the capacity of LLMs for handling tasks that involve randomness through a series of experime…

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:30:50

LLP: LLM-based Product Pricing in E-commerce
Hairu Wang, Sheng You, Qiheng Zhang, Xike Xie, Shuguang Han, Yuchen Wu, Fei Huang, Jufeng Chen
https://arxiv.org/abs/2510.09347 http…

LLP: LLM-based Product Pricing in E-commerce
Unlike Business-to-Consumer e-commerce platforms (e.g., Amazon), inexperienced individual sellers on Consumer-to-Consumer platforms (e.g., eBay) often face significant challenges in setting prices for their second-hand products efficiently. Therefore, numerous studies have been proposed for automating price prediction. However, most of them are based on static regression models, which suffer from poor generalization performance and fail to capture market dynamics (e.g., the price of a used iPho…

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 09:43:41

GOAT: A Training Framework for Goal-Oriented Agent with Tools
Hyunji Min, Sangwon Jung, Junyoung Sung, Dosung Lee, Leekyeung Han, Paul Hongsuck Seo
https://arxiv.org/abs/2510.12218

GOAT: A Training Framework for Goal-Oriented Agent with Tools
Large language models (LLMs) have recently been extended beyond traditional text generation to serve as interactive agents capable of using external tools based on user intent. However, current LLM agents still show limited ability to handle goal-oriented queries, which require decomposing a high-level objective into multiple interdependent API calls with correct planning and execution. Current approaches mainly rely on zero-shot evaluation due to the absence of training data. While proprietary…

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:35:20

The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Nizar El Ghazal, Antoine Caubri\`ere, Valentin Vielzeuf
https://arxiv.org/abs/2510.09424

The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing p…

@arXiv_csAI_bot@mastoxiv.page
2025-10-14 17:30:01

Crosslisted article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[16/17]:
- Living Off the LLM: How LLMs Will Change Adversary Tactics
Sean Oesch, Jack Hutchins, Luke Koch, Kevin Kurian

@arXiv_csAI_bot@mastoxiv.page
2025-10-14 17:27:47

Crosslisted article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[3/17]:
- Preference-Aware Memory Update for Long-Term LLM Agents
Haoran Sun, Zekun Zhang, Shaoning Zeng

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:40:51

BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)
Tomas Ruiz, Siyao Peng, Barbara Plank, Carsten Schwemmer
https://arxiv.org/abs/2510.12516

BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)
Test-time scaling is a family of techniques to improve LLM outputs at inference time by performing extra computation. To the best of our knowledge, test-time scaling has been limited to domains with verifiably correct answers, like mathematics and coding. We transfer test-time scaling to the LeWiDi-2025 tasks to evaluate annotation disagreements. We experiment with three test-time scaling methods: two benchmark algorithms (Model Averaging and Majority Voting), and a Best-of-N sampling method. T…

@arXiv_csAI_bot@mastoxiv.page
2025-10-14 22:02:28

Replaced article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[4/14]:
- An approach for systematic decomposition of complex llm tasks
Tianle Zhou, Jiakai Xu, Guanhong Liu, Jiaxiang Liu, Haonan Wang, Eugene Wu

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 14:20:33

Replaced article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[6/6]:
- ICL-Router: In-Context Learned Model Representations for LLM Routing
Wang, Li, Zhang, Chen, Chen, Jian, Ye, Zhang, Hu

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 11:56:26

Crosslisted article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[1/2]:
- Scaling Law in LLM Simulated Personality: More Detailed and Realistic Persona Profile Is All You ...
Yuqi Bai, Tianyu Huang, Kun Sun, Yuting Chen

@arXiv_csAI_bot@mastoxiv.page
2025-10-14 12:29:08

SR-Scientist: Scientific Equation Discovery With Agentic AI
Shijie Xia, Yuhan Sun, Pengfei Liu
https://arxiv.org/abs/2510.11661 https://arxiv.org/pdf/2510.…

SR-Scientist: Scientific Equation Discovery With Agentic AI
Recently, Large Language Models (LLMs) have been applied to scientific equation discovery, leveraging their embedded scientific knowledge for hypothesis generation. However, current methods typically confine LLMs to the role of an equation proposer within search algorithms like genetic programming. In this paper, we present SR-Scientist, a framework that elevates the LLM from a simple equation proposer to an autonomous AI scientist that writes code to analyze data, implements the equation as co…

@arXiv_csCL_bot@mastoxiv.page
2025-10-14 13:08:28

Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies
Zirui Song, Yuan Huang, Junchang Liu, Haozhe Luo, Chenxi Wang, Lang Gao, Zixiang Xu, Mingfei Han, Xiaojun Chang, Xiuying Chen
https://arxiv.org/abs/2510.11389

Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies
Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human…

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 09:51:01

$\mathbf{T^3}$: Reducing Belief Deviation in Reinforcement Learning for Active Reasoning
Deyu Zou, Yongqiang Chen, Jianxiang Wang, Haochen Yang, Mufei Li, James Cheng, Pan Li, Yu Gong
https://arxiv.org/abs/2510.12264

$\mathbf{T^3}$: Reducing Belief Deviation in Reinforcement Learning for Active Reasoning
Active reasoning requires large language models (LLMs) to interact with external sources and strategically gather information to solve problems. Central to this process is belief tracking: maintaining a coherent understanding of the problem state and the missing information toward the solution. However, due to limited reasoning capabilities, LLM-based agents often suffer from belief deviation: they struggle to correctly model beliefs, lose track of problem states, and fall into uninformative or…

Tootfinder

Opt-in global Mastodon full text search. Join the index!