Tootfinder

@arXiv_csLG_bot@mastoxiv.page
2025-09-19 10:36:41

TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference
Dan Zhang, Min Cai, Jonathan Li, Ziniu Hu, Yisong Yue, Yuxiao Dong, Jie Tang
https://arxiv.org/abs/2509.15110

TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference
Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences during training. This temporal-difference (TD) regularization produces smooth rewards and improves alignment with long-term objectives. I…

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:46:10

Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction
Xinhe Li, Jiajun Liu, Peng Wang
https://arxiv.org/abs/2508.13037

Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction
Recent studies have demonstrated that Large Language Models (LLMs) have strong mathematical reasoning abilities but rely on hundreds of billions of parameters. To tackle the challenge of poor reasoning in Small Language Models (SLMs), existing methods typically leverage LLMs to generate massive amounts of data for cramming training. In psychology, they are akin to System 1 thinking, which resolves reasoning problems rapidly based on experience and intuition. However, human learning also require…

@arXiv_csAI_bot@mastoxiv.page
2025-09-19 09:56:41

Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment
Ankur Samanta, Akshayaa Magesh, Youliang Yu, Runzhe Wu, Ayush Jain, Daniel Jiang, Boris Vidolov, Paul Sajda, Yonathan Efroni, Kaveh Hassani
https://arxiv.org/abs/2509.15172

Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment
Language Models (LMs) are inconsistent reasoners, often generating contradictory responses to identical prompts. While inference-time methods can mitigate these inconsistencies, they fail to address the core problem: LMs struggle to reliably select reasoning pathways leading to consistent outcomes under exploratory sampling. To address this, we formalize self-consistency as an intrinsic property of well-aligned reasoning models and introduce Multi-Agent Consensus Alignment (MACA), a reinforceme…

@arXiv_csRO_bot@mastoxiv.page
2025-08-19 11:27:20

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, Liqiang Nie
https://arxiv.org/abs/2508.13073

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based V…

@arXiv_csCR_bot@mastoxiv.page
2025-09-19 07:38:11

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models
Gustavo Sandoval, Denys Fenchenko, Junyao Chen
https://arxiv.org/abs/2509.14271

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models
This paper documents early research conducted in 2022 on defending against prompt injection attacks in large language models, providing historical context for the evolution of this critical security domain. This research focuses on two adversarial attacks against Large Language Models (LLMs): prompt injection and goal hijacking. We examine how to construct these attacks, test them on various LLMs, and compare their effectiveness. We propose and evaluate a novel defense technique called Adversar…

@arXiv_hepph_bot@mastoxiv.page
2025-09-19 10:00:31

Lepton models from non-holomorphic $A^{\prime}_{5}$ modular flavor symmetry
Cai-Chang Li, Gui-Jun Ding
https://arxiv.org/abs/2509.15183 https://arxiv.org/p…

Lepton models from non-holomorphic $A^{\prime}_{5}$ modular flavor symmetry
In the framework of non-holomorphic modular invariance approach, we have systematically constructed all minimal lepton models based on the non-holomorphic $A^{\prime}_{5}$ modular symmetry from a bottom-up approach. In these models, the Yukawa couplings are described by polyharmonic Maaß forms of integer weights at level $N=5$. Under the assumption of Majorana neutrinos, both the Weinberg operator and the type-I seesaw mechanism are considered for neutrino mass generation. All minimal models a…

@arXiv_csHC_bot@mastoxiv.page
2025-08-20 08:54:50

Mind & Motion: Opportunities and Applications of Integrating Biomechanics and Cognitive Models in HCI
Arthur Fleig, Florian Fischer, Markus Klar, Patrick Ebel, Miroslav Bachinski, Per Ola Kristensson, Roderick Murray-Smith, Antti Oulasvirta
https://arxiv.org/abs/2508.13788

Mind & Motion: Opportunities and Applications of Integrating Biomechanics and Cognitive Models in HCI
Computational models of how users perceive and act within a virtual or physical environment offer enormous potential for the understanding and design of user interactions. Cognition models have been used to understand the role of attention and individual preferences and beliefs on human decision making during interaction, while biomechanical simulations have been successfully applied to analyse and predict physical effort, fatigue, and discomfort. The next frontier in HCI lies in connecting the…

@arXiv_csCV_bot@mastoxiv.page
2025-09-19 10:35:21

Calibration-Aware Prompt Learning for Medical Vision-Language Models
Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, Muhammad Haris Khan
https://arxiv.org/abs/2509.15226

Calibration-Aware Prompt Learning for Medical Vision-Language Models
Medical Vision-Language Models (Med-VLMs) have demonstrated remarkable performance across diverse medical imaging tasks by leveraging large-scale image-text pretraining. However, their confidence calibration is largely unexplored, and so remains a significant challenge. As such, miscalibrated predictions can lead to overconfident errors, undermining clinical trust and decision-making reliability. To address this, we introduce CalibPrompt, the first framework to calibrate Med-VLMs during prompt …

@askesis@qoto.org
2025-09-19 20:39:24

Jš vi IA mentindo e omitindo... Só o que me surpreende é usarmos isso para qualquer coisa, quer dizer, sermos usados por isso para qualquer coisa.
@… https://mas.to/@carnage4life/115228191

Dare Obasanjo (@carnage4life@mas.to)
OpenAI researchers discovered that their AI models don’t just hallucinate (confident guesses) but also scheme (deliberate lies to the user). Anyone who’s vibe coded for any decent amount of time has caught these sorts of lies. Luckily we aren’t putting these models in mission critical systems…right? https://techcrunch.com/2025/09/18/openais-research-on-ai-models-deliberately-lying-is-wild/

@arXiv_hepth_bot@mastoxiv.page
2025-08-19 09:31:40

On Instantons in Gross-Neveu and Gross-Neveu-Yukawa models
A. Imaanpur, S. E. Sadati
https://arxiv.org/abs/2508.12080 https://arxiv.org/pdf/2508.12080

On Instantons in Gross-Neveu and Gross-Neveu-Yukawa models
We study fermionic instantons of the Gross-Neveu and the Gross-Neveu-Yukawa models. We derive solutions for both models and examine the corresponding actions at the fixed points. We further map the solutions on to the sphere and discuss the relation to the Hubbard-Stratonovich approach. Close to the fixed points we compare and identify the results with those obtained in the large $N$ computation.

@arXiv_csDC_bot@mastoxiv.page
2025-08-19 09:00:30

Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement
Tian Wu, Liming Wang, Zijian Wen, Xiaoxi Zhang, Jingpu Duan, Xianwei Zhang, Jinhang Zuo
https://arxiv.org/abs/2508.12851

Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement
Mixture-of-Experts (MoE) have become a cornerstone for training and scaling large language models (LLMs), offering substantial gains in model capacity and efficiency through sparse expert activation. However, serving these models remains challenging in practice, particularly in resource-constrained edge environments, due to their large memory footprint and complex communication demands. While centralized cloud inference is common, it incurs high infrastructure costs, along with latency and priv…

@arXiv_eessIV_bot@mastoxiv.page
2025-08-20 09:42:50

Real-Time, Population-Based Reconstruction of 3D Bone Models via Very-Low-Dose Protocols
Yiqun Lin, Haoran Sun, Yongqing Li, Rabia Aslam, Lung Fung Tse, Tiange Cheng, Chun Sing Chui, Wing Fung Yau, Victorine R. Le Meur, Meruyert Amangeldy, Kiho Cho, Yinyu Ye, James Zou, Wei Zhao, Xiaomeng Li
https://arxiv.org/abs/2508.13947

Real-Time, Population-Based Reconstruction of 3D Bone Models via Very-Low-Dose Protocols
Patient-specific bone models are essential for designing surgical guides and preoperative planning, as they enable the visualization of intricate anatomical structures. However, traditional CT-based approaches for creating bone models are limited to preoperative use due to the low flexibility and high radiation exposure of CT and time-consuming manual delineation. Here, we introduce Semi-Supervised Reconstruction with Knowledge Distillation (SSR-KD), a fast and accurate AI framework to reconstr…

@arXiv_csIR_bot@mastoxiv.page
2025-09-19 08:42:51

What Matters in LLM-Based Feature Extractor for Recommender? A Systematic Analysis of Prompts, Models, and Adaptation
Kainan Shi (Xi'an Jiaotong University), Peilin Zhou (Hong Kong University of Science,Technology), Ge Wang (Xi'an Jiaotong University), Han Ding (Xi'an Jiaotong University), Fei Wang (Xi'an Jiaotong University)
https://

What Matters in LLM-Based Feature Extractor for Recommender? A Systematic Analysis of Prompts, Models, and Adaptation
Using Large Language Models (LLMs) to generate semantic features has been demonstrated as a powerful paradigm for enhancing Sequential Recommender Systems (SRS). This typically involves three stages: processing item text, extracting features with LLMs, and adapting them for downstream models. However, existing methods vary widely in prompting, architecture, and adaptation strategies, making it difficult to fairly compare design choices and identify what truly drives performance. In this work, w…

@arXiv_csAR_bot@mastoxiv.page
2025-08-19 07:30:39

HPD: Hybrid Projection Decomposition for Robust State Space Models on Analog CIM Hardware
Yuannuo Feng, Wenyong Zhou, Yuexi Lyu, Hanjie Liu, Zhengwu Liu, Ngai Wong, Wang Kang
https://arxiv.org/abs/2508.11935

HPD: Hybrid Projection Decomposition for Robust State Space Models on Analog CIM Hardware
State Space Models (SSMs) are efficient alternatives to traditional sequence models, excelling at processing long sequences with lower computational complexity. Their reliance on matrix multiplications makes them ideal for compute-in-memory (CIM) architectures, which improve energy efficiency by computing within memory arrays. However, device non-idealities in CIM introduce weight perturbations that can degrade inference accuracy. In this paper, we systematically analyze the robustness of SSMs …

@arXiv_csSE_bot@mastoxiv.page
2025-08-19 09:12:29

Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset
Zhipeng Xue, Xiaoting Zhang, Zhipeng Gao, Xing Hu, Shan Gao, Xin Xia, Shanping Li
https://arxiv.org/abs/2508.11958

Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset
The Large Language Models (LLMs) have demonstrated great potential in code-related tasks. However, most research focuses on improving the output quality of LLMs (e.g., correctness), and less attention has been paid to the LLM input (e.g., the training code quality). Given that code smells are widely existed in practice and can negatively impact software maintainability and readability, this study takes the first systematic research to assess and improve dataset quality in terms of code smells. …

@arXiv_statME_bot@mastoxiv.page
2025-08-20 08:35:00

Statistical Inference for Subgraph Frequencies of Exchangeable Hyperedge Models
Ayoushman Bhattacharya, Nilanjan Chakraborty, Robert Lunde
https://arxiv.org/abs/2508.13258 https…

Statistical Inference for Subgraph Frequencies of Exchangeable Hyperedge Models
In statistical network analysis, models for binary adjacency matrices satisfying vertex exchangeability are commonly used. However, such models may fail to capture key features of the data-generating process when interactions, rather than nodes, are fundamental units. We study statistical inference for subgraph counts under an exchangeable hyperedge model. We introduce several classes of subgraph statistics for hypergraphs and develop inferential tools for subgraph frequencies that account for …

@arXiv_csSD_bot@mastoxiv.page
2025-08-19 09:25:39

Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection
Bing Han, Anbai Jiang, Xinhu Zheng, Wei-Qiang Zhang, Jia Liu, Pingyi Fan, Yanmin Qian
https://arxiv.org/abs/2508.12230

Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection
Machine anomalous sound detection (ASD) is a valuable technique across various applications. However, its generalization performance is often limited due to challenges in data collection and the complexity of acoustic environments. Inspired by the success of large pre-trained models in numerous fields, this paper introduces a robust ASD model that leverages self-supervised pre-trained models trained on large-scale speech and audio datasets. Although there are inconsistencies between the pre-tra…

@arXiv_csCE_bot@mastoxiv.page
2025-08-20 08:18:40

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models
Ziyan Kuang, Feiyu Zhu, Maowei Jiang, Yanzhao Lai, Zelin Wang, Zhitong Wang, Meikang Qiu, Jiajia Huang, Min Peng, Qianqian Xie, Sophia Ananiadou
https://arxiv.org/abs/2508.13491

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models
Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking othe…

@arXiv_csLG_bot@mastoxiv.page
2025-09-19 10:38:21

Self-Improving Embodied Foundation Models
Seyed Kamyar Seyed Ghasemipour, Ayzaan Wahid, Jonathan Tompson, Pannag Sanketi, Igor Mordatch
https://arxiv.org/abs/2509.15155 https://…

Self-Improving Embodied Foundation Models
Foundation models trained on web-scale data have revolutionized robotics, but their application to low-level control remains largely limited to behavioral cloning. Drawing inspiration from the success of the reinforcement learning stage in fine-tuning large language models, we propose a two-stage post-training approach for robotics. The first stage, Supervised Fine-Tuning (SFT), fine-tunes pretrained foundation models using both: a) behavioral cloning, and b) steps-to-go prediction objectives. …

@arXiv_csCY_bot@mastoxiv.page
2025-08-19 08:44:00

Large Language Models in the Data Science Lifecycle: A Systematic Mapping Study
Sai Sanjna Chintakunta, Nathalia Nascimento, Everton Guimaraes
https://arxiv.org/abs/2508.11698 h…

Large Language Models in the Data Science Lifecycle: A Systematic Mapping Study
In recent years, Large Language Models (LLMs) have emerged as transformative tools across numerous domains, impacting how professionals approach complex analytical tasks. This systematic mapping study comprehensively examines the application of LLMs throughout the Data Science lifecycle. By analyzing relevant papers from Scopus and IEEE databases, we identify and categorize the types of LLMs being applied, the specific stages and tasks of the data science process they address, and the methodolo…

@arXiv_statML_bot@mastoxiv.page
2025-08-20 09:07:50

Uncertainty-Aware PCA for Arbitrarily Distributed Data Modeled by Gaussian Mixture Models
Daniel Kl\"otzl, Ozan Tastekin, David H\"agele, Marina Evers, Daniel Weiskopf
https://arxiv.org/abs/2508.13990

Uncertainty-Aware PCA for Arbitrarily Distributed Data Modeled by Gaussian Mixture Models
Multidimensional data is often associated with uncertainties that are not well-described by normal distributions. In this work, we describe how such distributions can be projected to a low-dimensional space using uncertainty-aware principal component analysis (UAPCA). We propose to model multidimensional distributions using Gaussian mixture models (GMMs) and derive the projection from a general formulation that allows projecting arbitrary probability density functions. The low-dimensional proje…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:18:51

HARNESS: Lightweight Distilled Arabic Speech Foundation Models
Vrunda N. sukhadia, Shammur Absar Chowdhury
https://arxiv.org/abs/2509.14689 https://arxiv.o…

HARNESS: Lightweight Distilled Arabic Speech Foundation Models
Large pre-trained speech models excel in downstream tasks but their deployment is impractical for resource-limited environments. In this paper, we introduce HArnESS, the first Arabic-centric self-supervised speech model family, designed to capture Arabic speech nuances. Using iterative self-distillation, we train large bilingual HArnESS (HL) SSL models and then distill knowledge into compressed student models (HS, HST), preserving Arabic-specific representations. We use low-rank approximation t…

@arXiv_csHC_bot@mastoxiv.page
2025-08-19 08:46:00

Playing telephone with generative models: "verification disability," "compelled reliance," and accessibility in data visualization
Frank Elavsky, Cindy Xiong Bearfield
https://arxiv.org/abs/2508.12192

Playing telephone with generative models: "verification disability," "compelled reliance," and accessibility in data visualization
This paper is a collaborative piece between two worlds of expertise in the field of data visualization: accessibility and bias. In particular, the rise of generative models playing a role in accessibility is a worrying trend for data visualization. These models are increasingly used to help author visualizations as well as generate descriptions of existing visualizations for people who are blind, low vision, or use assistive technologies such as screen readers. Sighted human-to-human bias has a…

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:20:40

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
https://arxiv.org/abs/2508.13968 ht…

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite t…

@arXiv_csAI_bot@mastoxiv.page
2025-08-20 10:09:30

ChronoLLM: Customizing Language Models for Physics-Based Simulation Code Generation
Jingquan Wang, Andrew Negrut, Harry Zhang, Khailanii Slaton, Shu Wang, Radu Serban, Jinlong Wu, Dan Negrut
https://arxiv.org/abs/2508.13975

ChronoLLM: Customizing Language Models for Physics-Based Simulation Code Generation
This contribution is concerned with the following issue: can pretrained large language models (LLMs) be refined and customized to the point where they become virtual assistants helping experts with the effective use of a simulation tool? In this case study, the ``simulation tool'' considered is PyChrono, an open source multi-physics dynamics engine for multibody systems. We present a framework for refining and customizing both open- and closed-source LLMs to harness the power of AI in generatin…

@arXiv_hepth_bot@mastoxiv.page
2025-09-19 08:42:11

Degenerate kinks and kink-instantons in two-dimensional scalar field models with $\mathcal{N}=1$ and $\mathcal{N}=2$ supersymmetry
Evgenii Ievlev, Mikhail Shifman
https://arxiv.org/abs/2509.14324

Degenerate kinks and kink-instantons in two-dimensional scalar field models with $\mathcal{N}=1$ and $\mathcal{N}=2$ supersymmetry
Models with classically degenerate vacua often support quasiclassical configurations of nontrivial topology. In (0+1)-dimensional quantum mechanics with a double-well potential, for example, instantons induce mixing between the two perturbative ground states in the purely bosonic case, while in the supersymmetric version, the tunneling amplitude is suppressed. In this work, we investigate (1+1)-dimensional models featuring classically Bogomol'nyi-Prasad-Sommerfield saturated kinks with degene…

@arXiv_csLG_bot@mastoxiv.page
2025-09-19 10:39:11

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu
https://arxiv.org/abs/2509.15194

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing label-free methods, confidence minimization, self-consistency, or majority-vote objectives, stabilize learning but steadily shrink exploration, causing an entropy collapse: generations become shorter, less diverse, and brittle. Unlike prior approaches such as Test-Time Reinforc…

@arXiv_csCR_bot@mastoxiv.page
2025-09-19 08:50:11

Beyond Data Privacy: New Privacy Risks for Large Language Models
Yuntao Du, Zitao Li, Ninghui Li, Bolin Ding
https://arxiv.org/abs/2509.14278 https://arxiv…

Beyond Data Privacy: New Privacy Risks for Large Language Models
Large Language Models (LLMs) have achieved remarkable progress in natural language understanding, reasoning, and autonomous decision-making. However, these advancements have also come with significant privacy concerns. While significant research has focused on mitigating the data privacy risks of LLMs during various stages of model training, less attention has been paid to new threats emerging from their deployment. The integration of LLMs into widely used applications and the weaponization of …

@arXiv_hepph_bot@mastoxiv.page
2025-08-19 10:29:10

Stringy Constraints on Modular Flavor Models
Keiya Ishiguro, Takafumi Kai, Tatsuo Kobayashi, Hajime Otsuka
https://arxiv.org/abs/2508.12392 https://arxiv.o…

Stringy Constraints on Modular Flavor Models
We investigate stringy constraints on moduli spaces in modular flavor models by analyzing moduli-dependent threshold corrections in heterotic string vacua. While moduli play a crucial role in determining the flavor structure of fermions predicted by modular flavor models, the parameter space in which their vacuum expectation values are allowed has not been fully explored. In this work, within the framework of perturbative heterotic string theory on toroidal orbifolds, we derive constraints on t…

@arXiv_csRO_bot@mastoxiv.page
2025-07-18 07:46:02

FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making
Yucen Wang, Rui Yu, Shenghua Wan, Le Gan, De-Chuan Zhan
https://arxiv.org/abs/2507.12496

FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making
Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the wo…

@arXiv_statME_bot@mastoxiv.page
2025-08-20 08:45:40

Identification and Estimation of Multi-order Tensor Factor Models
Zetai Cen
https://arxiv.org/abs/2508.13418 https://arxiv.org/pdf/2508.13418

Identification and Estimation of Multi-order Tensor Factor Models
We propose a novel framework in high-dimensional factor models to simultaneously analyze multiple tensor time series, each with potentially different tensor orders and dimensionality. The connection between different tensor time series is through their global factors that are correlated to each other. A salient feature of our model is that when all tensor time series have the same order, it can be regarded as an extension of multilevel factor models from vectors to general tensors. Under very m…

@arXiv_csSE_bot@mastoxiv.page
2025-09-19 08:59:21

CodeLSI: Leveraging Foundation Models for Automated Code Generation with Low-Rank Optimization and Domain-Specific Instruction Tuning
Huy Le, Phong Nguyen, Hao Do, Tuan Nguyen, Thien Pham, Anh Nguyen-Duc, Tho Quan
https://arxiv.org/abs/2509.14373

CodeLSI: Leveraging Foundation Models for Automated Code Generation with Low-Rank Optimization and Domain-Specific Instruction Tuning
Context: Automated code generation using Foundation Models (FMs) offers promising solutions for enhancing software development efficiency. However, challenges remain in ensuring domain specificity, cost-effectiveness, and security - especially when relying on third-party APIs. This paper introduces CodeLSI, a framework that combines low-rank optimization and domain-specific instruction tuning to address these challenges. Objectives: The aim of this study is to develop and evaluate CodeLSI, a …

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:17:10

Compressed Models are NOT Trust-equivalent to Their Large Counterparts
Rohit Raj Rai, Chirag Kothari, Siddhesh Shelke, Amit Awekar
https://arxiv.org/abs/2508.13533 https://

Compressed Models are NOT Trust-equivalent to Their Large Counterparts
Large Deep Learning models are often compressed before being deployed in a resource-constrained environment. Can we trust the prediction of compressed models just as we trust the prediction of the original large model? Existing work has keenly studied the effect of compression on accuracy and related performance measures. However, performance parity does not guarantee trust-equivalence. We propose a two-dimensional framework for trust-equivalence evaluation. First, interpretability alignment me…

@arXiv_csCV_bot@mastoxiv.page
2025-08-19 12:03:40

7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models
Elena Izzo, Luca Parolari, Davide Vezzaro, Lamberto Ballan
https://arxiv.org/abs/2508.12919 https://

7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models
Layout-guided text-to-image models offer greater control over the generation process by explicitly conditioning image synthesis on the spatial arrangement of elements. As a result, their adoption has increased in many computer vision applications, ranging from content creation to synthetic data generation. A critical challenge is achieving precise alignment between the image, textual prompt, and layout, ensuring semantic fidelity and spatial accuracy. Although recent benchmarks assess text alig…

@arXiv_csAI_bot@mastoxiv.page
2025-09-19 09:13:01

Understanding the Thinking Process of Reasoning Models: A Perspective from Schoenfeld's Episode Theory
Ming Li, Nan Zhang, Chenrui Fan, Hong Jiao, Yanbin Fu, Sydney Peters, Qingshu Xu, Robert Lissitz, Tianyi Zhou
https://arxiv.org/abs/2509.14662

Understanding the Thinking Process of Reasoning Models: A Perspective from Schoenfeld's Episode Theory
While Large Reasoning Models (LRMs) generate extensive chain-of-thought reasoning, we lack a principled framework for understanding how these thoughts are structured. In this paper, we introduce a novel approach by applying Schoenfeld's Episode Theory, a classic cognitive framework for human mathematical problem-solving, to analyze the reasoning traces of LRMs. We annotated thousands of sentences and paragraphs from model-generated solutions to math problems using seven cognitive labels (e.g., …

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:00:00

Towards a Larger Model via One-Shot Federated Learning on Heterogeneous Client Models
Wenxuan Ye, Xueli An, Onur Ayan, Junfan Wang, Xueqiang Yan, Georg Carle
https://arxiv.org/abs/2508.13625

Towards a Larger Model via One-Shot Federated Learning on Heterogeneous Client Models
Large models, renowned for superior performance, outperform smaller ones even without billion-parameter scales. While mobile network servers have ample computational resources to support larger models than client devices, privacy constraints prevent clients from directly sharing their raw data. Federated Learning (FL) enables decentralized clients to collaboratively train a shared model by exchanging model parameters instead of transmitting raw data. Yet, it requires a uniform model architectur…

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:44:50

Word Meanings in Transformer Language Models
Jumbly Grindrod, Peter Grindrod
https://arxiv.org/abs/2508.12863 https://arxiv.org/pdf/2508.12863

Word Meanings in Transformer Language Models
We investigate how word meanings are represented in the transformer language models. Specifically, we focus on whether transformer models employ something analogous to a lexical store - where each word has an entry that contains semantic information. To do this, we extracted the token embedding space of RoBERTa-base and k-means clustered it into 200 clusters. In our first study, we then manually inspected the resultant clusters to consider whether they are sensitive to semantic information. In …

@arXiv_csSE_bot@mastoxiv.page
2025-09-19 09:50:21

CARGO: A Framework for Confidence-Aware Routing of Large Language Models
Amine Barrak, Yosr Fourati, Michael Olchawa, Emna Ksontini, Khalil Zoghlami
https://arxiv.org/abs/2509.14899

CARGO: A Framework for Confidence-Aware Routing of Large Language Models
As large language models (LLMs) proliferate in scale, specialization, and latency profiles, the challenge of routing user prompts to the most appropriate model has become increasingly critical for balancing performance and cost. We introduce CARGO (Category-Aware Routing with Gap-based Optimization), a lightweight, confidence-aware framework for dynamic LLM selection. CARGO employs a single embedding-based regressor trained on LLM-judged pairwise comparisons to predict model performance, with a…

@arXiv_csCR_bot@mastoxiv.page
2025-08-19 11:36:00

Consiglieres in the Shadow: Understanding the Use of Uncensored Large Language Models in Cybercrimes
Zilong Lin, Zichuan Li, Xiaojing Liao, XiaoFeng Wang
https://arxiv.org/abs/2508.12622

Consiglieres in the Shadow: Understanding the Use of Uncensored Large Language Models in Cybercrimes
The advancement of AI technologies, particularly Large Language Models (LLMs), has transformed computing while introducing new security and privacy risks. Prior research shows that cybercriminals are increasingly leveraging uncensored LLMs (ULLMs) as backends for malicious services. Understanding these ULLMs has been hindered by the challenge of identifying them among the vast number of open-source LLMs hosted on platforms like Hugging Face. In this paper, we present the first systematic study …

@arXiv_csRO_bot@mastoxiv.page
2025-08-20 08:53:20

CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models
Catherine Glossop, William Chen, Arjun Bhorkar, Dhruv Shah, Sergey Levine
https://arxiv.org/abs/2508.13446

CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models
Generalist robots should be able to understand and follow user instructions, but current vision-language-action (VLA) models struggle with following fine-grained commands despite providing a powerful architecture for mapping open-vocabulary natural language instructions to robot actions. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we presen…

@arXiv_csHC_bot@mastoxiv.page
2025-09-19 09:03:51

VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models
Huanchen Wang, Wencheng Zhang, Zhiqiang Wang, Zhicong Lu, Yuxin Ma
https://arxiv.org/abs/2509.14571

VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models
Vision-language (VL) models have shown transformative potential across various critical domains due to their capability to comprehend multi-modal information. However, their performance frequently degrades under distribution shifts, making it crucial to assess and improve robustness against real-world data corruption encountered in practical applications. While advancements in VL benchmark datasets and data augmentation (DA) have contributed to robustness evaluation and improvement, there remai…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:38:41

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning
Yeongbin Seo, Dongha Lee, Jaehyung Kim, Jinyoung Yeo
https://arxiv.org/abs/2509.15188

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning
Autoregressive (AR) language models generate text one token at a time, which limits their inference speed. Diffusion-based language models offer a promising alternative, as they can decode multiple tokens in parallel. However, we identify a key bottleneck in current diffusion LMs: the long decoding-window problem, where tokens generated far from the input context often become irrelevant or repetitive. Previous solutions like semi-autoregressive address this issue by splitting windows into block…

@arXiv_csCV_bot@mastoxiv.page
2025-09-19 10:26:11

Transplant-Ready? Evaluating AI Lung Segmentation Models in Candidates with Severe Lung Disease
Jisoo Lee, Michael R. Harowicz, Yuwen Chen, Hanxue Gu, Isaac S. Alderete, Lin Li, Maciej A. Mazurowski, Matthew G. Hartwig
https://arxiv.org/abs/2509.15083

Transplant-Ready? Evaluating AI Lung Segmentation Models in Candidates with Severe Lung Disease
This study evaluates publicly available deep-learning based lung segmentation models in transplant-eligible patients to determine their performance across disease severity levels, pathology categories, and lung sides, and to identify limitations impacting their use in preoperative planning in lung transplantation. This retrospective study included 32 patients who underwent chest CT scans at Duke University Health System between 2017 and 2019 (total of 3,645 2D axial slices). Patients with stand…

@arXiv_csAI_bot@mastoxiv.page
2025-09-19 08:17:01

Rationality Check! Benchmarking the Rationality of Large Language Models
Zhilun Zhou, Jing Yi Wang, Nicholas Sukiennik, Chen Gao, Fengli Xu, Yong Li, James Evans
https://arxiv.org/abs/2509.14546

Rationality Check! Benchmarking the Rationality of Large Language Models
Large language models (LLMs), a recent advance in deep learning and machine intelligence, have manifested astonishing capacities, now considered among the most promising for artificial general intelligence. With human-like capabilities, LLMs have been used to simulate humans and serve as AI assistants across many applications. As a result, great concern has arisen about whether and under what circumstances LLMs think and behave like real human agents. Rationality is among the most important con…

@arXiv_csSE_bot@mastoxiv.page
2025-08-19 10:03:50

Strengthening Programming Comprehension in Large Language Models through Code Generation
Xiaoning Ren, Qiang Hu, Wei Ma, Yan Li, Yao Zhang, Lingxiao Jiang, Yinxing Xue
https://arxiv.org/abs/2508.12620 …

Strengthening Programming Comprehension in Large Language Models through Code Generation
Large language models (LLMs) have recently shown impressive results on diverse code-related tasks, benefiting from large-scale training and instruction tuning. However, studies reveal that their grasp of fundamental programming concepts, such as data flow and control flow, remains shallow, leading to fragile performance when code requires deeper reasoning. This limitation restricts the practical adoption of LLMs in real-world software development. To address this issue, this work introduces a c…

@arXiv_csCR_bot@mastoxiv.page
2025-09-19 09:59:31

Watermarking and Anomaly Detection in Machine Learning Models for LORA RF Fingerprinting
Aarushi Mahajan, Wayne Burleson
https://arxiv.org/abs/2509.15170 https://

Watermarking and Anomaly Detection in Machine Learning Models for LORA RF Fingerprinting
Radio frequency fingerprint identification (RFFI) distinguishes wireless devices by the small variations in their analog circuits, avoiding heavy cryptographic authentication. While deep learning on spectrograms improves accuracy, models remain vulnerable to copying, tampering, and evasion. We present a stronger RFFI system combining watermarking for ownership proof and anomaly detection for spotting suspicious inputs. Using a ResNet-34 on log-Mel spectrograms, we embed three watermarks: a simp…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:21:11

Evaluating Large Language Models for Cross-Lingual Retrieval
Longfei Zuo, Pingjun Hong, Oliver Kraus, Barbara Plank, Robert Litschko
https://arxiv.org/abs/2509.14749 https://

Evaluating Large Language Models for Cross-Lingual Retrieval
Multi-stage information retrieval (IR) has become a widely-adopted paradigm in search. While Large Language Models (LLMs) have been extensively evaluated as second-stage reranking models for monolingual IR, a systematic large-scale comparison is still lacking for cross-lingual IR (CLIR). Moreover, while prior work shows that LLM-based rerankers improve CLIR performance, their evaluation setup relies on lexical retrieval with machine translation (MT) for the first stage. This is not only prohibi…

@arXiv_csRO_bot@mastoxiv.page
2025-09-19 10:09:41

Designing Latent Safety Filters using Pre-Trained Vision Models
Ihab Tabbara, Yuxuan Yang, Ahmad Hamzeh, Maxwell Astafyev, Hussein Sibai
https://arxiv.org/abs/2509.14758 https:/…

Designing Latent Safety Filters using Pre-Trained Vision Models
Ensuring safety of vision-based control systems remains a major challenge hindering their deployment in critical settings. Safety filters have gained increased interest as effective tools for ensuring the safety of classical control systems, but their applications in vision-based control settings have so far been limited. Pre-trained vision models (PVRs) have been shown to be effective perception backbones for control in various robotics domains. In this paper, we are interested in examining th…

@arXiv_csCV_bot@mastoxiv.page
2025-09-19 10:29:21

Leveraging Geometric Visual Illusions as Perceptual Inductive Biases for Vision Models
Haobo Yang, Minghao Guo, Dequan Yang, Wenyu Wang
https://arxiv.org/abs/2509.15156 https://…

Leveraging Geometric Visual Illusions as Perceptual Inductive Biases for Vision Models
Contemporary deep learning models have achieved impressive performance in image classification by primarily leveraging statistical regularities within large datasets, but they rarely incorporate structured insights drawn directly from perceptual psychology. To explore the potential of perceptually motivated inductive biases, we propose integrating classic geometric visual illusions well-studied phenomena from human perception into standard image-classification training pipelines. Specifically, …

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 11:14:50

PC-Sampler: Position-Aware Calibration of Decoding Bias in Masked Diffusion Models
Pengcheng Huang, Shuhao Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Zulong Chen, Tong Xiao
https://arxiv.org/abs/2508.13021

PC-Sampler: Position-Aware Calibration of Decoding Bias in Masked Diffusion Models
Recent advances in masked diffusion models (MDMs) have established them as powerful non-autoregressive alternatives for sequence generation. Nevertheless, our preliminary experiments reveal that the generation quality of MDMs is still highly sensitive to the choice of decoding strategy. In particular, widely adopted uncertainty-based samplers suffer from two key limitations: a lack of global trajectory control and a pronounced bias toward trivial tokens in the early stages of decoding. These sh…

@arXiv_csLG_bot@mastoxiv.page
2025-09-19 10:37:21

Efficient Conformal Prediction for Regression Models under Label Noise
Yahav Cohen, Jacob Goldberger, Tom Tirer
https://arxiv.org/abs/2509.15120 https://ar…

Efficient Conformal Prediction for Regression Models under Label Noise
In high-stakes scenarios, such as medical imaging applications, it is critical to equip the predictions of a regression model with reliable confidence intervals. Recently, Conformal Prediction (CP) has emerged as a powerful statistical framework that, based on a labeled calibration set, generates intervals that include the true labels with a pre-specified probability. In this paper, we address the problem of applying CP for regression models when the calibration set contains noisy labels. We be…

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:50:30

Improving Detection of Watermarked Language Models
Dara Bahri, John Wieting
https://arxiv.org/abs/2508.13131 https://arxiv.org/pdf/2508.13131

Improving Detection of Watermarked Language Models
Watermarking has recently emerged as an effective strategy for detecting the generations of large language models (LLMs). The strength of a watermark typically depends strongly on the entropy afforded by the language model and the set of input prompts. However, entropy can be quite limited in practice, especially for models that are post-trained, for example via instruction tuning or reinforcement learning from human feedback (RLHF), which makes detection based on watermarking alone challenging…

@arXiv_csRO_bot@mastoxiv.page
2025-08-19 10:29:00

ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models
Zhichen Lou, Kechun Xu, Zhongxiang Zhou, Rong Xiong
https://arxiv.org/abs/2508.11918 https://…

ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models
The advancement of embodied intelligence is accelerating the integration of robots into daily life as human assistants. This evolution requires robots to not only interpret high-level instructions and plan tasks but also perceive and adapt within dynamic environments. Vision-Language Models (VLMs) present a promising solution by combining visual understanding and language reasoning. However, existing VLM-based methods struggle with interactive exploration, accurate perception, and real-time pla…

@arXiv_csSE_bot@mastoxiv.page
2025-09-19 07:49:11

Evolution of Kernels: Automated RISC-V Kernel Optimization with Large Language Models
Siyuan Chen, Zhichao Lu, Qingfu Zhang
https://arxiv.org/abs/2509.14265 https://

Evolution of Kernels: Automated RISC-V Kernel Optimization with Large Language Models
Automated kernel design is critical for overcoming software ecosystem barriers in emerging hardware platforms like RISC-V. While large language models (LLMs) have shown promise for automated kernel optimization, demonstrating success in CUDA domains with comprehensive technical documents and mature codebases, their effectiveness remains unproven for reference-scarce domains like RISC-V. We present Evolution of Kernels (EoK), a novel LLM-based evolutionary program search framework that automates…

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:15:10

Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance
Yiming Cao, Yanjie Li, Kaisheng Liang, Yuni Lai, Bin Xiao
https://arxiv.org/abs/2508.13739

Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance
Targeted adversarial attacks are essential for proactively identifying security flaws in Vision-Language Models before real-world deployment. However, current methods perturb images to maximize global similarity with the target text or reference image at the encoder level, collapsing rich visual semantics into a single global vector. This limits attack granularity, hindering fine-grained manipulations such as modifying a car while preserving its background. Furthermore, these methods largely ov…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 10:36:20

Help or Hurdle? Rethinking Model Context Protocol-Augmented Large Language Models
Wei Song, Haonan Zhong, Ziqi Ding, Jingling Xue, Yuekang Li
https://arxiv.org/abs/2508.12566 ht…

Help or Hurdle? Rethinking Model Context Protocol-Augmented Large Language Models
The Model Context Protocol (MCP) enables large language models (LLMs) to access external resources on demand. While commonly assumed to enhance performance, how LLMs actually leverage this capability remains poorly understood. We introduce MCPGAUGE, the first comprehensive evaluation framework for probing LLM-MCP interactions along four key dimensions: proactivity (self-initiated tool use), compliance (adherence to tool-use instructions), effectiveness (task performance post-integration), and o…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:28:51

V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
Qidong Wang, Junjie Hu, Ming Jiang
https://arxiv.org/abs/2509.14837

V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
Recent advances in causal interpretability have extended from language models to vision-language models (VLMs), seeking to reveal their internal mechanisms through input interventions. While textual interventions often target semantics, visual interventions typically rely on coarse pixel-level perturbations, limiting semantic insights on multimodal integration. In this study, we introduce V-SEAM, a novel framework that combines Visual Semantic Editing and Attention Modulating for causal interpr…

@arXiv_csAI_bot@mastoxiv.page
2025-08-20 09:49:30

CrafterDojo: A Suite of Foundation Models for Building Open-Ended Embodied Agents in Crafter
Junyeong Park, Hyeonseo Cho, Sungjin Ahn
https://arxiv.org/abs/2508.13530 https://…

CrafterDojo: A Suite of Foundation Models for Building Open-Ended Embodied Agents in Crafter
Developing general-purpose embodied agents is a core challenge in AI. Minecraft provides rich complexity and internet-scale data, but its slow speed and engineering overhead make it unsuitable for rapid prototyping. Crafter offers a lightweight alternative that retains key challenges from Minecraft, yet its use has remained limited to narrow tasks due to the absence of foundation models that have driven progress in the Minecraft setting. In this paper, we present CrafterDojo, a suite of foundat…

@arXiv_csSE_bot@mastoxiv.page
2025-09-19 09:29:01

Automating Modelica Module Generation Using Large Language Models: A Case Study on Building Control Description Language
Hanlong Wan, Xing Lu, Yan Chen, Karthik Devaprasad, Laura Hinkle
https://arxiv.org/abs/2509.14623

Automating Modelica Module Generation Using Large Language Models: A Case Study on Building Control Description Language
Dynamic energy systems and controls require advanced modeling frameworks to design and test supervisory and fault tolerant strategies. Modelica is a widely used equation based language, but developing control modules is labor intensive and requires specialized expertise. This paper examines the use of large language models (LLMs) to automate the generation of Control Description Language modules in the Building Modelica Library as a case study. We developed a structured workflow that combines s…

@arXiv_csRO_bot@mastoxiv.page
2025-08-20 09:50:30

Driving Style Recognition Like an Expert Using Semantic Privileged Information from Large Language Models
Zhaokun Chen, Chaopeng Zhang, Xiaohan Li, Wenshuo Wang, Gentiane Venture, Junqiang Xi
https://arxiv.org/abs/2508.13881

Driving Style Recognition Like an Expert Using Semantic Privileged Information from Large Language Models
Existing driving style recognition systems largely depend on low-level sensor-derived features for training, neglecting the rich semantic reasoning capability inherent to human experts. This discrepancy results in a fundamental misalignment between algorithmic classifications and expert judgments. To bridge this gap, we propose a novel framework that integrates Semantic Privileged Information (SPI) derived from large language models (LLMs) to align recognition outcomes with human-interpretable …

@arXiv_csCV_bot@mastoxiv.page
2025-08-19 12:04:50

Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models
Jianshu Zeng, Yuxuan Liu, Yutong Feng, Chenxuan Miao, Zixiang Gao, Jiwang Qu, Jianzhang Zhang, Bin Wang, Kun Yuan
https://arxiv.org/abs/2508.12945

Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models
Video relighting is a challenging yet valuable task, aiming to replace the background in videos while correspondingly adjusting the lighting in the foreground with harmonious blending. During translation, it is essential to preserve the original properties of the foreground, e.g., albedo, and propagate consistent relighting among temporal frames. In this paper, we propose Lumen, an end-to-end video relighting framework developed on large-scale video generative models, receiving flexible textual…

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:56:50

MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
Jiacheng Ruan, Dan Jiang, Xian Gao, Ting Liu, Yuzhuo Fu, Yangyang Kang
https://arxiv.org/abs/2508.13938

MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assess…

@arXiv_csLG_bot@mastoxiv.page
2025-09-18 10:19:11

Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics
Benjamin Sterling, Yousef El-Laham, M\'onica F. Bugallo
https://arxiv.org/abs/2509.14225

Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics
Recent advances in generative artificial intelligence applications have raised new data security concerns. This paper focuses on defending diffusion models against membership inference attacks. This type of attack occurs when the attacker can determine if a certain data point was used to train the model. Although diffusion models are intrinsically more resistant to membership inference attacks than other generative models, they are still susceptible. The defense proposed here utilizes criticall…

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:47:50

Reinforced Context Order Recovery for Adaptive Reasoning and Planning
Long Ma, Fangwei Zhong, Yizhou Wang
https://arxiv.org/abs/2508.13070 https://arxiv.or…

Reinforced Context Order Recovery for Adaptive Reasoning and Planning
Modern causal language models, followed by rapid developments in discrete diffusion models, can now produce a wide variety of interesting and useful content. However, these families of models are predominantly trained to output tokens with a fixed (left-to-right) or random order, which may deviate from the logical order in which tokens are generated originally. In this paper, we observe that current causal and diffusion models encounter difficulties in problems that require adaptive token gener…

@arXiv_csSE_bot@mastoxiv.page
2025-09-18 09:38:01

A Study on Thinking Patterns of Large Reasoning Models in Code Generation
Kevin Halim, Sin G. Teo, Ruitao Feng, Zhenpeng Chen, Yang Gu, Chong Wang, Yang Liu
https://arxiv.org/abs/2509.13758

A Study on Thinking Patterns of Large Reasoning Models in Code Generation
Currently, many large language models (LLMs) are utilized for software engineering tasks such as code generation. The emergence of more advanced models known as large reasoning models (LRMs), such as OpenAI's o3, DeepSeek R1, and Qwen3. They have demonstrated the capability of performing multi-step reasoning. Despite the advancement in LRMs, little attention has been paid to systematically analyzing the reasoning patterns these models exhibit and how such patterns influence the generated code. …

@arXiv_csLG_bot@mastoxiv.page
2025-09-18 10:04:01

Ensemble of Pre-Trained Models for Long-Tailed Trajectory Prediction
Divya Thuremella, Yi Yang, Simon Wanna, Lars Kunze, Daniele De Martini
https://arxiv.org/abs/2509.13914 http…

Ensemble of Pre-Trained Models for Long-Tailed Trajectory Prediction
This work explores the application of ensemble modeling to the multidimensional regression problem of trajectory prediction for vehicles in urban environments. As newer and bigger state-of-the-art prediction models for autonomous driving continue to emerge, an important open challenge is the problem of how to combine the strengths of these big models without the need for costly re-training. We show how, perhaps surprisingly, combining state-of-the-art deep learning models out-of-the-box (withou…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 11:16:00

G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance
Yongxin Guo, Wenbo Deng, Zhenglin Cheng, Xiaoying Tang
https://arxiv.org/abs/2508.13023 https://

G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance
Reinforcement Learning with Verifiable Rewards (RLVR) has markedly enhanced the reasoning abilities of large language models (LLMs). Its success, however, largely depends on strong base models with rich world knowledge, yielding only modest improvements for small-size language models (SLMs). To address this limitation, we investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories to compensate for SLMs' inherent weaknesses. Through a comprehensive study of va…

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:23:41

Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation
Gia Khanh Nguyen, Yifeng Huang, Minh Hoai
https://arxiv.org/abs/2509.13939 htt…

Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation
Visual counting is a fundamental yet challenging task, especially when users need to count objects of a specific type in complex scenes. While recent models, including class-agnostic counting models and large vision-language models (VLMs), show promise in counting tasks, their ability to perform fine-grained, intent-driven counting remains unclear. In this paper, we introduce PairTally, a benchmark dataset specifically designed to evaluate fine-grained visual counting. Each of the 681 high-reso…

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:51:00

Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study
Hanna Woloszyn, Benjamin Gagl
https://arxiv.org/abs/2508.13769 https://

Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study
The role of large language models (LLMs) in education is increasing, yet little attention has been paid to whether LLM-generated text resembles child language. This study evaluates how LLMs replicate child-like language by comparing LLM-generated texts to a collection of German children's descriptions of picture stories. We generated two LLM-based corpora using the same picture stories and two prompt types: zero-shot and few-shot prompts specifying a general age from the children corpus. We con…

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:59:00

Ask Good Questions for Large Language Models
Qi Wu, Zhongqi Lu
https://arxiv.org/abs/2508.14025 https://arxiv.org/pdf/2508.14025

Ask Good Questions for Large Language Models
Recent advances in large language models (LLMs) have significantly improved the performance of dialog systems, yet current approaches often fail to provide accurate guidance of topic due to their inability to discern user confusion in related concepts. To address this, we introduce the Ask-Good-Question (AGQ) framework, which features an improved Concept-Enhanced Item Response Theory (CEIRT) model to better identify users' knowledge levels. Our contributions include applying the CEIRT model alo…

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:36:51

ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?
Vy Tuong Dang, An Vo, Quang Tau, Duc Dm, Daeyoung Kim
https://arxiv.org/abs/2508.13680

ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?
Vision language models (VLMs) demonstrate remarkable capabilities on English multimodal tasks, but their performance on low-resource languages with genuinely multimodal educational content remains largely unexplored. In this work, we test how VLMs perform on Vietnamese educational assessments, investigating whether VLMs trained predominantly on English data can handle real-world cross-lingual multimodal reasoning. Our work presents the first comprehensive evaluation of VLM capabilities on multi…

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:38:40

Leveraging Large Language Models for Predictive Analysis of Human Misery
Bishanka Seal, Rahul Seetharaman, Aman Bansal, Abhilash Nandy
https://arxiv.org/abs/2508.12669 https://

Leveraging Large Language Models for Predictive Analysis of Human Misery
This study investigates the use of Large Language Models (LLMs) for predicting human-perceived misery scores from natural language descriptions of real-world scenarios. The task is framed as a regression problem, where the model assigns a scalar value from 0 to 100 to each input statement. We evaluate multiple prompting strategies, including zero-shot, fixed-context few-shot, and retrieval-based prompting using BERT sentence embeddings. Few-shot approaches consistently outperform zero-shot base…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:39:01

Fair-GPTQ: Bias-Aware Quantization for Large Language Models
Irina Proskurina, Guillaume Metzler, Julien Velcin
https://arxiv.org/abs/2509.15206 https://ar…

Fair-GPTQ: Bias-Aware Quantization for Large Language Models
High memory demands of generative language models have drawn attention to quantization, which reduces computational cost, memory usage, and latency by mapping model weights to lower-precision integers. Approaches such as GPTQ effectively minimize input-weight product errors during quantization; however, recent empirical studies show that they can increase biased outputs and degrade performance on fairness benchmarks, and it remains unclear which specific weights cause this issue. In this work, …

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:42:40

Generics and Default Reasoning in Large Language Models
James Ravi Kirkpatrick, Rachel Katharine Sterken
https://arxiv.org/abs/2508.13718 https://arxiv.org…

Generics and Default Reasoning in Large Language Models
This paper evaluates the capabilities of 28 large language models (LLMs) to reason with 20 defeasible reasoning patterns involving generic generalizations (e.g., 'Birds fly', 'Ravens are black') central to non-monotonic logic. Generics are of special interest to linguists, philosophers, logicians, and cognitive scientists because of their complex exception-permitting behaviour and their centrality to default reasoning, cognition, and concept acquisition. We find that while several frontier mode…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:33:11

A Comparative Evaluation of Large Language Models for Persian Sentiment Analysis and Emotion Detection in Social Media Texts
Kian Tohidi, Kia Dashtipour, Simone Rebora, Sevda Pourfaramarz
https://arxiv.org/abs/2509.14922

A Comparative Evaluation of Large Language Models for Persian Sentiment Analysis and Emotion Detection in Social Media Texts
This study presents a comprehensive comparative evaluation of four state-of-the-art Large Language Models (LLMs)--Claude 3.7 Sonnet, DeepSeek-V3, Gemini 2.0 Flash, and GPT-4o--for sentiment analysis and emotion detection in Persian social media texts. Comparative analysis among LLMs has witnessed a significant rise in recent years, however, most of these analyses have been conducted on English language tasks, creating gaps in understanding cross-linguistic performance patterns. This research ad…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:34:11

Cross-Modal Knowledge Distillation for Speech Large Language Models
Enzhi Wang, Qicheng Li, Zhiyuan Tang, Yuhang Jia
https://arxiv.org/abs/2509.14930 https://

Cross-Modal Knowledge Distillation for Speech Large Language Models
In this work, we present the first systematic evaluation of catastrophic forgetting and modality inequivalence in speech large language models, showing that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual, and performance further decreases with spoken queries. To address these challenges, we propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-base…

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 08:15:19

ALIGN: Word Association Learning for Cross-Cultural Generalization in Large Language Models
Chunhua Liu, Kabir Manandhar Shrestha, Sukai Huang
https://arxiv.org/abs/2508.13426 h…

ALIGN: Word Association Learning for Cross-Cultural Generalization in Large Language Models
As large language models (LLMs) increasingly mediate cross-cultural communication, their behavior still reflects the distributional bias of the languages and viewpoints that are over-represented in their pre-training corpora. Yet, it remains a challenge to model and align culture due to limited cultural knowledge and a lack of exploration into effective learning approaches. We introduce a cost-efficient, cognitively grounded remedy: parameter-efficient fine-tuning on native speakers' free word-…

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:40:50

LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models
Zhiyuan Ning, Tianle Gu, Jiaxin Song, Shixin Hong, Lingyu Li, Huacan Liu, Jie Li, Yixu Wang, Meng Lingyu, Yan Teng, Yingchun Wang
https://arxiv.org/abs/2508.12733

LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models
The widespread adoption and increasing prominence of large language models (LLMs) in global technologies necessitate a rigorous focus on ensuring their safety across a diverse range of linguistic and cultural contexts. The lack of a comprehensive evaluation and diverse data in existing multilingual safety evaluations for LLMs limits their effectiveness, hindering the development of robust multilingual safety alignment. To address this critical gap, we introduce LinguaSafe, a comprehensive multi…

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:45:20

A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models
Jinyi Han, Xinyi Wang, Haiquan Zhao, Tingyun li, Zishang Jiang, Sihang Jiang, Jiaqing Liang, Xin Lin, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao
https://arxiv.org/abs/2508.12903

A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models
Recent advances in self-refinement have demonstrated significant potential for improving the outputs of large language models (LLMs) through iterative refinement. However, most existing self-refinement methods rely on a reactive process with a fixed number of iterations, making it difficult to determine the optimal timing and content of refinement based on the evolving generation context. Inspired by the way humans dynamically refine their thoughts during execution, we propose ProActive Self-Re…

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 08:42:00

MATA (m\=ata): Mindful Assessment of the Telugu Abilities of Large Language Models
Chalamalasetti Kranti, Sowmya Vajjala
https://arxiv.org/abs/2508.13526 https://

MATA (māta): Mindful Assessment of the Telugu Abilities of Large Language Models
In this paper, we introduce MATA, a novel evaluation dataset to assess the ability of Large Language Models (LLMs) in Telugu language, comprising 729 carefully curated multiple-choice and open-ended questions that span diverse linguistic dimensions. We evaluate 11 open-weight and closed-source LLMs on our dataset and present a fine-grained analysis of their performance. Further, we empirically show how LLMs rely on superficial heuristics such as answer position and distractor patterns for multi…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:39:41

LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models
Ruijie Hou, Yueyang Jiao, Hanxu Hu, Yingming Li, Wai Lam, Huajian Zhang, Hongyuan Lu
https://arxiv.org/abs/2509.15218

LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models
The problem of data contamination is now almost inevitable during the development of large language models (LLMs), with the training data commonly integrating those evaluation benchmarks even unintentionally. This problem subsequently makes it hard to benchmark LLMs fairly. Instead of constructing contamination-free datasets (quite hard), we propose a novel framework, \textbf{LNE-Blocking}, to restore model performance prior to contamination on potentially leaked datasets. Our framework consist…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:35:11

CLEAR: A Comprehensive Linguistic Evaluation of Argument Rewriting by Large Language Models
Thomas Huber, Christina Niklaus
https://arxiv.org/abs/2509.15027 https://

CLEAR: A Comprehensive Linguistic Evaluation of Argument Rewriting by Large Language Models
While LLMs have been extensively studied on general text generation tasks, there is less research on text rewriting, a task related to general text generation, and particularly on the behavior of models on this task. In this paper we analyze what changes LLMs make in a text rewriting setting. We focus specifically on argumentative texts and their improvement, a task named Argument Improvement (ArgImp). We present CLEAR: an evaluation pipeline consisting of 57 metrics mapped to four linguistic l…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:36:51

LLM-OREF: An Open Relation Extraction Framework Based on Large Language Models
Hongyao Tu, Liang Zhang, Yujie Lin, Xin Lin, Haibo Zhang, Long Zhang, Jinsong Su
https://arxiv.org/abs/2509.15089

LLM-OREF: An Open Relation Extraction Framework Based on Large Language Models
The goal of open relation extraction (OpenRE) is to develop an RE model that can generalize to new relations not encountered during training. Existing studies primarily formulate OpenRE as a clustering task. They first cluster all test instances based on the similarity between the instances, and then manually assign a new relation to each cluster. However, their reliance on human annotation limits their practicality. In this paper, we propose an OpenRE framework based on large language models (…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:38:01

SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models
Huy Nghiem, Advik Sachdeva, Hal Daum\'e III
https://arxiv.org/abs/2509.15174

SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models
WARNING: This paper contains examples of offensive materials. Toxic content has become pervasive on social media platforms. We introduce SMARTER, a data-efficient two-stage framework for explainable content moderation using Large Language Models (LLMs). In Stage 1, we leverage LLMs' own outputs to generate synthetic explanations for both correct and incorrect labels, enabling alignment via preference optimization with minimal human supervision. In Stage 2, we refine explanation quality through …

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:39:31

Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models
Sreejato Chatterjee, Linh Tran, Quoc Duy Nguyen, Roni Kirson, Drue Hamlin, Harvest Aquino, Hanjia Lyu, Jiebo Luo, Timothy Dye
https://arxiv.org/abs/2509.15216

Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models
Traditional efforts to measure historical structural oppression struggle with cross-national validity due to the unique, locally specified histories of exclusion, colonization, and social status in each country, and often have relied on structured indices that privilege material resources while overlooking lived, identity-based exclusion. We introduce a novel framework for oppression measurement that leverages Large Language Models (LLMs) to generate context-sensitive scores of lived historical…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:33:21

Patent Language Model Pretraining with ModernBERT
Amirhossein Yousefiramandi, Ciaran Cooney
https://arxiv.org/abs/2509.14926 https://arxiv.org/pdf/2509.149…

Patent Language Model Pretraining with ModernBERT
Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over …

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:52:40

Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding
Maciej Skorski, Alina Landowska
https://arxiv.org/abs/2508.13804 https://

Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding
How do large language models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluate top language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maveri…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:37:31

Large Language Model probabilities cannot distinguish between possible and impossible language
Evelina Leivada, Raquel Montero, Paolo Morosi, Natalia Moskvina, Tamara Serrano, Marcel Aguilar, Fritz Guenther
https://arxiv.org/abs/2509.15114

Large Language Model probabilities cannot distinguish between possible and impossible language
A controversial test for Large Language Models concerns the ability to discern possible from impossible language. While some evidence attests to the models' sensitivity to what crosses the limits of grammatically impossible language, this evidence has been contested on the grounds of the soundness of the testing material. We use model-internal representations to tap directly into the way Large Language Models represent the 'grammatical-ungrammatical' distinction. In a novel benchmark, we elicit…

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:46:40

B\"{u}y\"{u}k Dil Modelleri i\c{c}in TR-MMLU Benchmark{\i}: Performans De\u{g}erlendirmesi, Zorluklar ve \.{I}yile\c{s}tirme F{\i}rsatlar{\i}
M. Ali Bayram, Ali Arda Fincan, Ahmet Semih G\"um\"u\c{s}, Banu Diri, Sava\c{s} Y{\i}ld{\i}r{\i}m, \"Oner Ayta\c{s}
https://arxiv.org/abs/2508.13044

Büyük Dil Modelleri için TR-MMLU Benchmarkı: Performans Değerlendirmesi, Zorluklar ve İyileştirme Fırsatları
Language models have made significant advancements in understanding and generating human language, achieving remarkable success in various applications. However, evaluating these models remains a challenge, particularly for resource-limited languages like Turkish. To address this issue, we introduce the Turkish MMLU (TR-MMLU) benchmark, a comprehensive evaluation framework designed to assess the linguistic and conceptual capabilities of large language models (LLMs) in Turkish. TR-MMLU is based …

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:50:50

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge
https://arxiv.org/abs/2508.13144

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark's ability to separate better models from worse models, and noise, a bench…

@arXiv_csCL_bot@mastoxiv.page
2025-09-16 12:19:47

ToolRM: Outcome Reward Models for Tool-Calling Large Language Models
Mayank Agarwal, Ibrahim Abdelaziz, Kinjal Basu, Merve Unuvar, Luis A. Lastras, Yara Rizk, Pavan Kapanipathi
https://arxiv.org/abs/2509.11963

ToolRM: Outcome Reward Models for Tool-Calling Large Language Models
As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has become a critical yet underexplored area. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark designed to systematically assess reward models' performance in tool-calling scenarios. Our analysis shows that current reward models often miss key s…

Tootfinder

Opt-in global Mastodon full text search. Join the index!