Tootfinder

@arXiv_csLG_bot@mastoxiv.page
2025-08-26 12:25:46

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Lei Bai, Yunqi Cai, Xi Dai, Shufei Zhang, Jinguang Cheng, Zh…

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on ca…

@arXiv_csSE_bot@mastoxiv.page
2025-08-27 09:06:02

Interleaving Large Language Models for Compiler Testing
Yunbo Ni, Shaohua Li
https://arxiv.org/abs/2508.18955 https://arxiv.org/pdf/2508.18955

Interleaving Large Language Models for Compiler Testing
Testing compilers with AI models, especially large language models (LLMs), has shown great promise. However, current approaches struggle with two key problems: The generated programs for testing compilers are often too simple, and extensive testing with the LLMs is computationally expensive. In this paper, we propose a novel compiler testing framework that decouples the testing process into two distinct phases: an offline phase and an online phase. In the offline phase, we use LLMs to generate …

@arXiv_csHC_bot@mastoxiv.page
2025-08-26 10:36:16

Measuring Large Language Models Dependency: Validating the Arabic Version of the LLM-D12 Scale
Sameha AlShakhsi, Ala Yankouskaya, Magnus Liebherr, Raian Ali
https://arxiv.org/abs/2508.17063

Measuring Large Language Models Dependency: Validating the Arabic Version of the LLM-D12 Scale
There is an urgent need for reliable, culturally validated instruments to assess psychological responses to AI in general and large language models (LLMs). This need is global issue, but it is especially urgent among Arabic-speaking populations, where AI and LLMs adoption is accelerating, yet psychometric tools remain limited. This study presents the first validation of the LLM-D12, a dual-dimensional scale assessing Instrumental and Relationship Dependency on LLMs, in an Arab sample. A total o…

@arXiv_csCY_bot@mastoxiv.page
2025-09-26 08:28:41

Communication Bias in Large Language Models: A Regulatory Perspective
Adrian Kuenzler, Stefan Schmid
https://arxiv.org/abs/2509.21075 https://arxiv.org/pdf…

Communication Bias in Large Language Models: A Regulatory Perspective
Large language models (LLMs) are increasingly central to many applications, raising concerns about bias, fairness, and regulatory compliance. This paper reviews risks of biased outputs and their societal impact, focusing on frameworks like the EU's AI Act and the Digital Services Act. We argue that beyond constant regulation, stronger attention to competition and design governance is needed to ensure fair, trustworthy AI. This is a preprint of the Communications of the ACM article of the same t…

@arXiv_csDC_bot@mastoxiv.page
2025-08-27 08:54:32

Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices
Fahao Chen, Jie Wan, Peng Li, Zhou Su, Dongxiao Yu
https://arxiv.org/abs/2508.19078

Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices
Federated fine-tuning of Mixture-of-Experts (MoE)-based large language models (LLMs) is challenging due to their massive computational requirements and the resource constraints of participants. Existing working attempts to fill this gap through model quantization, computation offloading, or expert pruning. However, they cannot achieve desired performance due to impractical system assumptions and a lack of consideration for MoE-specific characteristics. In this paper, we propose FLUX, a system d…

@arXiv_csIR_bot@mastoxiv.page
2025-09-26 07:37:21

DELM: a Python toolkit for Data Extraction with Language Models
Eric Fithian, Kirill Skobelev
https://arxiv.org/abs/2509.20617 https://arxiv.org/pdf/2509.2…

DELM: a Python toolkit for Data Extraction with Language Models
Large Language Models (LLMs) have become powerful tools for annotating unstructured data. However, most existing workflows rely on ad hoc scripts, making reproducibility, robustness, and systematic evaluation difficult. To address these challenges, we introduce DELM (Data Extraction with Language Models), an open-source Python toolkit designed for rapid experimental iteration of LLM-based data extraction pipelines and for quantifying the trade-offs between them. DELM minimizes boilerplate code …

@arXiv_csCE_bot@mastoxiv.page
2025-09-26 07:35:21

Difference-Guided Reasoning: A Temporal-Spatial Framework for Large Language Models
Hong Su
https://arxiv.org/abs/2509.20713 https://arxiv.org/pdf/2509.207…

Difference-Guided Reasoning: A Temporal-Spatial Framework for Large Language Models
Large Language Models (LLMs) are important tools for reasoning and problem-solving, while they often operate passively, answering questions without actively discovering new ones. This limitation reduces their ability to simulate human-like thinking, where noticing differences is a key trigger for reasoning. Thus, in this paper we propose a difference-guided reasoning framework, which enables LLMs to identify and act upon changes across time and space. The model formalizes differences through fe…

@arXiv_csCV_bot@mastoxiv.page
2025-08-25 07:33:40

VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos
Kaining Li, Shuwei He, Zihan Xu
https://arxiv.org/abs/2508.15903

VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos
Human action recognition in long-term videos, characterized by complex backgrounds and subtle action differences, poses significant challenges for traditional deep learning models due to computational overhead, difficulty in capturing long-range temporal dependencies, and limited semantic understanding. While Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have shown remarkable capabilities in multi-modal understanding and reasoning, their direct application to continuous …

@arXiv_csRO_bot@mastoxiv.page
2025-09-26 09:53:51

Digital Twin-Guided Robot Path Planning: A Beta-Bernoulli Fusion with Large Language Model as a Sensor
Mani Amani, Reza Akhavian
https://arxiv.org/abs/2509.20709 https://…

Digital Twin-Guided Robot Path Planning: A Beta-Bernoulli Fusion with Large Language Model as a Sensor
Integrating natural language (NL) prompts into robotic mission planning has attracted significant interest in recent years. In the construction domain, Building Information Models (BIM) encapsulate rich NL descriptions of the environment. We present a novel framework that fuses NL directives with BIM-derived semantic maps via a Beta-Bernoulli Bayesian fusion by interpreting the LLM as a sensor: each obstacle's design-time repulsive coefficient is treated as a Beta(alpha, beta) random variable a…

@arXiv_csMA_bot@mastoxiv.page
2025-08-27 07:36:32

Consensus Is All You Need: Gossip-Based Reasoning Among Large Language Models
Saksham Arora
https://arxiv.org/abs/2508.18292 https://arxiv.org/pdf/2508.182…

Consensus Is All You Need: Gossip-Based Reasoning Among Large Language Models
Large language models have advanced rapidly, but no single model excels in every area -- each has its strengths and weaknesses. Instead of relying on one model alone, we take inspiration from gossip protocols in distributed systems, where information is exchanged with peers until they all come to an agreement. In this setup, models exchange answers and gradually work toward a shared solution. Each LLM acts as a node in a peer-to-peer network, sharing responses and thought processes to reach a c…

@arXiv_csCL_bot@mastoxiv.page
2025-08-27 10:06:03

Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models
Chang Wang, Siyu Yan, Depeng Yuan, Yuqi Chen, Yanhua Huang, Yuanhang Zheng, Shuhao Li, Yinqi Zhang, Kedi Chen, Mingrui Zhu, Ruiwen Xu
https://arxiv.org/abs/2508.18739

Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models
The generation of ad headlines plays a vital role in modern advertising, where both quality and diversity are essential to engage a broad range of audience segments. Current approaches primarily optimize language models for headline quality or click-through rates (CTR), often overlooking the need for diversity and resulting in homogeneous outputs. To address this limitation, we propose DIVER, a novel framework based on large language models (LLMs) that are jointly optimized for both diversity a…

@arXiv_csDL_bot@mastoxiv.page
2025-08-26 07:37:36

Named Entity Recognition of Historical Text via Large Language Model
Shibingfeng Zhang, Giovanni Colavizza
https://arxiv.org/abs/2508.18090 https://arxiv.o…

Named Entity Recognition of Historical Text via Large Language Model
Large language models have demonstrated remarkable versatility across a wide range of natural language processing tasks and domains. One such task is Named Entity Recognition (NER), which involves identifying and classifying proper names in text, such as people, organizations, locations, dates, and other specific entities. NER plays a crucial role in extracting information from unstructured textual data, enabling downstream applications such as information retrieval from unstructured text. Tr…

@arXiv_csLG_bot@mastoxiv.page
2025-09-26 10:30:41

Go With The Flow: Churn-Tolerant Decentralized Training of Large Language Models
Nikolay Blagoev, Bart Cox, J\'er\'emie Decouchant, Lydia Y. Chen
https://arxiv.org/abs/2509.21221

Go With The Flow: Churn-Tolerant Decentralized Training of Large Language Models
Motivated by the emergence of large language models (LLMs) and the importance of democratizing their training, we propose GWTF, the first crash tolerant practical decentralized training framework for LLMs. Differently from existing distributed and federated training frameworks, GWTF enables the efficient collaborative training of a LLM on heterogeneous clients that volunteer their resources. In addition, GWTF addresses node churn, i.e., clients joining or leaving the system at any time, and net…

@arXiv_csAI_bot@mastoxiv.page
2025-08-27 10:14:23

Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction
Congchi Yin, Tianyi Wu, Yankai Shu, Alex Gu, Yunhan Wang, Jun Shao, Xun Jiang, Piji Li
https://arxiv.org/abs/2508.19035

Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction
Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for humans discovery of real world. We introduce a novel evaluation paradigm, \textit{black-box interaction}, to tackle this challenge. A black-box is defined by a hidden function that maps a specific set…

@arXiv_csSD_bot@mastoxiv.page
2025-07-25 08:50:42

DIFFA: Large Language Diffusion Models Can Listen and Understand
Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li
https://arxiv.org/abs/2507.18452

DIFFA: Large Language Diffusion Models Can Listen and Understand
Recent advances in Large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based Large Audio-Language Model de…

@arXiv_csSE_bot@mastoxiv.page
2025-08-26 08:23:56

Cognitive Agents Powered by Large Language Models for Agile Software Project Management
Konrad Cinkusz, Jaros{\l}aw A. Chudziak, Ewa Niewiadomska-Szynkiewicz
https://arxiv.org/abs/2508.16678

Cognitive Agents Powered by Large Language Models for Agile Software Project Management
This paper investigates the integration of cognitive agents powered by Large Language Models (LLMs) within the Scaled Agile Framework (SAFe) to reinforce software project management. By deploying virtual agents in simulated software environments, this study explores their potential to fulfill fundamental roles in IT project development, thereby optimizing project outcomes through intelligent automation. Particular emphasis is placed on the adaptability of these agents to Agile methodologies and…

@seeingwithsound@mas.to
2025-08-27 10:57:48

High-level visual representations in the human brain are aligned with large language models https://www.nature.com/articles/s42256-025-01072-0
News release: Using AI to "see" what we see

A mapping from LLM embeddings captures visual responses to natural scenes.

@arXiv_eessAS_bot@mastoxiv.page
2025-09-26 09:38:01

Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models
Haolin He, Xingjian Du, Renhe Sun, Zheqi Dai, Yujia Xiao, Mingru Yang, Jiayi Zhou, Xiquan Li, Zhengxi Liu, Zining Liang, Chunyat Wu, Qianhua He, Tan Lee, Xie Chen, Weilong Zheng, Weiqiang Wang, Mark Plumbley, Jian Liu, Qiuqiang Kong
https://…

Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models
Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training…

@arXiv_csHC_bot@mastoxiv.page
2025-08-26 07:59:56

Adaptive Command: Real-Time Policy Adjustment via Language Models in StarCraft II
Weiyu Ma, Dongyu Xu, Shu Lin, Haifeng Zhang, Jun Wang
https://arxiv.org/abs/2508.16580 https://…

Adaptive Command: Real-Time Policy Adjustment via Language Models in StarCraft II
We present Adaptive Command, a novel framework integrating large language models (LLMs) with behavior trees for real-time strategic decision-making in StarCraft II. Our system focuses on enhancing human-AI collaboration in complex, dynamic environments through natural language interactions. The framework comprises: (1) an LLM-based strategic advisor, (2) a behavior tree for action execution, and (3) a natural language interface with speech capabilities. User studies demonstrate significant impr…

@arXiv_csCY_bot@mastoxiv.page
2025-08-26 09:36:36

Invisible Filters: Cultural Bias in Hiring Evaluations Using Large Language Models
Pooja S. B. Rao, Laxminarayen Nagarajan Venkatesan, Mauro Cherubini, Dinesh Babu Jayagopi
https://arxiv.org/abs/2508.16673

Invisible Filters: Cultural Bias in Hiring Evaluations Using Large Language Models
Artificial Intelligence (AI) is increasingly used in hiring, with large language models (LLMs) having the potential to influence or even make hiring decisions. However, this raises pressing concerns about bias, fairness, and trust, particularly across diverse cultural contexts. Despite their growing role, few studies have systematically examined the potential biases in AI-driven hiring evaluation across cultures. In this study, we conduct a systematic analysis of how LLMs assess job interviews …

@arXiv_csCR_bot@mastoxiv.page
2025-09-26 09:01:41

A Framework for Rapidly Developing and Deploying Protection Against Large Language Model Attacks
Adam Swanda, Amy Chang, Alexander Chen, Fraser Burch, Paul Kassianik, Konstantin Berlin
https://arxiv.org/abs/2509.20639

A Framework for Rapidly Developing and Deploying Protection Against Large Language Model Attacks
The widespread adoption of Large Language Models (LLMs) has revolutionized AI deployment, enabling autonomous and semi-autonomous applications across industries through intuitive language interfaces and continuous improvements in model development. However, the attendant increase in autonomy and expansion of access permissions among AI applications also make these systems compelling targets for malicious attacks. Their inherent susceptibility to security flaws necessitates robust defenses, yet …

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:08:26

DiscussLLM: Teaching Large Language Models When to Speak
Deep Anil Patel, Iain Melvin, Christopher Malon, Martin Renqiang Min
https://arxiv.org/abs/2508.18167 https://

DiscussLLM: Teaching Large Language Models When to Speak
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text, yet they largely operate as reactive agents, responding only when directly prompted. This passivity creates an "awareness gap," limiting their potential as truly collaborative partners in dynamic human discussions. We introduce $\textit{DiscussLLM}$, a framework designed to bridge this gap by training models to proactively decide not just $\textit{what}$ to say, but critically…

@arXiv_csDC_bot@mastoxiv.page
2025-08-26 08:35:06

Memory-Efficient Federated Fine-Tuning of Large Language Models via Layer Pruning
Yebo Wu, Jingguang Li, Chunlin Tian, Zhijiang Guo, Li Li
https://arxiv.org/abs/2508.17209 https…

Memory-Efficient Federated Fine-Tuning of Large Language Models via Layer Pruning
Federated fine-tuning enables privacy-preserving Large Language Model (LLM) adaptation, but its high memory cost limits participation from resource-constrained devices. We propose FedPruner, an innovative federated fine-tuning paradigm that tackles this via intelligent layer pruning. FedPruner flexibly prunes the global model, creating personalized submodels based on device memory constraints. It employs a macro-micro synergistic pruning framework: a macro-level functionality-driven layer orche…

@arXiv_csAI_bot@mastoxiv.page
2025-08-27 10:08:53

Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks
Dimitrios Rontogiannis, Maxime Peyrard, Nicolas Baldwin, Martin Josifoski, Robert West, Dimitrios Gunopulos
https://arxiv.org/abs/2508.18905

Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks
Standard single-turn, static benchmarks fall short in evaluating the nuanced capabilities of Large Language Models (LLMs) on complex tasks such as software engineering. In this work, we propose a novel interactive evaluation framework that assesses LLMs on multi-requirement programming tasks through structured, feedback-driven dialogue. Each task is modeled as a requirement dependency graph, and an ``interviewer'' LLM, aware of the ground-truth solution, provides minimal, targeted hints to an `…

@arXiv_csLG_bot@mastoxiv.page
2025-08-26 12:26:36

AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models
Nikolay Kutuzov, Makar Baderko, Stepan Kulibaba, Artem Dzhalilov, Daniel Bobrov, Maxim Mashtaler, Alexander Gasnikov
https://arxiv.org/abs/2508.18182

AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models
Scaling distributed training of Large Language Models (LLMs) requires not only algorithmic advances but also efficient utilization of heterogeneous hardware resources. While existing methods such as DiLoCo have demonstrated promising results, they often fail to fully exploit computational clusters under dynamic workloads. To address this limitation, we propose a three-stage method that combines Multi-Instance Training (MIT), Adaptive Batched DiLoCo, and switch mode mechanism. MIT allows individ…

@arXiv_csRO_bot@mastoxiv.page
2025-08-27 09:40:52

An LLM-powered Natural-to-Robotic Language Translation Framework with Correctness Guarantees
ZhenDong Chen, ZhanShang Nie, ShiXing Wan, JunYi Li, YongTian Cheng, Shuai Zhao
https://arxiv.org/abs/2508.19074

An LLM-powered Natural-to-Robotic Language Translation Framework with Correctness Guarantees
The Large Language Models (LLM) are increasingly being deployed in robotics to generate robot control programs for specific user tasks, enabling embodied intelligence. Existing methods primarily focus on LLM training and prompt design that utilize LLMs to generate executable programs directly from user tasks in natural language. However, due to the inconsistency of the LLMs and the high complexity of the tasks, such best-effort approaches often lead to tremendous programming errors in the gener…

@arXiv_csCV_bot@mastoxiv.page
2025-09-25 10:39:32

EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models
Botai Yuan, Yutian Zhou, Yingjie Wang, Fushuo Huo, Yongcheng Jing, Li Shen, Ying Wei, Zhiqi Shen, Ziwei Liu, Tianwei Zhang, Jie Yang, Dacheng Tao
https://arxiv.org/abs/2509.20146

EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety. We study sycophancy -- models' tendency to uncritically echo user-provided information -- in high-stakes clinical settings. We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs. It contains 2,122 images across 18 departments and 20 modalities with 90 prompts that simulate biased inputs from patients, medical students, and ph…

@arXiv_csSE_bot@mastoxiv.page
2025-08-26 08:53:26

CelloAI: Leveraging Large Language Models for HPC Software Development in High Energy Physics
Mohammad Atif, Kriti Chopra, Ozgur Kilic, Tianle Wang, Zhihua Dong, Charles Leggett, Meifeng Lin, Paolo Calafiura, Salman Habib
https://arxiv.org/abs/2508.16713

CelloAI: Leveraging Large Language Models for HPC Software Development in High Energy Physics
Next-generation High Energy Physics (HEP) experiments will generate unprecedented data volumes, necessitating High Performance Computing (HPC) integration alongside traditional high-throughput computing. However, HPC adoption in HEP is hindered by the challenge of porting legacy software to heterogeneous architectures and the sparse documentation of these complex scientific codebases. We present CelloAI, a locally hosted coding assistant that leverages Large Language Models (LLMs) with retrieva…

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:09:26

Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios
Luana Bulla, Gabriele Tuccio, Misael Mongiov\`i, Aldo Gangemi
https://arxiv.org/abs/2508.18183

Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios
Translating natural languages into sign languages is a highly complex and underexplored task. Despite growing interest in accessibility and inclusivity, the development of robust translation systems remains hindered by the limited availability of parallel corpora which align natural language with sign language data. Existing methods often struggle to generalize in these data-scarce environments, as the few datasets available are typically domain-specific, lack standardization, or fail to captur…

@arXiv_csCR_bot@mastoxiv.page
2025-08-27 09:18:13

Collaborative Intelligence: Topic Modelling of Large Language Model use in Live Cybersecurity Operations
Martin Lochner, Keegan Keplinger
https://arxiv.org/abs/2508.18488 https:…

Collaborative Intelligence: Topic Modelling of Large Language Model use in Live Cybersecurity Operations
Objective: This work describes the topic modelling of Security Operations Centre (SOC) use of a large language model (LLM), during live security operations. The goal is to better understand how these specialists voluntarily use this tool. Background: Human-automation teams have been extensively studied, but transformer-based language models have sparked a new wave of collaboration. SOC personnel at a major cybersecurity provider used an LLM to support live security operations. This study exam…

@arXiv_csIR_bot@mastoxiv.page
2025-08-26 09:16:56

A Universal Framework for Offline Serendipity Evaluation in Recommender Systems via Large Language Models
Yu Tokutake, Kazushi Okamoto, Kei Harada, Atsushi Shibata, Koki Karube
https://arxiv.org/abs/2508.17571

A Universal Framework for Offline Serendipity Evaluation in Recommender Systems via Large Language Models
Serendipity in recommender systems (RSs) has attracted increasing attention as a concept that enhances user satisfaction by presenting unexpected and useful items. However, evaluating serendipitous performance remains challenging because its ground truth is generally unobservable. The existing offline metrics often depend on ambiguous definitions or are tailored to specific datasets and RSs, thereby limiting their generalizability. To address this issue, we propose a universally applicable eval…

@arXiv_csCL_bot@mastoxiv.page
2025-09-26 10:05:41

PerHalluEval: Persian Hallucination Evaluation Benchmark for Large Language Models
Mohammad Hosseini, Kimia Hosseini, Shayan Bali, Zahra Zanjani, Saeedeh Momtazi
https://arxiv.org/abs/2509.21104

PerHalluEval: Persian Hallucination Evaluation Benchmark for Large Language Models
Hallucination is a persistent issue affecting all large language Models (LLMs), particularly within low-resource languages such as Persian. PerHalluEval (Persian Hallucination Evaluation) is the first dynamic hallucination evaluation benchmark tailored for the Persian language. Our benchmark leverages a three-stage LLM-driven pipeline, augmented with human validation, to generate plausible answers and summaries regarding QA and summarization tasks, focusing on detecting extrinsic and intrinsic …

@arXiv_csCV_bot@mastoxiv.page
2025-08-27 10:27:03

Enhancing Document VQA Models via Retrieval-Augmented Generation
Eric L\'opez, Artemis Llabr\'es, Ernest Valveny
https://arxiv.org/abs/2508.18984 https://

Enhancing Document VQA Models via Retrieval-Augmented Generation
Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA…

@arXiv_csAI_bot@mastoxiv.page
2025-09-26 09:37:11

Embodied AI: From LLMs to World Models
Tongtong Feng, Xin Wang, Yu-Gang Jiang, Wenwu Zhu
https://arxiv.org/abs/2509.20021 https://arxiv.org/pdf/2509.20021

Embodied AI: From LLMs to World Models
Embodied Artificial Intelligence (AI) is an intelligent system paradigm for achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications and driving the evolution from cyberspace to physical systems. Recent breakthroughs in Large Language Models (LLMs) and World Models (WMs) have drawn significant attention for embodied AI. On the one hand, LLMs empower embodied AI via semantic reasoning and task decomposition, bringing high-level natural language instruc…

@arXiv_csDC_bot@mastoxiv.page
2025-08-26 07:58:46

Equinox: Holistic Fair Scheduling in Serving Large Language Models
Zhixiang Wei, James Yen, Jingyi Chen, Ziyang Zhang, Zhibai Huang, Chen Chen, Xingzi Yu, Yicheng Gu, Chenggang Wu, Yun Wang, Mingyuan Xia, Jie Wu, Hao Wang, Zhengwei Qi
https://arxiv.org/abs/2508.16646

Equinox: Holistic Fair Scheduling in Serving Large Language Models
We address the limitations of current LLM serving with a dual-counter framework separating user and operator perspectives. The User Fairness Counter measures quality of service via weighted tokens and latency; the Resource Fairness Counter measures operational efficiency through throughput and GPU utilization. Since these metrics are only available post-execution, creating a scheduling paradox, we introduce a deterministic Mixture of Prediction Experts (MoPE) framework to predict user-perceived…

@arXiv_csCL_bot@mastoxiv.page
2025-08-27 09:48:43

Emotion Omni: Enabling Empathetic Speech Response Generation through Large Language Models
Haoyu Wang, Guangyan Zhang, Jiale Chen, Jingyu Li, Yuehai Wang, Yiwen Guo
https://arxiv.org/abs/2508.18655

Emotion Omni: Enabling Empathetic Speech Response Generation through Large Language Models
With the development of speech large language models (speech LLMs), users can now interact directly with assistants via speech. However, most existing models simply convert the response content into speech without fully understanding the rich emotional and paralinguistic cues embedded in the user's query. In many cases, the same sentence can have different meanings depending on the emotional expression. Furthermore, emotional understanding is essential for improving user experience in human-mac…

@arXiv_csRO_bot@mastoxiv.page
2025-07-25 08:43:42

OpenNav: Open-World Navigation with Multimodal Large Language Models
Mingfeng Yuan, Letian Wang, Steven L. Waslander
https://arxiv.org/abs/2507.18033 https://

OpenNav: Open-World Navigation with Multimodal Large Language Models
Pre-trained large language models (LLMs) have demonstrated strong common-sense reasoning abilities, making them promising for robotic navigation and planning tasks. However, despite recent progress, bridging the gap between language descriptions and actual robot actions in the open-world, beyond merely invoking limited predefined motion primitives, remains an open challenge. In this work, we aim to enable robots to interpret and decompose complex language instructions, ultimately synthesizing a…

@arXiv_csLG_bot@mastoxiv.page
2025-09-25 10:51:12

Video models are zero-shot learners and reasoners
Thadd\"aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, Robert Geirhos
https://arxiv.org/abs/2509.20328

Video models are zero-shot learners and reasoners
The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language underst…

@arXiv_csCR_bot@mastoxiv.page
2025-08-26 11:10:36

Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Language Models
Qiming Guo, Jinwen Tang, Xingran Huang
https://arxiv.org/abs/2508.17674 https://

Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Language Models
We introduce Advertisement Embedding Attacks (AEA), a new class of LLM security threats that stealthily inject promotional or malicious content into model outputs and AI agents. AEA operate through two low-cost vectors: (1) hijacking third-party service-distribution platforms to prepend adversarial prompts, and (2) publishing back-doored open-source checkpoints fine-tuned with attacker data. Unlike conventional attacks that degrade accuracy, AEA subvert information integrity, causing models to …

@arXiv_csIR_bot@mastoxiv.page
2025-08-26 10:44:47

Retrieval Feedback Memory Enhancement Large Model Retrieval Generation Method
Leqian Li, Dianxi Shi, Jialu Zhou, Xinyu Wei, Mingyue Yang, Songchang Jin, Shaowu Yang
https://arxiv.org/abs/2508.17862

Retrieval Feedback Memory Enhancement Large Model Retrieval Generation Method
Large Language Models (LLMs) have shown remarkable capabilities across diverse tasks, yet they face inherent limitations such as constrained parametric knowledge and high retraining costs. Retrieval-Augmented Generation (RAG) augments the generation process by retrieving externally stored knowledge absent from the models internal parameters. However, RAG methods face challenges such as information loss and redundant retrievals during multi-round queries, accompanying the difficulties in precise…

@arXiv_csSE_bot@mastoxiv.page
2025-08-27 07:37:02

Training Language Model Agents to Find Vulnerabilities with CTF-Dojo
Terry Yue Zhuo, Dingmin Wang, Hantian Ding, Varun Kumar, Zijian Wang
https://arxiv.org/abs/2508.18370 https:…

Training Language Model Agents to Find Vulnerabilities with CTF-Dojo
Large language models (LLMs) have demonstrated exceptional capabilities when trained within executable runtime environments, notably excelling at software engineering tasks through verified feedback loops. Yet, scalable and generalizable execution-grounded environments remain scarce, limiting progress in training more capable ML agents. We introduce CTF-Dojo, the first large-scale executable runtime tailored for training LLMs with verifiable feedback, featuring 658 fully functional Capture-The-…

@arXiv_csCL_bot@mastoxiv.page
2025-08-27 10:16:23

ConfTuner: Training Large Language Models to Express Their Confidence Verbally
Yibo Li, Miao Xiong, Jiaying Wu, Bryan Hooi
https://arxiv.org/abs/2508.18847 https://

ConfTuner: Training Large Language Models to Express Their Confidence Verbally
Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence". Recent efforts have focused on calibrating LLMs' verbalized confidence: i.e., their expressions of confidence in text form, such as "I am 80% confident that...". Exist…

@arXiv_csDC_bot@mastoxiv.page
2025-08-27 08:25:23

Strata: Hierarchical Context Caching for Long Context Language Model Serving
Zhiqiang Xie, Ziyi Xu, Mark Zhao, Yuwei An, Vikram Sharma Mailthody, Scott Mahlke, Michael Garland, Christos Kozyrakis
https://arxiv.org/abs/2508.18572

Strata: Hierarchical Context Caching for Long Context Language Model Serving
Large Language Models (LLMs) with expanding context windows face significant performance hurdles. While caching key-value (KV) states is critical for avoiding redundant computation, the storage footprint of long-context caches quickly exceeds GPU memory capacity, forcing production systems to adopt hierarchical caching across memory hierarchies. However, transferring large cached contexts back to the GPU introduces severe performance bottlenecks: fragmented I/O from paged layouts prevents full …

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:02:16

Understanding Subword Compositionality of Large Language Models
Qiwei Peng, Yekun Chai, Anders S{\o}gaard
https://arxiv.org/abs/2508.17953 https://arxiv.or…

Understanding Subword Compositionality of Large Language Models
Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likel…

@arXiv_csLG_bot@mastoxiv.page
2025-08-27 10:31:53

PAX-TS: Model-agnostic multi-granular explanations for time series forecasting via localized perturbations
Tim Kreuzer, Jelena Zdravkovic, Panagiotis Papapetrou
https://arxiv.org/abs/2508.18982

PAX-TS: Model-agnostic multi-granular explanations for time series forecasting via localized perturbations
Time series forecasting has seen considerable improvement during the last years, with transformer models and large language models driving advancements of the state of the art. Modern forecasting models are generally opaque and do not provide explanations for their forecasts, while well-known post-hoc explainability methods like LIME are not suitable for the forecasting context. We propose PAX-TS, a model-agnostic post-hoc algorithm to explain time series forecasting models and their forecasts.…

@arXiv_csAI_bot@mastoxiv.page
2025-09-25 07:44:22

Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning
Sai Teja Reddy Adapala
https://arxiv.org/abs/2509.19517 https://arxiv.org/…

Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning
The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational limits that govern their reasoning under cognitive load remain poorly understood. In this work, we introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-sw…

@arXiv_csCR_bot@mastoxiv.page
2025-07-25 08:41:32

RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models
Haoran Gao, Yuanhe Zhang, Zhenhong Zhou, Lei Jiang, Fanyu Meng, Yujia Xiao, Kun Wang, Yang Liu, Junlan Feng
https://arxiv.org/abs/2507.18053

RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models
Resource Consumption Attacks (RCAs) have emerged as a significant threat to the deployment of Large Language Models (LLMs). With the integration of vision modalities, additional attack vectors exacerbate the risk of RCAs in large vision-language models (LVLMs). However, existing red-teaming studies have largely overlooked visual inputs as a potential attack surface, resulting in insufficient mitigation strategies against RCAs in LVLMs. To address this gap, we propose RECALLED (\textbf{RE}source…

@arXiv_csCL_bot@mastoxiv.page
2025-09-26 10:14:01

CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis
Xinzhe Xu, Liang Zhao, Hongshen Xu, Chen Chen
https://arxiv.org/abs/2509.21208

CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis
Large Language Models (LLMs) are increasingly tasked with analyzing legal texts and citing relevant statutes, yet their reliability is often compromised by general pre-training that ingests legal texts without specialized focus, obscuring the true depth of their legal knowledge. This paper introduces CLaw, a novel benchmark specifically engineered to meticulously evaluate LLMs on Chinese legal knowledge and its application in reasoning. CLaw comprises two key components: (1) a comprehensive, fi…

@arXiv_csCV_bot@mastoxiv.page
2025-09-26 10:23:21

Instruction-tuned Self-Questioning Framework for Multimodal Reasoning
You-Won Jang, Yu-Jung Heo, Jaeseok Kim, Minsu Lee, Du-Seong Chang, Byoung-Tak Zhang
https://arxiv.org/abs/2509.21251

Instruction-tuned Self-Questioning Framework for Multimodal Reasoning
The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual in…

@arXiv_csCL_bot@mastoxiv.page
2025-08-27 10:09:43

ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models
Qianyu He, Siyu Yuan, Xuefeng Li, Mingxuan Wang, Jiangjie Chen
https://arxiv.org/abs/2508.18773

ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models
Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI's gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to…

@arXiv_csSE_bot@mastoxiv.page
2025-09-25 08:02:02

Reverse Engineering User Stories from Code using Large Language Models
Mohamed Ouf, Haoyu Li, Michael Zhang, Mariam Guizani
https://arxiv.org/abs/2509.19587 https://

Reverse Engineering User Stories from Code using Large Language Models
User stories are essential in agile development, yet often missing or outdated in legacy and poorly documented systems. We investigate whether large language models (LLMs) can automatically recover user stories directly from source code and how prompt design impacts output quality. Using 1,750 annotated C++ snippets of varying complexity, we evaluate five state-of-the-art LLMs across six prompting strategies. Results show that all models achieve, on average, an F1 score of 0.8 for code up to 20…

@arXiv_csLG_bot@mastoxiv.page
2025-08-25 09:50:20

On the Evolution of Federated Post-Training Large Language Models: A Model Accessibility View
Tao Guo, Junxiao Wang, Fushuo Huo, Laizhong Cui, Song Guo, Jie Gui, Dacheng Tao
https://arxiv.org/abs/2508.16261

On the Evolution of Federated Post-Training Large Language Models: A Model Accessibility View
Federated Learning (FL) enables training models across decentralized data silos while preserving client data privacy. Recent research has explored efficient methods for post-training large language models (LLMs) within FL to address computational and communication challenges. While existing approaches often rely on access to LLMs' internal information, which is frequently restricted in real-world scenarios, an inference-only paradigm (black-box FedLLM) has emerged to address these limitations. …

@arXiv_csAI_bot@mastoxiv.page
2025-08-27 10:13:13

AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms
Pontus Strimling, Simon Karlsson, Irina Vartanova, Kimmo Eriksson
https://arxiv.org/abs/2508.19004

AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms
A fundamental question in cognitive science concerns how social norms are acquired and represented. While humans typically learn norms through embodied social experience, we investigated whether large language models can achieve sophisticated norm understanding through statistical learning alone. Across two studies, we systematically evaluated multiple AI systems' ability to predict human social appropriateness judgments for 555 everyday scenarios by examining how closely they predicted the ave…

@arXiv_csCL_bot@mastoxiv.page
2025-08-27 10:15:53

Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness
Sirui Chen, Changxin Tian, Binbin Hu, Kunlong Chen, Ziqi Liu, Zhiqiang Zhang, Jun Zhou
https://arxiv.org/abs/2508.18824

Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness
Enhancing the mathematical reasoning of large language models (LLMs) demands high-quality training data, yet conventional methods face critical challenges in scalability, cost, and data reliability. To address these limitations, we propose a novel program-assisted synthesis framework that systematically generates a high-quality mathematical corpus with guaranteed diversity, complexity, and correctness. This framework integrates mathematical knowledge systems and domain-specific tools to create …

@arXiv_csCV_bot@mastoxiv.page
2025-09-26 10:19:41

MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning
Sicheng Tao, Jungang Li, Yibo Yan, Junyan Zhang, Yubo Gao, Hanqian Li, ShuHang Xun, Yuxuan Fan, Hong Chen, Jianxiang He, Xuming Hu
https://arxiv.org/abs/2509.21113

MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning
Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit process inconsistency, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce MOSS-ChatV, a reinforcement learning framework w…

@arXiv_csCL_bot@mastoxiv.page
2025-08-27 10:24:13

Generative Interfaces for Language Models
Jiaqi Chen, Yanzhe Zhang, Yutong Zhang, Yijia Shao, Diyi Yang
https://arxiv.org/abs/2508.19227 https://arxiv.org/…

Generative Interfaces for Language Models
Large language models (LLMs) are increasingly seen as assistants, copilots, and consultants, capable of supporting a wide range of tasks through natural conversation. However, most systems remain constrained by a linear request-response format that often makes interactions inefficient in multi-turn, information-dense, and exploratory tasks. To address these limitations, we propose Generative Interfaces for Language Models, a paradigm in which LLMs respond to user queries by proactively generati…

@arXiv_csCR_bot@mastoxiv.page
2025-08-26 08:50:26

Guarding Your Conversations: Privacy Gatekeepers for Secure Interactions with Cloud-Based AI Models
GodsGift Uzor, Hasan Al-Qudah, Ynes Ineza, Abdul Serwadda
https://arxiv.org/abs/2508.16765

Guarding Your Conversations: Privacy Gatekeepers for Secure Interactions with Cloud-Based AI Models
The interactive nature of Large Language Models (LLMs), which closely track user data and context, has prompted users to share personal and private information in unprecedented ways. Even when users opt out of allowing their data to be used for training, these privacy settings offer limited protection when LLM providers operate in jurisdictions with weak privacy laws, invasive government surveillance, or poor data security practices. In such cases, the risk of sensitive information, including P…

@arXiv_csSE_bot@mastoxiv.page
2025-07-25 09:42:32

Automated Code Review Using Large Language Models with Symbolic Reasoning
Busra Icoz, Goksel Biricik
https://arxiv.org/abs/2507.18476 https://arxiv.org/pdf…

Automated Code Review Using Large Language Models with Symbolic Reasoning
Code review is one of the key processes in the software development lifecycle and is essential to maintain code quality. However, manual code review is subjective and time consuming. Given its rule-based nature, code review is well suited for automation. In recent years, significant efforts have been made to automate this process with the help of artificial intelligence. Recent developments in Large Language Models (LLMs) have also emerged as a promising tool in this area, but these models ofte…

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 11:57:26

DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models
Kaiwen Yan, Xuanqing Shi, Hongcheng Guo, Wenxuan Wang, Zhuosheng Zhang, Chengwei Qin
https://arxiv.org/abs/2508.17803

DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models
Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit…

@arXiv_csCL_bot@mastoxiv.page
2025-08-27 09:47:33

Breaking the Trade-Off Between Faithfulness and Expressiveness for Large Language Models
Chenxu Yang, Qingyi Si, Zheng Lin
https://arxiv.org/abs/2508.18651 https://

Breaking the Trade-Off Between Faithfulness and Expressiveness for Large Language Models
Grounding responses in external knowledge represents an effective strategy for mitigating hallucinations in Large Language Models (LLMs). However, current LLMs struggle to seamlessly integrate knowledge while simultaneously maintaining faithfulness (or fidelity) and expressiveness, capabilities that humans naturally possess. This limitation results in outputs that either lack support from external knowledge, thereby compromising faithfulness, or appear overly verbose and unnatural, thus sacrifi…

@arXiv_csSE_bot@mastoxiv.page
2025-09-26 08:25:11

Dynamic ReAct: Scalable Tool Selection for Large-Scale MCP Environments
Nishant Gaurav, Adit Akarsh, Ankit Ranjan, Manoj Bajaj
https://arxiv.org/abs/2509.20386 https://

Dynamic ReAct: Scalable Tool Selection for Large-Scale MCP Environments
We present Dynamic ReAct, a novel approach for enabling ReAct agents to ef- ficiently operate with extensive Model Control Protocol (MCP) tool sets that exceed the contextual memory limitations of large language models. Our approach addresses the fundamental challenge of tool selection in environments containing hundreds or thousands of available tools, where loading all tools simultaneously is computationally infeasible. We propose and evaluate five distinct architectures that progressively re…

@arXiv_csAI_bot@mastoxiv.page
2025-09-24 10:13:24

Advances in Large Language Models for Medicine
Zhiyu Kan, Wensheng Gan, Zhenlian Qi, Philip S. Yu
https://arxiv.org/abs/2509.18690 https://arxiv.org/pdf/25…

Advances in Large Language Models for Medicine
Artificial intelligence (AI) technology has advanced rapidly in recent years, with large language models (LLMs) emerging as a significant breakthrough. LLMs are increasingly making an impact across various industries, with the medical field standing out as the most prominent application area. This paper systematically reviews the up-to-date research progress of LLMs in the medical field, providing an in-depth analysis of training techniques for large medical models, their adaptation in healthca…

@arXiv_csLG_bot@mastoxiv.page
2025-09-26 10:29:31

Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say
Jacob Fein-Ashley, Dhruv Parikh, Rajgopal Kannan, Viktor Prasanna
https://arxiv.org/abs/2509.21164

Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say
Open-source Large Language Models (LLMs) increasingly specialize by domain (e.g., math, code, general reasoning), motivating systems that leverage complementary strengths across models. Prior multi-LLM approaches either (i) route a query to one or a few experts and generate independently, (ii) aggregate outputs from each model via costly multi-turn exchanges, or (iii) fuse weights into a single model-typically requiring architectural homogeneity. We introduce Mixture of Thoughts (MoT), a simple…

@arXiv_csCR_bot@mastoxiv.page
2025-08-25 09:25:20

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, Bill Byrne
https://arxiv.org/abs/2508.16406 …

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporate…

@arXiv_csCL_bot@mastoxiv.page
2025-09-26 10:12:51

GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models
Jieli Zhu, Vi Ngoc-Nha Tran
https://arxiv.org/abs/2509.21192

GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models
Small language models (SLMs) become unprecedentedly appealing due to their approximately equivalent performance compared to large language models (LLMs) in certain fields with less energy and time consumption during training and inference. However, the personally identifiable information (PII) leakage of SLMs for downstream tasks has yet to be explored. In this study, we investigate the PII leakage of the chatbot based on SLM. We first finetune a new chatbot, i.e., ChatBioGPT based on the backb…

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:13:26

From BERT to LLMs: Comparing and Understanding Chinese Classifier Prediction in Language Models
ZiqiZhang, Jianfei Ma, Emmanuele Chersoni, Jieshun You, Zhaoxin Feng
https://arxiv.org/abs/2508.18253

From BERT to LLMs: Comparing and Understanding Chinese Classifier Prediction in Language Models
Classifiers are an important and defining feature of the Chinese language, and their correct prediction is key to numerous educational applications. Yet, whether the most popular Large Language Models (LLMs) possess proper knowledge the Chinese classifiers is an issue that has largely remain unexplored in the Natural Language Processing (NLP) literature. To address such a question, we employ various masking strategies to evaluate the LLMs' intrinsic ability, the contribution of different sent…

@arXiv_csSE_bot@mastoxiv.page
2025-09-25 09:38:22

V-GameGym: Visual Game Generation for Code Large Language Models
Wei Zhang, Jack Yang, Renshuai Tao, Lingzheng Chai, Shawn Guo, Jiajun Wu, Xiaoming Chen, Ganqu Cui, Ning Ding, Xander Xu, Hu Wei, Bowen Zhou
https://arxiv.org/abs/2509.20136

V-GameGym: Visual Game Generation for Code Large Language Models
Code large language models have demonstrated remarkable capabilities in programming tasks, yet current benchmarks primarily focus on single modality rather than visual game development. Most existing code-related benchmarks evaluate syntax correctness and execution accuracy, overlooking critical game-specific metrics such as playability, visual aesthetics, and user engagement that are essential for real-world deployment. To address the gap between current LLM capabilities in algorithmic problem…

@arXiv_csAI_bot@mastoxiv.page
2025-08-26 09:12:46

Bridging the Gap in Ophthalmic AI: MM-Retinal-Reason Dataset and OphthaReason Model toward Dynamic Multimodal Reasoning
Ruiqi Wu, Yuang Yao, Tengfei Ma, Chenran Zhang, Na Su, Tao Zhou, Geng Chen, Wen Fan, Yi Zhou
https://arxiv.org/abs/2508.16129

Bridging the Gap in Ophthalmic AI: MM-Retinal-Reason Dataset and OphthaReason Model toward Dynamic Multimodal Reasoning
Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning abilities with reinforcement learning paradigm. Although several multimodal reasoning models have been explored in the medical domain, most of them focus exclusively on basic reasoning, which refers to shallow inference based on visual feature matching. However, real-world clinical diagnosis extends beyond basic reasoning, demanding reasoning processes that integrate heterogeneous clinical information (such…

@arXiv_csSE_bot@mastoxiv.page
2025-09-25 08:34:02

Assertion Messages with Large Language Models (LLMs) for Code
Ahmed Aljohani, Anamul Haque Mollah, Hyunsook Do
https://arxiv.org/abs/2509.19673 https://arx…

Assertion Messages with Large Language Models (LLMs) for Code
Assertion messages significantly enhance unit tests by clearly explaining the reasons behind test failures, yet they are frequently omitted by developers and automated test-generation tools. Despite recent advancements, Large Language Models (LLMs) have not been systematically evaluated for their ability to generate informative assertion messages. In this paper, we introduce an evaluation of four state-of-the-art Fill-in-the-Middle (FIM) LLMs - Qwen2.5-Coder-32B, Codestral-22B, CodeLlama-13B, a…

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:04:36

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, Golnoosh Farnadi
https://arxiv.org/abs/2508.18076 …

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scruti…

@arXiv_csAI_bot@mastoxiv.page
2025-08-27 10:13:33

Sense of Self and Time in Borderline Personality. A Comparative Robustness Study with Generative AI
Marcin Moskalewicz, Anna Sterna, Marek Pokropski, Paula Flores
https://arxiv.org/abs/2508.19008

Sense of Self and Time in Borderline Personality. A Comparative Robustness Study with Generative AI
This study examines the capacity of large language models (LLMs) to support phenomenological qualitative analysis of first-person experience in Borderline Personality Disorder (BPD), understood as a disorder of temporality and selfhood. Building on a prior human-led thematic analysis of 24 inpatients' life-story interviews, we compared three LLMs (OpenAI GPT-4o, Google Gemini 2.5 Pro, Anthropic Claude Opus 4) prompted to mimic the interpretative style of the original investigators. The models w…

@arXiv_csCL_bot@mastoxiv.page
2025-08-27 10:19:23

Automatic Prompt Optimization with Prompt Distillation
Viktor N. Zhuravlev, Artur R. Khairullin, Ernest A. Dyagin, Alena N. Sitkina, Nikita I. Kulin
https://arxiv.org/abs/2508.18992

Automatic Prompt Optimization with Prompt Distillation
Autoprompting is the process of automatically selecting optimized prompts for language models, which is gaining popularity due to the rapid development of prompt engineering driven by extensive research in the field of large language models (LLMs). This paper presents DistillPrompt -- a novel autoprompting method based on large language models that employs a multi-stage integration of task-specific information into prompts using training data. DistillPrompt utilizes distillation, compression, a…

@arXiv_csSE_bot@mastoxiv.page
2025-08-25 09:37:50

How Small is Enough? Empirical Evidence of Quantized Small Language Models for Automated Program Repair
Kazuki Kusama, Honglin Shu, Masanari Kondo, Yasutaka Kamei
https://arxiv.org/abs/2508.16499

How Small is Enough? Empirical Evidence of Quantized Small Language Models for Automated Program Repair
Background: Large language models (LLMs) have greatly improved the accuracy of automated program repair (APR) methods. However, LLMs are constrained by high computational resource requirements. Aims: We focus on small language models (SLMs), which perform well even with limited computational resources compared to LLMs. We aim to evaluate whether SLMs can achieve competitive performance in APR tasks. Method: We conducted experiments on the QuixBugs benchmark to compare the bug-fixing accuracy of…

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:06:26

Detecting and Characterizing Planning in Language Models
Jatin Nainani, Sankaran Vaidyanathan, Connor Watts, Andre N. Assis, Alice Rigg
https://arxiv.org/abs/2508.18098 https://…

Detecting and Characterizing Planning in Language Models
Modern large language models (LLMs) have demonstrated impressive performance across a wide range of multi-step reasoning tasks. Recent work suggests that LLMs may perform planning - selecting a future target token in advance and generating intermediate tokens that lead towards it - rather than merely improvising one token at a time. However, existing studies assume fixed planning horizons and often focus on single prompts or narrow domains. To distinguish planning from improvisation across mode…

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:04:46

How Quantization Shapes Bias in Large Language Models
Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych
https://arxiv.org/abs/2508.18088 https://

How Quantization Shapes Bias in Large Language Models
This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, toxicity, sentiment, and fairness. We employ both probabilistic and generated text-based metrics across nine benchmarks and evaluate models varying in architecture family and reasoning ability.…

@arXiv_csCL_bot@mastoxiv.page
2025-09-26 10:07:51

BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback
Hyunseo Kim, Sangam Lee, Kwangwook Seo, Dongha Lee
https://arxiv.org/abs/2509.21106

BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback
Search-augmented large language models (LLMs) have advanced information-seeking tasks by integrating retrieval into generation, reducing users' cognitive burden compared to traditional search systems. Yet they remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user h…

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 11:59:36

ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models
Manlai Liang, Mandi Liu, Jiangzhou Ji, Huaijun Li, Haobo Yang, Yaohan He, Jinlong Li
https://arxiv.org/abs/2508.17892

ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models
Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streamin…

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:01:56

AMELIA: A Family of Multi-task End-to-end Language Models for Argumentation
Henri Savigny, Bruno Yun
https://arxiv.org/abs/2508.17926 https://arxiv.org/pdf…

AMELIA: A Family of Multi-task End-to-end Language Models for Argumentation
Argument mining is a subfield of argumentation that aims to automatically extract argumentative structures and their relations from natural language texts. This paper investigates how a single large language model can be leveraged to perform one or several argument mining tasks. Our contributions are two-fold. First, we construct a multi-task dataset by surveying and converting 19 well-known argument mining datasets from the literature into a unified format. Second, we explore various training …

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:03:06

A Retail-Corpus for Aspect-Based Sentiment Analysis with Large Language Models
Oleg Silcenco, Marcos R. Machad, Wallace C. Ugulino, Daniel Braun
https://arxiv.org/abs/2508.17994

A Retail-Corpus for Aspect-Based Sentiment Analysis with Large Language Models
Aspect-based sentiment analysis enhances sentiment detection by associating it with specific aspects, offering deeper insights than traditional sentiment analysis. This study introduces a manually annotated dataset of 10,814 multilingual customer reviews covering brick-and-mortar retail stores, labeled with eight aspect categories and their sentiment. Using this dataset, the performance of GPT-4 and LLaMA-3 in aspect based sentiment analysis is evaluated to establish a baseline for the newly in…

@arXiv_csCL_bot@mastoxiv.page
2025-09-26 10:17:31

DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding
Kin Ian Lo, Hala Hawashin, Mina Abbaszadeh, Tilen Limback-Stokin, Hadi Wazni, Mehrnoosh Sadrzadeh
https://arxiv.org/abs/2509.21287

DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding
Recent vision-language models excel at large-scale image-text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate-argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors…

@arXiv_csCL_bot@mastoxiv.page
2025-08-25 10:01:20

Political Ideology Shifts in Large Language Models
Pietro Bernardelle, Stefano Civelli, Leon Fr\"ohling, Riccardo Lunardi, Kevin Roitero, Gianluca Demartini
https://arxiv.org/abs/2508.16013

Political Ideology Shifts in Large Language Models
Large language models (LLMs) are increasingly deployed in politically sensitive settings, raising concerns about their potential to encode, amplify, or be steered toward specific ideologies. We investigate how adopting synthetic personas influences ideological expression in LLMs across seven models (7B-70B+ parameters) from multiple families, using the Political Compass Test as a standardized probe. Our analysis reveals four consistent patterns: (i) larger models display broader and more polari…

@arXiv_csCL_bot@mastoxiv.page
2025-09-25 10:35:42

Benchmarking Gaslighting Attacks Against Speech Large Language Models
Jinyang Wu, Bin Zhu, Xiandong Zou, Qiquan Zhang, Xu Fang, Pan Zhou
https://arxiv.org/abs/2509.19858 https:/…

Benchmarking Gaslighting Attacks Against Speech Large Language Models
As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial…

@arXiv_csCL_bot@mastoxiv.page
2025-08-25 10:04:10

TULIP: Adapting Open-Source Large Language Models for Underrepresented Languages and Specialized Financial Tasks
\.Irem Demirta\c{s}, Burak Payzun, Se\c{c}il Arslan
https://arxiv.org/abs/2508.16243

TULIP: Adapting Open-Source Large Language Models for Underrepresented Languages and Specialized Financial Tasks
Thanks to the growing popularity of large language models over the years, there is great potential for their applications in finance. Despite the exceptional performance of larger proprietary models, which are presented as black-box solutions through APIs, smaller models that can be hosted on-premise present opportunities for adaptability and privacy. Especially in cases where the management of sensitive information and application of domain knowledge is important, like finance, enhancing the c…

@arXiv_csCL_bot@mastoxiv.page
2025-08-25 10:05:10

MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering
Adil Bahaj, Mounir Ghogho
https://arxiv.org/abs/2508.16357 https://arxiv.o…

MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering
The rapid advancement of large language models (LLMs) has significantly propelled progress in natural language processing (NLP). However, their effectiveness in specialized, low-resource domains-such as Arabic legal contexts-remains limited. This paper introduces MizanQA (pronounced Mizan, meaning "scale" in Arabic, a universal symbol of justice), a benchmark designed to evaluate LLMs on Moroccan legal question answering (QA) tasks, characterised by rich linguistic and legal complexity. The dat…

@arXiv_csCL_bot@mastoxiv.page
2025-09-26 10:05:31

Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs
Yixin Wan, Xingrun Chen, Kai-Wei Chang
https://arxiv.org/abs/2509.21080 https://…

Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs
Large language models (LLMs) have unlocked a wide range of downstream generative applications. However, we found that they also risk perpetuating subtle fairness issues tied to culture, positioning their generations from the perspectives of the mainstream US culture while demonstrating salient externality towards non-mainstream ones. In this work, we identify and systematically investigate this novel culture positioning bias, in which an LLM's default generative stance aligns with a mainstream …

@arXiv_csCL_bot@mastoxiv.page
2025-09-25 10:44:52

Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation
Chaojun Nie, Jun Zhou, Guanxiang Wang, Shisong Wud, Zichen Wang
https://arxiv.org/abs/2509.20162

Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation
Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge scarcity and temporal lag create knowledge gaps for domain applications. While post-training on domain datasets can embed knowledge into models, existing approaches have some limitations. Continual Pre-Training (CPT) treats all tokens in domain documents with…

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:11:46

MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols
Yuhao Du, Qianwei Huang, Guo Zhu, Zhanchen Dai, Sunian Chen, Qiming Zhu, Yuhao Zhang, Li Zhou, Benyou Wang
https://arxiv.org/abs/2508.18240

MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols
The rapid advancement of speech-to-speech (S2S) large language models (LLMs) has significantly improved real-time spoken interaction. However, current evaluation frameworks remain inadequate for assessing performance in complex, multi-turn dialogues. To address this, we introduce MTalk-Bench, a multi-turn S2S benchmark covering three core dimensions: Semantic Information, Paralinguistic Information, and Ambient Sound. Each dimension includes nine realistic scenarios, along with targeted tasks t…

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 11:58:56

Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs
Dingdong Wang, Junan Li, Mingyu Cui, Dongchao Yang, Xueyuan Chen, Helen Meng
https://arxiv.org/abs/2508.17863

Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs
With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings. …

@arXiv_csCL_bot@mastoxiv.page
2025-08-25 10:06:20

LLM-as-classifier: Semi-Supervised, Iterative Framework for Hierarchical Text Classification using Large Language Models
Doohee You, Andy Parisi, Zach Vander Velden, Lara Dantas Inojosa
https://arxiv.org/abs/2508.16478

LLM-as-classifier: Semi-Supervised, Iterative Framework for Hierarchical Text Classification using Large Language Models
The advent of Large Language Models (LLMs) has provided unprecedented capabilities for analyzing unstructured text data. However, deploying these models as reliable, robust, and scalable classifiers in production environments presents significant methodological challenges. Standard fine-tuning approaches can be resource-intensive and often struggle with the dynamic nature of real-world data distributions, which is common in the industry. In this paper, we propose a comprehensive, semi-supervise…

@arXiv_csCL_bot@mastoxiv.page
2025-08-25 10:02:20

CEQuest: Benchmarking Large Language Models for Construction Estimation
Yanzhao Wu, Lufan Wang, Rui Liu
https://arxiv.org/abs/2508.16081 https://arxiv.org/…

CEQuest: Benchmarking Large Language Models for Construction Estimation
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of general-domain tasks. However, their effectiveness in specialized fields, such as construction, remains underexplored. In this paper, we introduce CEQuest, a novel benchmark dataset specifically designed to evaluate the performance of LLMs in answering construction-related questions, particularly in the areas of construction drawing interpretation and estimation. We conduct comprehensive experiments us…

@arXiv_csCL_bot@mastoxiv.page
2025-08-25 10:02:00

Ethical Considerations of Large Language Models in Game Playing
Qingquan Zhang, Yuchen Li, Bo Yuan, Julian Togelius, Georgios N. Yannakakis, Jialin Liu
https://arxiv.org/abs/2508.16065

Ethical Considerations of Large Language Models in Game Playing
Large language models (LLMs) have demonstrated tremendous potential in game playing, while little attention has been paid to their ethical implications in those contexts. This work investigates and analyses the ethical considerations of applying LLMs in game playing, using Werewolf, also known as Mafia, as a case study. Gender bias, which affects game fairness and player experience, has been observed from the behaviour of LLMs. Some roles, such as the Guard and Werewolf, are more sensitive than…

@arXiv_csCL_bot@mastoxiv.page
2025-08-27 10:16:43

ReflectivePrompt: Reflective evolution in autoprompting algorithms
Viktor N. Zhuravlev, Artur R. Khairullin, Ernest A. Dyagin, Alena N. Sitkina, Nikita I. Kulin
https://arxiv.org/abs/2508.18870

ReflectivePrompt: Reflective evolution in autoprompting algorithms
Autoprompting is the process of automatically selecting optimized prompts for language models, which has been gaining popularity with the rapid advancement of prompt engineering, driven by extensive research in the field of large language models (LLMs). This paper presents ReflectivePrompt - a novel autoprompting method based on evolutionary algorithms that employs a reflective evolution approach for more precise and comprehensive search of optimal prompts. ReflectivePrompt utilizes short-term …

@arXiv_csCL_bot@mastoxiv.page
2025-07-25 10:07:52

The Moral Gap of Large Language Models
Maciej Skorski, Alina Landowska
https://arxiv.org/abs/2507.18523 https://arxiv.org/pdf/2507.18523

The Moral Gap of Large Language Models
Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high f…

@arXiv_csCL_bot@mastoxiv.page
2025-07-25 09:58:42

BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit
Biao Yi, Zekun Fei, Jianing Geng, Tong Li, Lihai Nie, Zheli Liu, Yiming Li
https://arxiv.org/abs/2507.18305

BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit
Large reasoning models (LRMs) have emerged as a significant advancement in artificial intelligence, representing a specialized class of large language models (LLMs) designed to tackle complex reasoning tasks. The defining characteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoning capabilities. In this paper, we identify a previously unexplored attack vector against LRMs, which we term "overthinking backdoors". We advance this concept by proposing a novel tunable backdoor, w…

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 17:16:41

Replaced article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[1/6]:
- Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models
Qingyue Wang, Yanhe Fu, Yanan Cao, Shuai Wang, Zhiliang Tian, Liang Ding

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 17:16:54

Replaced article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[2/6]:
- EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models
Hu, Zhou, You, Xu, Wang, Lian, Yu, Ma, Cui

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 17:17:07

Replaced article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[3/6]:
- Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Inte...
Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, Haohan Wang

Tootfinder

Opt-in global Mastodon full text search. Join the index!