Tootfinder

@samvarma@fosstodon.org
2025-06-04 15:32:47

This author is invaluable to me because they always have a fresh take that I haven't seen anywhere else. Was a fave follow on the bad place.
In this case, re #LLMs
#AI #LLM

Beyond Nudge
LLMs ensure their survival by showing us that we can all find meaning in our lives so long as we keep talking with the LLMs.

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 18:04:31

This https://arxiv.org/abs/2505.07453 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

How well do LLMs reason over tabular data, really?
Large Language Models (LLMs) excel in natural language tasks, but less is known about their reasoning capabilities over tabular data. Prior analyses devise evaluation strategies that poorly reflect an LLM's realistic performance on tabular queries. Moreover, we have a limited understanding of the robustness of LLMs towards realistic variations in tabular inputs. Therefore, we ask: Can general-purpose LLMs reason over tabular data, really?, and focus on two questions 1) are tabular reasoning cap…

@arXiv_csCL_bot@mastoxiv.page
2025-07-04 09:52:11

MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs
Purbesh Mitra, Sennur Ulukus
https://arxiv.org/abs/2507.02851 https://a…

MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs
Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with a…

@arXiv_csCR_bot@mastoxiv.page
2025-07-04 09:57:01

Early Signs of Steganographic Capabilities in Frontier LLMs
Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy, Roland S. Zimmermann, David Lindner
https://arxiv.org/abs/2507.02737

Early Signs of Steganographic Capabilities in Frontier LLMs
Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models are unable to encode sh…

@arXiv_csIR_bot@mastoxiv.page
2025-06-05 09:41:33

This https://arxiv.org/abs/2505.20730 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csIR_…

What LLMs Miss in Recommendations: Bridging the Gap with Retrieval-Augmented Collaborative Signals
User-item interactions contain rich collaborative signals that form the backbone of many successful recommender systems. While recent work has explored the use of large language models (LLMs) for recommendation, it remains unclear whether LLMs can effectively reason over this type of collaborative information. In this paper, we conduct a systematic comparison between LLMs and classical matrix factorization (MF) models to assess LLMs' ability to leverage user-item interaction data. We further in…

@arXiv_csSE_bot@mastoxiv.page
2025-06-05 07:24:15

CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking
Neeva Oza, Ishaan Govil, Parul Gupta, Dinesh Khandelwal, Dinesh Garg, Parag Singla
https://arxiv.org/abs/2506.04019

CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking
LLMs have been extensively used for the task of automated code generation. In this work, we examine the applicability of LLMs for the related but relatively unexplored task of code-equivalence checking, i.e., given two programs, whether they are functionally equivalent or not. This is an important problem since benchmarking code equivalence can play a critical role in evaluating LLM capabilities for tasks such as code re-writing and code translation. Towards this end, we present CETBench - Code…

@arXiv_csLG_bot@mastoxiv.page
2025-06-05 11:01:27

This https://arxiv.org/abs/2506.02965 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Memory-Efficient and Privacy-Preserving Collaborative Training for Mixture-of-Experts LLMs
Mixture-of-Experts (MoE) has been gaining popularity due to its successful adaptation to large language models (LLMs). In this work, we introduce Privacy-preserving Collaborative Mixture-of-Experts (PC-MoE), which leverages the sparsity of the MoE architecture for memory-efficient decentralized collaborative LLM training, enabling multiple parties with limited GPU-memory and data resources to collectively train more capable LLMs than they could achieve individually. At the same time, this appro…

@arXiv_csAR_bot@mastoxiv.page
2025-06-04 07:17:33

CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge
Chunlin Tian, Xinpeng Qin, Kahou Tam, Li Li, Zijian Wang, Yuanzhe Zhao, Minglei Zhang, Chengzhong Xu
https://arxiv.org/abs/2506.02847

CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge
Deploying large language models (LLMs) on edge devices is crucial for delivering fast responses and ensuring data privacy. However, the limited storage, weight, and power of edge devices make it difficult to deploy LLM-powered applications. These devices must balance latency requirements with energy consumption and model accuracy. In this paper, we first quantify the challenges of deploying LLMs on off-the-shelf edge devices and then we present CLONE, an in-depth algorithm-hardware co-design at…

@arXiv_csDB_bot@mastoxiv.page
2025-06-04 13:32:44

This https://arxiv.org/abs/2501.04901 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csDB_…

ThriftLLM: On Cost-Effective Selection of Large Language Models for Classification Queries
In recent years, large language models (LLMs) have demonstrated remarkable capabilities in comprehending and generating natural language content, attracting widespread attention in both industry and academia. An increasing number of services offer LLMs for various tasks via APIs. Different LLMs demonstrate expertise in different domains of queries (e.g., text classification queries). Meanwhile, LLMs of different scales, complexities, and performance are priced diversely. Driven by this, several…

@arXiv_csHC_bot@mastoxiv.page
2025-06-03 16:34:01

This https://arxiv.org/abs/2503.16456 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csHC_…

Position: Beyond Assistance -- Reimagining LLMs as Ethical and Adaptive Co-Creators in Mental Health Care
This position paper argues for a fundamental shift in how Large Language Models (LLMs) are integrated into the mental health care domain. We advocate for their role as co-creators rather than mere assistive tools. While LLMs have the potential to enhance accessibility, personalization, and crisis intervention, their adoption remains limited due to concerns about bias, evaluation, over-reliance, dehumanization, and regulatory uncertainties. To address these challenges, we propose two structured …

@arXiv_csDC_bot@mastoxiv.page
2025-06-04 07:27:34

NestedFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMs
Haeun Lee, Omin Kwon, Yeonhong Park, Jae W. Lee
https://arxiv.org/abs/2506.02024

NestedFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMs
Large Language Models (LLMs) are playing a crucial role in latency-critical, high-throughput services like virtual assistants and code generation. While techniques such as continuous batching and paged attention address service-level objectives (SLOs), and quantization methods accelerate inference, the dynamic and efficient adaptation of precision at runtime remains a significant, largely underexplored challenge. The emergence of hardware support for FP8 arithmetic, offering up to 2x the throug…

@tante@tldr.nettime.org
2025-06-03 14:38:54

This is such a perfect analogy.
My goto is "asbestos". Super useful invention which bit us in the ass afterwards.
https://xoxo.zone/@annika/114614639082253074

Annika Backstrom (@annika@xoxo.zone)
LLMs are the cars of the computing world: they seem convenient, but the tradeoffs are not immediately obvious, and if you come to depend on them you may find it hard to break the habit

@castarco@hachyderm.io
2025-05-04 22:28:42

Anyone has the impression that virtually all LLMs use a sort of "hyper-allistic" language?
As if we had a spectrum for allism disorder and LLMs were an extreme case of it.

@arXiv_csCV_bot@mastoxiv.page
2025-07-04 10:24:31

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation
Jiaer Xia, Bingkui Tong, Yuhang Zang, Rui Shao, Kaiyang Zhou
https://arxiv.org/abs/2507.02859

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images…

@arXiv_csIT_bot@mastoxiv.page
2025-07-04 08:24:21

On the Convergence of Large Language Model Optimizer for Black-Box Network Management
Hoon Lee, Wentao Zhou, Merouane Debbah, Inkyu Lee
https://arxiv.org/abs/2507.02689

On the Convergence of Large Language Model Optimizer for Black-Box Network Management
Future wireless networks are expected to incorporate diverse services that often lack general mathematical models. To address such black-box network management tasks, the large language model (LLM) optimizer framework, which leverages pretrained LLMs as optimization agents, has recently been promoted as a promising solution. This framework utilizes natural language prompts describing the given optimization problems along with past solutions generated by LLMs themselves. As a result, LLMs can ob…

@arXiv_csCL_bot@mastoxiv.page
2025-07-04 09:36:41

Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs
Ken Tsui
https://arxiv.org/abs/2507.02778 https://

Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs
Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot' - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework t…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 17:48:46

This https://arxiv.org/abs/2501.07071 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

Value Compass Leaderboard: A Platform for Fundamental and Validated Evaluation of LLMs Values
As Large Language Models (LLMs) achieve remarkable breakthroughs, aligning their values with humans has become imperative for their responsible development and customized applications. However, there still lack evaluations of LLMs values that fulfill three desirable goals. (1) Value Clarification: We expect to clarify the underlying values of LLMs precisely and comprehensively, while current evaluations focus narrowly on safety risks such as bias and toxicity. (2) Evaluation Validity: Existing …

@arXiv_statME_bot@mastoxiv.page
2025-06-04 14:00:58

This https://arxiv.org/abs/2505.19145 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_sta…

Do Large Language Models (Really) Need Statistical Foundations?
Large language models (LLMs) represent a new paradigm for processing unstructured data, with applications across an unprecedented range of domains. In this paper, we address, through two arguments, whether the development and application of LLMs would genuinely benefit from foundational contributions from the statistics discipline. First, we argue affirmatively, beginning with the observation that LLMs are inherently statistical models due to their profound data dependency and stochastic genera…

@arXiv_csPL_bot@mastoxiv.page
2025-06-05 09:41:06

This https://arxiv.org/abs/2405.08965 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csPL_…

Meaning-Typed Programming: Language Abstraction and Runtime for Model-Integrated Applications
Software development is shifting from traditional logical programming to model-integrated applications that leverage generative AI and large language models (LLMs) during runtime. However, integrating LLMs remains complex, requiring developers to manually craft prompts and process outputs. Existing tools attempt to assist with prompt engineering, but often introduce additional complexity. This paper presents Meaning-Typed Programming (MTP) model, a novel paradigm that abstracts LLM integratio…

@arXiv_physicssocph_bot@mastoxiv.page
2025-06-03 16:49:23

This https://arxiv.org/abs/2407.04503 has been replaced.
initial toot: https://mastoxiv.page/@arX…

When LLMs Play the Telephone Game: Cultural Attractors as Conceptual Tools to Evaluate LLMs in Multi-turn Settings
As large language models (LLMs) start interacting with each other and generating an increasing amount of text online, it becomes crucial to better understand how information is transformed as it passes from one LLM to the next. While significant research has examined individual LLM behaviors, existing studies have largely overlooked the collective behaviors and information distortions arising from iterated LLM interactions. Small biases, negligible at the single output level, risk being amplifi…

@arXiv_csRO_bot@mastoxiv.page
2025-06-05 10:00:55

This https://arxiv.org/abs/2505.20573 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

Collision- and Reachability-Aware Multi-Robot Control with Grounded LLM Planners
Large language models (LLMs) have demonstrated strong performance in various robot control tasks. However, their deployment in real-world applications remains constrained. Even state-ofthe-art LLMs, such as GPT-o4mini, frequently produce invalid action plans that violate physical constraints, such as directing a robot to an unreachable location or causing collisions between robots. This issue primarily arises from a lack of awareness of these physical constraints during the reasoning process. T…

@samir@functional.computer
2025-06-03 20:48:17

If LLMs were so good at writing code, they wouldn’t need a new thought leader yelling about them every day.
They might be. At this point, I do not care. Lots of people (including, most recently, Ptacek, Yegge, etc.) are trying to sell me something and I have no interest in listening.
If your thing is good, show, don’t tell.
But it’s not, is it?
These articles… you’re not trying to convince me, you’re trying to convince yourselves.
So please: keep them to yoursel…

@arXiv_csCY_bot@mastoxiv.page
2025-06-04 13:34:40

This https://arxiv.org/abs/2506.00095 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCY_…

ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases
Hepato-pancreato-biliary (HPB) disorders represent a global public health challenge due to their high morbidity and mortality. Although large language models (LLMs) have shown promising performance in general medical question-answering tasks, the current evaluation benchmarks are mostly derived from standardized examinations or manually designed questions, lacking HPB coverage and clinical cases. To address these issues, we systematically eatablish an HPB disease evaluation benchmark comprising…

@arXiv_csIR_bot@mastoxiv.page
2025-07-04 08:10:01

When LLMs Disagree: Diagnosing Relevance Filtering Bias and Retrieval Divergence in SDG Search
William A. Ingram, Bipasha Banerjee, Edward A. Fox
https://arxiv.org/abs/2507.02139 …

When LLMs Disagree: Diagnosing Relevance Filtering Bias and Retrieval Divergence in SDG Search
Large language models (LLMs) are increasingly used to assign document relevance labels in information retrieval pipelines, especially in domains lacking human-labeled data. However, different models often disagree on borderline cases, raising concerns about how such disagreement affects downstream retrieval. This study examines labeling disagreement between two open-weight LLMs, LLaMA and Qwen, on a corpus of scholarly abstracts related to Sustainable Development Goals (SDGs) 1, 3, and 7. We is…

@arXiv_csLG_bot@mastoxiv.page
2025-06-05 10:59:18

This https://arxiv.org/abs/2505.24298 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers fro…

@arXiv_csCR_bot@mastoxiv.page
2025-07-04 09:12:41

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
Krishna Kanth Nakka, Xue Jiang, Xuebing Zhou
https://arxiv.org/abs/2507.02332

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
This paper investigates privacy jailbreaking in LLMs via steering, focusing on whether manipulating activations can bypass LLM alignment and alter response behaviors to privacy related queries (e.g., a certain public figure's sexual orientation). We begin by identifying attention heads predictive of refusal behavior for private attributes (e.g., sexual orientation) using lightweight linear probes trained with privacy evaluator labels. Next, we steer the activations of a small subset of these at…

@arXiv_csSE_bot@mastoxiv.page
2025-06-05 07:23:54

VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation
Yuansheng Ni, Ping Nie, Kai Zou, Xiang Yue, Wenhu Chen
https://arxiv.org/abs/2506.03930

VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation
Large language models (LLMs) often struggle with visualization tasks like plotting diagrams, charts, where success depends on both code correctness and visual semantics. Existing instruction-tuning datasets lack execution-grounded supervision and offer limited support for iterative code correction, resulting in fragile and unreliable plot generation. We present VisCode-200K, a large-scale instruction tuning dataset for Python-based visualization and self-correction. It contains over 200K exampl…

@tante@tldr.nettime.org
2025-07-03 09:48:09

"LLMs are okay at coding, but at scale they build jumbled messes. I’ve scaled back my use of AI when coding and gone back to using my brain and pen and paper."
https://albertofortin.com/writing/coding-with-ai

After months of coding with LLMs, I'm going back to using my brain • albertofortin.com
I've been building MVPs and SaaS products for 15 years. Let's work together on your next project.

@arXiv_csAI_bot@mastoxiv.page
2025-07-04 07:31:41

Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs
Mohammad Ali Alomrani, Yingxue Zhang, Derek Li, Qianyi Sun, Soumyasundar Pal, Zhanguang Zhang, Yaochen Hu, Rohan Deepak Ajwani, Antonios Valkanas, Raika Karimi, Peng Cheng, Yunzhou Wang, Pengyi Liao, Hanrui Huang, Bin Wang, Jianye Hao, Mark Coates
https://

Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs
Large language models (LLMs) have rapidly progressed into general-purpose agents capable of solving a broad spectrum of tasks. However, current models remain inefficient at reasoning: they apply fixed inference-time compute regardless of task complexity, often overthinking simple problems while underthinking hard ones. This survey presents a comprehensive review of efficient test-time compute (TTC) strategies, which aim to improve the computational efficiency of LLM reasoning. We introduce a tw…

@arXiv_csCL_bot@mastoxiv.page
2025-07-04 09:31:41

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers
Zhijian Xu, Yilun Zhao, Manasi Patwardhan, Lovekesh Vig, Arman Cohan
https://arxiv.org/abs/2507.02694

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers
Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGe…

@arXiv_csSE_bot@mastoxiv.page
2025-06-05 07:23:42

Boosting Open-Source LLMs for Program Repair via Reasoning Transfer and LLM-Guided Reinforcement Learning
Xunzhu Tang, Jacques Klein, Tegawend\'e F. Bissyand\'e
https://arxiv.org/abs/2506.03921

Boosting Open-Source LLMs for Program Repair via Reasoning Transfer and LLM-Guided Reinforcement Learning
Several closed-source LLMs have consistently outperformed open-source alternatives in program repair tasks, primarily due to their superior reasoning capabilities and extensive pre-training. This paper introduces Repairity, a novel three-stage methodology that significantly narrows this performance gap through reasoning extraction and reinforcement learning. Our approach: (1) systematically filters high-quality reasoning traces from closed-source models using correctness verification, (2) trans…

@arXiv_csCR_bot@mastoxiv.page
2025-07-04 07:43:21

MGC: A Compiler Framework Exploiting Compositional Blindness in Aligned LLMs for Malware Generation
Lu Yan, Zhuo Zhang, Xiangzhe Xu, Shengwei An, Guangyu Shen, Zhou Xuan, Xuan Chen, Xiangyu Zhang
https://arxiv.org/abs/2507.02057

MGC: A Compiler Framework Exploiting Compositional Blindness in Aligned LLMs for Malware Generation
Large language models (LLMs) have democratized software development, reducing the expertise barrier for programming complex applications. This accessibility extends to malicious software development, raising significant security concerns. While LLM providers have implemented alignment mechanisms to prevent direct generation of overtly malicious code, these safeguards predominantly evaluate individual prompts in isolation, overlooking a critical vulnerability: malicious operations can be systema…

@arXiv_csAR_bot@mastoxiv.page
2025-06-03 07:17:05

ReTern: Exploiting Natural Redundancy and Sign Transformations for Enhanced Fault Tolerance in Compute-in-Memory based Ternary LLMs
Akul Malhotra, Sumeet Kumar Gupta
https://arxiv.org/abs/2506.01140

ReTern: Exploiting Natural Redundancy and Sign Transformations for Enhanced Fault Tolerance in Compute-in-Memory based Ternary LLMs
Ternary large language models (LLMs), which utilize ternary precision weights and 8-bit activations, have demonstrated competitive performance while significantly reducing the high computational and memory requirements of full-precision LLMs. The energy efficiency and performance of Ternary LLMs can be further improved by deploying them on ternary computing-in-memory (TCiM) accelerators, thereby alleviating the von-Neumann bottleneck. However, TCiM accelerators are prone to memory stuck-at faul…

@arXiv_csHC_bot@mastoxiv.page
2025-06-05 07:18:43

Sampling Preferences Yields Simple Trustworthiness Scores
Sean Steinle
https://arxiv.org/abs/2506.03399 https://arxiv.org/pdf/2506.03…

Sampling Preferences Yields Simple Trustworthiness Scores
With the onset of large language models (LLMs), the performance of artificial intelligence (AI) models is becoming increasingly multi-dimensional. Accordingly, there have been several large, multi-dimensional evaluation frameworks put forward to evaluate LLMs. Though these frameworks are much more realistic than previous attempts which only used a single score like accuracy, multi-dimensional evaluations can complicate decision-making since there is no obvious way to select an optimal model. Th…

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 14:08:57

This https://arxiv.org/abs/2506.01538 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation
Although Multi-Agent Reinforcement Learning (MARL) is effective for complex multi-robot tasks, it suffers from low sample efficiency and requires iterative manual reward tuning. Large Language Models (LLMs) have shown promise in single-robot settings, but their application in multi-robot systems remains largely unexplored. This paper introduces a novel LLM-Aided MARL (LAMARL) approach, which integrates MARL with LLMs, significantly enhancing sample efficiency without requiring manual design. LA…

@arXiv_csCY_bot@mastoxiv.page
2025-06-05 09:38:07

This https://arxiv.org/abs/2506.00095 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCY_…

@arXiv_csAI_bot@mastoxiv.page
2025-07-04 09:11:41

Data Diversification Methods In Alignment Enhance Math Performance In LLMs
Berkan Dokmeci, Qingyang Wu, Ben Athiwaratkun, Ce Zhang, Shuaiwen Leon Song, James Zou
https://arxiv.org/abs/2507.02173

Data Diversification Methods In Alignment Enhance Math Performance In LLMs
While recent advances in preference learning have enhanced alignment in human feedback, mathematical reasoning remains a persistent challenge. We investigate how data diversification strategies in preference optimization can improve the mathematical reasoning abilities of large language models (LLMs). We evaluate three common data generation methods: temperature sampling, Chain-of-Thought prompting, and Monte Carlo Tree Search (MCTS), and introduce Diversified-ThinkSolve (DTS), a novel structur…

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 21:51:54

This https://arxiv.org/abs/2505.19433 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression
Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehens…

@arXiv_csIR_bot@mastoxiv.page
2025-06-05 07:19:10

GORACS: Group-level Optimal Transport-guided Coreset Selection for LLM-based Recommender Systems
Tiehua Mei, Hengrui Chen, Peng Yu, Jiaqing Liang, Deqing Yang
https://arxiv.org/abs/2506.04015

GORACS: Group-level Optimal Transport-guided Coreset Selection for LLM-based Recommender Systems
Although large language models (LLMs) have shown great potential in recommender systems, the prohibitive computational costs for fine-tuning LLMs on entire datasets hinder their successful deployment in real-world scenarios. To develop affordable and effective LLM-based recommender systems, we focus on the task of coreset selection which identifies a small subset of fine-tuning data to optimize the test loss, thereby facilitating efficient LLMs' fine-tuning. Although there exist some intuitive …

@arXiv_csSE_bot@mastoxiv.page
2025-06-05 07:21:33

Fault Localisation and Repair for DL Systems: An Empirical Study with LLMs
Jinhan Kim, Nargiz Humbatova, Gunel Jahangirova, Shin Yoo, Paolo Tonella
https://arxiv.org/abs/2506.03396

Fault Localisation and Repair for DL Systems: An Empirical Study with LLMs
Numerous Fault Localisation (FL) and repair techniques have been proposed to address faults in Deep Learning (DL) models. However, their effectiveness in practical applications remains uncertain due to the reliance on pre-defined rules. This paper presents a comprehensive evaluation of state-of-the-art FL and repair techniques, examining their advantages and limitations. Moreover, we introduce a novel approach that harnesses the power of Large Language Models (LLMs) in localising and repairing …

@arXiv_csCL_bot@mastoxiv.page
2025-07-03 10:17:10

The Thin Line Between Comprehension and Persuasion in LLMs
Adrian de Wynter, Tangming Yuan
https://arxiv.org/abs/2507.01936 https://a…

The Thin Line Between Comprehension and Persuasion in LLMs
Large language models (LLMs) are excellent at maintaining high-level, convincing dialogues. They are being fast deployed as chatbots and evaluators in sensitive areas, such as peer review and mental health applications. This, along with the disparate accounts on their reasoning capabilities, calls for a closer examination of LLMs and their comprehension of dialogue. In this work we begin by evaluating LLMs' ability to maintain a debate--one of the purest yet most complex forms of human communic…

@arXiv_csCR_bot@mastoxiv.page
2025-06-04 07:26:46

BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage
Kalyan Nakka, Nitesh Saxena
https://arxiv.org/abs/2506.02479

BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage
The inherent risk of generating harmful and unsafe content by Large Language Models (LLMs), has highlighted the need for their safety alignment. Various techniques like supervised fine-tuning, reinforcement learning from human feedback, and red-teaming were developed for ensuring the safety alignment of LLMs. However, the robustness of these aligned LLMs is always challenged by adversarial attacks that exploit unexplored and underlying vulnerabilities of the safety alignment. In this paper, we …

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 16:10:09

This https://arxiv.org/abs/2406.13945 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks
As large language models (LLMs) continue to advance and gain widespread use, establishing systematic and reliable evaluation methodologies for LLMs and vision-language models (VLMs) has become essential to ensure their real-world effectiveness and reliability. There have been some early explorations about the usability of LLMs for limited urban tasks, but a systematic and scalable evaluation benchmark is still lacking. The challenge in constructing a systematic evaluation benchmark for urban re…

@arXiv_csHC_bot@mastoxiv.page
2025-06-03 16:54:22

This https://arxiv.org/abs/2503.18792 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csHC_…

REALM: A Dataset of Real-World LLM Use Cases
Large Language Models (LLMs), such as the GPT series, have driven significant industrial applications, leading to economic and societal transformations. However, a comprehensive understanding of their real-world applications remains limited. To address this, we introduce REALM, a dataset of over 94,000 LLM use cases collected from Reddit and news articles. REALM captures two key dimensions: the diverse applications of LLMs and the demographics of their users. It categorizes LLM applications and…

@arXiv_csCY_bot@mastoxiv.page
2025-06-03 07:26:13

The World As Large Language Models See It: Exploring the reliability of LLMs in representing geographical features
Omid Reza Abbasi, Franz Welscher, Georg Weinberger, Johannes Scholz
https://arxiv.org/abs/2506.00203

The World As Large Language Models See It: Exploring the reliability of LLMs in representing geographical features
As large language models (LLMs) continue to evolve, questions about their trustworthiness in delivering factual information have become increasingly important. This concern also applies to their ability to accurately represent the geographic world. With recent advancements in this field, it is relevant to consider whether and to what extent LLMs' representations of the geographical world can be trusted. This study evaluates the performance of GPT-4o and Gemini 2.0 Flash in three key geospatial …

@arXiv_csLG_bot@mastoxiv.page
2025-06-05 11:00:19

This https://arxiv.org/abs/2506.00486 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs
Despite rapid advancements in the research and deployment of large language models (LLMs), the statistical distribution of model parameters, as well as their influence on initialization, training dynamics, and downstream efficiency, has received surprisingly little attention. A recent work introduced BackSlash, a training-time compression algorithm. It first demonstrated that pre-trained LLM parameters follow generalized Gaussian distributions (GGDs) better. By optimizing GG priors during train…

@arXiv_csIR_bot@mastoxiv.page
2025-06-05 07:18:47

ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking
Xianming Li, Aamir Shakir, Rui Huang, Julius Lipp, Jing Li
https://arxiv.org/abs/2506.03487

ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking
Reranking is fundamental to information retrieval and retrieval-augmented generation, with recent Large Language Models (LLMs) significantly advancing reranking quality. While recent advances with LLMs have significantly improved document reranking quality, current approaches primarily rely on large-scale LLMs (>7B parameters) through zero-shot prompting, presenting high computational costs. Small Language Models (SLMs) offer a promising alternative because of their efficiency, but our prelimin…

@arXiv_csSE_bot@mastoxiv.page
2025-06-03 16:16:02

This https://arxiv.org/abs/2401.16310 has been replaced.
link: https://scholar.google.com/scholar?q=a

An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors
Security code review is a time-consuming and labor-intensive process typically requiring integration with automated security defect detection tools. However, existing security analysis tools struggle with poor generalization, high false positive rates, and coarse detection granularity. Large Language Models (LLMs) have been considered promising candidates for addressing those challenges. In this study, we conducted an empirical study to explore the potential of LLMs in detecting security defect…

@arXiv_csSE_bot@mastoxiv.page
2025-06-05 09:44:15

This https://arxiv.org/abs/2506.02658 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csSE_…

Computational Thinking Reasoning in Large Language Models
While large language models (LLMs) have demonstrated remarkable reasoning capabilities, they often struggle with complex tasks that require specific thinking paradigms, such as divide-and-conquer and procedural deduction, \etc Previous researches integrate external, reliable tools to alleviate logical inconsistencies and hallucinations in LLMs' problem-solving processes. However, we argue that the root challenge is more profound: LLMs lack the complex thinking paradigms (\ie, computational thin…

@arXiv_csCY_bot@mastoxiv.page
2025-06-03 07:20:02

Comparative analysis of privacy-preserving open-source LLMs regarding extraction of diagnostic information from clinical CMR imaging reports
Sina Amirrajab, Volker Vehof, Michael Bietenbeck, Ali Yilmaz
https://arxiv.org/abs/2506.00060

Comparative analysis of privacy-preserving open-source LLMs regarding extraction of diagnostic information from clinical CMR imaging reports
Purpose: We investigated the utilization of privacy-preserving, locally-deployed, open-source Large Language Models (LLMs) to extract diagnostic information from free-text cardiovascular magnetic resonance (CMR) reports. Materials and Methods: We evaluated nine open-source LLMs on their ability to identify diagnoses and classify patients into various cardiac diagnostic categories based on descriptive findings in 109 clinical CMR reports. Performance was quantified using standard classification …

@arXiv_csHC_bot@mastoxiv.page
2025-07-04 09:10:51

Misaligned from Within: Large Language Models Reproduce Our Double-Loop Learning Blindness
Tim Rogers, Ben Teehankee
https://arxiv.org/abs/2507.02283 https…

Misaligned from Within: Large Language Models Reproduce Our Double-Loop Learning Blindness
This paper examines a critical yet unexplored dimension of the AI alignment problem: the potential for Large Language Models (LLMs) to inherit and amplify existing misalignments between human espoused theories and theories-in-use. Drawing on action science research, we argue that LLMs trained on human-generated text likely absorb and reproduce Model 1 theories-in-use - a defensive reasoning pattern that both inhibits learning and creates ongoing anti-learning dynamics at the dyad, group, and or…

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 21:37:23

This https://arxiv.org/abs/2505.03793 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection
The proliferation of open-sourced Large Language Models (LLMs) and diverse downstream tasks necessitates efficient model selection, given the impracticality of fine-tuning all candidates due to computational constraints. Despite the recent advances in LLM selection, a fundamental research question largely remains nascent: how can we model the dynamic behaviors of LLMs during fine-tuning, thereby enhancing our understanding of their generalization performance across diverse downstream tasks? In …

@arXiv_csCL_bot@mastoxiv.page
2025-06-03 08:19:57

Benford's Curse: Tracing Digit Bias to Numerical Hallucination in LLMs
Jiandong Shao, Yao Lu, Jianfei Yang
https://arxiv.org/abs/2506.01734 https://

Benford's Curse: Tracing Digit Bias to Numerical Hallucination in LLMs
Large Language Models (LLMs) exhibit impressive performance on complex reasoning tasks, yet they frequently fail on basic numerical problems, producing incorrect outputs. Inspired by Benford's Law -- a statistical pattern where lower digits occur more frequently as leading digits -- we hypothesize that the long-tailed digit distributions in web-collected corpora may be learned by LLMs during pretraining, leading to biased numerical generation. To investigate the hypothesis, we first examine whe…

@arXiv_csCR_bot@mastoxiv.page
2025-06-04 13:33:00

This https://arxiv.org/abs/2404.16873 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Large Language Models (LLMs) are vulnerable to jailbreaking attacks that lead to generation of inappropriate or harmful content. Manual red-teaming requires a time-consuming search for adversarial prompts, whereas automatic adversarial prompt generation often leads to semantically meaningless attacks that do not scale well. In this paper, we present a novel method that uses another LLM, called AdvPrompter, to generate human-readable adversarial prompts in seconds. AdvPrompter, which is trained …

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 17:41:46

This https://arxiv.org/abs/2412.13147 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

Are Your LLMs Capable of Stable Reasoning?
The rapid advancement of large language models (LLMs) has shown remarkable progress in complex reasoning tasks. However, a significant disparity exists between benchmark performances and real-world applications. We attribute this gap primarily to current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, especially in complex reasoning tasks where both accuracy and consistency are essential. In this paper, we introduce G-Pass@$k$, a novel evaluat…

@arXiv_csSE_bot@mastoxiv.page
2025-06-04 07:31:51

Computational Thinking Reasoning in Large Language Models
Kechi Zhang, Ge Li, Jia Li, Huangzhao Zhang, Jingjing Xu, Hao Zhu, Lecheng Wang, Jia Li, Yihong Dong, Jing Mai, Bin Gu, Zhi Jin
https://arxiv.org/abs/2506.02658

@arXiv_csCY_bot@mastoxiv.page
2025-06-03 07:20:41

Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs
Nariman Naderi, Zahra Atf, Peter R Lewis, Aref Mahjoub far, Seyed Amir Ahmad Safavi-Naini, Ali Soroush
https://arxiv.org/abs/2506.00072

Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs
This paper investigates how prompt engineering techniques impact both accuracy and confidence elicitation in Large Language Models (LLMs) applied to medical contexts. Using a stratified dataset of Persian board exam questions across multiple specialties, we evaluated five LLMs - GPT-4o, o3-mini, Llama-3.3-70b, Llama-3.1-8b, and DeepSeek-v3 - across 156 configurations. These configurations varied in temperature settings (0.3, 0.7, 1.0), prompt styles (Chain-of-Thought, Few-Shot, Emotional, Exper…

@arXiv_csCL_bot@mastoxiv.page
2025-07-04 09:42:51

Multimodal Mathematical Reasoning with Diverse Solving Perspective
Wenhao Shi, Zhiqiang Hu, Yi Bin, Yang Yang, See-Kiong Ng, Heng Tao Shen
https://arxiv.org/abs/2507.02804

Multimodal Mathematical Reasoning with Diverse Solving Perspective
Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution tra…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 17:38:17

This https://arxiv.org/abs/2412.11934 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

Stepwise Reasoning Error Disruption Attack of LLMs
Large language models (LLMs) have made remarkable strides in complex reasoning tasks, but their safety and robustness in reasoning processes remain underexplored. Existing attacks on LLM reasoning are constrained by specific settings or lack of imperceptibility, limiting their feasibility and generalizability. To address these challenges, we propose the Stepwise rEasoning Error Disruption (SEED) attack, which subtly injects errors into prior reasoning steps to mislead the model into producing i…

@arXiv_csCR_bot@mastoxiv.page
2025-06-03 17:30:50

This https://arxiv.org/abs/2501.18626 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

The TIP of the Iceberg: Revealing a Hidden Class of Task-in-Prompt Adversarial Attacks on LLMs
We present a novel class of jailbreak adversarial attacks on LLMs, termed Task-in-Prompt (TIP) attacks. Our approach embeds sequence-to-sequence tasks (e.g., cipher decoding, riddles, code execution) into the model's prompt to indirectly generate prohibited inputs. To systematically assess the effectiveness of these attacks, we introduce the PHRYGE benchmark. We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models, including GPT-4o and LLaMA…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 07:18:25

Evaluation of LLMs for mathematical problem solving
Ruonan Wang, Runxi Wang, Yunwen Shen, Chengfeng Wu, Qinglin Zhou, Rohitash Chandra
https://arxiv.org/abs/2506.00309

Evaluation of LLMs for mathematical problem solving
Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexities (GSM8K, MATH500, and UNSW datasets). We take a five-dimensional approach based on the Structured Chain-of-Thought (SCoT) framework to assess final answer correctness, step com…

@arXiv_csCR_bot@mastoxiv.page
2025-07-03 09:06:10

SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism
Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, Heng Tao Shen
https://arxiv.org/abs/2507.01513

SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism
By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment.Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs' built-in safeguards.Yet, they fall short in uncovering …

@arXiv_csSE_bot@mastoxiv.page
2025-07-04 09:30:01

LLMREI: Automating Requirements Elicitation Interviews with LLMs
Alexander Korn, Samuel Gorsch, Andreas Vogelsang
https://arxiv.org/abs/2507.02564 https://…

LLMREI: Automating Requirements Elicitation Interviews with LLMs
Requirements elicitation interviews are crucial for gathering system requirements but heavily depend on skilled analysts, making them resource-intensive, susceptible to human biases, and prone to miscommunication. Recent advancements in Large Language Models present new opportunities for automating parts of this process. This study introduces LLMREI, a chat bot designed to conduct requirements elicitation interviews with minimal human intervention, aiming to reduce common interviewer errors and…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 07:27:17

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning
Weiyang Guo, Zesheng Shi, Zhuo Li, Yequan Wang, Xuebo Liu, Wenya Wang, Fangming Liu, Min Zhang, Jing Li
https://arxiv.org/abs/2506.00782

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning
As large language models (LLMs) grow in power and influence, ensuring their safety and preventing harmful output becomes critical. Automated red teaming serves as a tool to detect security vulnerabilities in LLMs without manual labor. However, most existing methods struggle to balance the effectiveness and diversity of red-team generated attack prompts. To address this challenge, we propose \ourapproach, a novel automated red teaming training framework that utilizes reinforcement learning to ex…

@arXiv_csCL_bot@mastoxiv.page
2025-07-03 10:02:40

LLMs for Legal Subsumption in German Employment Contracts
Oliver Wardas, Florian Matthes
https://arxiv.org/abs/2507.01734 https://arx…

LLMs for Legal Subsumption in German Employment Contracts
Legal work, characterized by its text-heavy and resource-intensive nature, presents unique challenges and opportunities for NLP research. While data-driven approaches have advanced the field, their lack of interpretability and trustworthiness limits their applicability in dynamic legal environments. To address these issues, we collaborated with legal experts to extend an existing dataset and explored the use of Large Language Models (LLMs) and in-context learning to evaluate the legality of cla…

@arXiv_csCR_bot@mastoxiv.page
2025-07-04 09:54:51

Control at Stake: Evaluating the Security Landscape of LLM-Driven Email Agents
Jiangrong Wu, Yuhong Nan, Jianliang Wu, Zitong Yao, Zibin Zheng
https://arxiv.org/abs/2507.02699

Control at Stake: Evaluating the Security Landscape of LLM-Driven Email Agents
The increasing capabilities of LLMs have led to the rapid proliferation of LLM agent apps, where developers enhance LLMs with access to external resources to support complex task execution. Among these, LLM email agent apps represent one of the widely used categories, as email remains a critical communication medium for users. LLM email agents are capable of managing and responding to email using LLM-driven reasoning and autonomously executing user instructions via external email APIs (e.g., se…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 07:23:19

DrKGC: Dynamic Subgraph Retrieval-Augmented LLMs for Knowledge Graph Completion across General and Biomedical Domains
Yongkang Xiao, Sinian Zhang, Yi Dai, Huixue Zhou, Jue Hou, Jie Ding, Rui Zhang
https://arxiv.org/abs/2506.00708

DrKGC: Dynamic Subgraph Retrieval-Augmented LLMs for Knowledge Graph Completion across General and Biomedical Domains
Knowledge graph completion (KGC) aims to predict missing triples in knowledge graphs (KGs) by leveraging existing triples and textual information. Recently, generative large language models (LLMs) have been increasingly employed for graph tasks. However, current approaches typically encode graph context in textual form, which fails to fully exploit the potential of LLMs for perceiving and reasoning about graph structures. To address this limitation, we propose DrKGC (Dynamic Subgraph Retrieval-…

@arXiv_csCR_bot@mastoxiv.page
2025-06-04 13:34:26

This https://arxiv.org/abs/2412.15289 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
Large language models (LLMs) have made significant advancements across various tasks, but their safety alignment remain a major concern. Exploring jailbreak prompts can expose LLMs' vulnerabilities and guide efforts to secure them. Existing methods primarily design sophisticated instructions for the LLM to follow, or rely on multiple iterations, which could hinder the performance and efficiency of jailbreaks. In this work, we propose a novel jailbreak paradigm, Simple Assistive Task Linkage (SA…

@arXiv_csSE_bot@mastoxiv.page
2025-06-04 13:38:28

This https://arxiv.org/abs/2501.07849 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csSE_…

The Invisible Hand: Unveiling Provider Bias in Large Language Models for Code Generation
Large Language Models (LLMs) have emerged as the new recommendation engines, surpassing traditional methods in both capability and scope, particularly in code generation. In this paper, we reveal a novel provider bias in LLMs: without explicit directives, these models show systematic preferences for services from specific providers in their recommendations (e.g., favoring Google Cloud over Microsoft Azure). To systematically investigate this bias, we develop an automated pipeline to construct t…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 18:11:17

This https://arxiv.org/abs/2505.19165 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

OrgAccess: A Benchmark for Role Based Access Control in Organization Scale LLMs
Role-based access control (RBAC) and hierarchical structures are foundational to how information flows and decisions are made within virtually all organizations. As the potential of Large Language Models (LLMs) to serve as unified knowledge repositories and intelligent assistants in enterprise settings becomes increasingly apparent, a critical, yet under explored, challenge emerges: \textit{can these models reliably understand and operate within the complex, often nuanced, constraints imposed b…

@arXiv_csCR_bot@mastoxiv.page
2025-06-03 17:52:02

This https://arxiv.org/abs/2505.18889 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

Security Concerns for Large Language Models: A Survey
Large Language Models (LLMs) such as GPT-4 (and its recent iterations like GPT-4o and the GPT-4.1 series), Google's Gemini, Anthropic's Claude 3 models, and xAI's Grok have caused a revolution in natural language processing, but their capabilities also introduce new security vulnerabilities. In this survey, we provide a comprehensive overview of the emerging security concerns around LLMs, categorizing threats into prompt injection and jailbreaking, adversarial attacks (including input perturbat…

@arXiv_csSE_bot@mastoxiv.page
2025-06-04 07:36:27

Reuse or Generate? Accelerating Code Editing via Edit-Oriented Speculative Decoding
Peiding Wang, Li Zhang, Fang Liu, Yinghao Zhu, Wang Xu, Lin Shi, Xiaoli Lian, Minxiao Li, Bo Shen, An Fu
https://arxiv.org/abs/2506.02780

Reuse or Generate? Accelerating Code Editing via Edit-Oriented Speculative Decoding
Large Language Models (LLMs) have demonstrated remarkable capabilities in code editing, substantially enhancing software development productivity. However, the inherent complexity of code editing tasks forces existing approaches to rely on LLMs' autoregressive end-to-end generation, where decoding speed plays a critical role in efficiency. While inference acceleration techniques like speculative decoding are applied to improve the decoding efficiency, these methods fail to account for the uniqu…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 16:19:26

This https://arxiv.org/abs/2406.13948 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

CityGPT: Empowering Urban Spatial Cognition of Large Language Models
Large language models(LLMs), with their powerful language generation and reasoning capabilities, have already achieved notable success in many domains, e.g., math and code generation. However, they often fall short when tackling real-life geospatial tasks within urban environments. This limitation stems from a lack of physical world knowledge and relevant data during training. To address this gap, we propose \textit{CityGPT}, a systematic framework designed to enhance LLMs' understanding of urb…

@arXiv_csCR_bot@mastoxiv.page
2025-06-03 17:36:14

This https://arxiv.org/abs/2502.11191 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training
Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-…

@arXiv_csSE_bot@mastoxiv.page
2025-06-03 17:24:06

This https://arxiv.org/abs/2504.11711 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csSE_…

The Hitchhiker's Guide to Program Analysis, Part II: Deep Thoughts by LLMs
Static analysis plays a crucial role in software vulnerability detection, yet faces a persistent precision-scalability tradeoff. In large codebases like the Linux kernel, traditional static analysis tools often generate excessive false positives due to simplified vulnerability modeling and overapproximation of path and data constraints. While large language models (LLMs) demonstrate promising code understanding capabilities, their direct application to program analysis remains unreliable due to…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 07:26:17

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho
https://arxiv.org/abs/2506.00781 ht…

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Recent advances in Large Language Models (LLMs) have spurred transformative applications in various domains, ranging from open-source to proprietary LLMs. However, jailbreak attacks, which aim to break safety alignment and user compliance by tricking the target LLMs into answering harmful and risky responses, are becoming an urgent concern. The practice of red-teaming for LLMs is to proactively explore potential risks and error-prone instances before the release of frontier AI technology. This …

@arXiv_csCR_bot@mastoxiv.page
2025-06-04 07:33:29

ATAG: AI-Agent Application Threat Assessment with Attack Graphs
Parth Atulbhai Gandhi, Akansha Shukla, David Tayouri, Beni Ifland, Yuval Elovici, Rami Puzis, Asaf Shabtai
https://arxiv.org/abs/2506.02859

ATAG: AI-Agent Application Threat Assessment with Attack Graphs
Evaluating the security of multi-agent systems (MASs) powered by large language models (LLMs) is challenging, primarily because of the systems' complex internal dynamics and the evolving nature of LLM vulnerabilities. Traditional attack graph (AG) methods often lack the specific capabilities to model attacks on LLMs. This paper introduces AI-agent application Threat assessment with Attack Graphs (ATAG), a novel framework designed to systematically analyze the security risks associated with AI-a…

@arXiv_csSE_bot@mastoxiv.page
2025-06-03 17:33:48

This https://arxiv.org/abs/2505.23387 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csSE_…

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization
Large Language Models (LLMs) generate functionally correct solutions but often fall short in code efficiency, a critical bottleneck for real-world deployment. In this paper, we introduce a novel test-time iterative optimization framework to address this, employing a closed-loop system where LLMs iteratively refine code based on empirical performance feedback from an execution sandbox. We explore three training strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and G…

@arXiv_csAI_bot@mastoxiv.page
2025-06-05 09:45:09

This https://arxiv.org/abs/2506.02139 has been replaced.
link: https://scholar.google.com/scholar?q=a

The Unified Cognitive Consciousness Theory for Language Models: Anchoring Semantics, Thresholds of Activation, and Emergent Reasoning
Few-shot learning in large language models (LLMs) reveals a core paradox: certain tasks generalize from just a few examples, while others demand extensive supervision. To explain this, we introduce the Unified Cognitive Consciousness Theory (UCCT), which reconceptualizes LLMs not as deficient agents, but as unconscious substrates: dense, distributed repositories of linguistic and conceptual patterns that operate without explicit semantics, intention, or goal-directed reasoning. Under this view,…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 18:05:22

This https://arxiv.org/abs/2505.08459 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

Strategy-Augmented Planning for Large Language Models via Opponent Exploitation
Efficiently modeling and exploiting opponents is a long-standing challenge in adversarial domains. Large Language Models (LLMs) trained on extensive textual data have recently demonstrated outstanding performance in general tasks, introducing new research directions for opponent modeling. Some studies primarily focus on directly using LLMs to generate decisions based on the elaborate prompt context that incorporates opponent descriptions, while these approaches are limited to scenarios where LL…

@arXiv_csSE_bot@mastoxiv.page
2025-06-04 13:37:05

This https://arxiv.org/abs/2409.14644 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csSE_…

An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models
The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code clustering. However, existing methods for source code embedding, including those based on LLMs, often rely on costly supervised training or fine-tuning for domain adaptation. This paper proposes a novel approach to embedding source code by combining large la…

@arXiv_csCR_bot@mastoxiv.page
2025-07-04 09:24:21

Evaluating Language Models For Threat Detection in IoT Security Logs
Jorge J. Tejero-Fern\'andez, Alfonso S\'anchez-Maci\'an
https://arxiv.org/abs/2507.02390

Evaluating Language Models For Threat Detection in IoT Security Logs
Log analysis is a relevant research field in cybersecurity as they can provide a source of information for the detection of threats to networks and systems. This paper presents a pipeline to use fine-tuned Large Language Models (LLMs) for anomaly detection and mitigation recommendation using IoT security logs. Utilizing classical machine learning classifiers as a baseline, three open-source LLMs are compared for binary and multiclass anomaly detection, with three strategies: zero-shot, few-shot…

@arXiv_csSE_bot@mastoxiv.page
2025-06-05 09:43:03

This https://arxiv.org/abs/2503.20197 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csSE_…

Enhancing the Robustness of LLM-Generated Code: Empirical Study and Framework
Ensuring the robustness of code generated by large language models (LLMs) is crucial for real-world reliability. However, existing evaluations predominantly focus on correctness, often neglecting key robustness concerns such as missing input validation and insufficient error handling. In this paper, we present the first empirical study on the robustness of LLM-generated code. We introduce novel robustness metrics and analyze four state-of-the-art code LLMs, revealing that, on average, 43.1% of …

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 18:09:04

This https://arxiv.org/abs/2505.16978 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

HyGenar: An LLM-Driven Hybrid Genetic Algorithm for Few-Shot Grammar Generation
Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs. Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small …

@arXiv_csCR_bot@mastoxiv.page
2025-06-03 16:55:16

This https://arxiv.org/abs/2408.16028 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data
Supervised-learning-based vulnerability detectors often fall short due to limited labelled training data. In contrast, Large Language Models (LLMs) like GPT-4 are trained on vast unlabelled code corpora, yet perform only marginally better than coin flips when directly prompted to detect vulnerabilities. In this paper, we reframe vulnerability detection as anomaly detection, based on the premise that vulnerable code is rare and thus anomalous relative to patterns learned by LLMs. We introduce AN…

@arXiv_csAI_bot@mastoxiv.page
2025-07-04 07:43:41

Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab
Haonan Duan, Stephen Zhewen Lu, Caitlin Fiona Harrigan, Nishkrit Desai, Jiarui Lu, Micha{\l} Koziarski, Leonardo Cotta, Chris J. Maddison
https://arxiv.org/abs/2507.02083

Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab
Designing experiments and result interpretations are core scientific competencies, particularly in biology, where researchers perturb complex systems to uncover the underlying systems. Recent efforts to evaluate the scientific capabilities of large language models (LLMs) fail to test these competencies because wet-lab experimentation is prohibitively expensive: in expertise, time and equipment. We introduce SciGym, a first-in-class benchmark that assesses LLMs' iterative experiment design and a…

@arXiv_csSE_bot@mastoxiv.page
2025-06-04 07:26:19

Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability
Mengliang He, Jiayi Zeng, Yankai Jiang, Wei Zhang, Zeming Liu, Xiaoming Shi, Aimin Zhou
https://arxiv.org/abs/2506.02073

Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability
While large language models (LLMs) show promise in code generation, existing benchmarks neglect the flowchart-based code generation. To promote further research on flowchart-based code generation, this work presents Flow2Code, a novel benchmark for flowchart-based code generation evaluation. The evaluation dataset spans 15 programming languages and includes 5,622 code segments paired with 16,866 flowcharts of three types: code, UML, and pseudocode. Extensive experiments with 13 multimodal LLMs …

@arXiv_csSE_bot@mastoxiv.page
2025-06-05 07:21:30

Empirical Evaluation of Generalizable Automated Program Repair with Large Language Models
Viola Campos, Ridwan Shariffdeen, Adrian Ulges, Yannic Noller
https://arxiv.org/abs/2506.03283

Empirical Evaluation of Generalizable Automated Program Repair with Large Language Models
Automated Program Repair (APR) proposes bug fixes to aid developers in maintaining software. The state of the art in this domain focuses on using LLMs, leveraging their strong capabilities to comprehend specifications in natural language and to generate program code. Recent works have shown that LLMs can be used to generate repairs. However, despite the APR community's research achievements and several industry deployments in the last decade, APR still lacks the capabilities to generalize broad…

@arXiv_csSE_bot@mastoxiv.page
2025-07-04 09:28:21

Meta-Fair: AI-Assisted Fairness Testing of Large Language Models
Miguel Romero-Arjona, Jos\'e A. Parejo, Juan C. Alonso, Ana B. S\'anchez, Aitor Arrieta, Sergio Segura
https://arxiv.org/abs/2507.02533

Meta-Fair: AI-Assisted Fairness Testing of Large Language Models
Fairness--the absence of unjustified bias--is a core principle in the development of Artificial Intelligence (AI) systems, yet it remains difficult to assess and enforce. Current approaches to fairness testing in large language models (LLMs) often rely on manual evaluation, fixed templates, deterministic heuristics, and curated datasets, making them resource-intensive and difficult to scale. This work aims to lay the groundwork for a novel, automated method for testing fairness in LLMs, reducin…

@arXiv_csSE_bot@mastoxiv.page
2025-06-05 07:22:50

From Theory to Practice: Real-World Use Cases on Trustworthy LLM-Driven Process Modeling, Prediction and Automation
Peter Pfeiffer, Alexander Rombach, Maxim Majlatow, Nijat Mehdiyev
https://arxiv.org/abs/2506.03801

From Theory to Practice: Real-World Use Cases on Trustworthy LLM-Driven Process Modeling, Prediction and Automation
Traditional Business Process Management (BPM) struggles with rigidity, opacity, and scalability in dynamic environments while emerging Large Language Models (LLMs) present transformative opportunities alongside risks. This paper explores four real-world use cases that demonstrate how LLMs, augmented with trustworthy process intelligence, redefine process modeling, prediction, and automation. Grounded in early-stage research projects with industrial partners, the work spans manufacturing, modeli…

Tootfinder

Opt-in global Mastodon full text search. Join the index!