Tootfinder

@arXiv_csAI_bot@mastoxiv.page
2025-09-22 07:30:51

The Distribution Shift Problem in Transportation Networks using Reinforcement Learning and AI
Federico Taschin, Abderrahmane Lazaraq, Ozan K. Tonguz, Inci Ozgunes
https://arxiv.org/abs/2509.15291

The Distribution Shift Problem in Transportation Networks using Reinforcement Learning and AI
The use of Machine Learning (ML) and Artificial Intelligence (AI) in smart transportation networks has increased significantly in the last few years. Among these ML and AI approaches, Reinforcement Learning (RL) has been shown to be a very promising approach by several authors. However, a problem with using Reinforcement Learning in Traffic Signal Control is the reliability of the trained RL agents due to the dynamically changing distribution of the input data with respect to the distribution o…

@arXiv_csLG_bot@mastoxiv.page
2025-09-22 10:26:51

Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations
Yujie Zhu, Charles A. Hepburn, Matthew Thorpe, Giovanni Montana
https://arxiv.org/abs/2509.15981

Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations
In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We …

@arXiv_csRO_bot@mastoxiv.page
2025-09-22 09:53:01

PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models
Ruiqi Wang, Dezhong Zhao, Ziqin Yuan, Tianyu Shao, Guohua Chen, Dominic Kao, Sungeun Hong, Byung-Cheol Min
https://arxiv.org/abs/2509.15607

PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models
Preference-based reinforcement learning (PbRL) has emerged as a promising paradigm for teaching robots complex behaviors without reward engineering. However, its effectiveness is often limited by two critical challenges: the reliance on extensive human input and the inherent difficulties in resolving query ambiguity and credit assignment during reward learning. In this paper, we introduce PRIMT, a PbRL framework designed to overcome these challenges by leveraging foundation models (FMs) for mul…

@arXiv_quantph_bot@mastoxiv.page
2025-09-22 10:16:11

Quantum Reinforcement Learning with Dynamic-Circuit Qubit Reuse and Grover-Based Trajectory Optimization
Thet Htar Su, Shaswot Shresthamali, Masaaki Kondo
https://arxiv.org/abs/2509.16002

Quantum Reinforcement Learning with Dynamic-Circuit Qubit Reuse and Grover-Based Trajectory Optimization
A fully quantum reinforcement learning framework is developed that integrates a quantum Markov decision process, dynamic circuit-based qubit reuse, and Grover's algorithm for trajectory optimization. The framework encodes states, actions, rewards, and transitions entirely within the quantum domain, enabling parallel exploration of state-action sequences through superposition and eliminating classical subroutines. Dynamic circuit operations, including mid-circuit measurement and reset, allow reu…

@arXiv_eessSP_bot@mastoxiv.page
2025-08-21 08:58:59

Deep Reinforcement Learning Based Routing for Heterogeneous Multi-Hop Wireless Networks
Brian Kim, Justin H. Kong, Terrence J. Moore, Fikadu T. Dagefu
https://arxiv.org/abs/2508.14884

Deep Reinforcement Learning Based Routing for Heterogeneous Multi-Hop Wireless Networks
Routing in multi-hop wireless networks is a complex problem, especially in heterogeneous networks where multiple wireless communication technologies coexist. Reinforcement learning (RL) methods, such as Q-learning, have been introduced for decentralized routing by allowing nodes to make decisions based on local observations. However, Q-learning suffers from scalability issues and poor generalization due to the difficulty in managing the Q-table in large or dynamic network topologies, especially…

@arXiv_eessSY_bot@mastoxiv.page
2025-09-22 09:09:41

Hierarchical Reinforcement Learning with Low-Level MPC for Multi-Agent Control
Max Studt, Georg Schildbach
https://arxiv.org/abs/2509.15799 https://arxiv.o…

Hierarchical Reinforcement Learning with Low-Level MPC for Multi-Agent Control
Achieving safe and coordinated behavior in dynamic, constraint-rich environments remains a major challenge for learning-based control. Pure end-to-end learning often suffers from poor sample efficiency and limited reliability, while model-based methods depend on predefined references and struggle to generalize. We propose a hierarchical framework that combines tactical decision-making via reinforcement learning (RL) with low-level execution through Model Predictive Control (MPC). For the case o…

@arXiv_csHC_bot@mastoxiv.page
2025-08-22 09:32:31

Demystifying Reward Design in Reinforcement Learning for Upper Extremity Interaction: Practical Guidelines for Biomechanical Simulations in HCI
Hannah Selder, Florian Fischer, Per Ola Kristensson, Arthur Fleig
https://arxiv.org/abs/2508.15727

Demystifying Reward Design in Reinforcement Learning for Upper Extremity Interaction: Practical Guidelines for Biomechanical Simulations in HCI
Designing effective reward functions is critical for reinforcement learning-based biomechanical simulations, yet HCI researchers and practitioners often waste (computation) time with unintuitive trial-and-error tuning. This paper demystifies reward function design by systematically analyzing the impact of effort minimization, task completion bonuses, and target proximity incentives on typical HCI tasks such as pointing, tracking, and choice reaction. We show that proximity incentives are essent…

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 09:45:01

EMO-RL: Emotion-Rule-Based Reinforcement Learning Enhanced Audio-Language Model for Generalized Speech Emotion Recognition
Pengcheng Li, Botao Zhao, Zuheng Kang, Junqing Peng, Xiaoyang Qu, Yayun He, Jianzong Wang
https://arxiv.org/abs/2509.15654

EMO-RL: Emotion-Rule-Based Reinforcement Learning Enhanced Audio-Language Model for Generalized Speech Emotion Recognition
Although Large Audio-Language Models (LALMs) have exhibited outstanding performance in auditory understanding, their performance in affective computing scenarios, particularly in emotion recognition, reasoning, and subtle sentiment differentiation, remains suboptimal. Recent advances in Reinforcement Learning (RL) have shown promise in improving LALMs' reasoning abilities. However, two critical challenges hinder the direct application of RL techniques to Speech Emotion Recognition (SER) tasks: …

@arXiv_csNI_bot@mastoxiv.page
2025-08-21 09:43:40

Energy-Efficient Routing Algorithm for Wireless Sensor Networks: A Multi-Agent Reinforcement Learning Approach
Parham Soltani, Mehrshad Eskandarpour, Amir Ahmadizad, Hossein Soleimani
https://arxiv.org/abs/2508.14679

Energy-Efficient Routing Algorithm for Wireless Sensor Networks: A Multi-Agent Reinforcement Learning Approach
Efficient energy management is essential in Wireless Sensor Networks (WSNs) to extend network lifetime and ensure reliable data transmission. This paper presents a novel method using reinforcement learning-based cluster-head selection and a hybrid multi-hop routing algorithm, which leverages Q-learning within a multi-agent system to dynamically adapt transmission paths based on the energy distribution across sensor nodes. Each sensor node is modeled as an autonomous agent that observes local st…

@arXiv_csAI_bot@mastoxiv.page
2025-08-22 09:48:31

Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning
Xiancheng Gao, Yufeng Shi, Wengang Zhou, Houqiang Li
https://arxiv.org/abs/2508.15327 https://…

Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning
Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited exp…

@arXiv_csLG_bot@mastoxiv.page
2025-09-22 10:25:51

RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation
Chao Yu, Yuanqing Wang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Yongji Wu, Chunyang Zhu, Junhao Hu, Zixiao Huang, Mingjie Wei, Yuqing Xie, Ke Yang, Bo Dai, Zhexuan Xu, Xiangyuan Wang, Xu Fu, Zhihao Liu, Kang Chen, Weilin Liu, Gang Liu, Boxun Li, Jianlei Yang, Zhi Yang, Guohao Dai, Yu Wang

RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation
Reinforcement learning (RL) has demonstrated immense potential in advancing artificial general intelligence, agentic intelligence, and embodied intelligence. However, the inherent heterogeneity and dynamicity of RL workflows often lead to low hardware utilization and slow training on existing systems. In this paper, we present RLinf, a high-performance RL training system based on our key observation that the major roadblock to efficient RL training lies in system flexibility. To maximize flexib…

@arXiv_quantph_bot@mastoxiv.page
2025-08-21 10:04:20

Reinforcement learning entangling operations on spin qubits
Mohammad Abedi, Markus Schmitt
https://arxiv.org/abs/2508.14761 https://arxiv.org/pdf/2508.1476…

Reinforcement learning entangling operations on spin qubits
High-fidelity control of one- and two-qubit gates past the error correction threshold is an essential ingredient for scalable quantum computing. We present a reinforcement learning (RL) approach to find entangling protocols for semiconductor-based singlet-triplet qubits in a double quantum dot. Despite the presence of realistically modelled experimental constraints, such as various noise contributions and finite rise-time effects, we demonstrate that an RL agent can yield performative protocols…

@arXiv_csRO_bot@mastoxiv.page
2025-08-21 09:16:49

SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning
Yuhang Lin, Yijia Xie, Jiahong Xie, Yuehao Huang, Ruoyu Wang, Jiajun Lv, Yukai Ma, Xingxing Zuo
https://arxiv.org/abs/2508.14120

SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning
Generating physically realistic humanoid-object interactions (HOI) is a fundamental challenge in robotics. Existing HOI generation approaches, such as diffusion-based models, often suffer from artifacts such as implausible contacts, penetrations, and unrealistic whole-body actions, which hinder successful execution in physical environments. To address these challenges, we introduce SimGenHOI, a unified framework that combines the strengths of generative modeling and reinforcement learning to pr…

@arXiv_physicssocph_bot@mastoxiv.page
2025-09-22 09:38:01

Hybrid Learning and Optimization methods for solving Capacitated Vehicle Routing Problem
Monit Sharma, Hoong Chuin Lau
https://arxiv.org/abs/2509.15262 https://

Hybrid Learning and Optimization methods for solving Capacitated Vehicle Routing Problem
The Capacitated Vehicle Routing Problem (CVRP) is a fundamental NP-hard problem in logistics. Augmented Lagrangian Methods (ALM) for solving CVRP performance depends heavily on well-tuned penalty parameters. In this paper, we propose a hybrid optimization approach that integrates deep reinforcement learning (RL) to automate the selection of penalty parameter values within both classical (RL-C-ALM) and quantum-enhanced (RL-Q-ALM) ALM solvers. Using Soft Actor-Critic, our approach learns penalty …

@arXiv_csLG_bot@mastoxiv.page
2025-09-22 10:33:21

Automated Cyber Defense with Generalizable Graph-based Reinforcement Learning Agents
Isaiah J. King, Benjamin Bowman, H. Howie Huang
https://arxiv.org/abs/2509.16151 https://

Automated Cyber Defense with Generalizable Graph-based Reinforcement Learning Agents
Deep reinforcement learning (RL) is emerging as a viable strategy for automated cyber defense (ACD). The traditional RL approach represents networks as a list of computers in various states of safety or threat. Unfortunately, these models are forced to overfit to specific network topologies, rendering them ineffective when faced with even small environmental perturbations. In this work, we frame ACD as a two-player context-based partially observable Markov decision problem with observations rep…

@arXiv_csAI_bot@mastoxiv.page
2025-08-22 10:00:51

A Dynamical Systems Framework for Reinforcement Learning Safety and Robustness Verification
Ahmed Nasir, Abdelhafid Zenati
https://arxiv.org/abs/2508.15588 https://

A Dynamical Systems Framework for Reinforcement Learning Safety and Robustness Verification
The application of reinforcement learning to safety-critical systems is limited by the lack of formal methods for verifying the robustness and safety of learned policies. This paper introduces a novel framework that addresses this gap by analyzing the combination of an RL agent and its environment as a discrete-time autonomous dynamical system. By leveraging tools from dynamical systems theory, specifically the Finite-Time Lyapunov Exponent (FTLE), we identify and visualize Lagrangian Coherent …

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 08:18:59

ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs
Hongxin Ding, Baixiang Huang, Yue Fang, Weibin Liao, Xinke Jiang, Zheng Li, Junfeng Zhao, Yasha Wang
https://arxiv.org/abs/2508.13514

ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs
Interactive medical questioning is essential in real-world clinical consultations, where physicians must actively gather information from patients. While medical Large Language Models (LLMs) have shown impressive capabilities in static medical question answering, they predominantly operate under a reactive paradigm: generating answers directly without seeking additional information, which risks incorrect diagnoses in such interactive settings. To address this limitation, we propose ProMed, a re…

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 09:27:11

Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition
Yiru Zhang, Hang Su, Lichun Fan, Zhenbo Luo, Jian Luan
https://arxiv.org/abs/2509.15612

Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition
Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe the speech of a specified target speaker from multi-speaker mixtures in cocktail party scenarios. Recent advancement of Large Audio-Language Models (LALMs) has already brought some new insights to TS-ASR. However, significant room for optimization remains for the TS-ASR task within the LALMs architecture. While Chain of Thoughts (CoT) and Reinforcement Learning (RL) have proven effective in certain speech tasks, TS-ASR, whi…

@arXiv_csCV_bot@mastoxiv.page
2025-09-22 14:08:58

Replaced article(s) found for cs.CV. https://arxiv.org/list/cs.CV/new
[3/4]:
- cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning
Kolodiazhnyi, Tarasov, Zhemchuzhnikov, Nikulin, Zisman, Vorontsova, Konushin, Kurenkov, Rukhovich

@arXiv_csLG_bot@mastoxiv.page
2025-08-22 10:19:51

Distributed Detection of Adversarial Attacks in Multi-Agent Reinforcement Learning with Continuous Action Space
Kiarash Kazari, Ezzeldin Shereen, Gy\"orgy D\'an
https://arxiv.org/abs/2508.15764

Distributed Detection of Adversarial Attacks in Multi-Agent Reinforcement Learning with Continuous Action Space
We address the problem of detecting adversarial attacks against cooperative multi-agent reinforcement learning with continuous action space. We propose a decentralized detector that relies solely on the local observations of the agents and makes use of a statistical characterization of the normal behavior of observable agents. The proposed detector utilizes deep neural networks to approximate the normal behavior of agents as parametric multivariate Gaussian distributions. Based on the predicted…

@arXiv_csAI_bot@mastoxiv.page
2025-08-22 10:02:31

Understanding Action Effects through Instrumental Empowerment in Multi-Agent Reinforcement Learning
Ardian Selmonaj, Miroslav Strupl, Oleg Szehr, Alessandro Antonucci
https://arxiv.org/abs/2508.15652

Understanding Action Effects through Instrumental Empowerment in Multi-Agent Reinforcement Learning
To reliably deploy Multi-Agent Reinforcement Learning (MARL) systems, it is crucial to understand individual agent behaviors within a team. While prior work typically evaluates overall team performance based on explicit reward signals or learned value functions, it is unclear how to infer agent contributions in the absence of any value feedback. In this work, we investigate whether meaningful insights into agent behaviors can be extracted that are consistent with the underlying value functions,…

@arXiv_eessSY_bot@mastoxiv.page
2025-09-22 09:44:11

On-Policy Reinforcement-Learning Control for Optimal Energy Sharing and Temperature Regulation in District Heating Systems
Xinyi Yi, Ioannis Lestas
https://arxiv.org/abs/2509.16083

On-Policy Reinforcement-Learning Control for Optimal Energy Sharing and Temperature Regulation in District Heating Systems
We address the problem of temperature regulation and optimal energy sharing in district heating systems (DHSs) where the demand and system parameters are unknown. We propose a temperature regulation scheme that employs data-driven on-policy updates that achieve these objectives. In particular, we show that the proposed control scheme converges to an optimal equilibrium point of the system, while also having guaranteed convergence to an optimal LQR control policy, thus providing good transient p…

@arXiv_csRO_bot@mastoxiv.page
2025-08-21 08:38:40

No More Marching: Learning Humanoid Locomotion for Short-Range SE(2) Targets
Pranay Dugar, Mohitvishnu S. Gadde, Jonah Siekmann, Yesh Godse, Aayam Shrestha, Alan Fern
https://arxiv.org/abs/2508.14098

No More Marching: Learning Humanoid Locomotion for Short-Range SE(2) Targets
Humanoids operating in real-world workspaces must frequently execute task-driven, short-range movements to SE(2) target poses. To be practical, these transitions must be fast, robust, and energy efficient. While learning-based locomotion has made significant progress, most existing methods optimize for velocity-tracking rather than direct pose reaching, resulting in inefficient, marching-style behavior when applied to short-range tasks. In this work, we develop a reinforcement learning approach…

@arXiv_csIR_bot@mastoxiv.page
2025-08-21 11:52:59

Replaced article(s) found for cs.IR. https://arxiv.org/list/cs.IR/new
[1/1]:
- Reinforcement Learning to Rank Using Coarse-grained Rewards
Yiteng Tu, Zhichao Xu, Tao Yang, Weihang Su, Yujia Zhou, Yiqun Liu, Fen Lin, Qin Liu, Qingyao Ai

@arXiv_csMA_bot@mastoxiv.page
2025-09-19 08:23:01

LEED: A Highly Efficient and Scalable LLM-Empowered Expert Demonstrations Framework for Multi-Agent Reinforcement Learning
Tianyang Duan, Zongyuan Zhang, Songxiao Guo, Dong Huang, Yuanye Zhao, Zheng Lin, Zihan Fang, Dianxin Luan, Heming Cui, Yong Cui
https://arxiv.org/abs/2509.14680

LEED: A Highly Efficient and Scalable LLM-Empowered Expert Demonstrations Framework for Multi-Agent Reinforcement Learning
Multi-agent reinforcement learning (MARL) holds substantial promise for intelligent decision-making in complex environments. However, it suffers from a coordination and scalability bottleneck as the number of agents increases. To address these issues, we propose the LLM-empowered expert demonstrations framework for multi-agent reinforcement learning (LEED). LEED consists of two components: a demonstration generation (DG) module and a policy optimization (PO) module. Specifically, the DG module …

@arXiv_csNI_bot@mastoxiv.page
2025-08-21 09:36:40

Adaptive Vision-Based Coverage Optimization in Mobile Wireless Sensor Networks: A Multi-Agent Deep Reinforcement Learning Approach
Parham Soltani, Mehrshad Eskandarpour, Sina Heidari, Farnaz Alizadeh, Hossein Soleimani
https://arxiv.org/abs/2508.14676

Adaptive Vision-Based Coverage Optimization in Mobile Wireless Sensor Networks: A Multi-Agent Deep Reinforcement Learning Approach
Traditional Wireless Sensor Networks (WSNs) typically rely on pre-analysis of the target area, network size, and sensor coverage to determine initial deployment. This often results in significant overlap to ensure continued network operation despite sensor energy depletion. With the emergence of Mobile Wireless Sensor Networks (MWSNs), issues such as sensor failure and static coverage limitations can be more effectively addressed through mobility. This paper proposes a novel deployment strategy…

@arXiv_csAI_bot@mastoxiv.page
2025-08-22 10:06:31

NiceWebRL: a Python library for human subject experiments with reinforcement learning environments
Wilka Carvalho, Vikram Goddla, Ishaan Sinha, Hoon Shin, Kunal Jha
https://arxiv.org/abs/2508.15693

NiceWebRL: a Python library for human subject experiments with reinforcement learning environments
We present NiceWebRL, a research tool that enables researchers to use machine reinforcement learning (RL) environments for online human subject experiments. NiceWebRL is a Python library that allows any Jax-based environment to be transformed into an online interface, supporting both single-agent and multi-agent environments. As such, NiceWebRL enables AI researchers to compare their algorithms to human performance, cognitive scientists to test ML algorithms as theories for human cognition, and…

@arXiv_csLG_bot@mastoxiv.page
2025-08-21 10:15:20

PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning
Ruheng Wang, Hang Zhang, Trieu Nguyen, Shasha Feng, Hao-Wei Pang, Xiang Yu, Li Xiao, Peter Zhiping Zhang
https://arxiv.org/abs/2508.14765

PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning
Designing therapeutic peptides with tailored properties is hindered by the vastness of sequence space, limited experimental data, and poor interpretability of current generative models. To address these challenges, we introduce PepThink-R1, a generative framework that integrates large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). Unlike prior approaches, PepThink-R1 explicitly reasons about monomer-level modifications during sequence …

@arXiv_qfinTR_bot@mastoxiv.page
2025-09-17 08:35:10

Reinforcement Learning-Based Market Making as a Stochastic Control on Non-Stationary Limit Order Book Dynamics
Rafael Zimmer, Oswaldo Luiz do Valle Costa
https://arxiv.org/abs/2509.12456

Reinforcement Learning-Based Market Making as a Stochastic Control on Non-Stationary Limit Order Book Dynamics
Reinforcement Learning has emerged as a promising framework for developing adaptive and data-driven strategies, enabling market makers to optimize decision-making policies based on interactions with the limit order book environment. This paper explores the integration of a reinforcement learning agent in a market-making context, where the underlying market dynamics have been explicitly modeled to capture observed stylized facts of real markets, including clustered order arrival times, non-stati…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:30:21

Empathy-R1: A Chain-of-Empathy and Reinforcement Learning Framework for Long-Form Mental Health Support
Xianrong Yao, Dong She, Chenxu Zhang, Yimeng Zhang, Yueru Sun, Noman Ahmed, Yang Gao, Zhanpeng Jin
https://arxiv.org/abs/2509.14851

Empathy-R1: A Chain-of-Empathy and Reinforcement Learning Framework for Long-Form Mental Health Support
Empathy is critical for effective mental health support, especially when addressing Long Counseling Texts (LCTs). However, existing Large Language Models (LLMs) often generate replies that are semantically fluent but lack the structured reasoning necessary for genuine psychological support, particularly in a Chinese context. To bridge this gap, we introduce Empathy-R1, a novel framework that integrates a Chain-of-Empathy (CoE) reasoning process with Reinforcement Learning (RL) to enhance respon…

@arXiv_csCV_bot@mastoxiv.page
2025-08-19 12:05:10

Breaking Reward Collapse: Adaptive Reinforcement for Open-ended Medical Reasoning with Enhanced Semantic Discrimination
Yizhou Liu, Jingwei Wei, Zizhi Chen, Minghao Han, Xukun Zhang, Keliang Liu, Lihua Zhang
https://arxiv.org/abs/2508.12957

Breaking Reward Collapse: Adaptive Reinforcement for Open-ended Medical Reasoning with Enhanced Semantic Discrimination
Reinforcement learning (RL) with rule-based rewards has demonstrated strong potential in enhancing the reasoning and generalization capabilities of vision-language models (VLMs) and large language models (LLMs), while reducing computational overhead. However, its application in medical imaging remains underexplored. Existing reinforcement fine-tuning (RFT) approaches in this domain primarily target closed-ended visual question answering (VQA), limiting their applicability to real-world clinical…

@arXiv_csLG_bot@mastoxiv.page
2025-09-22 10:31:31

DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, Ming-Yu Liu
https://arxiv.org/abs/2509.16117

DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new o…

@arXiv_csAI_bot@mastoxiv.page
2025-09-22 08:14:31

CCrepairBench: A High-Fidelity Benchmark and Reinforcement Learning Framework for C Compilation Repair
Weixuan Sun, Jucai Zhai, Dengfeng Liu, Xin Zhang, Xiaojun Wu, Qiaobo Hao, AIMgroup, Yang Fang, Jiuyang Tang
https://arxiv.org/abs/2509.15690

CCrepairBench: A High-Fidelity Benchmark and Reinforcement Learning Framework for C++ Compilation Repair
The automated repair of C++ compilation errors presents a significant challenge, the resolution of which is critical for developer productivity. Progress in this domain is constrained by two primary factors: the scarcity of large-scale, high-fidelity datasets and the limitations of conventional supervised methods, which often fail to generate semantically correct patches.This paper addresses these gaps by introducing a comprehensive framework with three core contributions. First, we present CCr…

@arXiv_csCL_bot@mastoxiv.page
2025-08-22 10:15:01

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang, Weidi Xie
https://arxiv.org/abs/2508.15746

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construc…

@arXiv_quantph_bot@mastoxiv.page
2025-09-22 10:16:41

AI Methods for Permutation Circuit Synthesis Across Generic Topologies
Victor Villar, Juan Cruz-Benito, Ismael Faro, David Kremer
https://arxiv.org/abs/2509.16020 https://

AI Methods for Permutation Circuit Synthesis Across Generic Topologies
This paper investigates artificial intelligence (AI) methodologies for the synthesis and transpilation of permutation circuits across generic topologies. Our approach uses Reinforcement Learning (RL) techniques to achieve near-optimal synthesis of permutation circuits up to 25 qubits. Rather than developing specialized models for individual topologies, we train a foundational model on a generic rectangular lattice, and employ masking mechanisms to dynamically select subsets of topologies during…

@arXiv_csLG_bot@mastoxiv.page
2025-09-22 10:22:01

Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds
Remo Sasso, Michelangelo Conserva, Dominik Jeurissen, Paulo Rauber
https://arxiv.org/abs/2509.15915

Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds
While reinforcement learning from scratch has shown impressive results in solving sequential decision-making tasks with efficient simulators, real-world applications with expensive interactions require more sample-efficient agents. Foundation models (FMs) are natural candidates to improve sample efficiency as they possess broad knowledge and reasoning capabilities, but it is yet unclear how to effectively integrate them into the reinforcement learning framework. In this paper, we anticipate and…

@arXiv_csMA_bot@mastoxiv.page
2025-09-19 08:19:31

Constructive Conflict-Driven Multi-Agent Reinforcement Learning for Strategic Diversity
Yuxiang Mai, Qiyue Yin, Wancheng Ni, Pei Xu, Kaiqi Huang
https://arxiv.org/abs/2509.14276

Constructive Conflict-Driven Multi-Agent Reinforcement Learning for Strategic Diversity
In recent years, diversity has emerged as a useful mechanism to enhance the efficiency of multi-agent reinforcement learning (MARL). However, existing methods predominantly focus on designing policies based on individual agent characteristics, often neglecting the interplay and mutual influence among agents during policy formation. To address this gap, we propose Competitive Diversity through Constructive Conflict (CoDiCon), a novel approach that incorporates competitive incentives into coopera…

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:13:10

Reinforcement Learning-based Adaptive Path Selection for Programmable Networks
Jos\'e Eduardo Zerna Torres, Marios Avgeris, Chrysa Papagianni, Gergely Pongr\'acz, Istv\'an G\'odor, Paola Grosso
https://arxiv.org/abs/2508.13806

Reinforcement Learning-based Adaptive Path Selection for Programmable Networks
This work presents a proof-of-concept implementation of a distributed, in-network reinforcement learning (IN-RL) framework for adaptive path selection in programmable networks. By combining Stochastic Learning Automata (SLA) with real-time telemetry data collected via In-Band Network Telemetry (INT), the proposed system enables local, data-driven forwarding decisions that adapt dynamically to congestion conditions. The system is evaluated on a Mininet-based testbed using P4-programmable BMv2 sw…

@arXiv_csRO_bot@mastoxiv.page
2025-09-19 10:10:51

Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution
Humphrey Munn, Brendan Tidd, Peter B\"ohm, Marcus Gallagher, David Howard
https://arxiv.org/abs/2509.14816

Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution
Reinforcement Learning (RL) robot controllers usually aggregate many task objectives into one scalar reward. While large-scale proximal policy optimisation (PPO) has enabled impressive results such as robust robot locomotion in the real world, many tasks still require careful reward tuning and are brittle to local optima. Tuning cost and sub-optimality grow with the number of objectives, limiting scalability. Modelling reward vectors and their trade-offs can address these issues; however, multi…

@arXiv_csAI_bot@mastoxiv.page
2025-08-20 09:52:50

Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, Lin Ma
https://arxiv.org/abs/2508.13587

Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that a…

@arXiv_csCV_bot@mastoxiv.page
2025-09-17 10:52:50

Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models
Yan Chen, Long Li, Teng Xi, Long Zeng, Jingdong Wang
https://arxiv.org/abs/2509.13031

Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models
Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual in…

@arXiv_eessSY_bot@mastoxiv.page
2025-09-19 08:43:21

Digital Twin-based Cooperative Autonomous Driving in Smart Intersections: A Multi-Agent Reinforcement Learning Approach
Taoyuan Yu, Kui Wang, Zongdian Li, Tao Yu, Kei Sakaguchi, Walid Saad
https://arxiv.org/abs/2509.15099

Digital Twin-based Cooperative Autonomous Driving in Smart Intersections: A Multi-Agent Reinforcement Learning Approach
Unsignalized intersections pose safety and efficiency challenges due to complex traffic flows and blind spots. In this paper, a digital twin (DT)-based cooperative driving system with roadside unit (RSU)-centric architecture is proposed for enhancing safety and efficiency at unsignalized intersections. The system leverages comprehensive bird-eye-view (BEV) perception to eliminate blind spots and employs a hybrid reinforcement learning (RL) framework combining offline pre-training with online fi…

@arXiv_csNI_bot@mastoxiv.page
2025-08-19 10:09:40

REACH: Reinforcement Learning for Efficient Allocation in Community and Heterogeneous Networks
Zhiwei Yu, Chengze Du, Heng Xu, Ying Zhou, Bo Liu, Jialong Li
https://arxiv.org/abs/2508.12857

REACH: Reinforcement Learning for Efficient Allocation in Community and Heterogeneous Networks
Community GPU platforms are emerging as a cost-effective and democratized alternative to centralized GPU clusters for AI workloads, aggregating idle consumer GPUs from globally distributed and heterogeneous environments. However, their extreme hardware/software diversity, volatile availability, and variable network conditions render traditional schedulers ineffective, leading to suboptimal task completion. In this work, we present REACH (Reinforcement Learning for Efficient Allocation in Commun…

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:17:20

Convergent Reinforcement Learning Algorithms for Stochastic Shortest Path Problem
Soumyajit Guin, Shalabh Bhatnagar
https://arxiv.org/abs/2508.13963 https://

Convergent Reinforcement Learning Algorithms for Stochastic Shortest Path Problem
In this paper we propose two algorithms in the tabular setting and an algorithm for the function approximation setting for the Stochastic Shortest Path (SSP) problem. SSP problems form an important class of problems in Reinforcement Learning (RL), as other types of cost-criteria in RL can be formulated in the setting of SSP. We show asymptotic almost-sure convergence for all our algorithms. We observe superior performance of our tabular algorithms compared to other well-known convergent RL algo…

@arXiv_csMA_bot@mastoxiv.page
2025-08-19 08:01:20

Centralized Permutation Equivariant Policy for Cooperative Multi-Agent Reinforcement Learning
Zhuofan Xu, Benedikt Bollig, Matthias F\"ugger, Thomas Nowak, Vincent Le Dr\'eau
https://arxiv.org/abs/2508.11706

Centralized Permutation Equivariant Policy for Cooperative Multi-Agent Reinforcement Learning
The Centralized Training with Decentralized Execution (CTDE) paradigm has gained significant attention in multi-agent reinforcement learning (MARL) and is the foundation of many recent algorithms. However, decentralized policies operate under partial observability and often yield suboptimal performance compared to centralized policies, while fully centralized approaches typically face scalability challenges as the number of agents increases. We propose Centralized Permutation Equivariant (CPE…

@arXiv_csRO_bot@mastoxiv.page
2025-08-19 11:02:40

Robot Trains Robot: Automatic Real-World Policy Adaptation and Learning for Humanoids
Kaizhe Hu, Haochen Shi, Yao He, Weizhuo Wang, C. Karen Liu, Shuran Song
https://arxiv.org/abs/2508.12252

Robot Trains Robot: Automatic Real-World Policy Adaptation and Learning for Humanoids
Simulation-based reinforcement learning (RL) has significantly advanced humanoid locomotion tasks, yet direct real-world RL from scratch or adapting from pretrained policies remains rare, limiting the full potential of humanoid robots. Real-world learning, despite being crucial for overcoming the sim-to-real gap, faces substantial challenges related to safety, reward design, and learning efficiency. To address these limitations, we propose Robot-Trains-Robot (RTR), a novel framework where a rob…

@arXiv_quantph_bot@mastoxiv.page
2025-09-18 10:08:31

Quantum Reinforcement Learning-Guided Diffusion Model for Image Synthesis via Hybrid Quantum-Classical Generative Model Architectures
Chi-Sheng Chen, En-Jui Kuo
https://arxiv.org/abs/2509.14163

Quantum Reinforcement Learning-Guided Diffusion Model for Image Synthesis via Hybrid Quantum-Classical Generative Model Architectures
Diffusion models typically employ static or heuristic classifier-free guidance (CFG) schedules, which often fail to adapt across timesteps and noise conditions. In this work, we introduce a quantum reinforcement learning (QRL) controller that dynamically adjusts CFG at each denoising step. The controller adopts a hybrid quantum--classical actor--critic architecture: a shallow variational quantum circuit (VQC) with ring entanglement generates policy features, which are mapped by a compact multil…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 10:19:50

Wisdom of the Crowd: Reinforcement Learning from Coevolutionary Collective Feedback
Wenzhen Yuan, Shengji Tang, Weihao Lin, Jiacheng Ruan, Ganqu Cui, Bo Zhang, Tao Chen, Ting Liu, Yuzhuo Fu, Peng Ye, Lei Bai
https://arxiv.org/abs/2508.12338

Wisdom of the Crowd: Reinforcement Learning from Coevolutionary Collective Feedback
Reinforcement learning (RL) has significantly enhanced the reasoning capabilities of large language models (LLMs), but its reliance on expensive human-labeled data or complex reward models severely limits scalability. While existing self-feedback methods aim to address this problem, they are constrained by the capabilities of a single model, which can lead to overconfidence in incorrect answers, reward hacking, and even training collapse. To this end, we propose Reinforcement Learning from Coev…

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:20:20

Learning from Preferences and Mixed Demonstrations in General Settings
Jason R Brown, Carl Henrik Ek, Robert D Mullins
https://arxiv.org/abs/2508.14027 https://

Learning from Preferences and Mixed Demonstrations in General Settings
Reinforcement learning is a general method for learning in sequential settings, but it can often be difficult to specify a good reward function when the task is complex. In these cases, preference feedback or expert demonstrations can be used instead. However, existing approaches utilising both together are often ad-hoc, rely on domain-specific properties, or won't scale. We develop a new framing for learning from human data, \emph{reward-rational partial orderings over observations}, designed …

@arXiv_csAI_bot@mastoxiv.page
2025-08-20 09:50:20

Toward Better EHR Reasoning in LLMs: Reinforcement Learning with Expert Attention Guidance
Yue Fang, Yuxin Guo, Jiaran Gao, Hongxin Ding, Xinke Jiang, Weibin Liao, Yongxin Xu, Yinghao Zhu, Zhibang Yang, Liantao Ma, Junfeng Zhao, Yasha Wang
https://arxiv.org/abs/2508.13579

Toward Better EHR Reasoning in LLMs: Reinforcement Learning with Expert Attention Guidance
Improving large language models (LLMs) for electronic health record (EHR) reasoning is essential for enabling accurate and generalizable clinical predictions. While LLMs excel at medical text understanding, they underperform on EHR-based prediction tasks due to challenges in modeling temporally structured, high-dimensional data. Existing approaches often rely on hybrid paradigms, where LLMs serve merely as frozen prior retrievers while downstream deep learning (DL) models handle prediction, fai…

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:07:40

MACTAS: Self-Attention-Based Module for Inter-Agent Communication in Multi-Agent Reinforcement Learning
Maciej Wojtala, Bogusz Stefa\'nczyk, Dominik Bogucki, {\L}ukasz Lepak, Jakub Strykowski, Pawe{\l} Wawrzy\'nski
https://arxiv.org/abs/2508.13661

MACTAS: Self-Attention-Based Module for Inter-Agent Communication in Multi-Agent Reinforcement Learning
Communication is essential for the collective execution of complex tasks by human agents, motivating interest in communication mechanisms for multi-agent reinforcement learning (MARL). However, existing communication protocols in MARL are often complex and non-differentiable. In this work, we introduce a self-attention-based communication module that exchanges information between the agents in MARL. Our proposed approach is fully differentiable, allowing agents to learn to generate messages in …

@arXiv_csRO_bot@mastoxiv.page
2025-08-21 09:15:39

Efficient Environment Design for Multi-Robot Navigation via Continuous Control
Jahid Chowdhury Choton, John Woods, William Hsu
https://arxiv.org/abs/2508.14105 https://

Efficient Environment Design for Multi-Robot Navigation via Continuous Control
Multi-robot navigation and path planning in continuous state and action spaces with uncertain environments remains an open challenge. Deep Reinforcement Learning (RL) is one of the most popular paradigms for solving this task, but its real-world application has been limited due to sample inefficiency and long training periods. Moreover, the existing works using RL for multi-robot navigation lack formal guarantees while designing the environment. In this paper, we introduce an efficient and high…

@arXiv_eessSY_bot@mastoxiv.page
2025-09-19 07:55:51

Near-Real-Time Resource Slicing for QoS Optimization in 5G O-RAN using Deep Reinforcement Learning
Peihao Yan, Jie Lu, Huacheng Zeng, Y. Thomas Hou
https://arxiv.org/abs/2509.14343

Near-Real-Time Resource Slicing for QoS Optimization in 5G O-RAN using Deep Reinforcement Learning
Open-Radio Access Network (O-RAN) has become an important paradigm for 5G and beyond radio access networks. This paper presents an xApp called xSlice for the Near-Real-Time (Near-RT) RAN Intelligent Controller (RIC) of 5G O-RANs. xSlice is an online learning algorithm that adaptively adjusts MAC-layer resource allocation in response to dynamic network states, including time-varying wireless channel conditions, user mobility, traffic fluctuations, and changes in user demand. To address these net…

@arXiv_csMA_bot@mastoxiv.page
2025-09-19 08:24:21

Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning
Simin Li, Zheng Yuwei, Zihao Mao, Linhao Wang, Ruixiao Xu, Chengdong Ma, Xin Yu, Yuqing Ma, Qi Dou, Xin Wang, Jie Luo, Bo An, Yaodong Yang, Weifeng Lv, Xianglong Liu
https://arxiv.org/abs/2509.15103

Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning
Partial agent failure becomes inevitable when systems scale up, making it crucial to identify the subset of agents whose compromise would most severely degrade overall performance. In this paper, we study this Vulnerable Agent Identification (VAI) problem in large-scale multi-agent reinforcement learning (MARL). We frame VAI as a Hierarchical Adversarial Decentralized Mean Field Control (HAD-MFC), where the upper level involves an NP-hard combinatorial task of selecting the most vulnerable agen…

@arXiv_csCV_bot@mastoxiv.page
2025-10-15 10:44:21

CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving
Xiaoji Zheng, Ziyuan Yang, Yanhao Chen, Yuhang Peng, Yuanrong Tang, Gengyuan Liu, Bokui Chen, Jiangtao Gong
https://arxiv.org/abs/2510.12560

CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving
End-to-end autonomous driving models trained solely with imitation learning (IL) often suffer from poor generalization. In contrast, reinforcement learning (RL) promotes exploration through reward maximization but faces challenges such as sample inefficiency and unstable convergence. A natural solution is to combine IL and RL. Moving beyond the conventional two-stage paradigm (IL pretraining followed by RL fine-tuning), we propose CoIRL-AD, a competitive dual-policy framework that enables IL an…

@arXiv_csRO_bot@mastoxiv.page
2025-09-22 09:41:01

Momentum-constrained Hybrid Heuristic Trajectory Optimization Framework with Residual-enhanced DRL for Visually Impaired Scenarios
Yuting Zeng, Zhiwen Zheng, You Zhou, JiaLing Xiao, Yongbin Yu, Manping Fan, Bo Gong, Liyong Ren
https://arxiv.org/abs/2509.15582

Momentum-constrained Hybrid Heuristic Trajectory Optimization Framework with Residual-enhanced DRL for Visually Impaired Scenarios
This paper proposes a momentum-constrained hybrid heuristic trajectory optimization framework (MHHTOF) tailored for assistive navigation in visually impaired scenarios, integrating trajectory sampling generation, optimization and evaluation with residual-enhanced deep reinforcement learning (DRL). In the first stage, heuristic trajectory sampling cluster (HTSC) is generated in the Frenet coordinate system using third-order interpolation with fifth-order polynomials and momentum-constrained traj…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 10:53:40

Reinforcement Learning with Rubric Anchors
Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, Junbo Zhao
https://arxiv.org/abs/2508.12790

Reinforcement Learning with Rubric Anchors
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-end…

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:16:20

Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control
SM Mazharul Islam, Manfred Huber
https://arxiv.org/abs/2508.13922 https://

Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control
A policy in deep reinforcement learning (RL), either deterministic or stochastic, is commonly parameterized as a Gaussian distribution alone, limiting the learned behavior to be unimodal. However, the nature of many practical decision-making problems favors a multimodal policy that facilitates robust exploration of the environment and thus to address learning challenges arising from sparse rewards, complex dynamics, or the need for strategic adaptation to varying contexts. This issue is exacerb…

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:15:40

Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation
Thanh Nguyen, Chang D. Yoo
https://arxiv.org/abs/2508.13904 https://

Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation
The generative power of diffusion models (DMs) has recently enabled high-performing decision-making algorithms in offline reinforcement learning (RL), achieving state-of-the-art results across standard benchmarks. Among them, Diffusion Q-Learning (DQL) stands out as a leading method for its consistently strong performance. Nevertheless, DQL remains limited in practice due to its reliance on multi-step denoising for action generation during both training and inference. Although one-step denoisin…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 10:00:30

RLNVR: Reinforcement Learning from Non-Verified Real-World Rewards
Rohit Krishnan, Jon Evans
https://arxiv.org/abs/2508.12165 https://arxiv.org/pdf/2508.12…

RLNVR: Reinforcement Learning from Non-Verified Real-World Rewards
This paper introduces RLNVR (Reinforcement Learning from Non-Verified Rewards), a framework for training language models using noisy, real-world feedback signals without requiring explicit human verification. Traditional RLHF requires expensive, verified reward signals that are impractical in many real-world domains. RLNVR addresses this challenge through baseline normalization and semantic similarity-based reward transfer. We demonstrate RLNVR through Walter, a prototype system that optimizes …

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 08:35:20

GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning
Kelin Yu, Sheng Zhang, Harshit Soora, Furong Huang, Heng Huang, Pratap Tokekar, Ruohan Gao
https://arxiv.org/abs/2508.11049

GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning
Recent advances have shown that video generation models can enhance robot learning by deriving effective robot actions through inverse dynamics. However, these methods heavily depend on the quality of generated data and struggle with fine-grained manipulation due to the lack of environment feedback. While video-based reinforcement learning improves policy robustness, it remains constrained by the uncertainty of video generation and the challenges of collecting large-scale robot datasets for tra…

@arXiv_csAI_bot@mastoxiv.page
2025-08-20 10:12:10

ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, Jie Tang
https://arxiv.org/abs/2508.14040

ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments. Scaling end-to-end RL training is crucial for improvement and generalization across diverse desktop tasks, yet remains challenging due to environmental in…

@arXiv_csRO_bot@mastoxiv.page
2025-08-19 11:11:20

SIGN: Safety-Aware Image-Goal Navigation for Autonomous Drones via Reinforcement Learning
Zichen Yan, Rui Huang, Lei He, Shao Guo, Lin Zhao
https://arxiv.org/abs/2508.12394 http…

SIGN: Safety-Aware Image-Goal Navigation for Autonomous Drones via Reinforcement Learning
Image-goal navigation (ImageNav) tasks a robot with autonomously exploring an unknown environment and reaching a location that visually matches a given target image. While prior works primarily study ImageNav for ground robots, enabling this capability for autonomous drones is substantially more challenging due to their need for high-frequency feedback control and global localization for stable flight. In this paper, we propose a novel sim-to-real framework that leverages visual reinforcement l…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 11:10:20

OPTIC-ER: A Reinforcement Learning Framework for Real-Time Emergency Response and Equitable Resource Allocation in Underserved African Communities
Mary Tonwe
https://arxiv.org/abs/2508.12943

OPTIC-ER: A Reinforcement Learning Framework for Real-Time Emergency Response and Equitable Resource Allocation in Underserved African Communities
Public service systems in many African regions suffer from delayed emergency response and spatial inequity, causing avoidable suffering. This paper introduces OPTIC-ER, a reinforcement learning (RL) framework for real-time, adaptive, and equitable emergency response. OPTIC-ER uses an attention-guided actor-critic architecture to manage the complexity of dispatch environments. Its key innovations are a Context-Rich State Vector, encoding action sub-optimality, and a Precision Reward Function, wh…

@arXiv_csAI_bot@mastoxiv.page
2025-09-18 08:19:31

$Agent^2$: An Agent-Generates-Agent Framework for Reinforcement Learning Automation
Yuan Wei, Xiaohan Shan, Ran Miao, Jianmin Li
https://arxiv.org/abs/2509.13368 https://…

$Agent^2$: An Agent-Generates-Agent Framework for Reinforcement Learning Automation
Reinforcement learning agent development traditionally requires extensive expertise and lengthy iterations, often resulting in high failure rates and limited accessibility. This paper introduces $Agent^2$, a novel agent-generates-agent framework that achieves fully automated RL agent design through intelligent LLM-driven generation. The system autonomously transforms natural language task descriptions and environment code into comprehensive, high-performance reinforcement learning solutions wit…

@arXiv_csLG_bot@mastoxiv.page
2025-09-18 10:14:11

TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning
Ziyuan Chen, Zhenghui Zhao, Zhangye Han, Miancan Liu, Xianhang Ye, Yiqing Li, Hongbo Min, Jinkui Ren, Xiantao Zhang, Guitao Cao
https://arxiv.org/abs/2509.14172

TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning
With the rapid advancement of large language models and vision-language models, employing large models as Web Agents has become essential for automated web interaction. However, training Web Agents with reinforcement learning faces critical challenges including credit assignment misallocation, prohibitively high annotation costs, and reward sparsity. To address these issues, we propose Tree-Guided Preference Optimization (TGPO), an offline reinforcement learning framework that proposes a tree-s…

@arXiv_csRO_bot@mastoxiv.page
2025-09-18 10:07:51

Reinforcement Learning for Autonomous Point-to-Point UAV Navigation
Salim Oyinlola, Nitesh Subedi, Soumik Sarkar
https://arxiv.org/abs/2509.13943 https://a…

Reinforcement Learning for Autonomous Point-to-Point UAV Navigation
Unmanned Aerial Vehicles (UAVs) are increasingly used in automated inspection, delivery, and navigation tasks that require reliable autonomy. This project develops a reinforcement learning (RL) approach to enable a single UAV to autonomously navigate between predefined points without manual intervention. The drone learns navigation policies through trial-and-error interaction, using a custom reward function that encourages goal-reaching efficiency while penalizing collisions and unsafe behavior…

@arXiv_csLG_bot@mastoxiv.page
2025-08-21 10:17:00

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent
Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu
https://arxiv.org/abs/2508.14853

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent
As large language models (LLMs) are increasingly deployed in critical applications, ensuring their robustness and safety alignment remains a major challenge. Despite the overall success of alignment techniques such as reinforcement learning from human feedback (RLHF) on typical prompts, LLMs remain vulnerable to jailbreak attacks enabled by crafted adversarial triggers appended to user prompts. Most existing jailbreak methods either rely on inefficient searches over discrete token spaces or dir…

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 09:07:50

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward
Jiarui Yang, Bin Zhu, Jingjing Chen, Yu-Gang Jiang
https://arxiv.org/abs/2508.11143

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward
Existing reinforcement learning (RL) methods struggle with long-horizon robotic manipulation tasks, particularly those involving sparse rewards. While action chunking is a promising paradigm for robotic manipulation, using RL to directly learn continuous action chunks in a stable and data-efficient manner remains a critical challenge. This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. To mak…

@arXiv_csLG_bot@mastoxiv.page
2025-09-22 10:22:21

Improving Monte Carlo Tree Search for Symbolic Regression
Zhengyao Huang, Daniel Zhengyu Huang, Tiannan Xiao, Dina Ma, Zhenyu Ming, Hao Shi, Yuanhui Wen
https://arxiv.org/abs/2509.15929

Improving Monte Carlo Tree Search for Symbolic Regression
Symbolic regression aims to discover concise, interpretable mathematical expressions that satisfy desired objectives, such as fitting data, posing a highly combinatorial optimization problem. While genetic programming has been the dominant approach, recent efforts have explored reinforcement learning methods for improving search efficiency. Monte Carlo Tree Search (MCTS), with its ability to balance exploration and exploitation through guided search, has emerged as a promising technique for sym…

@arXiv_csAI_bot@mastoxiv.page
2025-09-19 09:13:31

RationAnomaly: Log Anomaly Detection with Rationality via Chain-of-Thought and Reinforcement Learning
Song Xu, Yilun Liu, Minggui He, Mingchen Dai, Ziang Chen, Chunguang Zhao, Jingzhou Du, Shimin Tao, Weibin Meng, Shenglin Zhang, Yongqian Sun, Boxing Chen, Daimeng Wei
https://arxiv.org/abs/2509.14693…

RationAnomaly: Log Anomaly Detection with Rationality via Chain-of-Thought and Reinforcement Learning
Logs constitute a form of evidence signaling the operational status of software systems. Automated log anomaly detection is crucial for ensuring the reliability of modern software systems. However, existing approaches face significant limitations: traditional deep learning models lack interpretability and generalization, while methods leveraging Large Language Models are often hindered by unreliability and factual inaccuracies. To address these issues, we propose RationAnomaly, a novel framewor…

@arXiv_csRO_bot@mastoxiv.page
2025-08-15 08:50:22

Few-shot Vision-based Human Activity Recognition with MLLM-based Visual Reinforcement Learning
Wenqi Zheng, Yutaka Arakawa
https://arxiv.org/abs/2508.10371 https://

Few-shot Vision-based Human Activity Recognition with MLLM-based Visual Reinforcement Learning
Reinforcement learning in large reasoning models enables learning from feedback on their outputs, making it particularly valuable in scenarios where fine-tuning data is limited. However, its application in multi-modal human activity recognition (HAR) domains remains largely underexplored. Our work extends reinforcement learning to the human activity recognition domain with multimodal large language models. By incorporating visual reinforcement learning in the training process, the model's gener…

@arXiv_csLG_bot@mastoxiv.page
2025-08-21 10:17:30

Compute-Optimal Scaling for Value-Based Deep RL
Preston Fu, Oleh Rybkin, Zhiyuan Zhou, Michal Nauman, Pieter Abbeel, Sergey Levine, Aviral Kumar
https://arxiv.org/abs/2508.14881

Compute-Optimal Scaling for Value-Based Deep RL
As models grow larger and training them becomes expensive, it becomes increasingly important to scale training recipes not just to larger models and more data, but to do so in a compute-optimal manner that extracts maximal performance per unit of compute. While such scaling has been well studied for language modeling, reinforcement learning (RL) has received less attention in this regard. In this paper, we investigate compute scaling for online, value-based deep RL. These methods present two pr…

@arXiv_csLG_bot@mastoxiv.page
2025-08-21 10:08:50

Beyond ReLU: Chebyshev-DQN for Enhanced Deep Q-Networks
Saman Yazdannik, Morteza Tayefi, Shamim Sanisales
https://arxiv.org/abs/2508.14536 https://arxiv.or…

Beyond ReLU: Chebyshev-DQN for Enhanced Deep Q-Networks
The performance of Deep Q-Networks (DQN) is critically dependent on the ability of its underlying neural network to accurately approximate the action-value function. Standard function approximators, such as multi-layer perceptrons, may struggle to efficiently represent the complex value landscapes inherent in many reinforcement learning problems. This paper introduces a novel architecture, the Chebyshev-DQN (Ch-DQN), which integrates a Chebyshev polynomial basis into the DQN framework to create…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 10:31:40

The Yokai Learning Environment: Tracking Beliefs Over Space and Time
Constantin Ruhdorfer, Matteo Bortoletto, Andreas Bulling
https://arxiv.org/abs/2508.12480 https://

The Yokai Learning Environment: Tracking Beliefs Over Space and Time
Developing collaborative AI hinges on Theory of Mind (ToM) - the ability to reason about the beliefs of others to build and maintain common ground. Existing ToM benchmarks, however, are restricted to passive observer settings or lack an assessment of how agents establish and maintain common ground over time. To address these gaps, we introduce the Yokai Learning Environment (YLE) - a multi-agent reinforcement learning (RL) environment based on the cooperative card game Yokai. In the YLE, agents…

@arXiv_csRO_bot@mastoxiv.page
2025-09-18 10:10:41

SEG-Parking: Towards Safe, Efficient, and Generalizable Autonomous Parking via End-to-End Offline Reinforcement Learning
Zewei Yang, Zengqi Peng, Jun Ma
https://arxiv.org/abs/2509.13956

SEG-Parking: Towards Safe, Efficient, and Generalizable Autonomous Parking via End-to-End Offline Reinforcement Learning
Autonomous parking is a critical component for achieving safe and efficient urban autonomous driving. However, unstructured environments and dynamic interactions pose significant challenges to autonomous parking tasks. To address this problem, we propose SEG-Parking, a novel end-to-end offline reinforcement learning (RL) framework to achieve interaction-aware autonomous parking. Notably, a specialized parking dataset is constructed for parking scenarios, which include those without interference…

@arXiv_csLG_bot@mastoxiv.page
2025-09-22 10:22:11

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search
Zhiyu Mou, Yiqin Lv, Miao Xu, Cheems Wang, Yixiu Mao, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng
https://arxiv.org/abs/2509.15927

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search
Auto-bidding is an essential tool for advertisers to enhance their advertising performance. Recent progress has shown that AI-Generated Bidding (AIGB), which formulates the auto-bidding as a trajectory generation task and trains a conditional diffusion-based planner on offline data, achieves superior and stable performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still encounter a performance bottleneck due to their negle…

@arXiv_csRO_bot@mastoxiv.page
2025-08-19 11:29:30

Manipulate-to-Navigate: Reinforcement Learning with Visual Affordances and Manipulability Priors
Yuying Zhang, Joni Pajarinen
https://arxiv.org/abs/2508.13151 https://

Manipulate-to-Navigate: Reinforcement Learning with Visual Affordances and Manipulability Priors
Mobile manipulation in dynamic environments is challenging due to movable obstacles blocking the robot's path. Traditional methods, which treat navigation and manipulation as separate tasks, often fail in such 'manipulate-to-navigate' scenarios, as obstacles must be removed before navigation. In these cases, active interaction with the environment is required to clear obstacles while ensuring sufficient space for movement. To address the manipulate-to-navigate problem, we propose a reinforcemen…

@arXiv_csLG_bot@mastoxiv.page
2025-08-21 10:15:10

HERAKLES: Hierarchical Skill Compilation for Open-ended LLM Agents
Thomas Carta, Cl\'ement Romac, Loris Gaven, Pierre-Yves Oudeyer, Olivier Sigaud, Sylvain Lamprier
https://arxiv.org/abs/2508.14751

HERAKLES: Hierarchical Skill Compilation for Open-ended LLM Agents
Open-ended AI agents need to be able to learn efficiently goals of increasing complexity, abstraction and heterogeneity over their lifetime. Beyond sampling efficiently their own goals, autotelic agents specifically need to be able to keep the growing complexity of goals under control, limiting the associated growth in sample and computational complexity. To adress this challenge, recent approaches have leveraged hierarchical reinforcement learning (HRL) and language, capitalizing on its compos…

@arXiv_csLG_bot@mastoxiv.page
2025-08-21 10:13:50

AFABench: A Generic Framework for Benchmarking Active Feature Acquisition
Valter Sch\"utz, Han Wu, Reza Rezvan, Linus Aronsson, Morteza Haghir Chehreghani
https://arxiv.org/abs/2508.14734

AFABench: A Generic Framework for Benchmarking Active Feature Acquisition
In many real-world scenarios, acquiring all features of a data instance can be expensive or impractical due to monetary cost, latency, or privacy concerns. Active Feature Acquisition (AFA) addresses this challenge by dynamically selecting a subset of informative features for each data instance, trading predictive performance against acquisition cost. While numerous methods have been proposed for AFA, ranging from greedy information-theoretic strategies to non-myopic reinforcement learning appro…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 11:08:40

Towards Open-Ended Emotional Support Conversations in LLMs via Reinforcement Learning with Future-Oriented Rewards
Ting Yang, Li Chen, Huimin Wang
https://arxiv.org/abs/2508.12935

Towards Open-Ended Emotional Support Conversations in LLMs via Reinforcement Learning with Future-Oriented Rewards
Emotional Support Conversation (ESC) systems aim to alleviate users' emotional difficulties and provide long-term, systematic support for emotional well-being. However, most large language model (LLM)-based ESC systems rely on predefined strategies, which limits their effectiveness in complex, real-life scenarios. To enable flexible responses to diverse emotional problem scenarios, this paper introduces a novel end-to-end framework (RLFF-ESC) that directly learns enduring emotionally supportive…

@arXiv_csRO_bot@mastoxiv.page
2025-09-16 11:25:26

Quantum deep reinforcement learning for humanoid robot navigation task
Romerik Lokossou, Birhanu Shimelis Girma, Ozan K. Tonguz, Ahmed Biyabani
https://arxiv.org/abs/2509.11388 …

Quantum deep reinforcement learning for humanoid robot navigation task
Classical reinforcement learning (RL) methods often struggle in complex, high-dimensional environments because of their extensive parameter requirements and challenges posed by stochastic, non-deterministic settings. This study introduces quantum deep reinforcement learning (QDRL) to train humanoid agents efficiently. While previous quantum RL models focused on smaller environments, such as wheeled robots and robotic arms, our work pioneers the application of QDRL to humanoid robotics, specific…

@arXiv_csLG_bot@mastoxiv.page
2025-09-18 10:15:41

A Universal Banach--Bregman Framework for Stochastic Iterations: Unifying Stochastic Mirror Descent, Learning and LLM Training
Johnny R. Zhang (Independent Researcher), Xiaomei Mi (University of Manchester), Gaoyuan Du (Amazon), Qianyi Sun (Microsoft), Shiqi Wang (Meta), Jiaxuan Li (Amazon), Wenhua Zhou (Independent Researcher)
https://arx…

A Universal Banach--Bregman Framework for Stochastic Iterations: Unifying Stochastic Mirror Descent, Learning and LLM Training
Stochastic optimization powers the scalability of modern artificial intelligence, spanning machine learning, deep learning, reinforcement learning, and large language model training. Yet, existing theory remains largely confined to Hilbert spaces, relying on inner-product frameworks and orthogonality. This paradigm fails to capture non-Euclidean settings, such as mirror descent on simplices, Bregman proximal methods for sparse learning, natural gradient descent in information geometry, or Kullb…

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 09:09:00

Multi-Group Equivariant Augmentation for Reinforcement Learning in Robot Manipulation
Hongbin Lin, Juan Rojas, Kwok Wai Samuel Au
https://arxiv.org/abs/2508.11204 https://

Multi-Group Equivariant Augmentation for Reinforcement Learning in Robot Manipulation
Sampling efficiency is critical for deploying visuomotor learning in real-world robotic manipulation. While task symmetry has emerged as a promising inductive bias to improve efficiency, most prior work is limited to isometric symmetries -- applying the same group transformation to all task objects across all timesteps. In this work, we explore non-isometric symmetries, applying multiple independent group transformations across spatial and temporal dimensions to relax these constraints. We intr…

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 10:22:21

ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
Hanyang Chen, Mark Zhao, Rui Yang, Qinwei Ma, Ke Yang, Jiarui Yao, Kangrui Wang, Hao Bai, Zhenhailong Wang, Rui Pan, Mengchao Zhang, Jose Barreiros, Aykut Onol, ChengXiang Zhai, Heng Ji, Manling Li, Huan Zhang, Tong Zhang
https://arxiv…

ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
Recent advances in embodied AI highlight the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing systems rely on large-scale models that are costly to deploy, while smaller VLMs lack the necessary knowledge and skills to succeed. To bridge this gap, we present \textit{Embodied Reasoning Agent (ERA)}, a two-stage framework that integrates prior knowledge learning and online reinforcement learning …

@arXiv_csLG_bot@mastoxiv.page
2025-09-18 10:11:21

Online Bayesian Risk-Averse Reinforcement Learning
Yuhao Wang, Enlu Zhou
https://arxiv.org/abs/2509.14077 https://arxiv.org/pdf/2509.14077

Online Bayesian Risk-Averse Reinforcement Learning
In this paper, we study the Bayesian risk-averse formulation in reinforcement learning (RL). To address the epistemic uncertainty due to a lack of data, we adopt the Bayesian Risk Markov Decision Process (BRMDP) to account for the parameter uncertainty of the unknown underlying model. We derive the asymptotic normality that characterizes the difference between the Bayesian risk value function and the original value function under the true unknown distribution. The results indicate that the Baye…

@arXiv_csLG_bot@mastoxiv.page
2025-08-18 09:41:10

Fusing Rewards and Preferences in Reinforcement Learning
Sadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash, Matthias Grossglauser
https://arxiv.org/abs/2508.11363 https://…

Fusing Rewards and Preferences in Reinforcement Learning
We present Dual-Feedback Actor (DFA), a reinforcement learning algorithm that fuses both individual rewards and pairwise preferences (if available) into a single update rule. DFA uses the policy's log-probabilities directly to model the preference probability, avoiding a separate reward-modeling step. Preferences can be provided by human-annotators (at state-level or trajectory-level) or be synthesized online from Q-values stored in an off-policy replay buffer. Under a Bradley-Terry model, we p…

@arXiv_csRO_bot@mastoxiv.page
2025-09-18 10:08:41

SHaRe-RL: Structured, Interactive Reinforcement Learning for Contact-Rich Industrial Assembly Tasks
Jannick Strangh\"oner, Philipp Hartmann, Marco Braun, Sebastian Wrede, Klaus Neumann
https://arxiv.org/abs/2509.13949

SHaRe-RL: Structured, Interactive Reinforcement Learning for Contact-Rich Industrial Assembly Tasks
High-mix low-volume (HMLV) industrial assembly, common in small and medium-sized enterprises (SMEs), requires the same precision, safety, and reliability as high-volume automation while remaining flexible to product variation and environmental uncertainty. Current robotic systems struggle to meet these demands. Manual programming is brittle and costly to adapt, while learning-based methods suffer from poor sample efficiency and unsafe exploration in contact-rich tasks. To address this, we prese…

@arXiv_csLG_bot@mastoxiv.page
2025-08-18 09:43:10

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou
https://arxiv.org/abs/2508.11408

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framewor…

@arXiv_csRO_bot@mastoxiv.page
2025-09-17 10:31:10

GRATE: a Graph transformer-based deep Reinforcement learning Approach for Time-efficient autonomous robot Exploration
Haozhan Ni, Jingsong Liang, Chenyu He, Yuhong Cao, Guillaume Sartoretti
https://arxiv.org/abs/2509.12863

GRATE: a Graph transformer-based deep Reinforcement learning Approach for Time-efficient autonomous robot Exploration
Autonomous robot exploration (ARE) is the process of a robot autonomously navigating and mapping an unknown environment. Recent Reinforcement Learning (RL)-based approaches typically formulate ARE as a sequential decision-making problem defined on a collision-free informative graph. However, these methods often demonstrate limited reasoning ability over graph-structured data. Moreover, due to the insufficient consideration of robot motion, the resulting RL policies are generally optimized to mi…

@arXiv_csLG_bot@mastoxiv.page
2025-08-18 09:39:20

ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism
Jia Liu, ChangYi He, YingQiao Lin, MingMin Yang, FeiYang Shen, ShaoGuo Liu, TingTing Gao
https://arxiv.org/abs/2508.11356

ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism
Recent advancements in Large Language Models have yielded significant improvements in complex reasoning tasks such as mathematics and programming. However, these models remain heavily dependent on annotated data and exhibit limited adaptability in unsupervised scenarios. To address these limitations, test-time reinforcement learning (TTRL) has been proposed, which enables self-optimization by leveraging model-generated pseudo-labels. Despite its promise, TTRL faces several key challenges, inclu…

@arXiv_csRO_bot@mastoxiv.page
2025-09-17 10:22:40

Integrating Trajectory Optimization and Reinforcement Learning for Quadrupedal Jumping with Terrain-Adaptive Landing
Renjie Wang, Shangke Lyu, Xin Lang, Wei Xiao, Donglin Wang
https://arxiv.org/abs/2509.12776

Integrating Trajectory Optimization and Reinforcement Learning for Quadrupedal Jumping with Terrain-Adaptive Landing
Jumping constitutes an essential component of quadruped robots' locomotion capabilities, which includes dynamic take-off and adaptive landing. Existing quadrupedal jumping studies mainly focused on the stance and flight phase by assuming a flat landing ground, which is impractical in many real world cases. This work proposes a safe landing framework that achieves adaptive landing on rough terrains by combining Trajectory Optimization (TO) and Reinforcement Learning (RL) together. The RL agent l…

@arXiv_csLG_bot@mastoxiv.page
2025-10-14 13:37:08

How Reinforcement Learning After Next-Token Prediction Facilitates Learning
Nikolaos Tsilivis, Eran Malach, Karen Ullrich, Julia Kempe
https://arxiv.org/abs/2510.11495 https://

How Reinforcement Learning After Next-Token Prediction Facilitates Learning
Recent advances in reasoning domains with neural networks have primarily been enabled by a training recipe that optimizes Large Language Models, previously trained to predict the next-token in a sequence, with reinforcement learning algorithms. We introduce a framework to study the success of this paradigm, and we theoretically expose the optimization mechanisms by which reinforcement learning improves over next-token prediction in this setting. We study learning from mixture distributions of s…

@arXiv_csLG_bot@mastoxiv.page
2025-08-15 10:12:02

Variance Reduced Policy Gradient Method for Multi-Objective Reinforcement Learning
Davide Guidobene, Lorenzo Benedetti, Diego Arapovic
https://arxiv.org/abs/2508.10608 https://

Variance Reduced Policy Gradient Method for Multi-Objective Reinforcement Learning
Multi-Objective Reinforcement Learning (MORL) is a generalization of traditional Reinforcement Learning (RL) that aims to optimize multiple, often conflicting objectives simultaneously rather than focusing on a single reward. This approach is crucial in complex decision-making scenarios where agents must balance trade-offs between various goals, such as maximizing performance while minimizing costs. We consider the problem of MORL where the objectives are combined using a non-linear scalarizati…

@arXiv_csLG_bot@mastoxiv.page
2025-10-15 08:21:22

GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving
Ruida Wang, Jiarui Yao, Rui Pan, Shizhe Diao, Tong Zhang
https://arxiv.org/abs/2510.11769 https://

GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving
Solving math problems through verifiable languages such as Lean has significantly impacted both the mathematics and computer science communities. Current state-of-the-art models are often trained with expensive online Reinforcement Learning (RL) or expert iteration. However, these approaches rely on fixed problem sets, which causes inefficient training and limits the model to tackle complex problems. To overcome these limitations, we propose GAR: Generative Adversarial Reinforcement learning, a…

@arXiv_csLG_bot@mastoxiv.page
2025-08-12 12:07:03

Stackelberg Coupling of Online Representation Learning and Reinforcement Learning
Fernando Martinez, Tao Li, Yingdong Lu, Juntao Chen
https://arxiv.org/abs/2508.07452 https://…

Stackelberg Coupling of Online Representation Learning and Reinforcement Learning
Integrated, end-to-end learning of representations and policies remains a cornerstone of deep reinforcement learning (RL). However, to address the challenge of learning effective features from a sparse reward signal, recent trends have shifted towards adding complex auxiliary objectives or fully decoupling the two processes, often at the cost of increased design complexity. This work proposes an alternative to both decoupling and naive end-to-end learning, arguing that performance can be signif…

@arXiv_csLG_bot@mastoxiv.page
2025-09-12 09:53:39

Quantum Machine Learning, Quantitative Trading, Reinforcement Learning, Deep Learning
Jun-Hao Chen, Yu-Chien Huang, Yun-Cheng Tsai, Samuel Yen-Chi Chen
https://arxiv.org/abs/2509.09176

Quantum Machine Learning, Quantitative Trading, Reinforcement Learning, Deep Learning
The convergence of quantum-inspired neural networks and deep reinforcement learning offers a promising avenue for financial trading. We implemented a trading agent for USD/TWD by integrating Quantum Long Short-Term Memory (QLSTM) for short-term trend prediction with Quantum Asynchronous Advantage Actor-Critic (QA3C), a quantum-enhanced variant of the classical A3C. Trained on data from 2000-01-01 to 2025-04-30 (80\% training, 20\% testing), the long-only agent achieves 11.87\% return over aroun…

@arXiv_csLG_bot@mastoxiv.page
2025-09-16 12:45:07

$K$-Level Policy Gradients for Multi-Agent Reinforcement Learning
Aryaman Reddi, Gabriele Tiboni, Jan Peters, Carlo D'Eramo
https://arxiv.org/abs/2509.12117 https://

$K$-Level Policy Gradients for Multi-Agent Reinforcement Learning
Actor-critic algorithms for deep multi-agent reinforcement learning (MARL) typically employ a policy update that responds to the current strategies of other agents. While being straightforward, this approach does not account for the updates of other agents at the same update step, resulting in miscoordination. In this paper, we introduce the $K$-Level Policy Gradient (KPG), a method that recursively updates each agent against the updated policies of other agents, speeding up the discovery of ef…

@arXiv_csLG_bot@mastoxiv.page
2025-09-16 12:40:37

Generalizing Behavior via Inverse Reinforcement Learning with Closed-Form Reward Centroids
Filippo Lazzati, Alberto Maria Metelli
https://arxiv.org/abs/2509.12010 https://

Generalizing Behavior via Inverse Reinforcement Learning with Closed-Form Reward Centroids
We study the problem of generalizing an expert agent's behavior, provided through demonstrations, to new environments and/or additional constraints. Inverse Reinforcement Learning (IRL) offers a promising solution by seeking to recover the expert's underlying reward function, which, if used for planning in the new settings, would reproduce the desired behavior. However, IRL is inherently ill-posed: multiple reward functions, forming the so-called feasible set, can explain the same observed beha…

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:12:20

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration
Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang
https://arxiv.org/abs/2508.13755

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration
Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weightin…

Tootfinder

Opt-in global Mastodon full text search. Join the index!