
2025-08-12 12:07:03
Stackelberg Coupling of Online Representation Learning and Reinforcement Learning
Fernando Martinez, Tao Li, Yingdong Lu, Juntao Chen
https://arxiv.org/abs/2508.07452 https://…
Stackelberg Coupling of Online Representation Learning and Reinforcement Learning
Fernando Martinez, Tao Li, Yingdong Lu, Juntao Chen
https://arxiv.org/abs/2508.07452 https://…
ReconDreamer-RL: Enhancing Reinforcement Learning via Diffusion-based Scene Reconstruction
Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Xinze Chen, Guanghong Jia, Guan Huang, Wenjun Mei
https://arxiv.org/abs/2508.08170
CleanQRL: Lightweight Single-file Implementations of Quantum Reinforcement Learning Algorithms
Georg Kruse, Rodrigo Coelho, Andreas Rosskopf, Robert Wille, Jeanette Miriam Lorenz
https://arxiv.org/abs/2507.07593
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li
https://arxiv.org/abs/2506.09942 …
L2Calib: $SE(3)$-Manifold Reinforcement Learning for Robust Extrinsic Calibration with Degenerate Motion Resilience
Baorun Li, Chengrui Zhu, Siyi Du, Bingran Chen, Jie Ren, Wenfei Wang, Yong Liu, Jiajun Lv
https://arxiv.org/abs/2508.06330
MOORL: A Framework for Integrating Offline-Online Reinforcement Learning
Gaurav Chaudhary, Wassim Uddin Mondal, Laxmidhar Behera
https://arxiv.org/abs/2506.09574
Adversarial Attacks on Reinforcement Learning-based Medical Questionnaire Systems: Input-level Perturbation Strategies and Medical Constraint Validation
Peizhuo Liu
https://arxiv.org/abs/2508.05677
Reinforcement Learning in Vision: A Survey
Weijia Wu, Chen Gao, Joya Chen, Kevin Qinghong Lin, Qingwei Meng, Yiming Zhang, Yuke Qiu, Hong Zhou, Mike Zheng Shou
https://arxiv.org/abs/2508.08189
M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation
Zhiyou Xiao, Qinhan Yu, Binghui Li, Geng Chen, Chong Chen, Wentao Zhang
https://arxiv.org/abs/2508.06328
QForce-RL: Quantized FPGA-Optimized Reinforcement Learning Compute Engine
Anushka Jha, Tanushree Dewangan, Mukul Lokhande, Santosh Kumar Vishvakarma
https://arxiv.org/abs/2506.07046
MoRE: Mixture of Residual Experts for Humanoid Lifelike Gaits Learning on Complex Terrains
Dewei Wang, Xinmiao Wang, Xinzhe Liu, Jiyuan Shi, Yingnan Zhao, Chenjia Bai, Xuelong Li
https://arxiv.org/abs/2506.08840
Policy-Based Trajectory Clustering in Offline Reinforcement Learning
Hao Hu, Xinqi Wang, Simon Shaolei Du
https://arxiv.org/abs/2506.09202 https://
Joint Scheduling and Resource Allocation in mmWave IAB Networks Using Deep RL
Maryam Abbasalizadeh, Sashank Narain
https://arxiv.org/abs/2508.07604 https://
CLIP-RL: Surgical Scene Segmentation Using Contrastive Language-Vision Pretraining & Reinforcement Learning
Fatmaelzahraa Ali Ahmed, Muhammad Arsalan, Abdulaziz Al-Ali, Khalid Al-Jalham, Shidin Balakrishnan
https://arxiv.org/abs/2507.04317
Optimizing Cognitive Networks: Reinforcement Learning Meets Energy Harvesting Over Cascaded Channels
Deemah H. Tashman, Soumaya Cherkaoui, Walaa Hamouda
https://arxiv.org/abs/2507.06981
Multi-Task Reward Learning from Human Ratings
Mingkang Wu, Devin White, Evelyn Rose, Vernon Lawhern, Nicholas R Waytowich, Yongcan Cao
https://arxiv.org/abs/2506.09183
This https://arxiv.org/abs/2503.02189 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csMA_…
Robust Bandwidth Estimation for Real-Time Communication with Offline Reinforcement Learning
Jian Kai, Tianwei Zhang, Zihan Ling, Yang Cao, Can Shen
https://arxiv.org/abs/2507.05785
Hierarchical Learning-Enhanced MPC for Safe Crowd Navigation with Heterogeneous Constraints
Huajian Liu, Yixuan Feng, Wei Dong, Kunpeng Fan, Chao Wang, Yongzhuo Gao
https://arxiv.org/abs/2506.09859
Regret-Optimal Q-Learning with Low Cost for Single-Agent and Federated Reinforcement Learning
Haochen Zhang, Zhong Zheng, Lingzhou Xue
https://arxiv.org/abs/2506.04626
Scaling RL to Long Videos
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
https://arxiv.org/abs/2507.07966
Agent Lightning: Train ANY AI Agents with Reinforcement Learning
Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K. Qiu, Yuqing Yang
https://arxiv.org/abs/2508.03680
CRScore : Reinforcement Learning with Verifiable Tool and AI Feedback for Code Review
Manav Nitin Kapadnis, Atharva Naik, Carolyn Rose
https://arxiv.org/abs/2506.00296
Reinforcement Learning with Action Chunking
Qiyang Li, Zhiyuan Zhou, Sergey Levine
https://arxiv.org/abs/2507.07969 https://arxiv.org/pdf/2507.07969 https://arxiv.org/html/2507.07969
arXiv:2507.07969v1 Announce Type: new
Abstract: We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
toXiv_bot_toot
Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation
Huaicheng Zhang, Wei Tan, Guangzheng Li, Yixuan Zhang, Hangting Chen, Shun Lei, Chenyu Yang, Zhiyong Wu, Shuai Wang, Qijun Huang, Dong Yu
https://arxiv.org/abs/2508.05011
Reinforced Refinement with Self-Aware Expansion for End-to-End Autonomous Driving
Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, Chen Lv
https://arxiv.org/abs/2506.09800
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal
https://arxiv.org/abs/2507.06485
Pearl: Automatic Code Optimization Using Deep Reinforcement Learning
Djamel Rassem Lamouri, Iheb Nassim Aouadj, Smail Kourta, Riyadh Baghdadi
https://arxiv.org/abs/2506.01880
Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models
Haitao Hong, Yuchen Yan, Xingyu Wu, Guiyang Hou, Wenqi Zhang, Weiming Lu, Yongliang Shen, Jun Xiao
https://arxiv.org/abs/2508.05613
EXPO: Stable Reinforcement Learning with Expressive Policies
Perry Dong, Qiyang Li, Dorsa Sadigh, Chelsea Finn
https://arxiv.org/abs/2507.07986 https://arxiv.org/pdf/2507.07986 https://arxiv.org/html/2507.07986
arXiv:2507.07986v1 Announce Type: new
Abstract: We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.
toXiv_bot_toot
Adaptive k-space Radial Sampling for Cardiac MRI with Reinforcement Learning
Ruru Xu, Ilkay Oksuz
https://arxiv.org/abs/2508.04727 https://arxiv.org/pdf/25…
Learning-based primal-dual optimal control of discrete-time stochastic systems with multiplicative noise
Xiushan Jiang, Weihai Zhang
https://arxiv.org/abs/2506.02613
Improving Long-Range Navigation with Spatially-Enhanced Recurrent Memory via End-to-End Reinforcement Learning
Fan Yang, Per Frivik, David Hoeller, Chen Wang, Cesar Cadena, Marco Hutter
https://arxiv.org/abs/2506.05997
This https://arxiv.org/abs/2506.04168 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
This https://arxiv.org/abs/2506.02841 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_ees…
BASIL: Best-Action Symbolic Interpretable Learning for Evolving Compact RL Policies
Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar
https://arxiv.org/abs/2506.00328
This https://arxiv.org/abs/2506.03038 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCL_…
AutoIndexer: A Reinforcement Learning-Enhanced Index Advisor Towards Scaling Workloads
Taiyi Wang, Eiko Yoneki
https://arxiv.org/abs/2507.23084 https://arx…
This https://arxiv.org/abs/2503.04280 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…
This https://arxiv.org/abs/2505.00546 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
This https://arxiv.org/abs/2506.04147 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…
RL-U$^2$Net: A Dual-Branch UNet with Reinforcement Learning-Assisted Multimodal Feature Fusion for Accurate 3D Whole-Heart Segmentation
Jierui Qu, Jianchun Zhao
https://arxiv.org/abs/2508.02557
This https://arxiv.org/abs/2404.17589 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csIR_…
Policy Newton methods for Distortion Riskmetrics
Soumen Pachal, Mizhaan Prajit Maniyar, Prashanth L. A
https://arxiv.org/abs/2508.07249 https://arxiv.org/p…
Posterior-GRPO: Rewarding Reasoning Processes in Code Generation
Lishui Fan, Yu Zhang, Mouxiang Chen, Zhongxin Liu
https://arxiv.org/abs/2508.05170 https://
Detecting and Mitigating Reward Hacking in Reinforcement Learning Systems: A Comprehensive Empirical Study
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
https://arxiv.org/abs/2507.05619
Learning to Reason for Factuality
Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas O\u{g}uz, Rulin Shao, Gargi Ghosh, Jason Weston, Wen-tau Yih
https://arxiv.org/abs/2508.05618
Reinforcement Learning for Discrete-time LQG Mean Field Social Control Problems with Unknown Dynamics
Hanfang Zhang, Bing-Chang Wang, Shuo Chen
https://arxiv.org/abs/2507.01420
Self-correcting Reward Shaping via Language Models for Reinforcement Learning Agents in Games
Ant\'onio Afonso, Iolanda Leite, Alessandro Sestini, Florian Fuchs, Konrad Tollmar, Linus Gissl\'en
https://arxiv.org/abs/2506.23626
A Reinforcement Learning-Based Telematic Routing Protocol for the Internet of Underwater Things
Mohammadhossein Homaei, Mehran Tarif, Agustin Di Bartolo, Oscar Mogollon Gutierrez, Mar Avila
https://arxiv.org/abs/2506.00133
This https://arxiv.org/abs/2506.03703 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
Computationally efficient Gauss-Newton reinforcement learning for model predictive control
Dean Brandner, Sebastien Gros, Sergio Lucia
https://arxiv.org/abs/2508.02441 https://
SimLauncher: Launching Sample-Efficient Real-world Robotic Reinforcement Learning via Simulation Pre-training
Mingdong Wu, Lehong Wu, Yizhuo Wu, Weiyao Huang, Hongwei Fan, Zheyuan Hu, Haoran Geng, Jinzhou Li, Jiahe Ying, Long Yang, Yuanpei Chen, Hao Dong
https://arxiv.org/abs/2507.04452 …
Homogenization of Multi-agent Learning Dynamics in Finite-state Markov Games
Yann Kerzreho (ENS Paris Saclay)
https://arxiv.org/abs/2506.21079 https://
This https://arxiv.org/abs/2505.11862 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
Reward Balancing Revisited: Enhancing Offline Reinforcement Learning for Recommender Systems
Wenzheng Shu, Yanxiang Zeng, Yongxiang Tang, Teng Sha, Ning Luo, Yanhua Cheng, Xialong Liu, Fan Zhou, Peng Jiang
https://arxiv.org/abs/2506.22112
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel
https://arxiv.org/abs/2507…
This https://arxiv.org/abs/2505.19641 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…
MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, Dong Yu
https://arxiv.org/abs/2507.05720
Reasoning-Table: Exploring Reinforcement Learning for Table Reasoning
Fangyu Lei, Jinxiang Meng, Yiming Huang, Tinghong Chen, Yun Zhang, Shizhu He, Jun Zhao, Kang Liu
https://arxiv.org/abs/2506.01710
DRAE: Dynamic Retrieval-Augmented Expert Networks for Lifelong Learning and Task Adaptation in Robotics
Yayu Long, Kewei Chen, Long Jin, Mingsheng Shang
https://arxiv.org/abs/2507.04661
Multi-task Offline Reinforcement Learning for Online Advertising in Recommender Systems
Langming Liu, Wanyu Wang, Chi Zhang, Bo Li, Hongzhi Yin, Xuetao Wei, Wenbo Su, Bo Zheng, Xiangyu Zhao
https://arxiv.org/abs/2506.23090
Data-assimilated model-informed reinforcement learning
Defne E. Ozan, Andrea N\'ovoa, Georgios Rigas, Luca Magri
https://arxiv.org/abs/2506.01755 https…
CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning
Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, Chunhe Xia
https://arxiv.org/abs/2508.02219
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu
https://arxiv.org/abs/2506.20512 http…
This https://arxiv.org/abs/2503.07792 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…
This https://arxiv.org/abs/2409.17469 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…
Agnostic Reinforcement Learning: Foundations and Algorithms
Gene Li
https://arxiv.org/abs/2506.01884 https://arxiv.org/pdf/2506.01884…
MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs
Purbesh Mitra, Sennur Ulukus
https://arxiv.org/abs/2507.02851 https://a…
This https://arxiv.org/abs/2505.23703 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…
This https://arxiv.org/abs/2505.24298 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
This https://arxiv.org/abs/2505.23585 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
SLAC: Simulation-Pretrained Latent Action Space for Whole-Body Real-World RL
Jiaheng Hu, Peter Stone, Roberto Mart\'in-Mart\'in
https://arxiv.org/abs/2506.04147
Libra: Assessing and Improving Reward Model by Learning to Think
Meng Zhou, Bei Li, Jiahao Liu, Xiaowen Shi, Yang Bai, Rongxiang Weng, Jingang Wang, Xunliang Cai
https://arxiv.org/abs/2507.21645
Mechanical Intelligence-Aware Curriculum Reinforcement Learning for Humanoids with Parallel Actuation
Yusuke Tanaka, Alvin Zhu, Quanyou Wang, Dennis Hong
https://arxiv.org/abs/2507.00273
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab
https://arx…
This https://arxiv.org/abs/2506.00691 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
Reinforcement Learning with Data Bootstrapping for Dynamic Subgoal Pursuit in Humanoid Robot Navigation
Chengyang Peng, Zhihao Zhang, Shiting Gong, Sankalp Agrawal, Keith A. Redmill, Ayonga Hereid
https://arxiv.org/abs/2506.02206
A Forget-and-Grow Strategy for Deep Reinforcement Learning Scaling in Continuous Control
Zilin Kang, Chenyuan Hu, Yu Luo, Zhecheng Yuan, Ruijie Zheng, Huazhe Xu
https://arxiv.org/abs/2507.02712
Multi-Timescale Hierarchical Reinforcement Learning for Unified Behavior and Control of Autonomous Driving
Guizhe Jin, Zhuoren Li, Bo Leng, Ran Yu, Lu Xiong
https://arxiv.org/abs/2506.23771
ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning
Ruiyang Zhou, Shuozhe Li, Amy Zhang, Liu Leqi
https://arxiv.org/abs/2507.02834
Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents
Shaofei Cai, Zhancun Mu, Haiwen Xia, Bowei Zhang, Anji Liu, Yitao Liang
https://arxiv.org/abs/2507.23698
This https://arxiv.org/abs/2505.24034 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
Benchmarking Massively Parallelized Multi-Task Reinforcement Learning for Robotics Tasks
Vira Joshi, Zifan Xu, Bo Liu, Peter Stone, Amy Zhang
https://arxiv.org/abs/2507.23172 ht…
DiWA: Diffusion Policy Adaptation with World Models
Akshay L Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, Abhinav Valada
https://arxiv.org/abs/2508.03645
Learning to Explore: An In-Context Learning Approach for Pure Exploration
Alessio Russo, Ryan Welch, Aldo Pacchiano
https://arxiv.org/abs/2506.01876 https:…
DriveMind: A Dual-VLM based Reinforcement Learning Framework for Autonomous Driving
Dawood Wasif, Terrence J Moore, Chandan K Reddy, Jin-Hee Cho
https://arxiv.org/abs/2506.00819
Spatial-Temporal Reinforcement Learning for Network Routing with Non-Markovian Traffic
Molly Wang
https://arxiv.org/abs/2507.22174 https://arxiv.org/pdf/25…
This https://arxiv.org/abs/2505.23527 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
CueLearner: Bootstrapping and local policy adaptation from relative feedback
Giulio Schiavi, Andrei Cramariuc, Lionel Ott, Roland Siegwart
https://arxiv.org/abs/2507.04730
This https://arxiv.org/abs/2411.14622 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…
This https://arxiv.org/abs/2505.23527 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
This https://arxiv.org/abs/2505.22642 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…
This https://arxiv.org/abs/2501.07985 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…
This https://arxiv.org/abs/2308.13140 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…
Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models
Andrii Balashov
https://arxiv.org/abs/2507.17107 https://
This https://arxiv.org/abs/2503.18616 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…
This https://arxiv.org/abs/2505.21119 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
This https://arxiv.org/abs/2505.16401 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…