Tootfinder

@arXiv_csLG_bot@mastoxiv.page
2025-08-25 09:59:30

RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs
Hangzhan Jin, Sicheng Lv, Sifan Wu, Mohammad Hamdaqa
https://arxiv.org/abs/2508.16546

RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs
Training large language models (LLMs) from scratch is increasingly impractical, making post-training methods such as supervised fine-tuning (SFT) and reinforcement-learning fine-tuning (RL-FT, e.g., PPO) central to modern practice. Using an out-of-distribution (OOD) variant of the 24-point card game and new spectrum-based diagnostics, we revisit how these two stages reshape model representation and OOD performance. Our key findings are- (1) RL-FT can restore much of the OOD performance loss fro…

@arXiv_csCV_bot@mastoxiv.page
2025-08-26 12:32:47

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Songze Li, Xiangyu Zhao, Haodong Duan, Nianche…

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g.…

@arXiv_csCL_bot@mastoxiv.page
2025-06-26 09:40:40

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu
https://arxiv.org/abs/2506.20512 http…

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model fami…

@arXiv_econGN_bot@mastoxiv.page
2025-07-25 07:55:42

From Individual Learning to Market Equilibrium: Correcting Structural and Parametric Biases in RL Simulations of Economic Models
Zeqiang Zhang, Ruxin Chen
https://arxiv.org/abs/2507.18229

From Individual Learning to Market Equilibrium: Correcting Structural and Parametric Biases in RL Simulations of Economic Models
The application of Reinforcement Learning (RL) to economic modeling reveals a fundamental conflict between the assumptions of equilibrium theory and the emergent behavior of learning agents. While canonical economic models assume atomistic agents act as `takers' of aggregate market conditions, a naive single-agent RL simulation incentivizes the agent to become a `manipulator' of its environment. This paper first demonstrates this discrepancy within a search-and-matching model with concave produ…

@arXiv_csLG_bot@mastoxiv.page
2025-07-24 09:53:09

Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models
Andrii Balashov
https://arxiv.org/abs/2507.17107 https://

Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models
Reinforcement learning (RL) is a key post-pretraining step for aligning large language models (LLMs) with complex tasks and human preferences. While it is often assumed that RL fine-tuning requires updating most of a model's parameters, we challenge this assumption with a surprising finding: RL fine-tuning consistently modifies only a small subnetwork (typically 5-30% of weights), leaving most parameters unchanged. We call this phenomenon RL-induced parameter update sparsity. It arises naturall…

@arXiv_csAI_bot@mastoxiv.page
2025-06-24 10:49:20

Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities
Yuanchen Bei, Weizhi Zhang, Siwen Wang, Weizhi Chen, Sheng Zhou, Hao Chen, Yong Li, Jiajun Bu, Shirui Pan, Yizhou Yu, Irwin King, Fakhri Karray, Philip S. Yu
https://arxiv.org/abs/2506.18019

Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities
AI agents have experienced a paradigm shift, from early dominance by reinforcement learning (RL) to the rise of agents powered by large language models (LLMs), and now further advancing towards a synergistic fusion of RL and LLM capabilities. This progression has endowed AI agents with increasingly strong abilities. Despite these advances, to accomplish complex real-world tasks, agents are required to plan and execute effectively, maintain reliable memory, and coordinate smoothly with other age…

@arXiv_csNI_bot@mastoxiv.page
2025-06-24 10:29:30

RL-Driven Semantic Compression Model Selection and Resource Allocation in Semantic Communication Systems
Xinyi Lin, Peizheng Li, Adnan Aijaz
https://arxiv.org/abs/2506.18660

RL-Driven Semantic Compression Model Selection and Resource Allocation in Semantic Communication Systems
Semantic communication (SemCom) is an emerging paradigm that leverages semantic-level understanding to improve communication efficiency, particularly in resource-constrained scenarios. However, existing SemCom systems often overlook diverse computational and communication capabilities and requirements among different users. Motivated by the need to adaptively balance semantic accuracy, latency, and energy consumption, this paper presents a reinforcement learning (RL)-driven framework for semant…

@arXiv_quantph_bot@mastoxiv.page
2025-07-25 10:05:02

Hybrid quantum-classical algorithm for near-optimal planning in POMDPs
Gilberto Cunha, Alexandra Ram\^oa, Andr\'e Sequeira, Michael de Oliveira, Lu\'is Barbosa
https://arxiv.org/abs/2507.18606 …

Hybrid quantum-classical algorithm for near-optimal planning in POMDPs
Reinforcement learning (RL) provides a principled framework for decision-making in partially observable environments, which can be modeled as Markov decision processes and compactly represented through dynamic decision Bayesian networks. Recent advances demonstrate that inference on sparse Bayesian networks can be accelerated using quantum rejection sampling combined with amplitude amplification, leading to a computational speedup in estimating acceptance probabilities.\\ Building on this resul…

@arXiv_csRO_bot@mastoxiv.page
2025-06-26 08:38:10

Hierarchical Reinforcement Learning and Value Optimization for Challenging Quadruped Locomotion
Jeremiah Coholich, Muhammad Ali Murtaza, Seth Hutchinson, Zsolt Kira
https://arxiv.org/abs/2506.20036

Hierarchical Reinforcement Learning and Value Optimization for Challenging Quadruped Locomotion
We propose a novel hierarchical reinforcement learning framework for quadruped locomotion over challenging terrain. Our approach incorporates a two-layer hierarchy in which a high-level policy (HLP) selects optimal goals for a low-level policy (LLP). The LLP is trained using an on-policy actor-critic RL algorithm and is given footstep placements as goals. We propose an HLP that does not require any additional training or environment samples and instead operates via an online optimization proces…

@arXiv_csSE_bot@mastoxiv.page
2025-08-25 08:51:00

Breaking Barriers in Software Testing: The Power of AI-Driven Automation
Saba Naqvi, Mohammad Baqar
https://arxiv.org/abs/2508.16025 https://arxiv.org/pdf/…

Breaking Barriers in Software Testing: The Power of AI-Driven Automation
Software testing remains critical for ensuring reliability, yet traditional approaches are slow, costly, and prone to gaps in coverage. This paper presents an AI-driven framework that automates test case generation and validation using natural language processing (NLP), reinforcement learning (RL), and predictive models, embedded within a policy-driven trust and fairness model. The approach translates natural language requirements into executable tests, continuously optimizes them through learn…

@arXiv_csDC_bot@mastoxiv.page
2025-07-25 09:21:11

FCPO: Federated Continual Policy Optimization for Real-Time High-Throughput Edge Video Analytics
Lucas Liebe, Thanh-Tung Nguyen, Dongman Lee
https://arxiv.org/abs/2507.18047 htt…

FCPO: Federated Continual Policy Optimization for Real-Time High-Throughput Edge Video Analytics
The growing complexity of Edge Video Analytics (EVA) facilitates new kind of intelligent applications, but creates challenges in real-time inference serving systems. State-of-the-art (SOTA) scheduling systems optimize global workload distributions for heterogeneous devices but often suffer from extended scheduling cycles, leading to sub-optimal processing in rapidly changing Edge environments. Local Reinforcement Learning (RL) enables quick adjustments between cycles but faces scalability, know…

@arXiv_csLG_bot@mastoxiv.page
2025-08-25 09:54:10

Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning
Yue Pei, Hongming Zhang, Chao Gao, Martin M\"uller, Mengxiao Zhu, Hao Sheng, Haogang Zhu, Liang Lin
https://arxiv.org/abs/2508.16420

Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning
Offline reinforcement learning (RL) has achieved significant advances in domains such as robotic control, autonomous driving, and medical decision-making. Most existing methods primarily focus on training policies that maximize cumulative returns from a given dataset. However, many real-world applications require precise control over policy performance levels, rather than simply pursuing the best possible return. Reinforcement learning via supervised learning (RvS) frames offline RL as a sequen…

@arXiv_eessSY_bot@mastoxiv.page
2025-06-25 09:34:10

Partially Observable Residual Reinforcement Learning for PV-Inverter-Based Voltage Control in Distribution Grids
Sarra Bouchkati, Ramil Sabirov, Steffen Kortmann, Andreas Ulbig
https://arxiv.org/abs/2506.19353

Partially Observable Residual Reinforcement Learning for PV-Inverter-Based Voltage Control in Distribution Grids
This paper introduces an efficient Residual Reinforcement Learning (RRL) framework for voltage control in active distribution grids. Voltage control remains a critical challenge in distribution grids, where conventional Reinforcement Learning (RL) methods often suffer from slow training convergence and inefficient exploration. To overcome these challenges, the proposed RRL approach learns a residual policy on top of a modified Sequential Droop Control (SDC) mechanism, ensuring faster convergenc…

@arXiv_qbioQM_bot@mastoxiv.page
2025-07-25 12:42:04

Replaced article(s) found for q-bio.QM. https://arxiv.org/list/q-bio.QM/new
[1/1]:
- A PBN-RL-XAI Framework for Discovering a "Hit-and-Run" Therapeutic Strategy in Melanoma
Zhonglin Liu

@arXiv_csAI_bot@mastoxiv.page
2025-07-25 13:34:13

Replaced article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[5/5]:
- Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards
Li, Zhou, Kazemi, Sun, Ghaddar, Alomrani, Ma, Luo, Li, Wen, Hao, Coates, Zhang

@arXiv_csMA_bot@mastoxiv.page
2025-08-25 09:10:40

Integrated Noise and Safety Management in UAM via A Unified Reinforcement Learning Framework
Surya Murthy, Zhenyu Gao, John-Paul Clarke, Ufuk Topcu
https://arxiv.org/abs/2508.16440

Integrated Noise and Safety Management in UAM via A Unified Reinforcement Learning Framework
Urban Air Mobility (UAM) envisions the widespread use of small aerial vehicles to transform transportation in dense urban environments. However, UAM faces critical operational challenges, particularly the balance between minimizing noise exposure and maintaining safe separation in low-altitude urban airspace, two objectives that are often addressed separately. We propose a reinforcement learning (RL)-based air traffic management system that integrates both noise and safety considerations within…

@arXiv_csCV_bot@mastoxiv.page
2025-08-26 12:29:26

Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation
Yaqi Li, Peng Chen, Mingyang Han, Bu Pi, Haoxiang Shi, Runzhou Zhao, Yang Yao, Xuan Zhang, Jun Song
https://arxiv.org/abs/2508.18032

Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation
Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes…

@arXiv_qfinPM_bot@mastoxiv.page
2025-07-25 08:15:21

HARLF: Hierarchical Reinforcement Learning and Lightweight LLM-Driven Sentiment Integration for Financial Portfolio Optimization
Benjamin Coriat, Eric Benhamou
https://arxiv.org/abs/2507.18560

HARLF: Hierarchical Reinforcement Learning and Lightweight LLM-Driven Sentiment Integration for Financial Portfolio Optimization
This paper presents a novel hierarchical framework for portfolio optimization, integrating lightweight Large Language Models (LLMs) with Deep Reinforcement Learning (DRL) to combine sentiment signals from financial news with traditional market indicators. Our three-tier architecture employs base RL agents to process hybrid data, meta-agents to aggregate their decisions, and a super-agent to merge decisions based on market data and sentiment analysis. Evaluated on data from 2018 to 2024, after t…

@arXiv_eessSP_bot@mastoxiv.page
2025-07-24 08:20:49

PPAAS: PVT and Pareto Aware Analog Sizing via Goal-conditioned Reinforcement Learning
Seunggeun Kim, Ziyi Wang, Sungyoung Lee, Youngmin Oh, Hanqing Zhu, Doyun Kim, David Z. Pan
https://arxiv.org/abs/2507.17003

PPAAS: PVT and Pareto Aware Analog Sizing via Goal-conditioned Reinforcement Learning
Device sizing is a critical yet challenging step in analog and mixed-signal circuit design, requiring careful optimization to meet diverse performance specifications. This challenge is further amplified under process, voltage, and temperature (PVT) variations, which cause circuit behavior to shift across different corners. While reinforcement learning (RL) has shown promise in automating sizing for fixed targets, training a generalized policy that can adapt to a wide range of design specificati…

@arXiv_csNI_bot@mastoxiv.page
2025-08-25 08:41:20

xDiff: Online Diffusion Model for Collaborative Inter-Cell Interference Management in 5G O-RAN
Peihao Yan, Huacheng Zeng, Y. Thomas Hou
https://arxiv.org/abs/2508.15843 https://…

xDiff: Online Diffusion Model for Collaborative Inter-Cell Interference Management in 5G O-RAN
Open Radio Access Network (O-RAN) is a key architectural paradigm for 5G and beyond cellular networks, enabling the adoption of intelligent and efficient resource management solutions. Meanwhile, diffusion models have demonstrated remarkable capabilities in image and video generation, making them attractive for network optimization tasks. In this paper, we propose xDiff, a diffusion-based reinforcement learning(RL) framework for inter-cell interference management (ICIM) in O-RAN. We first formu…

@arXiv_csLG_bot@mastoxiv.page
2025-08-25 09:57:00

On Zero-Shot Reinforcement Learning
Scott Jeen
https://arxiv.org/abs/2508.16496 https://arxiv.org/pdf/2508.16496 …

On Zero-Shot Reinforcement Learning
Modern reinforcement learning (RL) systems capture deep truths about general, human problem-solving. In domains where new data can be simulated cheaply, these systems uncover sequential decision-making policies that far exceed the ability of any human. Society faces many problems whose solutions require this skill, but they are often in domains where new data cannot be cheaply simulated. In such scenarios, we can learn simulators from existing data, but these will only ever be approximately cor…

@arXiv_csCR_bot@mastoxiv.page
2025-08-19 10:22:40

Optimizing Token Choice for Code Watermarking: A RL Approach
Zhimeng Guo, Huaisheng Zhu, Siyuan Xu, Hangfan Zhang, Teng Xiao, Minhao Cheng
https://arxiv.org/abs/2508.11925 https…

Optimizing Token Choice for Code Watermarking: A RL Approach
The need for detecting LLM-generated code necessitates watermarking systems capable of operating within its highly structured and syntactically constrained environment. To address this, we introduce CodeTracer, an innovative adaptive code watermarking framework underpinned by a novel reinforcement learning training paradigm. At its core, CodeTracer features a policy-driven approach that utilizes a parameterized model to intelligently bias token choices during next-token prediction. This strateg…

@arXiv_csRO_bot@mastoxiv.page
2025-08-19 09:52:40

Integrating Symbolic RL Planning into a BDI-based Autonomous UAV Framework: System Integration and SIL Validation
Sangwoo Jeon, Juchul Shin, YeonJe Cho, Gyeong-Tae Kim, Seongwoo Kim
https://arxiv.org/abs/2508.11890

Integrating Symbolic RL Planning into a BDI-based Autonomous UAV Framework: System Integration and SIL Validation
Modern autonomous drone missions increasingly require software frameworks capable of seamlessly integrating structured symbolic planning with adaptive reinforcement learning (RL). Although traditional rule-based architectures offer robust structured reasoning for drone autonomy, their capabilities fall short in dynamically complex operational environments that require adaptive symbolic planning. Symbolic RL (SRL), using the Planning Domain Definition Language (PDDL), explicitly integrates domai…

@arXiv_csDC_bot@mastoxiv.page
2025-07-21 07:34:50

DistFlow: A Fully Distributed RL Framework for Scalable and Efficient LLM Post-Training
Zhixin Wang, Tianyi Zhou, Liming Liu, Ao Li, Jiarui Hu, Dian Yang, Jinlong Hou, Siyuan Feng, Yuan Cheng, Yuan Qi
https://arxiv.org/abs/2507.13833

DistFlow: A Fully Distributed RL Framework for Scalable and Efficient LLM Post-Training
Reinforcement learning (RL) has become the pivotal post-training technique for large language model. Effectively scaling reinforcement learning is now the key to unlocking advanced reasoning capabilities and ensuring safe, goal-aligned behavior in the most powerful LLMs. Mainstream frameworks usually employ a hybrid-controller architecture where a single-controller dispatches the overall execution logic and manages overall data transfer and the multi-controller executes distributed computation.…

@arXiv_csAR_bot@mastoxiv.page
2025-06-10 07:17:22

QForce-RL: Quantized FPGA-Optimized Reinforcement Learning Compute Engine
Anushka Jha, Tanushree Dewangan, Mukul Lokhande, Santosh Kumar Vishvakarma
https://arxiv.org/abs/2506.07046

QForce-RL: Quantized FPGA-Optimized Reinforcement Learning Compute Engine
Reinforcement Learning (RL) has outperformed other counterparts in sequential decision-making and dynamic environment control. However, FPGA deployment is significantly resource-expensive, as associated with large number of computations in training agents with high-quality images and possess new challenges. In this work, we propose QForce-RL takes benefits of quantization to enhance throughput and reduce energy footprint with light-weight RL architecture, without significant performance degrada…

@arXiv_eessSY_bot@mastoxiv.page
2025-06-17 10:56:09

RL-Guided MPC for Autonomous Greenhouse Control
Salim Msaad, Murray Harraway, Robert D. McAllister
https://arxiv.org/abs/2506.13278 https://

RL-Guided MPC for Autonomous Greenhouse Control
The efficient operation of greenhouses is essential for enhancing crop yield while minimizing energy costs. This paper investigates a control strategy that integrates Reinforcement Learning (RL) and Model Predictive Control (MPC) to optimize economic benefits in autonomous greenhouses. Previous research has explored the use of RL and MPC for greenhouse control individually, or by using MPC as the function approximator for the RL agent. This study introduces the RL-Guided MPC framework, where a …

@arXiv_csLG_bot@mastoxiv.page
2025-07-24 09:36:59

Pragmatic Policy Development via Interpretable Behavior Cloning
Anton Matsson, Yaochen Rao, Heather J. Litman, Fredrik D. Johansson
https://arxiv.org/abs/2507.17056

Pragmatic Policy Development via Interpretable Behavior Cloning
Offline reinforcement learning (RL) holds great promise for deriving optimal policies from observational data, but challenges related to interpretability and evaluation limit its practical use in safety-critical domains. Interpretability is hindered by the black-box nature of unconstrained RL policies, while evaluation -- typically performed off-policy -- is sensitive to large deviations from the data-collecting behavior policy, especially when using methods based on importance sampling. To add…

@arXiv_csIR_bot@mastoxiv.page
2025-08-11 09:52:09

M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation
Zhiyou Xiao, Qinhan Yu, Binghui Li, Geng Chen, Chong Chen, Wentao Zhang
https://arxiv.org/abs/2508.06328

M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation
Current research on Multimodal Retrieval-Augmented Generation (MRAG) enables diverse multimodal inputs but remains limited to single-modality outputs, restricting expressive capacity and practical utility. In contrast, real-world applications often demand both multimodal inputs and multimodal outputs for effective communication and grounded reasoning. Motivated by the recent success of Reinforcement Learning (RL) in complex reasoning tasks for Large Language Models (LLMs), we adopt RL as a prin…

@arXiv_csHC_bot@mastoxiv.page
2025-06-17 10:52:09

Can you see how I learn? Human observers' inferences about Reinforcement Learning agents' learning processes
Bernhard Hilpert, Muhan Hou, Kim Baraka, Joost Broekens
https://arxiv.org/abs/2506.13583

Can you see how I learn? Human observers' inferences about Reinforcement Learning agents' learning processes
Reinforcement Learning (RL) agents often exhibit learning behaviors that are not intuitively interpretable by human observers, which can result in suboptimal feedback in collaborative teaching settings. Yet, how humans perceive and interpret RL agent's learning behavior is largely unknown. In a bottom-up approach with two experiments, this work provides a data-driven understanding of the factors of human observers' understanding of the agent's learning process. A novel, observation-based paradi…

@arXiv_eessIV_bot@mastoxiv.page
2025-07-08 11:39:20

CLIP-RL: Surgical Scene Segmentation Using Contrastive Language-Vision Pretraining & Reinforcement Learning
Fatmaelzahraa Ali Ahmed, Muhammad Arsalan, Abdulaziz Al-Ali, Khalid Al-Jalham, Shidin Balakrishnan
https://arxiv.org/abs/2507.04317

CLIP-RL: Surgical Scene Segmentation Using Contrastive Language-Vision Pretraining & Reinforcement Learning
Understanding surgical scenes can provide better healthcare quality for patients, especially with the vast amount of video data that is generated during MIS. Processing these videos generates valuable assets for training sophisticated models. In this paper, we introduce CLIP-RL, a novel contrastive language-image pre-training model tailored for semantic segmentation for surgical scenes. CLIP-RL presents a new segmentation approach which involves reinforcement learning and curriculum learning, e…

@arXiv_quantph_bot@mastoxiv.page
2025-08-21 10:04:20

Reinforcement learning entangling operations on spin qubits
Mohammad Abedi, Markus Schmitt
https://arxiv.org/abs/2508.14761 https://arxiv.org/pdf/2508.1476…

Reinforcement learning entangling operations on spin qubits
High-fidelity control of one- and two-qubit gates past the error correction threshold is an essential ingredient for scalable quantum computing. We present a reinforcement learning (RL) approach to find entangling protocols for semiconductor-based singlet-triplet qubits in a double quantum dot. Despite the presence of realistically modelled experimental constraints, such as various noise contributions and finite rise-time effects, we demonstrate that an RL agent can yield performative protocols…

@arXiv_csLG_bot@mastoxiv.page
2025-07-24 08:46:30

Reinforcement Learning in hyperbolic space for multi-step reasoning
Tao Xu, Dung-Yang Lee, Momiao Xiong
https://arxiv.org/abs/2507.16864 https://

Reinforcement Learning in hyperbolic space for multi-step reasoning
Multi-step reasoning is a fundamental challenge in artificial intelligence, with applications ranging from mathematical problem-solving to decision-making in dynamic environments. Reinforcement Learning (RL) has shown promise in enabling agents to perform multi-step reasoning by optimizing long-term rewards. However, conventional RL methods struggle with complex reasoning tasks due to issues such as credit assignment, high-dimensional state representations, and stability concerns. Recent advanc…

@arXiv_csDB_bot@mastoxiv.page
2025-07-21 07:43:30

LLaPipe: LLM-Guided Reinforcement Learning for Automated Data Preparation Pipeline Construction
Jing Chang, Chang Liu, Jinbin Huang, Rui Mao, Jianbin Qin
https://arxiv.org/abs/2507.13712

LLaPipe: LLM-Guided Reinforcement Learning for Automated Data Preparation Pipeline Construction
Automated data preparation is crucial for democratizing machine learning, yet existing reinforcement learning (RL) based approaches suffer from inefficient exploration in the vast space of possible preprocessing pipelines. We present LLaPipe, a novel framework that addresses this exploration bottleneck by integrating Large Language Models (LLMs) as intelligent policy advisors. Unlike traditional methods that rely solely on statistical features and blind trial-and-error, LLaPipe leverages the se…

@arXiv_csRO_bot@mastoxiv.page
2025-06-24 11:57:40

Robots and Children that Learn Together : Improving Knowledge Retention by Teaching Peer-Like Interactive Robots
Imene Tarakli, Samuele Vinanzi, Richard Moore, Alessandro Di Nuovo
https://arxiv.org/abs/2506.18365

Robots and Children that Learn Together : Improving Knowledge Retention by Teaching Peer-Like Interactive Robots
Despite growing interest in Learning-by-Teaching (LbT), few studies have explored how this paradigm can be implemented with autonomous, peer-like social robots in real classrooms. Most prior work has relied on scripted or Wizard-of-Oz behaviors, limiting our understanding of how real-time, interactive learning can be supported by artificial agents. This study addresses this gap by introducing Interactive Reinforcement Learning (RL) as a cognitive model for teachable social robots. We conducted …

@arXiv_csLG_bot@mastoxiv.page
2025-08-21 10:17:30

Compute-Optimal Scaling for Value-Based Deep RL
Preston Fu, Oleh Rybkin, Zhiyuan Zhou, Michal Nauman, Pieter Abbeel, Sergey Levine, Aviral Kumar
https://arxiv.org/abs/2508.14881

Compute-Optimal Scaling for Value-Based Deep RL
As models grow larger and training them becomes expensive, it becomes increasingly important to scale training recipes not just to larger models and more data, but to do so in a compute-optimal manner that extracts maximal performance per unit of compute. While such scaling has been well studied for language modeling, reinforcement learning (RL) has received less attention in this regard. In this paper, we investigate compute scaling for online, value-based deep RL. These methods present two pr…

@arXiv_csCL_bot@mastoxiv.page
2025-06-18 09:15:18

Reasoning with Exploration: An Entropy Perspective
Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, Furu Wei
https://arxiv.org/abs/2506.14758

Reasoning with Exploration: An Entropy Perspective
Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover strong positive correlations between high-entropy regions and three types of explora…

@arXiv_csAI_bot@mastoxiv.page
2025-08-20 09:52:50

Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, Lin Ma
https://arxiv.org/abs/2508.13587

Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that a…

@arXiv_csSE_bot@mastoxiv.page
2025-07-18 09:11:32

A Survey of Reinforcement Learning for Software Engineering
Dong Wang, Hanmo You, Lingwei Zhu, Kaiwei Lin, Zheng Chen, Chen Yang, Junji Yu, Zan Wang, Junjie Chen
https://arxiv.org/abs/2507.12483

A Survey of Reinforcement Learning for Software Engineering
Reinforcement Learning (RL) has emerged as a powerful paradigm for sequential decision-making and has attracted growing interest across various domains, particularly following the advent of Deep Reinforcement Learning (DRL) in 2015. Simultaneously, the rapid advancement of Large Language Models (LLMs) has further fueled interest in integrating RL with LLMs to enable more adaptive and intelligent systems. In the field of software engineering (SE), the increasing complexity of systems and the ris…

@arXiv_quantph_bot@mastoxiv.page
2025-07-23 10:27:32

Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis
Sara Giordano, Kornikar Sen, Miguel A. Martin-Delgado
https://arxiv.org/abs/2507.16641

Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis
A reinforcement learning (RL) framework is introduced for the efficient synthesis of quantum circuits that generate specified target quantum states from a fixed initial state, addressing a central challenge in both the NISQ era and future fault-tolerant quantum computing. The approach utilizes tabular Q-learning, based on action sequences, within a discretized quantum state space, to effectively manage the exponential growth of the space dimension. The framework introduces a hybrid reward mecha…

@arXiv_csLG_bot@mastoxiv.page
2025-08-18 09:48:30

SeamlessFlow: A Trainer Agent Isolation RL Framework Achieving Bubble-Free Pipelines via Tag Scheduling
Jinghui Wang, Shaojie Wang, Yinghan Cui, Xuxing Chen, Chao Wang, Xiaojiang Zhang, Minglei Zhang, Jiarong Zhang, Wenhao Zhuang, Yuchen Cao, Wankang Bao, Haimo Li, Zheng Lin, Huiming Wang, Haoyang Huang, Zongxian Feng, Zizheng Zhan, Ken Deng, Wen Xiang, Huaixi Tang, Kun Wu, Mengtong Li, Mengfei Xie, Junyi Peng, Haotian Zhang, Bin Chen, Bing Yu

SeamlessFlow: A Trainer Agent Isolation RL Framework Achieving Bubble-Free Pipelines via Tag Scheduling
We introduce SeamlessFlow, a server based reinforcement learning (RL) framework that addresses two core challenges in industrial scale RL: (1) decoupling RL training from the complex execution flow of agents; (2) maximizing GPU utilization with minimal idle time while preserving the stability and scalability required for large-scale deployments. First, SeamlessFlow introduces a data plane that decouples the RL trainer from diverse, complex agent implementations while sustaining high throughput.…

@arXiv_qfinPM_bot@mastoxiv.page
2025-08-19 08:45:20

Optimal Portfolio Construction -- A Reinforcement Learning Embedded Bayesian Hierarchical Risk Parity (RL-BHRP) Approach
Shaofeng Kang, Zeying Tian
https://arxiv.org/abs/2508.11856

Optimal Portfolio Construction -- A Reinforcement Learning Embedded Bayesian Hierarchical Risk Parity (RL-BHRP) Approach
We propose a two-level, learning-based portfolio method (RL-BHRP) that spreads risk across sectors and stocks, and adjusts exposures as market conditions change. Using U.S. Equities from 2012 to mid-2025, we design the model using 2012 to 2019 data, and evaluate it out-of-sample from 2020 to 2025 against a sector index built from exchange-traded funds and a static risk-balanced portfolio. Over the test window, the adaptive portfolio compounds wealth by approximately 120 percent, compared with 1…

@arXiv_csCV_bot@mastoxiv.page
2025-08-14 07:39:22

RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System
Abdolazim Rezaei, Mehdi Sookhak, Mahboobeh Haghparast
https://arxiv.org/abs/2508.09186 ht…

RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System
The proliferation of AI-powered cameras in Intelligent Transportation Systems (ITS) creates a severe conflict between the need for rich visual data and the fundamental right to privacy. Existing privacy-preserving mechanisms, such as blurring or encryption, are often insufficient, creating an undesirable trade-off where either privacy is compromised against advanced reconstruction attacks or data utility is critically degraded. To resolve this impasse, we propose RL-MoE, a novel framework that …

@arXiv_csRO_bot@mastoxiv.page
2025-08-21 09:15:39

Efficient Environment Design for Multi-Robot Navigation via Continuous Control
Jahid Chowdhury Choton, John Woods, William Hsu
https://arxiv.org/abs/2508.14105 https://

Efficient Environment Design for Multi-Robot Navigation via Continuous Control
Multi-robot navigation and path planning in continuous state and action spaces with uncertain environments remains an open challenge. Deep Reinforcement Learning (RL) is one of the most popular paradigms for solving this task, but its real-world application has been limited due to sample inefficiency and long training periods. Moreover, the existing works using RL for multi-robot navigation lack formal guarantees while designing the environment. In this paper, we introduce an efficient and high…

@arXiv_csCL_bot@mastoxiv.page
2025-07-18 09:40:32

QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
Jiazheng Li, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Hongzhou Lin, Yi Wu, Jingzhao Zhang
https://arxiv.org/abs/2507.13266

QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on mat…

@arXiv_csHC_bot@mastoxiv.page
2025-07-08 11:55:10

HyperSumm-RL: A Dialogue Summarization Framework for Modeling Leadership Perception in Social Robots
Subasish Das
https://arxiv.org/abs/2507.04160 https://…

HyperSumm-RL: A Dialogue Summarization Framework for Modeling Leadership Perception in Social Robots
This paper introduces HyperSumm-RL, a hypertext-aware summarization and interaction analysis framework designed to investigate human perceptions of social robot leadership through long-form dialogue. The system utilizes a structured Natural Language Processing (NLP) workflow that combines transformer-based long dialogue summarization, leadership style modeling, and user response analysis, enabling scalable evaluation of social robots in complex human-robot interaction (HRI) settings. Unlike pri…

@arXiv_eessSY_bot@mastoxiv.page
2025-07-23 09:19:22

A Distributed Actor-Critic Algorithm for Fixed-Time Consensus in Nonlinear Multi-Agent Systems
Aria Delshad, Maryam Babazadeh
https://arxiv.org/abs/2507.16520

A Distributed Actor-Critic Algorithm for Fixed-Time Consensus in Nonlinear Multi-Agent Systems
This paper proposes a reinforcement learning (RL)-based backstepping control strategy to achieve fixed time consensus in nonlinear multi-agent systems with strict feedback dynamics. Agents exchange only output information with their neighbors over a directed communication graph, without requiring full state measurements or symmetric communication. Achieving fixed time consensus, where convergence occurs within a pre-specified time bound that is independent of initial conditions is faced with si…

@arXiv_csAI_bot@mastoxiv.page
2025-08-15 09:45:32

Scaling Up without Fading Out: Goal-Aware Sparse GNN for RL-based Generalized Planning
Sangwoo Jeon, Juchul Shin, Gyeong-Tae Kim, YeonJe Cho, Seongwoo Kim
https://arxiv.org/abs/2508.10747

Scaling Up without Fading Out: Goal-Aware Sparse GNN for RL-based Generalized Planning
Generalized planning using deep reinforcement learning (RL) combined with graph neural networks (GNNs) has shown promising results in various symbolic planning domains described by PDDL. However, existing approaches typically represent planning states as fully connected graphs, leading to a combinatorial explosion in edge information and substantial sparsity as problem scales grow, especially evident in large grid-based environments. This dense representation results in diluted node-level infor…

@arXiv_csLG_bot@mastoxiv.page
2025-08-18 09:43:10

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou
https://arxiv.org/abs/2508.11408

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framewor…

@arXiv_csSE_bot@mastoxiv.page
2025-06-03 07:29:49

CRScore : Reinforcement Learning with Verifiable Tool and AI Feedback for Code Review
Manav Nitin Kapadnis, Atharva Naik, Carolyn Rose
https://arxiv.org/abs/2506.00296

CRScore++: Reinforcement Learning with Verifiable Tool and AI Feedback for Code Review
Reinforcement learning (RL) to improve code review comment generation requires handling unstructured outputs, making reinforcement learning (RL) feedback challenging. The two main RL approaches, namely RL with Verifiable Feedback (RLVR) and RL with AI Feedback (RLAIF), offer trade-offs: RLVR provides reliable feedback for structured tasks like code generation, while RLAIF works for unstructured outputs but is subjective. We bridge this gap with CRScore++, an RL framework that leverages both LLM…

@arXiv_eessSP_bot@mastoxiv.page
2025-08-21 08:58:59

Deep Reinforcement Learning Based Routing for Heterogeneous Multi-Hop Wireless Networks
Brian Kim, Justin H. Kong, Terrence J. Moore, Fikadu T. Dagefu
https://arxiv.org/abs/2508.14884

Deep Reinforcement Learning Based Routing for Heterogeneous Multi-Hop Wireless Networks
Routing in multi-hop wireless networks is a complex problem, especially in heterogeneous networks where multiple wireless communication technologies coexist. Reinforcement learning (RL) methods, such as Q-learning, have been introduced for decentralized routing by allowing nodes to make decisions based on local observations. However, Q-learning suffers from scalability issues and poor generalization due to the difficulty in managing the Q-table in large or dynamic network topologies, especially…

@arXiv_csDB_bot@mastoxiv.page
2025-07-21 07:39:30

CogniQ-H: A Soft Hierarchical Reinforcement Learning Paradigm for Automated Data Preparation
Jing Chang, Chang Liu, Jinbin Huang, Rui Mao, Jianbin Qin
https://arxiv.org/abs/2507.13710

CogniQ-H: A Soft Hierarchical Reinforcement Learning Paradigm for Automated Data Preparation
Data preparation is a foundational yet notoriously challenging component of the machine learning lifecycle, characterized by a vast combinatorial search space of potential operator sequences. While reinforcement learning (RL) offers a promising direction, existing approaches are inefficient as they fail to capture the structured, hierarchical nature of the problem. We argue that Hierarchical Reinforcement Learning (HRL), a paradigm that has been successful in other domains, provides a conceptua…

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 09:07:50

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward
Jiarui Yang, Bin Zhu, Jingjing Chen, Yu-Gang Jiang
https://arxiv.org/abs/2508.11143

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward
Existing reinforcement learning (RL) methods struggle with long-horizon robotic manipulation tasks, particularly those involving sparse rewards. While action chunking is a promising paradigm for robotic manipulation, using RL to directly learn continuous action chunks in a stable and data-efficient manner remains a critical challenge. This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. To mak…

@arXiv_csCL_bot@mastoxiv.page
2025-08-22 10:15:01

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang, Weidi Xie
https://arxiv.org/abs/2508.15746

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construc…

@arXiv_csAI_bot@mastoxiv.page
2025-08-22 10:06:31

NiceWebRL: a Python library for human subject experiments with reinforcement learning environments
Wilka Carvalho, Vikram Goddla, Ishaan Sinha, Hoon Shin, Kunal Jha
https://arxiv.org/abs/2508.15693

NiceWebRL: a Python library for human subject experiments with reinforcement learning environments
We present NiceWebRL, a research tool that enables researchers to use machine reinforcement learning (RL) environments for online human subject experiments. NiceWebRL is a Python library that allows any Jax-based environment to be transformed into an online interface, supporting both single-agent and multi-agent environments. As such, NiceWebRL enables AI researchers to compare their algorithms to human performance, cognitive scientists to test ML algorithms as theories for human cognition, and…

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:17:20

Convergent Reinforcement Learning Algorithms for Stochastic Shortest Path Problem
Soumyajit Guin, Shalabh Bhatnagar
https://arxiv.org/abs/2508.13963 https://

Convergent Reinforcement Learning Algorithms for Stochastic Shortest Path Problem
In this paper we propose two algorithms in the tabular setting and an algorithm for the function approximation setting for the Stochastic Shortest Path (SSP) problem. SSP problems form an important class of problems in Reinforcement Learning (RL), as other types of cost-criteria in RL can be formulated in the setting of SSP. We show asymptotic almost-sure convergence for all our algorithms. We observe superior performance of our tabular algorithms compared to other well-known convergent RL algo…

@arXiv_quantph_bot@mastoxiv.page
2025-07-17 10:02:20

BenchRL-QAS: Benchmarking reinforcement learning algorithms for quantum architecture search
Azhar Ikhtiarudin, Aditi Das, Param Thakkar, Akash Kundu
https://arxiv.org/abs/2507.12189

BenchRL-QAS: Benchmarking reinforcement learning algorithms for quantum architecture search
We introduce BenchRL-QAS, a unified benchmarking framework for systematically evaluating reinforcement learning (RL) algorithms in quantum architecture search (QAS) across diverse variational quantum algorithm tasks and system sizes ranging from 2- to 8-qubit. Our study benchmarks nine RL agents including both value-based and policy-gradient methods on representative quantum problems such as variational quantum eigensolver, variational quantum state diagonalization, quantum classification, and …

@arXiv_csRO_bot@mastoxiv.page
2025-08-20 09:46:40

Toward Deployable Multi-Robot Collaboration via a Symbolically-Guided Decision Transformer
Rathnam Vidushika Rasanji, Jin Wei-Kocsis, Jiansong Zhang, Dongming Gan, Ragu Athinarayanan, Paul Asunda
https://arxiv.org/abs/2508.13877

Toward Deployable Multi-Robot Collaboration via a Symbolically-Guided Decision Transformer
Reinforcement learning (RL) has demonstrated great potential in robotic operations. However, its data-intensive nature and reliance on the Markov Decision Process (MDP) assumption limit its practical deployment in real-world scenarios involving complex dynamics and long-term temporal dependencies, such as multi-robot manipulation. Decision Transformers (DTs) have emerged as a promising offline alternative by leveraging causal transformers for sequence modeling in RL tasks. However, their applic…

@arXiv_csNI_bot@mastoxiv.page
2025-07-22 09:17:00

PRATA: A Framework to Enable Predictive QoS in Vehicular Networks via Artificial Intelligence
Federico Mason, Tommaso Zugno, Matteo Drago, Marco Giordani, Mate Boban, Michele Zorzi
https://arxiv.org/abs/2507.14211

PRATA: A Framework to Enable Predictive QoS in Vehicular Networks via Artificial Intelligence
Predictive Quality of Service (PQoS) makes it possible to anticipate QoS changes, e.g., in wireless networks, and trigger appropriate countermeasures to avoid performance degradation. Hence, PQoS is extremely useful for automotive applications such as teleoperated driving, which poses strict constraints in terms of latency and reliability. A promising tool for PQoS is given by Reinforcement Learning (RL), a methodology that enables the design of decision-making strategies for stochastic optimiz…

@arXiv_eessSY_bot@mastoxiv.page
2025-06-19 08:44:37

Make Your AUV Adaptive: An Environment-Aware Reinforcement Learning Framework For Underwater Tasks
Yimian Ding, Jingzehua Xu, Guanwen Xie, Shuai Zhang, Yi Li
https://arxiv.org/abs/2506.15082

Make Your AUV Adaptive: An Environment-Aware Reinforcement Learning Framework For Underwater Tasks
This study presents a novel environment-aware reinforcement learning (RL) framework designed to augment the operational capabilities of autonomous underwater vehicles (AUVs) in underwater environments. Departing from traditional RL architectures, the proposed framework integrates an environment-aware network module that dynamically captures flow field data, effectively embedding this critical environmental information into the state space. This integration facilitates real-time environmental ad…

@arXiv_csLG_bot@mastoxiv.page
2025-07-17 10:13:10

Kevin: Multi-Turn RL for Generating CUDA Kernels
Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, Silas Alberti
https://arxiv.org/abs/2507.11948 https:/…

Kevin: Multi-Turn RL for Generating CUDA Kernels
Writing GPU kernels is a challenging task and critical for AI systems' efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in…

@arXiv_csCV_bot@mastoxiv.page
2025-08-12 12:45:43

ReconDreamer-RL: Enhancing Reinforcement Learning via Diffusion-based Scene Reconstruction
Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Xinze Chen, Guanghong Jia, Guan Huang, Wenjun Mei
https://arxiv.org/abs/2508.08170

ReconDreamer-RL: Enhancing Reinforcement Learning via Diffusion-based Scene Reconstruction
Reinforcement learning for training end-to-end autonomous driving models in closed-loop simulations is gaining growing attention. However, most simulation environments differ significantly from real-world conditions, creating a substantial simulation-to-reality (sim2real) gap. To bridge this gap, some approaches utilize scene reconstruction techniques to create photorealistic environments as a simulator. While this improves realistic sensor simulation, these methods are inherently constrained b…

@arXiv_csAI_bot@mastoxiv.page
2025-08-22 10:00:51

A Dynamical Systems Framework for Reinforcement Learning Safety and Robustness Verification
Ahmed Nasir, Abdelhafid Zenati
https://arxiv.org/abs/2508.15588 https://

A Dynamical Systems Framework for Reinforcement Learning Safety and Robustness Verification
The application of reinforcement learning to safety-critical systems is limited by the lack of formal methods for verifying the robustness and safety of learned policies. This paper introduces a novel framework that addresses this gap by analyzing the combination of an RL agent and its environment as a discrete-time autonomous dynamical system. By leveraging tools from dynamical systems theory, specifically the Finite-Time Lyapunov Exponent (FTLE), we identify and visualize Lagrangian Coherent …

@arXiv_csRO_bot@mastoxiv.page
2025-06-23 11:54:50

Learning Dexterous Object Handover
Daniel Frau-Alfaro, Julio Casta\~no-Amoros, Santiago Puente, Pablo Gil, Roberto Calandra
https://arxiv.org/abs/2506.16822

Learning Dexterous Object Handover
Object handover is an important skill that we use daily when interacting with other humans. To deploy robots in collaborative setting, like houses, being able to receive and handing over objects safely and efficiently becomes a crucial skill. In this work, we demonstrate the use of Reinforcement Learning (RL) for dexterous object handover between two multi-finger hands. Key to this task is the use of a novel reward function based on dual quaternions to minimize the rotation distance, which outp…

@arXiv_csCL_bot@mastoxiv.page
2025-08-15 10:06:12

ReviewRL: Towards Automated Scientific Review with RL
Sihang Zeng, Kai Tian, Kaiyan Zhang, Yuru wang, Junqi Gao, Runze Liu, Sa Yang, Jingxuan Li, Xinwei Long, Jiaheng Ma, Biqing Qi, Bowen Zhou
https://arxiv.org/abs/2508.10308

ReviewRL: Towards Automated Scientific Review with RL
Peer review is essential for scientific progress but faces growing challenges due to increasing submission volumes and reviewer fatigue. Existing automated review approaches struggle with factual accuracy, rating consistency, and analytical depth, often generating superficial or generic feedback lacking the insights characteristic of high-quality human reviews. We introduce ReviewRL, a reinforcement learning framework for generating comprehensive and factually grounded scientific paper reviews.…

@arXiv_csRO_bot@mastoxiv.page
2025-08-19 11:02:40

Robot Trains Robot: Automatic Real-World Policy Adaptation and Learning for Humanoids
Kaizhe Hu, Haochen Shi, Yao He, Weizhuo Wang, C. Karen Liu, Shuran Song
https://arxiv.org/abs/2508.12252

Robot Trains Robot: Automatic Real-World Policy Adaptation and Learning for Humanoids
Simulation-based reinforcement learning (RL) has significantly advanced humanoid locomotion tasks, yet direct real-world RL from scratch or adapting from pretrained policies remains rare, limiting the full potential of humanoid robots. Real-world learning, despite being crucial for overcoming the sim-to-real gap, faces substantial challenges related to safety, reward design, and learning efficiency. To address these limitations, we propose Robot-Trains-Robot (RTR), a novel framework where a rob…

@arXiv_csCV_bot@mastoxiv.page
2025-07-11 10:19:01

Scaling RL to Long Videos
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
https://arxiv.org/abs/2507.07966

Scaling RL to Long Videos
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supe…

@arXiv_eessSY_bot@mastoxiv.page
2025-07-22 17:23:40

Replaced article(s) found for eess.SY. https://arxiv.org/list/eess.SY/new
[1/2]:
- Maximum Causal Entropy IRL in Mean-Field Games and GNEP Framework for Forward RL
Berkay Anahtarci, Can Deha Kariksiz, Naci Saldi

@arXiv_csAI_bot@mastoxiv.page
2025-07-16 10:17:21

Illuminating the Three Dogmas of Reinforcement Learning under Evolutionary Light
Mani Hamidi, Terrence W. Deacon
https://arxiv.org/abs/2507.11482 https://

Illuminating the Three Dogmas of Reinforcement Learning under Evolutionary Light
Three core tenets of reinforcement learning (RL)--concerning the definition of agency, the objective of learning, and the scope of the reward hypothesis--have been highlighted as key targets for conceptual revision, with major implications for theory and application. We propose a framework, inspired by open-ended evolutionary theory, to reconsider these three "dogmas." We revisit each assumption and address related concerns raised alongside them. To make our arguments relevant to RL as a model …

@arXiv_csRO_bot@mastoxiv.page
2025-08-14 07:45:02

CLF-RL: Control Lyapunov Function Guided Reinforcement Learning
Kejun Li, Zachary Olkin, Yisong Yue, Aaron D. Ames
https://arxiv.org/abs/2508.09354 https://

CLF-RL: Control Lyapunov Function Guided Reinforcement Learning
Reinforcement learning (RL) has shown promise in generating robust locomotion policies for bipedal robots, but often suffers from tedious reward design and sensitivity to poorly shaped objectives. In this work, we propose a structured reward shaping framework that leverages model-based trajectory generation and control Lyapunov functions (CLFs) to guide policy learning. We explore two model-based planners for generating reference trajectories: a reduced-order linear inverted pendulum (LIP) mode…

@arXiv_csLG_bot@mastoxiv.page
2025-06-10 19:22:39

This https://arxiv.org/abs/2506.04168 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Horizon Reduction Makes RL Scalable
In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000x larger than typical offline RL datasets. We observe that despite scaling…

@arXiv_csCV_bot@mastoxiv.page
2025-08-19 12:05:10

Breaking Reward Collapse: Adaptive Reinforcement for Open-ended Medical Reasoning with Enhanced Semantic Discrimination
Yizhou Liu, Jingwei Wei, Zizhi Chen, Minghao Han, Xukun Zhang, Keliang Liu, Lihua Zhang
https://arxiv.org/abs/2508.12957

Breaking Reward Collapse: Adaptive Reinforcement for Open-ended Medical Reasoning with Enhanced Semantic Discrimination
Reinforcement learning (RL) with rule-based rewards has demonstrated strong potential in enhancing the reasoning and generalization capabilities of vision-language models (VLMs) and large language models (LLMs), while reducing computational overhead. However, its application in medical imaging remains underexplored. Existing reinforcement fine-tuning (RFT) approaches in this domain primarily target closed-ended visual question answering (VQA), limiting their applicability to real-world clinical…

@arXiv_csCL_bot@mastoxiv.page
2025-06-12 09:05:02

VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li
https://arxiv.org/abs/2506.09942 …

VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). …

@arXiv_csRO_bot@mastoxiv.page
2025-06-10 17:29:49

This https://arxiv.org/abs/2506.04147 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

SLAC: Simulation-Pretrained Latent Action Space for Whole-Body Real-World RL
Building capable household and industrial robots requires mastering the control of versatile, high-degree-of-freedom (DoF) systems such as mobile manipulators. While reinforcement learning (RL) holds promise for autonomously acquiring robot control policies, scaling it to high-DoF embodiments remains challenging. Direct RL in the real world demands both safe exploration and high sample efficiency, which are difficult to achieve in practice. Sim-to-real RL, on the other hand, is often brittle du…

@arXiv_csCV_bot@mastoxiv.page
2025-08-12 12:47:03

Reinforcement Learning in Vision: A Survey
Weijia Wu, Chen Gao, Joya Chen, Kevin Qinghong Lin, Qingwei Meng, Yiming Zhang, Yuke Qiu, Hong Zhou, Mike Zheng Shou
https://arxiv.org/abs/2508.08189

Reinforcement Learning in Vision: A Survey
Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then org…

@arXiv_csAI_bot@mastoxiv.page
2025-08-20 10:12:10

ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, Jie Tang
https://arxiv.org/abs/2508.14040

ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments. Scaling end-to-end RL training is crucial for improvement and generalization across diverse desktop tasks, yet remains challenging due to environmental in…

@arXiv_csRO_bot@mastoxiv.page
2025-06-05 07:23:33

SLAC: Simulation-Pretrained Latent Action Space for Whole-Body Real-World RL
Jiaheng Hu, Peter Stone, Roberto Mart\'in-Mart\'in
https://arxiv.org/abs/2506.04147

@arXiv_csLG_bot@mastoxiv.page
2025-06-12 10:00:21

On a few pitfalls in KL divergence gradient estimation for RL
Yunhao Tang, R\'emi Munos
https://arxiv.org/abs/2506.09477 https://…

On a few pitfalls in KL divergence gradient estimation for RL
We point out a few pitfalls in implementing gradient estimation for KL divergence in RL training for LLM, as seen in a number of open source projects and papers. The first major pitfall is to differentiate through the KL estimate as loss functions to minimize KL divergence. We show that such implementations are generally incorrect and do not produce the desired KL gradient. Secondly, we show that some implementations do not account for the sequential nature of the estimation problem and produce…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 10:19:50

Wisdom of the Crowd: Reinforcement Learning from Coevolutionary Collective Feedback
Wenzhen Yuan, Shengji Tang, Weihao Lin, Jiacheng Ruan, Ganqu Cui, Bo Zhang, Tao Chen, Ting Liu, Yuzhuo Fu, Peng Ye, Lei Bai
https://arxiv.org/abs/2508.12338

Wisdom of the Crowd: Reinforcement Learning from Coevolutionary Collective Feedback
Reinforcement learning (RL) has significantly enhanced the reasoning capabilities of large language models (LLMs), but its reliance on expensive human-labeled data or complex reward models severely limits scalability. While existing self-feedback methods aim to address this problem, they are constrained by the capabilities of a single model, which can lead to overconfidence in incorrect answers, reward hacking, and even training collapse. To this end, we propose Reinforcement Learning from Coev…

@arXiv_csLG_bot@mastoxiv.page
2025-07-17 10:13:20

Online Training and Pruning of Deep Reinforcement Learning Networks
Valentin Frank Ingmar Guenter, Athanasios Sideris
https://arxiv.org/abs/2507.11975 http…

Online Training and Pruning of Deep Reinforcement Learning Networks
Scaling deep neural networks (NN) of reinforcement learning (RL) algorithms has been shown to enhance performance when feature extraction networks are used but the gained performance comes at the significant expense of increased computational and memory complexity. Neural network pruning methods have successfully addressed this challenge in supervised learning. However, their application to RL is underexplored. We propose an approach to integrate simultaneous training and pruning within advance…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 16:23:12

Replaced article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[5/8]:
- Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration
Tyler Ga Wei Lum, Olivia Y. Lee, C. Karen Liu, Jeannette Bohg

@arXiv_csLG_bot@mastoxiv.page
2025-08-21 10:15:20

PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning
Ruheng Wang, Hang Zhang, Trieu Nguyen, Shasha Feng, Hao-Wei Pang, Xiang Yu, Li Xiao, Peter Zhiping Zhang
https://arxiv.org/abs/2508.14765

PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning
Designing therapeutic peptides with tailored properties is hindered by the vastness of sequence space, limited experimental data, and poor interpretability of current generative models. To address these challenges, we introduce PepThink-R1, a generative framework that integrates large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). Unlike prior approaches, PepThink-R1 explicitly reasons about monomer-level modifications during sequence …

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 11:10:20

OPTIC-ER: A Reinforcement Learning Framework for Real-Time Emergency Response and Equitable Resource Allocation in Underserved African Communities
Mary Tonwe
https://arxiv.org/abs/2508.12943

OPTIC-ER: A Reinforcement Learning Framework for Real-Time Emergency Response and Equitable Resource Allocation in Underserved African Communities
Public service systems in many African regions suffer from delayed emergency response and spatial inequity, causing avoidable suffering. This paper introduces OPTIC-ER, a reinforcement learning (RL) framework for real-time, adaptive, and equitable emergency response. OPTIC-ER uses an attention-guided actor-critic architecture to manage the complexity of dispatch environments. Its key innovations are a Context-Rich State Vector, encoding action sub-optimality, and a Precision Reward Function, wh…

@arXiv_csRO_bot@mastoxiv.page
2025-06-13 09:11:30

Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop
Justin Kerr, Kush Hari, Ethan Weber, Chung Min Kim, Brent Yi, Tyler Bonnen, Ken Goldberg, Angjoo Kanazawa
https://arxiv.org/abs/2506.10968

Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop
Humans do not passively observe the visual world -- we actively look in order to act. Motivated by this principle, we introduce EyeRobot, a robotic system with gaze behavior that emerges from the need to complete real-world tasks. We develop a mechanical eyeball that can freely rotate to observe its surroundings and train a gaze policy to control it using reinforcement learning. We accomplish this by first collecting teleoperated demonstrations paired with a 360 camera. This data is imported in…

@arXiv_csLG_bot@mastoxiv.page
2025-06-05 10:59:18

This https://arxiv.org/abs/2505.24298 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers fro…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 10:31:40

The Yokai Learning Environment: Tracking Beliefs Over Space and Time
Constantin Ruhdorfer, Matteo Bortoletto, Andreas Bulling
https://arxiv.org/abs/2508.12480 https://

The Yokai Learning Environment: Tracking Beliefs Over Space and Time
Developing collaborative AI hinges on Theory of Mind (ToM) - the ability to reason about the beliefs of others to build and maintain common ground. Existing ToM benchmarks, however, are restricted to passive observer settings or lack an assessment of how agents establish and maintain common ground over time. To address these gaps, we introduce the Yokai Learning Environment (YLE) - a multi-agent reinforcement learning (RL) environment based on the cooperative card game Yokai. In the YLE, agents…

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:16:20

Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control
SM Mazharul Islam, Manfred Huber
https://arxiv.org/abs/2508.13922 https://

Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control
A policy in deep reinforcement learning (RL), either deterministic or stochastic, is commonly parameterized as a Gaussian distribution alone, limiting the learned behavior to be unimodal. However, the nature of many practical decision-making problems favors a multimodal policy that facilitates robust exploration of the environment and thus to address learning challenges arising from sparse rewards, complex dynamics, or the need for strategic adaptation to varying contexts. This issue is exacerb…

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:13:10

Reinforcement Learning-based Adaptive Path Selection for Programmable Networks
Jos\'e Eduardo Zerna Torres, Marios Avgeris, Chrysa Papagianni, Gergely Pongr\'acz, Istv\'an G\'odor, Paola Grosso
https://arxiv.org/abs/2508.13806

Reinforcement Learning-based Adaptive Path Selection for Programmable Networks
This work presents a proof-of-concept implementation of a distributed, in-network reinforcement learning (IN-RL) framework for adaptive path selection in programmable networks. By combining Stochastic Learning Automata (SLA) with real-time telemetry data collected via In-Band Network Telemetry (INT), the proposed system enables local, data-driven forwarding decisions that adapt dynamically to congestion conditions. The system is evaluated on a Mininet-based testbed using P4-programmable BMv2 sw…

@arXiv_csLG_bot@mastoxiv.page
2025-07-09 10:13:42

Detecting and Mitigating Reward Hacking in Reinforcement Learning Systems: A Comprehensive Empirical Study
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
https://arxiv.org/abs/2507.05619

Detecting and Mitigating Reward Hacking in Reinforcement Learning Systems: A Comprehensive Empirical Study
Reward hacking in Reinforcement Learning (RL) systems poses a critical threat to the deployment of autonomous agents, where agents exploit flaws in reward functions to achieve high scores without fulfilling intended objectives. Despite growing awareness of this problem, systematic detection and mitigation approaches remain limited. This paper presents a large-scale empirical study of reward hacking across diverse RL environments and algorithms. We analyze 15,247 training episodes across 15 RL e…

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:15:40

Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation
Thanh Nguyen, Chang D. Yoo
https://arxiv.org/abs/2508.13904 https://

Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation
The generative power of diffusion models (DMs) has recently enabled high-performing decision-making algorithms in offline reinforcement learning (RL), achieving state-of-the-art results across standard benchmarks. Among them, Diffusion Q-Learning (DQL) stands out as a leading method for its consistently strong performance. Nevertheless, DQL remains limited in practice due to its reliance on multi-step denoising for action generation during both training and inference. Although one-step denoisin…

@arXiv_csLG_bot@mastoxiv.page
2025-07-11 10:23:41

EXPO: Stable Reinforcement Learning with Expressive Policies
Perry Dong, Qiyang Li, Dorsa Sadigh, Chelsea Finn
https://arxiv.org/abs/2507.07986 https://arxiv.org/pdf/2507.07986 https://arxiv.org/html/2507.07986
arXiv:2507.07986v1 Announce Type: new
Abstract: We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.
toXiv_bot_toot

@arXiv_csLG_bot@mastoxiv.page
2025-06-05 10:58:34

This https://arxiv.org/abs/2505.23527 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Normalizing Flows are Capable Models for RL
Modern reinforcement learning (RL) algorithms have found success by using powerful probabilistic models, such as transformers, energy-based models, and diffusion/flow-based models. To this end, RL researchers often choose to pay the price of accommodating these models into their algorithms -- diffusion models are expressive, but are computationally intensive due to their reliance on solving differential equations, while autoregressive transformer models are scalable but typically require learni…

@arXiv_csLG_bot@mastoxiv.page
2025-06-12 10:03:21

MOORL: A Framework for Integrating Offline-Online Reinforcement Learning
Gaurav Chaudhary, Wassim Uddin Mondal, Laxmidhar Behera
https://arxiv.org/abs/2506.09574

MOORL: A Framework for Integrating Offline-Online Reinforcement Learning
Sample efficiency and exploration remain critical challenges in Deep Reinforcement Learning (DRL), particularly in complex domains. Offline RL, which enables agents to learn optimal policies from static, pre-collected datasets, has emerged as a promising alternative. However, offline RL is constrained by issues such as out-of-distribution (OOD) actions that limit policy performance and generalization. To overcome these limitations, we propose Meta Offline-Online Reinforcement Learning (MOORL), …

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 22:01:41

This https://arxiv.org/abs/2505.23527 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

@arXiv_csLG_bot@mastoxiv.page
2025-07-09 10:17:22

MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, Dong Yu
https://arxiv.org/abs/2507.05720

MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
Recently, there has been a surge of vision-based GUI agents designed to automate everyday mobile and web tasks. These agents interpret raw GUI screenshots and autonomously decide where to click, scroll, or type, which bypasses handcrafted rules and app-specific APIs. However, most existing methods trained GUI agent in the offline environment using pre-collected trajectories. This approach limits scalability, causes overfitting to specific UI templates, and leads to brittle policies when faced w…

@arXiv_csLG_bot@mastoxiv.page
2025-06-10 19:18:05

This https://arxiv.org/abs/2505.00546 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Directly Forecasting Belief for Reinforcement Learning with Delays
Reinforcement learning (RL) with delays is challenging as sensory perceptions lag behind the actual events: the RL agent needs to estimate the real state of its environment based on past observations. State-of-the-art (SOTA) methods typically employ recursive, step-by-step forecasting of states. This can cause the accumulation of compounding errors. To tackle this problem, our novel belief estimation method, named Directly Forecasting Belief Transformer (DFBT), directly forecasts states from ob…

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 08:21:48

Agnostic Reinforcement Learning: Foundations and Algorithms
Gene Li
https://arxiv.org/abs/2506.01884 https://arxiv.org/pdf/2506.01884…

Agnostic Reinforcement Learning: Foundations and Algorithms
Reinforcement Learning (RL) has demonstrated tremendous empirical success across numerous challenging domains. However, we lack a strong theoretical understanding of the statistical complexity of RL in environments with large state spaces, where function approximation is required for sample-efficient learning. This thesis addresses this gap by rigorously examining the statistical complexity of RL with function approximation from a learning theoretic perspective. Departing from a long history of…

@arXiv_csLG_bot@mastoxiv.page
2025-07-11 10:23:31

Reinforcement Learning with Action Chunking
Qiyang Li, Zhiyuan Zhou, Sergey Levine
https://arxiv.org/abs/2507.07969 https://arxiv.org/pdf/2507.07969 https://arxiv.org/html/2507.07969
arXiv:2507.07969v1 Announce Type: new
Abstract: We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
toXiv_bot_toot

Tootfinder

Opt-in global Mastodon full text search. Join the index!