Tootfinder

@arXiv_csLG_bot@mastoxiv.page
2025-06-09 10:11:02

Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library
Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, Zichen Liu, Haizhou Zhao, Dakai An, Lunxi Cao, Qiyang Cao, Wanxi Deng, Feilei Du, Yiliang Gu, Jiahe Li, Xiang Li, Mingjie Liu, Yijia Luo, Zihe Liu, Yadao Wang, Pei Wang, Tianyuan Wu, Yanan Wu, Yuheng Zhao, Shuaibing Zhao, Jin Yang, Siran Yang, Yingshui Tan, …

Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library
We introduce ROLL, an efficient, scalable, and user-friendly library designed for Reinforcement Learning Optimization for Large-scale Learning. ROLL caters to three primary user groups: tech pioneers aiming for cost-effective, fault-tolerant large-scale training, developers requiring flexible control over training workflows, and researchers seeking agile experimentation. ROLL is built upon several key modules to serve these user groups effectively. First, a single-controller architecture combin…

@arXiv_csRO_bot@mastoxiv.page
2025-06-09 08:21:22

Improving Long-Range Navigation with Spatially-Enhanced Recurrent Memory via End-to-End Reinforcement Learning
Fan Yang, Per Frivik, David Hoeller, Chen Wang, Cesar Cadena, Marco Hutter
https://arxiv.org/abs/2506.05997

Improving Long-Range Navigation with Spatially-Enhanced Recurrent Memory via End-to-End Reinforcement Learning
Recent advancements in robot navigation, especially with end-to-end learning approaches like reinforcement learning (RL), have shown remarkable efficiency and effectiveness. Yet, successful navigation still relies on two key capabilities: mapping and planning, whether explicit or implicit. Classical approaches use explicit mapping pipelines to register ego-centric observations into a coherent map frame for the planner. In contrast, end-to-end learning achieves this implicitly, often through rec…

@arXiv_csLG_bot@mastoxiv.page
2025-06-09 10:13:22

How to craft a deep reinforcement learning policy for wind farm flow control
Elie Kadoche, Pascal Bianchi, Florence Carton, Philippe Ciblat, Damien Ernst
https://arxiv.org/abs/2506.06204

How to craft a deep reinforcement learning policy for wind farm flow control
Within wind farms, wake effects between turbines can significantly reduce overall energy production. Wind farm flow control encompasses methods designed to mitigate these effects through coordinated turbine control. Wake steering, for example, consists in intentionally misaligning certain turbines with the wind to optimize airflow and increase power output. However, designing a robust wake steering controller remains challenging, and existing machine learning approaches are limited to quasi-sta…

@arXiv_qfinPM_bot@mastoxiv.page
2025-05-08 07:38:22

Deep Reinforcement Learning for Investor-Specific Portfolio Optimization: A Volatility-Guided Asset Selection Approach
Arishi Orra, Aryan Bhambu, Himanshu Choudhary, Manoj Thakur, Selvaraju Natarajan
https://arxiv.org/abs/2505.03760

Deep Reinforcement Learning for Investor-Specific Portfolio Optimization: A Volatility-Guided Asset Selection Approach
Portfolio optimization requires dynamic allocation of funds by balancing the risk and return tradeoff under dynamic market conditions. With the recent advancements in AI, Deep Reinforcement Learning (DRL) has gained prominence in providing adaptive and scalable strategies for portfolio optimization. However, the success of these strategies depends not only on their ability to adapt to market dynamics but also on the careful pre-selection of assets that influence overall portfolio performance. I…

@arXiv_csMA_bot@mastoxiv.page
2025-06-09 07:45:02

Sequence Modeling for N-Agent Ad Hoc Teamwork
Caroline Wang, Di Yang Shi, Elad Liebman, Ishan Durugkar, Arrasy Rahman, Peter Stone
https://arxiv.org/abs/2506.05527

Sequence Modeling for N-Agent Ad Hoc Teamwork
N-agent ad hoc teamwork (NAHT) is a newly introduced challenge in multi-agent reinforcement learning, where controlled subteams of varying sizes must dynamically collaborate with varying numbers and types of unknown teammates without pre-coordination. The existing learning algorithm (POAM) considers only independent learning for its flexibility in dealing with a changing number of agents. However, independent learning fails to fully capture the inter-agent dynamics essential for effective colla…

@arXiv_statML_bot@mastoxiv.page
2025-06-06 07:39:46

Regret-Optimal Q-Learning with Low Cost for Single-Agent and Federated Reinforcement Learning
Haochen Zhang, Zhong Zheng, Lingzhou Xue
https://arxiv.org/abs/2506.04626

Regret-Optimal Q-Learning with Low Cost for Single-Agent and Federated Reinforcement Learning
Motivated by real-world settings where data collection and policy deployment -- whether for a single agent or across multiple agents -- are costly, we study the problem of on-policy single-agent reinforcement learning (RL) and federated RL (FRL) with a focus on minimizing burn-in costs (the sample sizes needed to reach near-optimal regret) and policy switching or communication costs. In parallel finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states and $A$ actions, existing m…

@arXiv_mathOC_bot@mastoxiv.page
2025-06-06 07:28:02

Optimal-PhiBE: A PDE-based Model-free framework for Continuous-time Reinforcement Learning
Yuhua Zhu, Yuming Zhang, Haoyu Zhang
https://arxiv.org/abs/2506.05208

Optimal-PhiBE: A PDE-based Model-free framework for Continuous-time Reinforcement Learning
This paper addresses continuous-time reinforcement learning (CTRL) where the system dynamics are governed by a stochastic differential equation but are unknown, and only discrete-time observations are available. Existing approaches face limitations: model-based PDE methods suffer from non-identifiability, while model-free methods based on the optimal Bellman equation (Optimal-BE) are prone to large discretization errors sensitive to both the dynamics and reward structure. To overcome these chal…

@arXiv_csLG_bot@mastoxiv.page
2025-06-09 10:11:32

Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models
Rihui Jin, Zheyu Xin, Xing Xie, Zuoyi Li, Guilin Qi, Yongrui Chen, Xinbang Dai, Tongtong Wu, Gholamreza Haffari
https://arxiv.org/abs/2506.06137

Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models
Table reasoning (TR) requires structured reasoning over semi-structured tabular data and remains challenging, particularly for small language models (SLMs, e.g., LLaMA-8B) due to their limited capacity compared to large LMs (LLMs, e.g., GPT-4o). To narrow this gap, we explore program-based TR (P-TR), which circumvents key limitations of text-based TR (T-TR), notably in numerical reasoning, by generating executable programs. However, applying P-TR to SLMs introduces two challenges: (i) vulnerabi…

@arXiv_csSE_bot@mastoxiv.page
2025-06-05 07:23:42

Boosting Open-Source LLMs for Program Repair via Reasoning Transfer and LLM-Guided Reinforcement Learning
Xunzhu Tang, Jacques Klein, Tegawend\'e F. Bissyand\'e
https://arxiv.org/abs/2506.03921

Boosting Open-Source LLMs for Program Repair via Reasoning Transfer and LLM-Guided Reinforcement Learning
Several closed-source LLMs have consistently outperformed open-source alternatives in program repair tasks, primarily due to their superior reasoning capabilities and extensive pre-training. This paper introduces Repairity, a novel three-stage methodology that significantly narrows this performance gap through reasoning extraction and reinforcement learning. Our approach: (1) systematically filters high-quality reasoning traces from closed-source models using correctness verification, (2) trans…

@arXiv_csPL_bot@mastoxiv.page
2025-06-03 07:24:17

Pearl: Automatic Code Optimization Using Deep Reinforcement Learning
Djamel Rassem Lamouri, Iheb Nassim Aouadj, Smail Kourta, Riyadh Baghdadi
https://arxiv.org/abs/2506.01880

Pearl: Automatic Code Optimization Using Deep Reinforcement Learning
Compilers are crucial in optimizing programs and accelerating their execution. However, optimizing programs automatically using compilers is not trivial. Recent work has attempted to use reinforcement learning (RL) to solve this problem. It has limitations though. Current methods either do not support the optimization of general loop nests or can only be used to optimize loop nests seen during training. In this paper, we propose Pearl, a novel framework that uses deep reinforcement learning to …

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 07:18:52

BASIL: Best-Action Symbolic Interpretable Learning for Evolving Compact RL Policies
Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar
https://arxiv.org/abs/2506.00328

BASIL: Best-Action Symbolic Interpretable Learning for Evolving Compact RL Policies
The quest for interpretable reinforcement learning is a grand challenge for the deployment of autonomous decision-making systems in safety-critical applications. Modern deep reinforcement learning approaches, while powerful, tend to produce opaque policies that compromise verification, reduce transparency, and impede human oversight. To address this, we introduce BASIL (Best-Action Symbolic Interpretable Learning), a systematic approach for generating symbolic, rule-based policies via online ev…

@arXiv_csCL_bot@mastoxiv.page
2025-06-03 08:19:46

Reasoning-Table: Exploring Reinforcement Learning for Table Reasoning
Fangyu Lei, Jinxiang Meng, Yiming Huang, Tinghong Chen, Yun Zhang, Shizhu He, Jun Zhao, Kang Liu
https://arxiv.org/abs/2506.01710

Reasoning-Table: Exploring Reinforcement Learning for Table Reasoning
Table reasoning, encompassing tasks such as table question answering, fact verification, and text-to-SQL, requires precise understanding of structured tabular data, coupled with numerical computation and code manipulation for effective inference. Supervised fine-tuning (SFT) approaches have achieved notable success but often struggle with generalization and robustness due to biases inherent in imitative learning. We introduce Reasoning-Table, the first application of reinforcement learning (RL)…

@arXiv_csMA_bot@mastoxiv.page
2025-06-09 07:50:42

Modeling human reputation-seeking behavior in a spatio-temporally complex public good provision game
Edward Hughes, Tina O. Zhu, Martin J. Chadwick, Raphael Koster, Antonio Garc\'ia Casta\~neda, Charles Beattie, Thore Graepel, Matthew M. Botvinick, Joel Z. Leibo
https://arxiv.org/abs/2506.06032…

Modeling human reputation-seeking behavior in a spatio-temporally complex public good provision game
Multi-agent reinforcement learning algorithms are useful for simulating social behavior in settings that are too complex for other theoretical approaches like game theory. However, they have not yet been empirically supported by laboratory experiments with real human participants. In this work we demonstrate how multi-agent reinforcement learning can model group behavior in a spatially and temporally complex public good provision game called Clean Up. We show that human groups succeed in Clean …

@arXiv_csDC_bot@mastoxiv.page
2025-06-05 07:17:43

Crowd-SFT: Crowdsourcing for LLM Alignment
Alex Sotiropoulos, Sulyab Thottungal Valapu, Linus Lei, Jared Coleman, Bhaskar Krishnamachari
https://arxiv.org/abs/2506.04063

Crowd-SFT: Crowdsourcing for LLM Alignment
Large Language Models (LLMs) increasingly rely on Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to align model responses with human preferences. While RLHF employs a reinforcement learning approach with a separate reward model, SFT uses human-curated datasets for supervised learning. Both approaches traditionally depend on small, vetted groups of annotators, making them costly, prone to bias, and limited in scalability. We propose an open, crowd-sourced fine…

@arXiv_csIT_bot@mastoxiv.page
2025-06-04 07:22:42

A Novel Deep Reinforcement Learning Method for Computation Offloading in Multi-User Mobile Edge Computing with Decentralization
Nguyen Chi Long, Trinh Van Chien, Ta Hai Tung, Van Son Nguyen, Trong-Minh Hoang, Nguyen Ngoc Hai Dang
https://arxiv.org/abs/2506.02458

A Novel Deep Reinforcement Learning Method for Computation Offloading in Multi-User Mobile Edge Computing with Decentralization
Mobile edge computing (MEC) allows appliances to offload workloads to neighboring MEC servers that have the potential for computation-intensive tasks with limited computational capabilities. This paper studied how deep reinforcement learning (DRL) algorithms are used in an MEC system to find feasible decentralized dynamic computation offloading strategies, which leads to the construction of an extensible MEC system that operates effectively with finite feedback. Even though the Deep Determinist…

@arXiv_csCR_bot@mastoxiv.page
2025-06-04 07:22:23

Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges
Lajos Muzsai, David Imolai, Andr\'as Luk\'acs
https://arxiv.org/abs/2506.02048

Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges
Large Language Models (LLMs) still struggle with the structured reasoning and tool-assisted computation needed for problem solving in cybersecurity applications. In this work, we introduce "random-crypto", a cryptographic Capture-the-Flag (CTF) challenge generator framework that we use to fine-tune a tool-augmented Llama-3.1-8B with Guided Reinforcement Prompt Optimisation (GRPO), allowing the agent to iteratively write and execute Python inside an isolated REPL. GRPO yields a +53% absolute jum…

@arXiv_csIR_bot@mastoxiv.page
2025-06-05 09:39:39

This https://arxiv.org/abs/2404.17589 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csIR_…

An Offline Reinforcement Learning Algorithm Customized for Multi-Task Fusion in Large-Scale Recommender Systems
As the last critical stage of RSs, Multi-Task Fusion (MTF) is responsible for combining multiple scores outputted by Multi-Task Learning (MTL) into a final score to maximize user satisfaction, which determines the ultimate recommendation results. Recently, to optimize long-term user satisfaction within a recommendation session, Reinforcement Learning (RL) is used for MTF in the industry. However, the offline RL algorithms used for MTF so far have the following severe problems: 1) to avoid out-o…

@arXiv_eessSY_bot@mastoxiv.page
2025-06-03 07:50:00

Interpretable reinforcement learning for heat pump control through asymmetric differentiable decision trees
Toon Van Puyvelde, Mehran Zareh, Chris Develder
https://arxiv.org/abs/2506.01641

Interpretable reinforcement learning for heat pump control through asymmetric differentiable decision trees
In recent years, deep reinforcement learning (DRL) algorithms have gained traction in home energy management systems. However, their adoption by energy management companies remains limited due to the black-box nature of DRL, which fails to provide transparent decision-making feedback. To address this, explainable reinforcement learning (XRL) techniques have emerged, aiming to make DRL decisions more transparent. Among these, soft differential decision tree (DDT) distillation provides a promisin…

@arXiv_qbioNC_bot@mastoxiv.page
2025-06-06 07:36:54

Discounting and Drug Seeking in Biological Hierarchical Reinforcement Learning
Vardhan Palod, Pranav Mahajan, Veeky Baths, Boris S. Gutkin
https://arxiv.org/abs/2506.04549

Discounting and Drug Seeking in Biological Hierarchical Reinforcement Learning
Despite a strong desire to quit, individuals with long-term substance use disorder (SUD) often struggle to resist drug use, even when aware of its harmful consequences. This disconnect between knowledge and compulsive behavior reflects a fundamental cognitive-behavioral conflict in addiction. Neurobiologically, differential cue-induced activity within striatal subregions, along with dopamine-mediated connectivity from the ventral to the dorsal striatum, contributes to compulsive drug-seeking. H…

@arXiv_csRO_bot@mastoxiv.page
2025-06-09 08:31:52

Self driving algorithm for an active four wheel drive racecar
Gergely Bari, Laszlo Palkovics
https://arxiv.org/abs/2506.06077 https://

Self driving algorithm for an active four wheel drive racecar
Controlling autonomous vehicles at their handling limits is a significant challenge, particularly for electric vehicles with active four wheel drive (A4WD) systems offering independent wheel torque control. While traditional Vehicle Dynamics Control (VDC) methods use complex physics-based models, this study explores Deep Reinforcement Learning (DRL) to develop a unified, high-performance controller. We employ the Proximal Policy Optimization (PPO) algorithm to train an agent for optimal lap tim…

@arXiv_csLG_bot@mastoxiv.page
2025-06-09 10:07:02

AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification
Geonwoo Cho, Jaemoon Lee, Jaegyun Im, Subi Lee, Jihwan Lee, Sundong Kim
https://arxiv.org/abs/2506.05980

AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification
Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy. Effective skill learning requires jointly maximizing both exploration and skill diversity. However, existing methods often face challenges in simultaneously optimizing for these two conflicting objectives. In this work, we propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED), which expli…

@arXiv_mathOC_bot@mastoxiv.page
2025-06-09 08:37:22

Policy Optimization for Continuous-time Linear-Quadratic Graphon Mean Field Games
Philipp Plank, Yufei Zhang
https://arxiv.org/abs/2506.05894 https://

Policy Optimization for Continuous-time Linear-Quadratic Graphon Mean Field Games
Multi-agent reinforcement learning, despite its popularity and empirical success, faces significant scalability challenges in large-population dynamic games. Graphon mean field games (GMFGs) offer a principled framework for approximating such games while capturing heterogeneity among players. In this paper, we propose and analyze a policy optimization framework for continuous-time, finite-horizon linear-quadratic GMFGs. Exploiting the structural properties of GMFGs, we design an efficient polic…

@arXiv_csNI_bot@mastoxiv.page
2025-06-03 07:22:55

A Reinforcement Learning-Based Telematic Routing Protocol for the Internet of Underwater Things
Mohammadhossein Homaei, Mehran Tarif, Agustin Di Bartolo, Oscar Mogollon Gutierrez, Mar Avila
https://arxiv.org/abs/2506.00133

A Reinforcement Learning-Based Telematic Routing Protocol for the Internet of Underwater Things
The Internet of Underwater Things (IoUT) faces major challenges such as low bandwidth, high latency, mobility, and limited energy resources. Traditional routing protocols like RPL, which were designed for land-based networks, do not perform well in these underwater conditions. This paper introduces RL-RPL-UA, a new routing protocol that uses reinforcement learning to improve performance in underwater environments. Each node includes a lightweight RL agent that selects the best parent node based…

@arXiv_quantph_bot@mastoxiv.page
2025-06-06 10:12:00

This https://arxiv.org/abs/2501.09622 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_qu…

Optimizing hypergraph product codes with random walks, simulated annealing and reinforcement learning
Hypergraph products are quantum low-density parity-check (LDPC) codes constructed from two classical LDPC codes. Although their dimension and distance depend only on the parameters of the underlying classical codes, optimizing their performance against various noise channels remains challenging. This difficulty partly stems from the complexity of decoding in the quantum setting. The standard, ad hoc approach typically involves selecting classical LDPC codes with large girth. In this work, we fo…

@arXiv_csCV_bot@mastoxiv.page
2025-06-04 15:03:13

This https://arxiv.org/abs/2505.24718 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…

Reinforcing Video Reasoning with Focused Thinking
Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propos…

@arXiv_csSE_bot@mastoxiv.page
2025-06-03 07:29:49

CRScore : Reinforcement Learning with Verifiable Tool and AI Feedback for Code Review
Manav Nitin Kapadnis, Atharva Naik, Carolyn Rose
https://arxiv.org/abs/2506.00296

CRScore++: Reinforcement Learning with Verifiable Tool and AI Feedback for Code Review
Reinforcement learning (RL) to improve code review comment generation requires handling unstructured outputs, making reinforcement learning (RL) feedback challenging. The two main RL approaches, namely RL with Verifiable Feedback (RLVR) and RL with AI Feedback (RLAIF), offer trade-offs: RLVR provides reliable feedback for structured tasks like code generation, while RLAIF works for unstructured outputs but is subjective. We bridge this gap with CRScore++, an RL framework that leverages both LLM…

@arXiv_statME_bot@mastoxiv.page
2025-06-04 07:50:52

Joint Modeling for Learning Decision-Making Dynamics in Behavioral Experiments
Yuan Bian, Xingche Guo, Yuanjia Wang
https://arxiv.org/abs/2506.02394 https:…

Joint Modeling for Learning Decision-Making Dynamics in Behavioral Experiments
Major depressive disorder (MDD), a leading cause of disability and mortality, is associated with reward-processing abnormalities and concentration issues. Motivated by the probabilistic reward task from the Establishing Moderators and Biosignatures of Antidepressant Response in Clinical Care (EMBARC) study, we propose a novel framework that integrates the reinforcement learning (RL) model and drift-diffusion model (DDM) to jointly analyze reward-based decision-making with response times. To acc…

@arXiv_csRO_bot@mastoxiv.page
2025-06-09 08:33:32

On-board Mission Replanning for Adaptive Cooperative Multi-Robot Systems
Elim Kwan, Rehman Qureshi, Liam Fletcher, Colin Laganier, Victoria Nockles, Richard Walters
https://arxiv.org/abs/2506.06094

On-board Mission Replanning for Adaptive Cooperative Multi-Robot Systems
Cooperative autonomous robotic systems have significant potential for executing complex multi-task missions across space, air, ground, and maritime domains. But they commonly operate in remote, dynamic and hazardous environments, requiring rapid in-mission adaptation without reliance on fragile or slow communication links to centralised compute. Fast, on-board replanning algorithms are therefore needed to enhance resilience. Reinforcement Learning shows strong promise for efficiently solving mi…

@arXiv_csCL_bot@mastoxiv.page
2025-06-03 08:21:02

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin
https://arx…

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that o…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 07:27:17

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning
Weiyang Guo, Zesheng Shi, Zhuo Li, Yequan Wang, Xuebo Liu, Wenya Wang, Fangming Liu, Min Zhang, Jing Li
https://arxiv.org/abs/2506.00782

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning
As large language models (LLMs) grow in power and influence, ensuring their safety and preventing harmful output becomes critical. Automated red teaming serves as a tool to detect security vulnerabilities in LLMs without manual labor. However, most existing methods struggle to balance the effectiveness and diversity of red-team generated attack prompts. To address this challenge, we propose \ourapproach, a novel automated red teaming training framework that utilizes reinforcement learning to ex…

@arXiv_csLO_bot@mastoxiv.page
2025-06-05 09:39:49

This https://arxiv.org/abs/2307.08780 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLO_…

Discounted-Sum Automata with Multiple Discount Factors
Discounting the influence of future events is a key paradigm in economics and it is widely used in computer-science models, such as games, Markov decision processes (MDPs), reinforcement learning, and automata. While a single game or MDP may allow for several different discount factors, nondeterministic discounted-sum automata (NDAs) were only studied with respect to a single discount factor. It is known that every class of NDAs with an integer as the discount factor has good computational prop…

@arXiv_qbioQM_bot@mastoxiv.page
2025-05-29 07:37:22

Learning optimal treatment strategies for intraoperative hypotension using deep reinforcement learning
Esra Adiyeke, Tianqi Liu, Venkata Sai Dheeraj Naganaboina, Han Li, Tyler J. Loftus, Yuanfang Ren, Benjamin Shickel, Matthew M. Ruppert, Karandeep Singh, Ruogu Fang, Parisa Rashidi, Azra Bihorac, Tezcan Ozrazgat-Baslanti
https://

Learning optimal treatment strategies for intraoperative hypotension using deep reinforcement learning
Traditional methods of surgical decision making heavily rely on human experience and prompt actions, which are variable. A data-driven system generating treatment recommendations based on patient states can be a substantial asset in perioperative decision-making, as in cases of intraoperative hypotension, for which suboptimal management is associated with acute kidney injury (AKI), a common and morbid postoperative complication. We developed a Reinforcement Learning (RL) model to recommend opti…

@arXiv_qfinTR_bot@mastoxiv.page
2025-06-06 07:39:03

Can Artificial Intelligence Trade the Stock Market?
J\k{e}drzej Maskiewicz, Pawe{\l} Sakowski
https://arxiv.org/abs/2506.04658 https://

Can Artificial Intelligence Trade the Stock Market?
The paper explores the use of Deep Reinforcement Learning (DRL) in stock market trading, focusing on two algorithms: Double Deep Q-Network (DDQN) and Proximal Policy Optimization (PPO) and compares them with Buy and Hold benchmark. It evaluates these algorithms across three currency pairs, the S&P 500 index and Bitcoin, on the daily data in the period of 2019-2023. The results demonstrate DRL's effectiveness in trading and its ability to manage risk by strategically avoiding trades in unfavorab…

@arXiv_csIT_bot@mastoxiv.page
2025-06-04 07:24:09

Maximizing the Promptness of Metaverse Systems using Edge Computing by Deep Reinforcement Learning
Tam Ninh Thi-Thanh, Trinh Van Chien, Hung Tran, Nguyen Hoai Son, Van Nhan Vo
https://arxiv.org/abs/2506.02657

Maximizing the Promptness of Metaverse Systems using Edge Computing by Deep Reinforcement Learning
Metaverse and Digital Twin (DT) have attracted much academic and industrial attraction to approach the future digital world. This paper introduces the advantages of deep reinforcement learning (DRL) in assisting Metaverse system-based Digital Twin. In this system, we assume that it includes several Metaverse User devices collecting data from the real world to transfer it into the virtual world, a Metaverse Virtual Access Point (MVAP) undertaking the processing of data, and an edge computing ser…

@arXiv_csLG_bot@mastoxiv.page
2025-06-09 10:12:12

Reusing Trajectories in Policy Gradients Enables Fast Convergence
Alessandro Montenegro, Federico Mansutti, Marco Mussi, Matteo Papini, Alberto Maria Metelli
https://arxiv.org/abs/2506.06178

Reusing Trajectories in Policy Gradients Enables Fast Convergence
Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. These methods learn the parameters of parametric policies via stochastic gradient ascent, typically using on-policy trajectory data to estimate the policy gradient. However, such reliance on fresh data makes them sample-inefficient. Indeed, vanilla PG methods require $O(ε^{-2})$ trajectories to reach an $ε$-approximate stationary point. A common …

@arXiv_csMA_bot@mastoxiv.page
2025-06-06 07:20:25

Towards Language-Augmented Multi-Agent Deep Reinforcement Learning
Maxime Toquebiau, Jae-Yun Jun, Fa\"iz Benamar, Nicolas Bredeche
https://arxiv.org/abs/2506.05236

Towards Language-Augmented Multi-Agent Deep Reinforcement Learning
Communication is a fundamental aspect of coordinated behavior in multi-agent reinforcement learning. Yet, most prior works in this field have focused on emergent communication protocols developed from scratch, often resulting in inefficient or non-interpretable systems. Inspired by the role of language in natural intelligence, we investigate how grounding agents in a human-defined language can improve learning and coordination of multiple embodied agents. We propose a framework in which agents …

@arXiv_csIR_bot@mastoxiv.page
2025-06-05 07:18:47

ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking
Xianming Li, Aamir Shakir, Rui Huang, Julius Lipp, Jing Li
https://arxiv.org/abs/2506.03487

ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking
Reranking is fundamental to information retrieval and retrieval-augmented generation, with recent Large Language Models (LLMs) significantly advancing reranking quality. While recent advances with LLMs have significantly improved document reranking quality, current approaches primarily rely on large-scale LLMs (>7B parameters) through zero-shot prompting, presenting high computational costs. Small Language Models (SLMs) offer a promising alternative because of their efficiency, but our prelimin…

@arXiv_csRO_bot@mastoxiv.page
2025-06-06 10:01:56

This https://arxiv.org/abs/2506.03568 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

Confidence-Guided Human-AI Collaboration: Reinforcement Learning with Distributional Proxy Value Propagation for Autonomous Driving
Autonomous driving promises significant advancements in mobility, road safety and traffic efficiency, yet reinforcement learning and imitation learning face safe-exploration and distribution-shift challenges. Although human-AI collaboration alleviates these issues, it often relies heavily on extensive human intervention, which increases costs and reduces efficiency. This paper develops a confidence-guided human-AI collaboration (C-HAC) strategy to overcome these limitations. First, C-HAC employ…

@arXiv_statML_bot@mastoxiv.page
2025-06-02 10:18:17

This https://arxiv.org/abs/2308.13135 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_sta…

Nonparametric Additive Value Functions: Interpretable Reinforcement Learning with an Application to Surgical Recovery
We propose a nonparametric additive model for estimating interpretable value functions in reinforcement learning, with an application in optimizing postoperative recovery through personalized, adaptive recommendations. While reinforcement learning has achieved significant success in various domains, recent methods often rely on black-box approaches such as neural networks, which hinder the examination of individual feature contributions to a decision-making policy. Our novel method offers a fle…

@arXiv_eessSY_bot@mastoxiv.page
2025-06-03 07:37:25

HMPC-assisted Adversarial Inverse Reinforcement Learning for Smart Home Energy Management
Jiadong He, Liang Yu, Zhiqiang Chen, Dawei Qiu, Dong Yue, Goran Strbac, Meng Zhang, Yujian Ye, Yi Wang
https://arxiv.org/abs/2506.00898

HMPC-assisted Adversarial Inverse Reinforcement Learning for Smart Home Energy Management
This letter proposes an Adversarial Inverse Reinforcement Learning (AIRL)-based energy management method for a smart home, which incorporates an implicit thermal dynamics model. In the proposed method, historical optimal decisions are first generated using a neural network-assisted Hierarchical Model Predictive Control (HMPC) framework. These decisions are then used as expert demonstrations in the AIRL module, which aims to train a discriminator to distinguish expert demonstrations from transit…

@arXiv_csLG_bot@mastoxiv.page
2025-06-09 10:12:22

A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization
Muhammed Ustaomeroglu, Guannan Qu
https://arxiv.org/abs/2506.06179

A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization
Self-attention has emerged as a core component of modern neural architectures, yet its theoretical underpinnings remain elusive. In this paper, we study self-attention through the lens of interacting entities, ranging from agents in multi-agent reinforcement learning to alleles in genetic sequences, and show that a single layer linear self-attention can efficiently represent, learn, and generalize functions capturing pairwise interactions, including out-of-distribution scenarios. Our analysis r…

@arXiv_csRO_bot@mastoxiv.page
2025-06-06 09:42:54

This https://arxiv.org/abs/2409.17469 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

VertiSelector: Automatic Curriculum Learning for Wheeled Mobility on Vertically Challenging Terrain
Reinforcement Learning (RL) has the potential to enable extreme off-road mobility by circumventing complex kinodynamic modeling, planning, and control by simulated end-to-end trial-and-error learning experiences. However, most RL methods are sample-inefficient when training in a large amount of manually designed simulation environments and struggle at generalizing to the real world. To address these issues, we introduce VertiSelector (VS), an automatic curriculum learning framework designed to …

@arXiv_mathOC_bot@mastoxiv.page
2025-06-04 07:43:39

Learning-based primal-dual optimal control of discrete-time stochastic systems with multiplicative noise
Xiushan Jiang, Weihai Zhang
https://arxiv.org/abs/2506.02613

Learning-based primal-dual optimal control of discrete-time stochastic systems with multiplicative noise
Reinforcement learning (RL) is an effective approach for solving optimal control problems without knowing the exact information of the system model. However, the classical Q-learning method, a model-free RL algorithm, has its limitations, such as lack of strict theoretical analysis and the need for artificial disturbances during implementation. This paper explores the partially model-free stochastic linear quadratic regulator (SLQR) problem for a system with multiplicative noise from the primal…

@arXiv_csAI_bot@mastoxiv.page
2025-06-05 09:38:19

This https://arxiv.org/abs/2505.19641 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logica…

@arXiv_eessSY_bot@mastoxiv.page
2025-06-03 07:56:29

Data-assimilated model-informed reinforcement learning
Defne E. Ozan, Andrea N\'ovoa, Georgios Rigas, Luca Magri
https://arxiv.org/abs/2506.01755 https…

Data-assimilated model-informed reinforcement learning
The control of spatio-temporally chaos is challenging because of high dimensionality and unpredictability. Model-free reinforcement learning (RL) discovers optimal control policies by interacting with the system, typically requiring observations of the full physical state.In practice, sensors often provide only partial and noisy measurements (observations) of the system. The objective of this paper is to develop a framework that enables the control of chaotic systems with partial and noisy obse…

@arXiv_csRO_bot@mastoxiv.page
2025-06-05 07:22:51

Autonomous Vehicle Lateral Control Using Deep Reinforcement Learning with MPC-PID Demonstration
Chengdong Wu, Sven Kirchner, Nils Purschke, Alois C. Knoll
https://arxiv.org/abs/2506.04040

Autonomous Vehicle Lateral Control Using Deep Reinforcement Learning with MPC-PID Demonstration
The controller is one of the most important modules in the autonomous driving pipeline, ensuring the vehicle reaches its desired position. In this work, a reinforcement learning based lateral control approach, despite the imperfections in the vehicle models due to measurement errors and simplifications, is presented. Our approach ensures comfortable, efficient, and robust control performance considering the interface between controlling and other modules. The controller consists of the conventi…

@arXiv_csNI_bot@mastoxiv.page
2025-06-03 07:26:28

Federated Deep Reinforcement Learning-Driven O-RAN for Automatic Multirobot Reconfiguration
Faisal Ahmed, Myungjin Lee, Shao-Yu Lien, Suresh Subramaniam, Motoharu Matsuura, Hiroshi Hasegawa, Shih-Chun Lin
https://arxiv.org/abs/2506.00822

Federated Deep Reinforcement Learning-Driven O-RAN for Automatic Multirobot Reconfiguration
The rapid evolution of Industry 4.0 has led to the emergence of smart factories, where multirobot system autonomously operates to enhance productivity, reduce operational costs, and improve system adaptability. However, maintaining reliable and efficient network operations in these dynamic and complex environments requires advanced automation mechanisms. This study presents a zero-touch network platform that integrates a hierarchical Open Radio Access Network (O-RAN) architecture, enabling the …

@arXiv_csCL_bot@mastoxiv.page
2025-06-03 08:19:48

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Yangfan He, Mi Zhang, Shen Yan
https://arxiv.org/abs/2506.01713

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges…

@arXiv_csCV_bot@mastoxiv.page
2025-06-04 14:50:11

This https://arxiv.org/abs/2505.15173 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…

AvatarShield: Visual Reinforcement Learning for Human-Centric Video Forgery Detection
The rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, particularly in video generation, has led to unprecedented creative capabilities but also increased threats to information integrity, identity security, and public trust. Existing detection methods, while effective in general scenarios, lack robust solutions for human-centric videos, which pose greater risks due to their realism and potential for legal and ethical misuse. Moreover, current detection approach…

@arXiv_csLG_bot@mastoxiv.page
2025-06-05 10:58:49

This https://arxiv.org/abs/2505.23585 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

On-Policy RL with Optimal Reward Baseline
Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. O…

@arXiv_csMA_bot@mastoxiv.page
2025-06-05 09:40:31

This https://arxiv.org/abs/2503.02077 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csMA_…

M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality
Designing effective reward functions in multi-agent reinforcement learning (MARL) is a significant challenge, often leading to suboptimal or misaligned behaviors in complex, coordinated environments. We introduce Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality ($\text{M}^3\text{HF}$), a novel framework that integrates multi-phase human feedback of mixed quality into the MARL training process. By involving humans with diverse expertise levels to provide iterat…

@arXiv_csLG_bot@mastoxiv.page
2025-06-05 11:00:37

This https://arxiv.org/abs/2506.00691 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Optimizing Sensory Neurons: Nonlinear Attention Mechanisms for Accelerated Convergence in Permutation-Invariant Neural Networks for Reinforcement Learning
Training reinforcement learning (RL) agents often requires significant computational resources and extended training times. To address this, we build upon the foundation laid by Google Brain's Sensory Neuron, which introduced a novel neural architecture for reinforcement learning tasks that maintained permutation in-variance in the sensory neuron system. While the baseline model demonstrated significant performance improvements over traditional approaches, we identified opportunities to enhance…

@arXiv_csRO_bot@mastoxiv.page
2025-06-06 09:54:09

This https://arxiv.org/abs/2505.10033 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

Evaluating Robustness of Deep Reinforcement Learning for Autonomous Surface Vehicle Control in Field Tests
Despite significant advancements in Deep Reinforcement Learning (DRL) for Autonomous Surface Vehicles (ASVs), their robustness in real-world conditions, particularly under external disturbances, remains insufficiently explored. In this paper, we evaluate the resilience of a DRL-based agent designed to capture floating waste under various perturbations. We train the agent using domain randomization and evaluate its performance in real-world field tests, assessing its ability to handle unexpected…

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 08:21:48

Agnostic Reinforcement Learning: Foundations and Algorithms
Gene Li
https://arxiv.org/abs/2506.01884 https://arxiv.org/pdf/2506.01884…

Agnostic Reinforcement Learning: Foundations and Algorithms
Reinforcement Learning (RL) has demonstrated tremendous empirical success across numerous challenging domains. However, we lack a strong theoretical understanding of the statistical complexity of RL in environments with large state spaces, where function approximation is required for sample-efficient learning. This thesis addresses this gap by rigorously examining the statistical complexity of RL with function approximation from a learning theoretic perspective. Departing from a long history of…

@arXiv_csIR_bot@mastoxiv.page
2025-06-06 07:19:34

Reason-to-Recommend: Using Interaction-of-Thought Reasoning to Enhance LLM Recommendation
Keyu Zhao, Fengli Xu, Yong Li
https://arxiv.org/abs/2506.05069 ht…

Reason-to-Recommend: Using Interaction-of-Thought Reasoning to Enhance LLM Recommendation
Driven by advances in Large Language Models (LLMs), integrating them into recommendation tasks has gained interest due to their strong semantic understanding and prompt flexibility. Prior work encoded user-item interactions or metadata into prompts for recommendations. In parallel, LLM reasoning, boosted by test-time scaling and reinforcement learning, has excelled in fields like mathematics and code, where reasoning traces and correctness signals are clear, enabling high performance and interp…

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 07:31:18

Reinforcement Learning with Data Bootstrapping for Dynamic Subgoal Pursuit in Humanoid Robot Navigation
Chengyang Peng, Zhihao Zhang, Shiting Gong, Sankalp Agrawal, Keith A. Redmill, Ayonga Hereid
https://arxiv.org/abs/2506.02206

Reinforcement Learning with Data Bootstrapping for Dynamic Subgoal Pursuit in Humanoid Robot Navigation
Safe and real-time navigation is fundamental for humanoid robot applications. However, existing bipedal robot navigation frameworks often struggle to balance computational efficiency with the precision required for stable locomotion. We propose a novel hierarchical framework that continuously generates dynamic subgoals to guide the robot through cluttered environments. Our method comprises a high-level reinforcement learning (RL) planner for subgoal selection in a robot-centric coordinate syste…

@arXiv_eessSY_bot@mastoxiv.page
2025-06-04 13:44:52

This https://arxiv.org/abs/2506.01755 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_ees…

@arXiv_mathOC_bot@mastoxiv.page
2025-06-02 07:27:33

Fine-tuning for Data-enabled Predictive Control of Noisy Systems by Reinforcement Learning
Jinbao Wang, Shiliang Zhang, Jun Liu, Xuehui Ma, Haolin Liu
https://arxiv.org/abs/2505.24572

Fine-tuning for Data-enabled Predictive Control of Noisy Systems by Reinforcement Learning
Data-enabled predictive control (DeePC) leverages system measurements in characterizing system dynamics for optimal control. The performance of DeePC relies on optimizing its hyperparameters, especially in noisy systems where the optimal hyperparameters adapt over time. Existing hyperparameter tuning approaches for DeePC are more than often computationally inefficient or overly conservative. This paper proposes an adaptive DeePC where we guide its hyperparameters adaption through reinforcement …

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 17:57:30

This https://arxiv.org/abs/2503.07792 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

Efficient Neural Clause-Selection Reinforcement
Clause selection is arguably the most important choice point in saturation-based theorem proving. Framing it as a reinforcement learning (RL) task is a way to challenge the human-designed heuristics of state-of-the-art provers and to instead automatically evolve -- just from prover experiences -- their potentially optimal replacement. In this work, we present a neural network architecture for scoring clauses for clause selection that is powerful yet efficient to evaluate. Following RL principle…

@arXiv_csSE_bot@mastoxiv.page
2025-05-30 07:21:33

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization
Mingzhe Du, Luu Tuan Tuan, Yue Liu, Yuhao Qing, Dong Huang, Xinyi He, Qian Liu, Zejun Ma, See-kiong Ng
https://arxiv.org/abs/2505.23387

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization
Large Language Models (LLMs) generate functionally correct solutions but often fall short in code efficiency, a critical bottleneck for real-world deployment. In this paper, we introduce a novel test-time iterative optimization framework to address this, employing a closed-loop system where LLMs iteratively refine code based on empirical performance feedback from an execution sandbox. We explore three training strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and G…

@arXiv_csMA_bot@mastoxiv.page
2025-06-03 07:22:03

Sorrel: A simple and flexible framework for multi-agent reinforcement learning
Rebekah A. Gelp\'i, Yibing Ju, Ethan C. Jackson, Yikai Tang, Shon Verch, Claas Voelcker, William A. Cunningham
https://arxiv.org/abs/2506.00228

Sorrel: A simple and flexible framework for multi-agent reinforcement learning
We introduce Sorrel (https://github.com/social-ai-uoft/sorrel), a simple Python interface for generating and testing new multi-agent reinforcement learning environments. This interface places a high degree of emphasis on simplicity and accessibility, and uses a more psychologically intuitive structure for the basic agent-environment loop, making it a useful tool for social scientists to investigate how learning and social interaction leads to the development and change of group dynamics. In thi…

@arXiv_csLG_bot@mastoxiv.page
2025-06-05 10:59:18

This https://arxiv.org/abs/2505.24298 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers fro…

@arXiv_csRO_bot@mastoxiv.page
2025-06-03 08:05:57

LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation
Guobin Zhu, Rui Zhou, Wenkang Ji, Shiyu Zhao
https://arxiv.org/abs/2506.01538

LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation
Although Multi-Agent Reinforcement Learning (MARL) is effective for complex multi-robot tasks, it suffers from low sample efficiency and requires iterative manual reward tuning. Large Language Models (LLMs) have shown promise in single-robot settings, but their application in multi-robot systems remains largely unexplored. This paper introduces a novel LLM-Aided MARL (LAMARL) approach, which integrates MARL with LLMs, significantly enhancing sample efficiency without requiring manual design. LA…

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 08:21:44

Learning to Explore: An In-Context Learning Approach for Pure Exploration
Alessio Russo, Ryan Welch, Aldo Pacchiano
https://arxiv.org/abs/2506.01876 https:…

Learning to Explore: An In-Context Learning Approach for Pure Exploration
In this work, we study the active sequential hypothesis testing problem, also known as pure exploration, where the goal is to actively control a data collection process to efficiently identify the correct hypothesis underlying a decision problem. While relevant across multiple domains, devising adaptive exploration strategies remains challenging, particularly due to difficulties in encoding appropriate inductive biases. Existing Reinforcement Learning (RL)-based methods often underperform when …

@arXiv_csRO_bot@mastoxiv.page
2025-06-03 08:02:05

Robust and Safe Multi-Agent Reinforcement Learning Framework with Communication for Autonomous Vehicles
Keshawn Smith, Zhili Zhang, H M Sabbir Ahmad, Ehsan Sabouni, Maniak Mondal, Song Han, Wenchao Li, Fei Miao
https://arxiv.org/abs/2506.00982

Robust and Safe Multi-Agent Reinforcement Learning Framework with Communication for Autonomous Vehicles
Deep multi-agent reinforcement learning (MARL) has been demonstrated effectively in simulations for many multi-robot problems. For autonomous vehicles, the development of vehicle-to-vehicle (V2V) communication technologies provide opportunities to further enhance safety of the system. However, zero-shot transfer of simulator-trained MARL policies to hardware dynamic systems remains challenging, and how to leverage communication and shared information for MARL has limited demonstrations on hardw…

@arXiv_csAI_bot@mastoxiv.page
2025-06-05 09:40:08

This https://arxiv.org/abs/2505.23703 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability
Enhancing the mathematical reasoning capabilities of LLMs has garnered significant attention in both the mathematical and computer science communities. Recent works have made substantial progress in both Natural Language (NL) reasoning and Formal Language (FL) reasoning by leveraging the potential of pure Reinforcement Learning (RL) methods on base models. However, RL approaches struggle to impart new capabilities not presented in the base model, highlighting the need to integrate more knowledg…

@arXiv_csMA_bot@mastoxiv.page
2025-06-02 07:19:47

R3DM: Enabling Role Discovery and Diversity Through Dynamics Models in Multi-agent Reinforcement Learning
Harsh Goel, Mohammad Omama, Behdad Chalaki, Vaishnav Tadiparthi, Ehsan Moradi Pari, Sandeep Chinchali
https://arxiv.org/abs/2505.24265

R3DM: Enabling Role Discovery and Diversity Through Dynamics Models in Multi-agent Reinforcement Learning
Multi-agent reinforcement learning (MARL) has achieved significant progress in large-scale traffic control, autonomous vehicles, and robotics. Drawing inspiration from biological systems where roles naturally emerge to enable coordination, role-based MARL methods have been proposed to enhance cooperation learning for complex tasks. However, existing methods exclusively derive roles from an agent's past experience during training, neglecting their influence on its future trajectories. This paper…

@arXiv_csRO_bot@mastoxiv.page
2025-06-03 08:00:08

DriveMind: A Dual-VLM based Reinforcement Learning Framework for Autonomous Driving
Dawood Wasif, Terrence J Moore, Chandan K Reddy, Jin-Hee Cho
https://arxiv.org/abs/2506.00819

DriveMind: A Dual-VLM based Reinforcement Learning Framework for Autonomous Driving
End-to-end autonomous driving systems map sensor data directly to control commands, but remain opaque, lack interpretability, and offer no formal safety guarantees. While recent vision-language-guided reinforcement learning (RL) methods introduce semantic feedback, they often rely on static prompts and fixed objectives, limiting adaptability to dynamic driving scenes. We present DriveMind, a unified semantic reward framework that integrates: (i) a contrastive Vision-Language Model (VLM) encoder…

@arXiv_eessSY_bot@mastoxiv.page
2025-06-04 13:38:53

This https://arxiv.org/abs/2501.02620 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_ees…

Back to Base: Towards Hands-Off Learning via Safe Resets with Reach-Avoid Safety Filters
Designing controllers that accomplish tasks while guaranteeing safety constraints remains a significant challenge. We often want an agent to perform well in a nominal task, such as environment exploration, while ensuring it can avoid unsafe states and return to a desired target by a specific time. In particular we are motivated by the setting of safe, efficient, hands-off training for reinforcement learning in the real world. By enabling a robot to safely and autonomously reset to a desired reg…

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 13:55:17

This https://arxiv.org/abs/2411.14622 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

Learning Autonomous Surgical Irrigation and Suction with the da Vinci Research Kit Using Reinforcement Learning
The irrigation-suction process is a common procedure to rinse and clean up the surgical field in minimally invasive surgery (MIS). In this process, surgeons first irrigate liquid, typically saline, into the surgical scene for rinsing and diluting the contaminant, and then suction the liquid out of the surgical field. While recent advances have shown promising results in the application of reinforcement learning (RL) for automating surgical subtasks, fewer studies have explored the automation of…

@arXiv_csMA_bot@mastoxiv.page
2025-06-02 07:19:33

Distributed Neural Policy Gradient Algorithm for Global Convergence of Networked Multi-Agent Reinforcement Learning
Pengcheng Dai, Yuanqiu Mo, Wenwu Yu, Wei Ren
https://arxiv.org/abs/2505.24113

Distributed Neural Policy Gradient Algorithm for Global Convergence of Networked Multi-Agent Reinforcement Learning
This paper studies the networked multi-agent reinforcement learning (NMARL) problem, where the objective of agents is to collaboratively maximize the discounted average cumulative rewards. Different from the existing methods that suffer from poor expression due to linear function approximation, we propose a distributed neural policy gradient algorithm that features two innovatively designed neural networks, specifically for the approximate Q-functions and policy functions of agents. This distri…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 18:15:33

This https://arxiv.org/abs/2505.23667 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…

Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models
Tables are a fundamental structure for organizing and analyzing data, making effective table understanding a critical capability for intelligent systems. While large language models (LMs) demonstrate strong general reasoning abilities, they continue to struggle with accurate numerical or symbolic reasoning over tabular data, especially in complex scenarios. Spreadsheet formulas provide a powerful and expressive medium for representing executable symbolic operations, encoding rich reasoning patt…

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 22:02:07

This https://arxiv.org/abs/2505.24034 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Trainin
Reinforcement Learning (RL) has become the most effective post-training approach for improving the capabilities of Large Language Models (LLMs). In practice, because of the high demands on latency and memory, it is particularly challenging to develop an efficient RL framework that reliably manages policy models with hundreds to thousands of billions of parameters. In this paper, we present LlamaRL, a fully distributed, asynchronous RL framework optimized for efficient training of large-scale …

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 14:08:57

This https://arxiv.org/abs/2506.01538 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 07:53:02

EDEN: Entorhinal Driven Egocentric Navigation Toward Robotic Deployment
Mikolaj Walczak, Romina Aalishah, Wyatt Mackey, Brittany Story, David L. Boothe Jr., Nicholas Waytowich, Xiaomin Lin, Tinoosh Mohsenin
https://arxiv.org/abs/2506.03046

EDEN: Entorhinal Driven Egocentric Navigation Toward Robotic Deployment
Deep reinforcement learning agents are often fragile while humans remain adaptive and flexible to varying scenarios. To bridge this gap, we present EDEN, a biologically inspired navigation framework that integrates learned entorhinal-like grid cell representations and reinforcement learning to enable autonomous navigation. Inspired by the mammalian entorhinal-hippocampal system, EDEN allows agents to perform path integration and vector-based navigation using visual and motion sensor data. At th…

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 13:44:53

This https://arxiv.org/abs/2409.16967 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

Scalable Multi-Robot Informative Path Planning for Target Mapping via Deep Reinforcement Learning
Autonomous robots are widely utilized for mapping and exploration tasks due to their cost-effectiveness. Multi-robot systems offer scalability and efficiency, especially in terms of the number of robots deployed in more complex environments. These tasks belong to the set of Multi-Robot Informative Path Planning (MRIPP) problems. In this paper, we propose a deep reinforcement learning approach for the MRIPP problem. We aim to maximize the number of discovered stationary targets in an unknown 3D …

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 14:04:26

This https://arxiv.org/abs/2503.18616 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

FF-SRL: High Performance GPU-Based Surgical Simulation For Robot Learning
Robotic surgery is a rapidly developing field that can greatly benefit from the automation of surgical tasks. However, training techniques such as Reinforcement Learning (RL) require a high number of task repetitions, which are generally unsafe and impractical to perform on real surgical systems. This stresses the need for simulated surgical environments, which are not only realistic, but also computationally efficient and scalable. We introduce FF-SRL (Fast and Flexible Surgical Reinforcement …

@arXiv_csMA_bot@mastoxiv.page
2025-06-06 07:19:38

CORA: Coalitional Rational Advantage Decomposition for Multi-Agent Policy Gradients
Mengda Ji, Genjiu Xu, Liying Wang
https://arxiv.org/abs/2506.04265 http…

CORA: Coalitional Rational Advantage Decomposition for Multi-Agent Policy Gradients
This work focuses on the credit assignment problem in cooperative multi-agent reinforcement learning (MARL). Sharing the global advantage among agents often leads to suboptimal policy updates as it fails to account for the distinct contributions of agents. Although numerous methods consider global or individual contributions for credit assignment, a detailed analysis at the coalition level remains lacking in many approaches. This work analyzes the over-updating problem during multi-agent policy…

@arXiv_csLG_bot@mastoxiv.page
2025-06-05 10:58:34

This https://arxiv.org/abs/2505.23527 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Normalizing Flows are Capable Models for RL
Modern reinforcement learning (RL) algorithms have found success by using powerful probabilistic models, such as transformers, energy-based models, and diffusion/flow-based models. To this end, RL researchers often choose to pay the price of accommodating these models into their algorithms -- diffusion models are expressive, but are computationally intensive due to their reliance on solving differential equations, while autoregressive transformer models are scalable but typically require learni…

@arXiv_csLG_bot@mastoxiv.page
2025-06-05 11:00:54

This https://arxiv.org/abs/2506.01016 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Optimistic critics can empower small actors
Actor-critic methods have been central to many of the recent advances in deep reinforcement learning. The most common approach is to use symmetric architectures, whereby both actor and critic have the same network topology and number of parameters. However, recent works have argued for the advantages of asymmetric setups, specifically with the use of smaller actors. We perform broad empirical investigations and analyses to better understand the implications of this and find that, in general, sm…

@arXiv_csRO_bot@mastoxiv.page
2025-06-03 18:00:56

This https://arxiv.org/abs/2505.22642 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control
Reinforcement learning (RL) has driven significant progress in robotics, but its complexity and long training times remain major bottlenecks. In this report, we introduce FastTD3, a simple, fast, and capable RL algorithm that significantly speeds up training for humanoid robots in popular suites such as HumanoidBench, IsaacLab, and MuJoCo Playground. Our recipe is remarkably simple: we train an off-policy TD3 agent with several modifications -- parallel simulation, large-batch updates, a distri…

@arXiv_csRO_bot@mastoxiv.page
2025-06-06 10:00:54

This https://arxiv.org/abs/2506.01759 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

ADEPT: Adaptive Diffusion Environment for Policy Transfer Sim-to-Real
Model-free reinforcement learning has emerged as a powerful method for developing robust robot control policies capable of navigating through complex and unstructured environments. The effectiveness of these methods hinges on two essential elements: (1) the use of massively parallel physics simulations to expedite policy training, and (2) an environment generator tasked with crafting sufficiently challenging yet attainable environments to facilitate continuous policy improvement. Existing metho…

@arXiv_csRO_bot@mastoxiv.page
2025-06-03 08:06:47

A Hierarchical Bin Packing Framework with Dual Manipulators via Heuristic Search and Deep Reinforcement Learning
Beomjoon Lee, Changjoo Nam
https://arxiv.org/abs/2506.01628

A Hierarchical Bin Packing Framework with Dual Manipulators via Heuristic Search and Deep Reinforcement Learning
We address the bin packing problem (BPP), which aims to maximize bin utilization when packing a variety of items. The offline problem, where the complete information about the item set and their sizes is known in advance, is proven to be NP-hard. The semi-online and online variants are even more challenging, as full information about incoming items is unavailable. While existing methods have tackled both 2D and 3D BPPs, the 2D BPP remains underexplored in terms of fully maximizing utilization. …

@arXiv_csRO_bot@mastoxiv.page
2025-06-03 17:33:32

This https://arxiv.org/abs/2501.07985 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

CHEQ-ing the Box: Safe Variable Impedance Learning for Robotic Polishing
Robotic systems are increasingly employed for industrial automation, with contact-rich tasks like polishing requiring dexterity and compliant behaviour. These tasks are difficult to model, making classical control challenging. Deep reinforcement learning (RL) offers a promising solution by enabling the learning of models and control policies directly from data. However, its application to real-world problems is limited by data inefficiency and unsafe exploration. Adaptive hybrid RL methods blen…

@arXiv_csRO_bot@mastoxiv.page
2025-06-02 07:21:36

Reactive Aerobatic Flight via Reinforcement Learning
Zhichao Han, Xijie Huang, Zhuxiu Xu, Jiarui Zhang, Yuze Wu, Mingyang Wang, Tianyue Wu, Fei Gao
https://arxiv.org/abs/2505.24396

Reactive Aerobatic Flight via Reinforcement Learning
Quadrotors have demonstrated remarkable versatility, yet their full aerobatic potential remains largely untapped due to inherent underactuation and the complexity of aggressive maneuvers. Traditional approaches, separating trajectory optimization and tracking control, suffer from tracking inaccuracies, computational latency, and sensitivity to initial conditions, limiting their effectiveness in dynamic, high-agility scenarios. Inspired by recent breakthroughs in data-driven methods, we propose …

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 13:36:56

This https://arxiv.org/abs/2308.13140 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

Learn With Imagination: Safe Set Guided State-wise Constrained Policy Optimization
Deep reinforcement learning (RL) excels in various control tasks, yet the absence of safety guarantees hampers its real-world applicability. In particular, explorations during learning usually results in safety violations, while the RL agent learns from those mistakes. On the other hand, safe control techniques ensure persistent safety satisfaction but demand strong priors on system dynamics, which is usually hard to obtain in practice. To address these problems, we present Safe Set Guided Stat…

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 22:01:41

This https://arxiv.org/abs/2505.23527 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 13:37:54

This https://arxiv.org/abs/2309.14792 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

Exploiting Local Observations for Robust Robot Learning
While many robotic tasks can be addressed through either centralized single-agent control with full state observation or decentralized multi-agent control, clear criteria for selecting the optimal approach are lacking. This paper presents a comprehensive investigation into how multi-agent reinforcement learning (MARL) with local observations can enhance robustness in complex robotic systems compared to traditional centralized control methods. We provide both theoretical analysis and empirical v…

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 21:57:41

This https://arxiv.org/abs/2505.21119 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Universal Value-Function Uncertainties
Estimating epistemic uncertainty in value functions is a crucial challenge for many aspects of reinforcement learning (RL), including efficient exploration, safe decision-making, and offline RL. While deep ensembles provide a robust method for quantifying value uncertainty, they come with significant computational overhead. Single-model methods, while computationally favorable, often rely on heuristics and typically require additional propagation mechanisms for myopic uncertainty estimates. In …

@arXiv_csRO_bot@mastoxiv.page
2025-06-02 10:27:15

This https://arxiv.org/abs/2505.20751 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

Interactive OT Gym: A Reinforcement Learning-Based Interactive Optical tweezer (OT)-Driven Microrobotics Simulation Platform
Optical tweezers (OT) offer unparalleled capabilities for micromanipulation with submicron precision in biomedical applications. However, controlling conventional multi-trap OT to achieve cooperative manipulation of multiple complex-shaped microrobots in dynamic environments poses a significant challenge. To address this, we introduce Interactive OT Gym, a reinforcement learning (RL)-based simulation platform designed for OT-driven microrobotics. Our platform supports complex physical field sim…

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 21:45:05

This https://arxiv.org/abs/2505.16401 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Divide-Fuse-Conquer: Eliciting "Aha Moments" in Multi-Scenario Games
Large language models (LLMs) have been observed to suddenly exhibit advanced reasoning abilities during reinforcement learning (RL), resembling an ``aha moment'' triggered by simple outcome-based rewards. While RL has proven effective in eliciting such breakthroughs in tasks involving mathematics, coding, and vision, it faces significant challenges in multi-scenario games. The diversity of game rules, interaction modes, and environmental complexities often leads to policies that perform well in…

@arXiv_csRO_bot@mastoxiv.page
2025-06-03 07:52:50

Disturbance-Aware Adaptive Compensation in Hybrid Force-Position Locomotion Policy for Legged Robots
Yang Zhang, Buqing Nie, Zhanxiang Cao, Yangqing Fu, Yue Gao
https://arxiv.org/abs/2506.00472

Disturbance-Aware Adaptive Compensation in Hybrid Force-Position Locomotion Policy for Legged Robots
Reinforcement Learning (RL)-based methods have significantly improved the locomotion performance of legged robots. However, these motion policies face significant challenges when deployed in the real world. Robots operating in uncertain environments struggle to adapt to payload variations and external disturbances, resulting in severe degradation of motion performance. In this work, we propose a novel Hybrid Force-Position Locomotion Policy (HFPLP) learning framework, where the action space of …

@arXiv_csRO_bot@mastoxiv.page
2025-06-05 07:23:33

SLAC: Simulation-Pretrained Latent Action Space for Whole-Body Real-World RL
Jiaheng Hu, Peter Stone, Roberto Mart\'in-Mart\'in
https://arxiv.org/abs/2506.04147

SLAC: Simulation-Pretrained Latent Action Space for Whole-Body Real-World RL
Building capable household and industrial robots requires mastering the control of versatile, high-degree-of-freedom (DoF) systems such as mobile manipulators. While reinforcement learning (RL) holds promise for autonomously acquiring robot control policies, scaling it to high-DoF embodiments remains challenging. Direct RL in the real world demands both safe exploration and high sample efficiency, which are difficult to achieve in practice. Sim-to-real RL, on the other hand, is often brittle du…

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 14:07:36

This https://arxiv.org/abs/2505.18780 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

One Policy but Many Worlds: A Scalable Unified Policy for Versatile Humanoid Locomotion
Humanoid locomotion faces a critical scalability challenge: traditional reinforcement learning (RL) methods require task-specific rewards and struggle to leverage growing datasets, even as more training terrains are introduced. We propose DreamPolicy, a unified framework that enables a single policy to master diverse terrains and generalize zero-shot to unseen scenarios by systematically integrating offline data and diffusion-driven motion synthesis. At its core, DreamPolicy introduces Humanoid…

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 14:01:29

This https://arxiv.org/abs/2502.01536 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion
Recent success in legged robot locomotion is attributed to the integration of reinforcement learning and physical simulators. However, these policies often encounter challenges when deployed in real-world environments due to sim-to-real gaps, as simulators typically fail to replicate visual realism and complex real-world geometry. Moreover, the lack of realistic visual rendering limits the ability of these policies to support high-level tasks requiring RGB-based perception like ego-centric navi…

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 07:40:36

AURA: Agentic Upskilling via Reinforced Abstractions
Alvin Zhu, Yusuke Tanaka, Dennis Hong
https://arxiv.org/abs/2506.02507 https://a…

AURA: Agentic Upskilling via Reinforced Abstractions
We study the combinatorial explosion involved in translating high-level task prompts into deployable control policies for agile robots through multi-stage reinforcement learning. We introduce AURA (Agentic Upskilling via Reinforced Abstractions), a schema-centric curriculum RL framework that leverages Large Language Models (LLMs) as autonomous designers of multi-stage curricula. AURA transforms user prompts into YAML workflows that encode full reward functions, domain randomization strategies, …

@arXiv_csRO_bot@mastoxiv.page
2025-06-04 07:51:37

Learned Controllers for Agile Quadrotors in Pursuit-Evasion Games
Alejandro Sanchez Roncero, Olov Andersson, Petter Ogren
https://arxiv.org/abs/2506.02849 …

Learned Controllers for Agile Quadrotors in Pursuit-Evasion Games
The increasing proliferation of small UAVs in civilian and military airspace has raised critical safety and security concerns, especially when unauthorized or malicious drones enter restricted zones. In this work, we present a reinforcement learning (RL) framework for agile 1v1 quadrotor pursuit-evasion. We train neural network policies to command body rates and collective thrust, enabling high-speed pursuit and evasive maneuvers that fully exploit the quadrotor's nonlinear dynamics. To mitigat…

Tootfinder

Opt-in global Mastodon full text search. Join the index!