Tootfinder

No exact results. Similar results found.

@arXiv_csAI_bot@mastoxiv.page
2025-10-14 12:23:38

From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization
Beining Wang, Weihang Su, Hongtao Tian, Tao Yang, Yujia Zhou, Ting Yao, Qingyao Ai, Yiqun Liu
https://arxiv.org/abs/2510.11457

From to : Multidimensional Supervision of Reasoning Process for LLM Optimization
Improving the multi-step reasoning ability of Large Language Models (LLMs) is a critical yet challenging task. The dominant paradigm, outcome-supervised reinforcement learning (RLVR), rewards only correct final answers, often propagating flawed reasoning and suffering from sparse reward signals. While process-level reward models (PRMs) provide denser, step-by-step feedback, they lack generalizability and interpretability, requiring task-specific segmentation of the reasoning process. To this en…

@cdarwin@c.im
2025-11-22 17:13:40

Global climate negotiations ended on Saturday in Brazil
with a watered-down resolution that makes no mention of fossil fuels, the main driver of global warming.
The final statement included plenty of warnings on the cost of inaction
but few provisions for how the world might address dangerously rising global temperatures head-on.
A marathon series of frenetic Friday night meetings ultimately salvaged the talks in Belém, on the edge of the Amazon rainforest.
The …

COP30 Climate Summit Ends With Dire Warnings and Scant Plans for Action
The final agreement, with no direct mention of the fossil fuels that are dangerously heating the planet, was a victory for oil-producing countries.

@arXiv_csCV_bot@mastoxiv.page
2025-10-09 10:25:21

Transforming Noise Distributions with Histogram Matching: Towards a Single Denoiser for All
Sheng Fu, Junchao Zhang, Kailun Yang
https://arxiv.org/abs/2510.06757 https://…

Transforming Noise Distributions with Histogram Matching: Towards a Single Denoiser for All
Supervised Gaussian denoisers exhibit limited generalization when confronted with out-of-distribution noise, due to the diverse distributional characteristics of different noise types. To bridge this gap, we propose a histogram matching approach that transforms arbitrary noise towards a target Gaussian distribution with known intensity. Moreover, a mutually reinforcing cycle is established between noise transformation and subsequent denoising. This cycle progressively refines the noise to be co…

@arXiv_csLG_bot@mastoxiv.page
2025-10-03 11:02:21

ExGRPO: Learning to Reason from Experience
Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng
https://arxiv.org/abs/2510.02245 https://…

ExGRPO: Learning to Reason from Experience
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we …

@arXiv_csRO_bot@mastoxiv.page
2025-09-29 10:15:27

DemoGrasp: Universal Dexterous Grasping from a Single Demonstration
Haoqi Yuan, Ziye Huang, Ye Wang, Chuan Mao, Chaoyi Xu, Zongqing Lu
https://arxiv.org/abs/2509.22149 https://

DemoGrasp: Universal Dexterous Grasping from a Single Demonstration
Universal grasping with multi-fingered dexterous hands is a fundamental challenge in robotic manipulation. While recent approaches successfully learn closed-loop grasping policies using reinforcement learning (RL), the inherent difficulty of high-dimensional, long-horizon exploration necessitates complex reward and curriculum design, often resulting in suboptimal solutions across diverse objects. We propose DemoGrasp, a simple yet effective method for learning universal dexterous grasping. We s…

@arXiv_csLG_bot@mastoxiv.page
2025-10-08 10:57:49

The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo
https://arxiv.org/abs/2510.06096

The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple e…

@arXiv_csMA_bot@mastoxiv.page
2025-09-25 07:59:22

The Heterogeneous Multi-Agent Challenge
Charles Dansereau, Junior-Samuel Lopez-Yepez, Karthik Soma, Antoine Fagette
https://arxiv.org/abs/2509.19512 https://

The Heterogeneous Multi-Agent Challenge
Multi-Agent Reinforcement Learning (MARL) is a growing research area which gained significant traction in recent years, extending Deep RL applications to a much wider range of problems. A particularly challenging class of problems in this domain is Heterogeneous Multi-Agent Reinforcement Learning (HeMARL), where agents with different sensors, resources, or capabilities must cooperate based on local information. The large number of real-world situations involving heterogeneous agents makes it an…

@arXiv_csLG_bot@mastoxiv.page
2025-09-30 14:43:31

Rethinking Entropy Regularization in Large Reasoning Models
Yuxian Jiang, Yafu Li, Guanxu Chen, Dongrui Liu, Yu Cheng, Jing Shao
https://arxiv.org/abs/2509.25133 https://…

Rethinking Entropy Regularization in Large Reasoning Models
Reinforcement learning with verifiable rewards (RLVR) has shown great promise in enhancing the reasoning abilities of large reasoning models (LRMs). However, it suffers from a critical issue: entropy collapse and premature convergence. Naive entropy regularization, a common approach for encouraging exploration in the traditional RL literature, fails to address this problem in the context of LRM. Our analysis reveals that this failure stems from the vast action space and long trajectories in LRM…

Tootfinder

Opt-in global Mastodon full text search. Join the index!