Tootfinder

@mia@hcommons.social
2025-08-19 13:42:42

On small, local language models: 'In a world increasingly dominated by massive models and opaque APIs, we believe there’s still room for small, transparent, controllable systems. Models you can fine-tune, understand and run on your own terms' https://www.turing.ac.uk…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 09:20:39

MAPF-World: Action World Model for Multi-Agent Path Finding
Zhanjiang Yang, Meng Li, Yang Shen, Yueming Li, Lijun Sun
https://arxiv.org/abs/2508.12087 https://

MAPF-World: Action World Model for Multi-Agent Path Finding
Multi-agent path finding (MAPF) is the problem of planning conflict-free paths from the designated start locations to goal positions for multiple agents. It underlies a variety of real-world tasks, including multi-robot coordination, robot-assisted logistics, and social navigation. Recent decentralized learnable solvers have shown great promise for large-scale MAPF, especially when leveraging foundation models and large datasets. However, these agents are reactive policy models and exhibit limi…

@arXiv_csCV_bot@mastoxiv.page
2025-08-19 12:06:30

Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model
Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Cyrus Wu, Wei Li, Xuchen Song, Yang Liu, Eric Li, Yahui Zhou
https://arxiv.org/abs/2508.13009

Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model
Recent advances in interactive video generations have demonstrated diffusion model's potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real-time performance. Consequently, they are hard to simulate real-world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we pr…

@arXiv_csLG_bot@mastoxiv.page
2025-09-19 10:39:11

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu
https://arxiv.org/abs/2509.15194

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing label-free methods, confidence minimization, self-consistency, or majority-vote objectives, stabilize learning but steadily shrink exploration, causing an entropy collapse: generations become shorter, less diverse, and brittle. Unlike prior approaches such as Test-Time Reinforc…

@arXiv_csSE_bot@mastoxiv.page
2025-09-19 08:23:21

SCoGen: Scenario-Centric Graph-Based Synthesis of Real-World Code Problems
Xifeng Yao, Dongyu Lang, Wu Zhang, Xintong Guo, Huarui Xie, Yinhao Ni, Ping Liu, Guang Shen, Yi Bai, Dandan Tu, Changzheng Zhang
https://arxiv.org/abs/2509.14281

SCoGen: Scenario-Centric Graph-Based Synthesis of Real-World Code Problems
Significant advancements have been made in the capabilities of code large language models, leading to their rapid adoption and application across a wide range of domains. However, their further advancements are often constrained by the scarcity of real-world coding problems. To bridge this gap, we propose a novel framework for synthesizing code problems that emulate authentic real-world scenarios. This framework systematically integrates domain knowledge, domain skills, and coding skills, all o…

@arXiv_csRO_bot@mastoxiv.page
2025-07-18 07:46:02

FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making
Yucen Wang, Rui Yu, Shenghua Wan, Le Gan, De-Chuan Zhan
https://arxiv.org/abs/2507.12496

FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making
Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the wo…

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:38:40

Leveraging Large Language Models for Predictive Analysis of Human Misery
Bishanka Seal, Rahul Seetharaman, Aman Bansal, Abhilash Nandy
https://arxiv.org/abs/2508.12669 https://

Leveraging Large Language Models for Predictive Analysis of Human Misery
This study investigates the use of Large Language Models (LLMs) for predicting human-perceived misery scores from natural language descriptions of real-world scenarios. The task is framed as a regression problem, where the model assigns a scalar value from 0 to 100 to each input statement. We evaluate multiple prompting strategies, including zero-shot, fixed-context few-shot, and retrieval-based prompting using BERT sentence embeddings. Few-shot approaches consistently outperform zero-shot base…

@Techmeme@techhub.social
2025-07-19 15:10:58

How an open-source approach helped DeepSeek and other Chinese AI companies; Hugging Face: Alibaba's Qwen is now the world's largest open-source AI ecosystem (South China Morning Post)
https://www.scmp.com/tech/big-tech/article

How open-source AI is helping China win hearts and market share
China’s free-for-all AI models, developed by firms like DeepSeek and Alibaba, present a viable alternative to US closed-source systems.

@arXiv_csHC_bot@mastoxiv.page
2025-09-19 09:03:51

VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models
Huanchen Wang, Wencheng Zhang, Zhiqiang Wang, Zhicong Lu, Yuxin Ma
https://arxiv.org/abs/2509.14571

VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models
Vision-language (VL) models have shown transformative potential across various critical domains due to their capability to comprehend multi-modal information. However, their performance frequently degrades under distribution shifts, making it crucial to assess and improve robustness against real-world data corruption encountered in practical applications. While advancements in VL benchmark datasets and data augmentation (DA) have contributed to robustness evaluation and improvement, there remai…

@Mediagazer@mstdn.social
2025-09-19 10:21:04

Q&A with CEO Cristóbal Valenzuela on Runway's "world models" breakthrough, how it differs from typical AI video generation, the Lionsgate partnership, and more (Cristina Criddle/Financial Times)

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 10:00:30

RLNVR: Reinforcement Learning from Non-Verified Real-World Rewards
Rohit Krishnan, Jon Evans
https://arxiv.org/abs/2508.12165 https://arxiv.org/pdf/2508.12…

RLNVR: Reinforcement Learning from Non-Verified Real-World Rewards
This paper introduces RLNVR (Reinforcement Learning from Non-Verified Rewards), a framework for training language models using noisy, real-world feedback signals without requiring explicit human verification. Traditional RLHF requires expensive, verified reward signals that are impractical in many real-world domains. RLNVR addresses this challenge through baseline normalization and semantic similarity-based reward transfer. We demonstrate RLNVR through Walter, a prototype system that optimizes …

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:15:10

Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance
Yiming Cao, Yanjie Li, Kaisheng Liang, Yuni Lai, Bin Xiao
https://arxiv.org/abs/2508.13739

Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance
Targeted adversarial attacks are essential for proactively identifying security flaws in Vision-Language Models before real-world deployment. However, current methods perturb images to maximize global similarity with the target text or reference image at the encoder level, collapsing rich visual semantics into a single global vector. This limits attack granularity, hindering fine-grained manipulations such as modifying a car while preserving its background. Furthermore, these methods largely ov…

@privacity@social.linux.pizza
2025-08-19 19:16:17

Highlights from FPF’s July 2025 Technologist Roundtable: AI Unlearning and Technical Guardrails
https://fpf.org/blog/highlights-from-fpfs-july-2025-technologist-roundtable-ai-unlearning-and-technical-guardrails/

Highlights from FPF’s July 2025 Technologist Roundtable: AI Unlearning and Technical Guardrails
On July 17, 2025, the Future of Privacy Forum (FPF) hosted the second in a series of Technologist Roundtables with the goal of convening an open dialogue on complex technical questions that impact law and policy, and assisting global data protection and privacy policymakers in understanding the relevant technical basics of large language models (LLMs). In this event, we invited a range of academic technical experts and data protection regulators from around the world to explore machine unlearni…

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:36:51

ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?
Vy Tuong Dang, An Vo, Quang Tau, Duc Dm, Daeyoung Kim
https://arxiv.org/abs/2508.13680

ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?
Vision language models (VLMs) demonstrate remarkable capabilities on English multimodal tasks, but their performance on low-resource languages with genuinely multimodal educational content remains largely unexplored. In this work, we test how VLMs perform on Vietnamese educational assessments, investigating whether VLMs trained predominantly on English data can handle real-world cross-lingual multimodal reasoning. Our work presents the first comprehensive evaluation of VLM capabilities on multi…

@arXiv_csLG_bot@mastoxiv.page
2025-09-19 10:31:11

Credit Card Fraud Detection
Iva Popova, Hamza A. A. Gardi
https://arxiv.org/abs/2509.15044 https://arxiv.org/pdf/2509.15044

Credit Card Fraud Detection
Credit card fraud remains a significant challenge due to class imbalance and fraudsters mimicking legitimate behavior. This study evaluates five machine learning models - Logistic Regression, Random Forest, XGBoost, K-Nearest Neighbors (KNN), and Multi-Layer Perceptron (MLP) on a real-world dataset using undersampling, SMOTE, and a hybrid approach. Our models are evaluated on the original imbalanced test set to better reflect real-world performance. Results show that the hybrid method achieves …

@arXiv_csCR_bot@mastoxiv.page
2025-08-20 07:51:40

MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols
Yixuan Yang, Daoyuan Wu, Yufan Chen
https://arxiv.org/abs/2508.13220 https://

MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols
Large Language Models (LLMs) are increasingly integrated into real-world applications via the Model Context Protocol (MCP), a universal, open standard for connecting AI agents with data sources and external tools. While MCP enhances the capabilities of LLM-based agents, it also introduces new security risks and expands their attack surfaces. In this paper, we present the first systematic taxonomy of MCP security, identifying 17 attack types across 4 primary attack surfaces. We introduce MCPSecB…

@arXiv_csSE_bot@mastoxiv.page
2025-08-19 10:03:50

Strengthening Programming Comprehension in Large Language Models through Code Generation
Xiaoning Ren, Qiang Hu, Wei Ma, Yan Li, Yao Zhang, Lingxiao Jiang, Yinxing Xue
https://arxiv.org/abs/2508.12620 …

Strengthening Programming Comprehension in Large Language Models through Code Generation
Large language models (LLMs) have recently shown impressive results on diverse code-related tasks, benefiting from large-scale training and instruction tuning. However, studies reveal that their grasp of fundamental programming concepts, such as data flow and control flow, remains shallow, leading to fragile performance when code requires deeper reasoning. This limitation restricts the practical adoption of LLMs in real-world software development. To address this issue, this work introduces a c…

@arXiv_quantph_bot@mastoxiv.page
2025-08-20 09:37:30

Enhanced Sensitivity and Noise Resilience in Two-Qubit Quantum Magnetometers
S. Nohekhan Shishavan, K. Aghayar Gharehbagh, H. Sedgi Gamichi
https://arxiv.org/abs/2508.13400 http…

Enhanced Sensitivity and Noise Resilience in Two-Qubit Quantum Magnetometers
We present a novel two-qubit quantum magnetometer Hamiltonian optimized for enhanced sensitivity and noise resilience. Compared to existing models, our formulation offers advantages in accuracy, robustness against noise, and entanglement dynamics. Using analytical methods, we derive the Quantum Fisher Information (QFI) and the Signal-to-Noise Ratio (SNR), highlighting its practical viability for magnetic field sensing. Our approach bridges theoretical insights with real-world applicability. We …

@arXiv_mathOC_bot@mastoxiv.page
2025-08-20 09:22:30

Online Stochastic Packing with General Correlations
Sabri Cetin, Yilun Chen, David A. Goldberg
https://arxiv.org/abs/2508.13458 https://arxiv.org/pdf/2508.…

Online Stochastic Packing with General Correlations
There has been a growing interest in studying online stochastic packing under more general correlation structures, motivated by the complex data sets and models driving modern applications. Several past works either assume correlations are weak or have a particular structure, have a complexity scaling with the number of Markovian "states of the world" (which may be exponentially large e.g. in the case of full history dependence), scale poorly with the horizon $T$, or make additional continuity …

@Techmeme@techhub.social
2025-07-16 04:55:53

Jensen Huang hailed AI models from DeepSeek, Alibaba, and Tencent as "world class" at a Beijing expo and said US licenses for H20 chips "will come very fast" (Reuters)
https://www.reuters.com/world/china/nvidias-huang-hail…

@arXiv_csIR_bot@mastoxiv.page
2025-08-19 09:30:39

Diagnostic-Guided Dynamic Profile Optimization for LLM-based User Simulators in Sequential Recommendation
Hongyang Liu, Zhu Sun, Tianjun Wei, Yan Wang, Jiajie Zhu, Xinghua Qu
https://arxiv.org/abs/2508.12645

Diagnostic-Guided Dynamic Profile Optimization for LLM-based User Simulators in Sequential Recommendation
Recent advances in large language models (LLMs) have enabled realistic user simulators for developing and evaluating recommender systems (RSs). However, existing LLM-based simulators for RSs face two major limitations: (1) static and single-step prompt-based inference that leads to inaccurate and incomplete user profile construction; (2) unrealistic and single-round recommendation-feedback interaction pattern that fails to capture real-world scenarios. To address these limitations, we propose D…

@arXiv_csIT_bot@mastoxiv.page
2025-08-19 08:58:30

Deep Semantic Inference over the Air: An Efficient Task-Oriented Communication System
Chenyang Wang, Roger Olsson, Stefan Forsstr\"om, Qing He
https://arxiv.org/abs/2508.12748

Deep Semantic Inference over the Air: An Efficient Task-Oriented Communication System
Empowered by deep learning, semantic communication marks a paradigm shift from transmitting raw data to conveying task-relevant meaning, enabling more efficient and intelligent wireless systems. In this study, we explore a deep learning-based task-oriented communication framework that jointly considers classification performance, computational latency, and communication cost. We adopt ResNets-based models and evaluate them on the CIFAR-10 and CIFAR-100 datasets to simulate real-world classifica…

@arXiv_csRO_bot@mastoxiv.page
2025-08-19 11:28:30

Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy
Tianyi Zhang, Haonan Duan, Haoran Hao, Yu Qiao, Jifeng Dai, Zhi Hou
https://arxiv.org/abs/2508.13103

Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy
Vision-Language-Action (VLA) models frequently encounter challenges in generalizing to real-world environments due to inherent discrepancies between observation and action spaces. Although training data are collected from diverse camera perspectives, the models typically predict end-effector poses within the robot base coordinate frame, resulting in spatial inconsistencies. To mitigate this limitation, we introduce the Observation-Centric VLA (OC-VLA) framework, which grounds action predictions…

@arXiv_statME_bot@mastoxiv.page
2025-08-19 09:59:00

A Systematic Particle Filter for Estimating Time-Varying Parameters in Advection-Diffusion Equations with Source Terms
Andrea Arnold
https://arxiv.org/abs/2508.12155 https://

A Systematic Particle Filter for Estimating Time-Varying Parameters in Advection-Diffusion Equations with Source Terms
Many real-world systems modeled using partial differential equations (PDEs) involve unknown parameters that must be estimated from limited, noisy system observations. While typically assumed to be constants, some of these unobserved parameters may vary with time. This work proposes a two-phase, offline-online numerical procedure for systematically estimating and quantifying uncertainty in time-varying parameters (TVPs) in time-dependent PDEs, specifically focusing on advection-diffusion models …

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 11:16:00

G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance
Yongxin Guo, Wenbo Deng, Zhenglin Cheng, Xiaoying Tang
https://arxiv.org/abs/2508.13023 https://

G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance
Reinforcement Learning with Verifiable Rewards (RLVR) has markedly enhanced the reasoning abilities of large language models (LLMs). Its success, however, largely depends on strong base models with rich world knowledge, yielding only modest improvements for small-size language models (SLMs). To address this limitation, we investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories to compensate for SLMs' inherent weaknesses. Through a comprehensive study of va…

@arXiv_csSD_bot@mastoxiv.page
2025-09-19 07:40:41

Deploying UDM Series in Real-Life Stuttered Speech Applications: A Clinical Evaluation Framework
Eric Zhang (SSHealth Team, AI for Healthcare Laboratory), Li Wei (SSHealth Team, AI for Healthcare Laboratory), Sarah Chen (SSHealth Team, AI for Healthcare Laboratory), Michael Wang (SSHealth Team, AI for Healthcare Laboratory)
https://arxiv.o…

Deploying UDM Series in Real-Life Stuttered Speech Applications: A Clinical Evaluation Framework
Stuttered and dysfluent speech detection systems have traditionally suffered from the trade-off between accuracy and clinical interpretability. While end-to-end deep learning models achieve high performance, their black-box nature limits clinical adoption. This paper looks at the Unconstrained Dysfluency Modeling (UDM) series-the current state-of-the-art framework developed by Berkeley that combines modular architecture, explicit phoneme alignment, and interpretable outputs for real-world clini…

@arXiv_csCV_bot@mastoxiv.page
2025-07-18 10:19:42

Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models
Arian Mousakhan, Sudhanshu Mittal, Silvio Galesso, Karim Farid, Thomas Brox
https://arxiv.org/abs/2507.13162

Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models
Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban …

@arXiv_csGR_bot@mastoxiv.page
2025-09-19 08:18:21

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance
Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang
https://arxiv.org/abs/2509.15130

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance
Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors. However, this potential is hindered by their limited controllability and geometric inconsistency, creating a gap between their strong priors and their practical use in 3D/4D tasks. As a result, current approaches often rely on retraining or fine-tuning, which risks degrading pretrained knowledge and incurs high computational costs. To address this, we propose WorldForg…

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:41:00

CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description
Shaoming Duan, Zirui Wang, Chuanyi Liu, Zhibin Zhu, Yuhao Zhang, Peiyi Han, Liang Yan, Zewu Penge
https://arxiv.org/abs/2508.12769

CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description
Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a fr…

@arXiv_csMA_bot@mastoxiv.page
2025-08-20 07:53:10

Self-Organizing Agent Network for LLM-based Workflow Automation
Yiming Xiong, Jian Wang, Bing Li, Yuhan Zhu, Yuqi Zhao
https://arxiv.org/abs/2508.13732 https://

Self-Organizing Agent Network for LLM-based Workflow Automation
Recent multi-agent frameworks built upon large language models (LLMs) have demonstrated remarkable capabilities in complex task planning. However, in real-world enterprise environments, business workflows are typically composed through modularization and reuse of numerous subprocesses, resulting in intricate workflows characterized by lengthy and deeply nested execution paths. Such complexity poses significant challenges for LLM-driven orchestration, as extended reasoning chains and state-space…

@arXiv_condmatsoft_bot@mastoxiv.page
2025-09-19 09:30:31

A General Model for Static Contact Angles
Carlos E Colosqui
https://arxiv.org/abs/2509.14692 https://arxiv.org/pdf/2509.14692

A General Model for Static Contact Angles
The problem of contact angle and hysteresis determination has direct implications for engineering applications of wetting, colloid and surface science. Significant technical challenges can arise under real-world operating conditions, because the static contact angle is strongly influenced by contamination at the liquid-solid and liquid-vapor interfaces, chemical aging over long times, and environmental variables such as relative humidity and temperature. Analytical models that account for these…

@arXiv_mathPR_bot@mastoxiv.page
2025-08-19 10:41:40

Benford behavior resulting from stick and box fragmentation processes
Bruce Fang, Steven J. Miller
https://arxiv.org/abs/2508.12915 https://arxiv.org/pdf/2…

Benford behavior resulting from stick and box fragmentation processes
Benford's law is the statement that in many real world data sets, the probability of having digit $d$ in base $B$, where $1 \leq d \leq B$, as the first digit is \log_{B}\!\left(\frac{d+1}{d}\right). We sometimes refer to this as weak Benford behavior, and we say that a data set exhibits strong Benford behavior in base $B$ if the probability of having significand at most s, where $1 \leq s < B$, is \log_{B}\!\left(s\right). We examine Benford behaviors in two different probabilistic models: sti…

@arXiv_eessSP_bot@mastoxiv.page
2025-07-17 07:45:30

Foundation Models for Brain Signals: A Critical Review of Current Progress and Future Directions
Gayal Kuruppu, Neeraj Wagh, Yogatheesan Varatharajah
https://arxiv.org/abs/2507.11783

Foundation Models for Brain Signals: A Critical Review of Current Progress and Future Directions
Patterns of electrical brain activity recorded via electroencephalography (EEG) offer immense value for scientific and clinical investigations. The inability of supervised EEG encoders to learn robust EEG patterns and their over-reliance on expensive signal annotations have sparked a transition towards general-purpose self-supervised EEG encoders, i.e., EEG foundation models (EEG-FMs), for robust and scalable EEG feature extraction. However, the real-world readiness of early EEG-FMs and the rub…

@muz4now@mastodon.world
2025-08-08 14:38:03

Check how your mix sounds through multiple popular headphone models with Kali Audio’s new HP-1 Multi-Reference Headphones
#MusicTech #MusicianTips

Check how your mix sounds through multiple popular headphone models with Kali Audio's new HP-1 Multi-Reference Headphones
Kali Audio has launched its first ever pair of over-ear headphones, the HP-1, which lets you switch between three voicings to check how your work will sound through the most popular headphones in use today.

@arXiv_csAI_bot@mastoxiv.page
2025-08-18 08:39:00

Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps
Kangyu Wang, Hongliang He, Lin Liu, Ruiqi Liang, Zhenzhong Lan, Jianguo Li
https://arxiv.org/abs/2508.11452

Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have ushered in a new era of AI capabilities, demonstrating near-human-level performance across diverse scenarios. While numerous benchmarks (e.g., MMLU) and leaderboards (e.g., Chatbot Arena) have been proposed to help evolve the development of LLMs and MLLMs, most rely on static datasets or crowdsourced general-domain prompts, often falling short of reflecting performance in real-world applications. To bridge this criti…

@arXiv_csSE_bot@mastoxiv.page
2025-08-20 07:48:50

COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models
James Meaden, Micha{\l} Jarosz, Piotr Jod{\l}owski, Grigori Melnik
https://arxiv.org/abs/2508.13757

COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models
Current code generation benchmarks focus primarily on functional correctness while overlooking two critical aspects of real-world programming: algorithmic efficiency and code quality. We introduce COMPASS (COdility's Multi-dimensional Programming ASSessment), a comprehensive evaluation framework that assesses code generation across three dimensions: correctness, efficiency, and quality. COMPASS consists of 50 competitive programming problems from real Codility competitions, providing authentic …

@arXiv_csRO_bot@mastoxiv.page
2025-09-19 10:12:41

Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale
Tobias J\"ulg, Pierre Krack, Seongjin Bien, Yannik Blei, Khaled Gamal, Ken Nakahara, Johannes Hechtl, Roberto Calandra, Wolfram Burgard, Florian Walter
https://arxiv.org/abs/2509.14932

Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale
Vision-Language-Action models (VLAs) mark a major shift in robot learning. They replace specialized architectures and task-tailored components of expert policies with large-scale data collection and setup-specific fine-tuning. In this machine learning-focused workflow that is centered around models and scalable training, traditional robotics software frameworks become a bottleneck, while robot simulations offer only limited support for transitioning from and to real-world experiments. In this w…

@arXiv_csCE_bot@mastoxiv.page
2025-07-16 08:16:01

Data-Driven Differential Evolution in Tire Industry Extrusion: Leveraging Surrogate Models
Eider Garate-Perez, Kerman L\'opez de Calle-Etxabe, Susana Ferreiro
https://arxiv.org/abs/2507.11191

Data-Driven Differential Evolution in Tire Industry Extrusion: Leveraging Surrogate Models
The optimization of industrial processes remains a critical challenge, particularly when no mathematical formulation of objective functions or constraints is available. This study addresses this issue by proposing a surrogate-based, data-driven methodology for optimizing complex real-world manufacturing systems using only historical process data. Machine learning models are employed to approximate system behavior and construct surrogate models, which are integrated into a tailored metaheuristic…

@arXiv_csHC_bot@mastoxiv.page
2025-09-19 09:47:51

An Evaluation-Centric Paradigm for Scientific Visualization Agents
Kuangshi Ai, Haichao Miao, Zhimin Li, Chaoli Wang, Shusen Liu
https://arxiv.org/abs/2509.15160 https://…

An Evaluation-Centric Paradigm for Scientific Visualization Agents
Recent advances in multi-modal large language models (MLLMs) have enabled increasingly sophisticated autonomous visualization agents capable of translating user intentions into data visualizations. However, measuring progress and comparing different agents remains challenging, particularly in scientific visualization (SciVis), due to the absence of comprehensive, large-scale benchmarks for evaluating real-world capabilities. This position paper examines the various types of evaluation required …

@Techmeme@techhub.social
2025-08-11 22:30:41

Nvidia debuts new Omniverse SDKs and Cosmos world foundation models for robotics devs, including Cosmos Reason, a 7B-parameter reasoning vision language model (Rebecca Szkutak/TechCrunch)
https://techcrunch.com/2025/08/11/nvid

Nvidia unveils new Cosmos world models, infra for robotics and physical uses | TechCrunch
Nvidia on Monday unveiled a set of new world AI models, libraries, and other infrastructure for robotics developers, most notable of which is Cosmos Reason, a 7-billion-parameter "reasoning" vision language model for physical AI applications and robots.

@arXiv_csCV_bot@mastoxiv.page
2025-08-19 12:05:10

Breaking Reward Collapse: Adaptive Reinforcement for Open-ended Medical Reasoning with Enhanced Semantic Discrimination
Yizhou Liu, Jingwei Wei, Zizhi Chen, Minghao Han, Xukun Zhang, Keliang Liu, Lihua Zhang
https://arxiv.org/abs/2508.12957

Breaking Reward Collapse: Adaptive Reinforcement for Open-ended Medical Reasoning with Enhanced Semantic Discrimination
Reinforcement learning (RL) with rule-based rewards has demonstrated strong potential in enhancing the reasoning and generalization capabilities of vision-language models (VLMs) and large language models (LLMs), while reducing computational overhead. However, its application in medical imaging remains underexplored. Existing reinforcement fine-tuning (RFT) approaches in this domain primarily target closed-ended visual question answering (VQA), limiting their applicability to real-world clinical…

@arXiv_csIR_bot@mastoxiv.page
2025-08-19 08:21:20

A Large-Scale Web Search Dataset for Federated Online Learning to Rank
Marcel Gregoriadis, Jingwei Kang, Johan Pouwelse
https://arxiv.org/abs/2508.12353 https://

A Large-Scale Web Search Dataset for Federated Online Learning to Rank
The centralized collection of search interaction logs for training ranking models raises significant privacy concerns. Federated Online Learning to Rank (FOLTR) offers a privacy-preserving alternative by enabling collaborative model training without sharing raw user data. However, benchmarks in FOLTR are largely based on random partitioning of classical learning-to-rank datasets, simulated user clicks, and the assumption of synchronous client participation. This oversimplifies real-world dynami…

@arXiv_csAI_bot@mastoxiv.page
2025-09-18 09:08:41

From Next Token Prediction to (STRIPS) World Models -- Preliminary Results
Carlos N\'u\~nez-Molina, Vicen\c{c} G\'omez, Hector Geffner
https://arxiv.org/abs/2509.13389 h…

From Next Token Prediction to (STRIPS) World Models -- Preliminary Results
We consider the problem of learning propositional STRIPS world models from action traces alone, using a deep learning architecture (transformers) and gradient descent. The task is cast as a supervised next token prediction problem where the tokens are the actions, and an action $a$ may follow an action sequence if the hidden effects of the previous actions do not make an action precondition of $a$ false. We show that a suitable transformer architecture can faithfully represent propositional STR…

@arXiv_csCL_bot@mastoxiv.page
2025-07-18 07:32:32

Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models
Lionel Wong, Katherine M. Collins, Lance Ying, Cedegao E. Zhang, Adrian Weller, Tobias Gersternberg, Timothy O'Donnell, Alexander K. Lew, Jacob D. Andreas, Joshua B. Tenenbaum, Tyler Brooke-Wilson
https://arxiv.org/abs/2507.12547

Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models
When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea -- a `…

@arXiv_csSE_bot@mastoxiv.page
2025-09-19 09:49:41

CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects
Hanyang Guo, Xunjin Zheng, Zihan Liao, Hang Yu, Peng DI, Ziyin Zhang, Hong-Ning Dai
https://arxiv.org/abs/2509.14856

CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects
Automated code review (CR) is a key application for Large Language Models (LLMs), but progress is hampered by a "reality gap": existing benchmarks evaluate models on isolated sub-tasks using simplified, context-poor data. This fails to reflect the holistic context-rich nature of real-world CR. To bridge this gap, we introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation. CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python pr…

@arXiv_csRO_bot@mastoxiv.page
2025-07-18 09:55:12

Latent Policy Steering with Embodiment-Agnostic Pretrained World Models
Yiqi Wang, Mrinal Verghese, Jeff Schneider
https://arxiv.org/abs/2507.13340 https:/…

Latent Policy Steering with Embodiment-Agnostic Pretrained World Models
Learning visuomotor policies via imitation has proven effective across a wide range of robotic domains. However, the performance of these policies is heavily dependent on the number of training demonstrations, which requires expensive data collection in the real world. In this work, we aim to reduce data collection efforts when learning visuomotor robot policies by leveraging existing or cost-effective data from a wide range of embodiments, such as public robot datasets and the datasets of huma…

@Techmeme@techhub.social
2025-09-18 15:20:43

Q&A with CEO Cristóbal Valenzuela on Runway's "world models" breakthrough, how it differs from typical AI video generation, the Lionsgate partnership, and more (Cristina Criddle/Financial Times)

@arXiv_csSD_bot@mastoxiv.page
2025-09-17 10:02:09

Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs
Han Yin, Jung-Woo Choi
https://arxiv.org/abs/2509.13148 https:/…

Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs
Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate the LALM's audio understanding performance, researchers have proposed different benchmarks. However, key aspects for real-world interactions are underexplored in existing benchmarks, i.e., audio signals typically contain both speech and non-speech components, and energy levels of these components can vary signifi…

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:21:31

Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models
Weihang Wang, Xinhao Li, Ziyue Wang, Yan Pang, Jielei Zhang, Peiyi Li, Qiang Zhang, Longwen Gao
https://arxiv.org/abs/2509.13836

Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models
Object hallucination in Large Vision-Language Models (LVLMs) significantly impedes their real-world applicability. As the primary component for accurately interpreting visual information, the choice of visual encoder is pivotal. We hypothesize that the diverse training paradigms employed by different visual encoders instill them with distinct inductive biases, which leads to their diverse hallucination performances. Existing benchmarks typically focus on coarse-grained hallucination detection a…

@arXiv_csIR_bot@mastoxiv.page
2025-07-17 08:06:40

Sparse Autoencoders for Sequential Recommendation Models: Interpretation and Flexible Control
Anton Klenitskiy, Konstantin Polev, Daria Denisova, Alexey Vasilev, Dmitry Simakov, Gleb Gusev
https://arxiv.org/abs/2507.12202

Sparse Autoencoders for Sequential Recommendation Models: Interpretation and Flexible Control
Many current state-of-the-art models for sequential recommendations are based on transformer architectures. Interpretation and explanation of such black box models is an important research question, as a better understanding of their internals can help understand, influence, and control their behavior, which is very important in a variety of real-world applications. Recently sparse autoencoders (SAE) have been shown to be a promising unsupervised approach for extracting interpretable features f…

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 10:00:21

Large Language Models Discriminate Against Speakers of German Dialects
Minh Duc Bui, Carolin Holtermann, Valentin Hofmann, Anne Lauscher, Katharina von der Wense
https://arxiv.org/abs/2509.13835

Large Language Models Discriminate Against Speakers of German Dialects
Dialects represent a significant component of human culture and are found across all regions of the world. In Germany, more than 40% of the population speaks a regional dialect (Adler and Hansen, 2022). However, despite cultural importance, individuals speaking dialects often face negative societal stereotypes. We examine whether such stereotypes are mirrored by large language models (LLMs). We draw on the sociolinguistic literature on dialect perception to analyze traits commonly associated wi…

@arXiv_csAI_bot@mastoxiv.page
2025-09-18 07:45:41

Imagined Autocurricula
Ahmet H. G\"uzel, Matthew Thomas Jackson, Jarek Luca Liesen, Tim Rockt\"aschel, Jakob Nicolaus Foerster, Ilija Bogunovic, Jack Parker-Holder
https://arxiv.org/abs/2509.13341

Imagined Autocurricula
Training agents to act in embodied environments typically requires vast training data or access to accurate simulation, neither of which exists for many cases in the real world. Instead, world models are emerging as an alternative leveraging offline, passively collected data, they make it possible to generate diverse worlds for training agents in simulation. In this work, we harness world models to generate imagined environments to train robust agents capable of generalizing to novel task varia…

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 09:08:30

Visuomotor Grasping with World Models for Surgical Robots
Hongbin Lin, Bin Li, Kwok Wai Samuel Au
https://arxiv.org/abs/2508.11200 https://arxiv.org/pdf/25…

Visuomotor Grasping with World Models for Surgical Robots
Grasping is a fundamental task in robot-assisted surgery (RAS), and automating it can reduce surgeon workload while enhancing efficiency, safety, and consistency beyond teleoperated systems. Most prior approaches rely on explicit object pose tracking or handcrafted visual features, limiting their generalization to novel objects, robustness to visual disturbances, and the ability to handle deformable objects. Visuomotor learning offers a promising alternative, but deploying it in RAS presents un…

@arXiv_csSE_bot@mastoxiv.page
2025-07-17 09:34:10

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?
Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, Zejun Ma
https://arxiv.org/abs/2507.12415

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?
Code performance optimization is paramount in real-world software engineering and critical for production-level systems. While Large Language Models (LLMs) have demonstrated impressive capabilities in code generation and bug fixing, their proficiency in enhancing code performance at the repository level remains largely unexplored. To address this gap, we introduce SWE-Perf, the first benchmark specifically designed to systematically evaluate LLMs on code performance optimization tasks within au…

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:36:30

CRISP: Persistent Concept Unlearning via Sparse Autoencoders
Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov
https://arxiv.org/abs/2508.13650 https://

CRISP: Persistent Concept Unlearning via Sparse Autoencoders
As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with …

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:18:40

RICO: Two Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection
Matthias Neuwirth-Trapp, Maarten Bieshaar, Danda Pani Paudel, Luc Van Gool
https://arxiv.org/abs/2508.13878

RICO: Two Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection
Incremental Learning (IL) trains models sequentially on new data without full retraining, offering privacy, efficiency, and scalability. IL must balance adaptability to new data with retention of old knowledge. However, evaluations often rely on synthetic, simplified benchmarks, obscuring real-world IL performance. To address this, we introduce two Realistic Incremental Object Detection Benchmarks (RICO): Domain RICO (D-RICO) features domain shifts with a fixed class set, and Expanding-Classes …

@arXiv_csRO_bot@mastoxiv.page
2025-09-18 10:06:31

PhysicalAgent: Towards General Cognitive Robotics with Foundation World Models
Artem Lykov, Jeffrin Sam, Hung Khang Nguyen, Vladislav Kozlovskiy, Yara Mahmoud, Valerii Serpiva, Miguel Altamirano Cabrera, Mikhail Konenkov, Dzmitry Tsetserukou
https://arxiv.org/abs/2509.13903

PhysicalAgent: Towards General Cognitive Robotics with Foundation World Models
We introduce PhysicalAgent, an agentic framework for robotic manipulation that integrates iterative reasoning, diffusion-based video generation, and closed-loop execution. Given a textual instruction, our method generates short video demonstrations of candidate trajectories, executes them on the robot, and iteratively re-plans in response to failures. This approach enables robust recovery from execution errors. We evaluate PhysicalAgent across multiple perceptual modalities (egocentric, third-p…

@arXiv_csSE_bot@mastoxiv.page
2025-09-18 08:46:51

Crash Report Enhancement with Large Language Models: An Empirical Study
S M Farah Al Fahim (Peter), Md Nakhla Rafi (Peter), Zeyang Ma (Peter), Dong Jae Kim (Peter), Tse-Hsun (Peter), Chen
https://arxiv.org/abs/2509.13535

Crash Report Enhancement with Large Language Models: An Empirical Study
Crash reports are central to software maintenance, yet many lack the diagnostic detail developers need to debug efficiently. We examine whether large language models can enhance crash reports by adding fault locations, root-cause explanations, and repair suggestions. We study two enhancement strategies: Direct-LLM, a single-shot approach that uses stack-trace context, and Agentic-LLM, an iterative approach that explores the repository for additional evidence. On a dataset of 492 real-world cras…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 10:21:40

GraphCogent: Overcoming LLMs' Working Memory Constraints via Multi-Agent Collaboration in Complex Graph Understanding
Rongzheng Wang, Qizhi Chen, Yihong Huang, Yizhuo Ma, Muquan Li, Jiakai Li, Ke Qin, Guangchun Luo, Shuang Liang
https://arxiv.org/abs/2508.12379

GraphCogent: Overcoming LLMs' Working Memory Constraints via Multi-Agent Collaboration in Complex Graph Understanding
Large language models (LLMs) show promising performance on small-scale graph reasoning tasks but fail when handling real-world graphs with complex queries. This phenomenon stems from LLMs' inability to effectively process complex graph topology and perform multi-step reasoning simultaneously. To address these limitations, we propose GraphCogent, a collaborative agent framework inspired by human Working Memory Model that decomposes graph reasoning into specialized cognitive processes: sense, buf…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:23:51

Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration
Haoran Zhang, Yafu Li, Xuyang Hu, Dongrui Liu, Zhilin Wang, Bo Li, Yu Cheng
https://arxiv.org/abs/2509.14760

Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration
Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety persp…

@arXiv_csHC_bot@mastoxiv.page
2025-09-16 10:08:56

The Siren Song of LLMs: How Users Perceive and Respond to Dark Patterns in Large Language Models
Yike Shi (Diane), Qing Xiao (Diane), Qing (Diane), Hu, Hong Shen, Hua Shen
https://arxiv.org/abs/2509.10830

The Siren Song of LLMs: How Users Perceive and Respond to Dark Patterns in Large Language Models
Large language models can influence users through conversation, creating new forms of dark patterns that differ from traditional UX dark patterns. We define LLM dark patterns as manipulative or deceptive behaviors enacted in dialogue. Drawing on prior work and AI incident reports, we outline a diverse set of categories with real-world examples. Using them, we conducted a scenario-based study where participants (N=34) compared manipulative and neutral LLM responses. Our results reveal that recog…

@arXiv_csCV_bot@mastoxiv.page
2025-09-19 10:18:21

Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification
Xiang Tuo, Xu Xuemiao, Liu Bangzhen, Li Jinyi, Li Yong, He Shengfeng
https://arxiv.org/abs/2509.14958

Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification
The rapid growth of 3D digital content necessitates expandable recognition systems for open-world scenarios. However, existing 3D class-incremental learning methods struggle under extreme data scarcity due to geometric misalignment and texture bias. While recent approaches integrate 3D data with 2D foundation models (e.g., CLIP), they suffer from semantic blurring caused by texture-biased projections and indiscriminate fusion of geometric-textural cues, leading to unstable decision prototypes a…

@arXiv_csLG_bot@mastoxiv.page
2025-08-15 10:08:12

Driving Accurate Allergen Prediction with Protein Language Models and Generalization-Focused Evaluation
Brian Shing-Hei Wong, Joshua Mincheol Kim, Sin-Hang Fung, Qing Xiong, Kelvin Fu-Kiu Ao, Junkang Wei, Ran Wang, Dan Michelle Wang, Jingying Zhou, Bo Feng, Alfred Sze-Lok Cheng, Kevin Y. Yip, Stephen Kwok-Wing Tsui, Qin Cao
https://arxiv.o…

Driving Accurate Allergen Prediction with Protein Language Models and Generalization-Focused Evaluation
Allergens, typically proteins capable of triggering adverse immune responses, represent a significant public health challenge. To accurately identify allergen proteins, we introduce Applm (Allergen Prediction with Protein Language Models), a computational framework that leverages the 100-billion parameter xTrimoPGLM protein language model. We show that Applm consistently outperforms seven state-of-the-art methods in a diverse set of tasks that closely resemble difficult real-world scenarios. Th…

@arXiv_csRO_bot@mastoxiv.page
2025-09-18 09:56:31

Dual-Actor Fine-Tuning of VLA Models: A Talk-and-Tweak Human-in-the-Loop Approach
Piaopiao Jin, Qi Wang, Guokang Sun, Ziwen Cai, Pinjia He, Yangwei You
https://arxiv.org/abs/2509.13774

Dual-Actor Fine-Tuning of VLA Models: A Talk-and-Tweak Human-in-the-Loop Approach
Vision-language-action (VLA) models demonstrate strong generalization in robotic manipulation but face challenges in complex, real-world tasks. While supervised fine-tuning with demonstrations is constrained by data quality, reinforcement learning (RL) offers a promising alternative. We propose a human-in-the-loop dual-actor fine-tuning framework grounded in RL. The framework integrates a primary actor for robust multi-task performance with a refinement actor for latent-space adaptation. Beyond…

@Techmeme@techhub.social
2025-08-17 15:55:49

NYC-based Protege, which prepares and sells real-world datasets like lab results and sports footage for AI training, raised a $25M Series A led by Footwork (Natasha Mascarenhas/The Information)
https://www.theinformation.com/articles/one-year-old…

The One-Year-Old Startup Notching Data Deals For Model Makers
Companies like Scale AI and Surge have proven there’s a market for human-labeled data, like professionals’ answers to complex math or law questions, that AI research labs can use to train their models. Now, a young startup is tapping a similar opportunity, linking large model makers and AI ...

@arXiv_csSE_bot@mastoxiv.page
2025-09-19 09:46:31

On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub
Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, Ahmed E. Hassan
https://arxiv.org/abs/2509.14745

On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub
Large language models (LLMs) are increasingly being integrated into software development processes. The ability to generate code and submit pull requests with minimal human intervention, through the use of autonomous AI agents, is poised to become a standard practice. However, little is known about the practical usefulness of these pull requests and the extent to which their contributions are accepted in real-world projects. In this paper, we empirically study 567 GitHub pull requests (PRs) gen…

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:16:30

VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization
Jailing Lin, Shu Jiang, Qingyuan Zeng, Zhenzhong Wang, Min Jiang
https://arxiv.org/abs/2508.13792

VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization
The intrinsic dynamics of an object governs its physical behavior in the real world, playing a critical role in enabling physically plausible interactive simulation with 3D assets. Existing methods have attempted to infer the intrinsic dynamics of objects from visual observations, but generally face two major challenges: one line of work relies on manually defined constitutive priors, making it difficult to generalize to complex scenarios; the other models intrinsic dynamics using neural networ…

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:42:40

HeteroRAG: A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks
Zhe Chen, Yusheng Liao, Shuyang Jiang, Zhiyuan Zhu, Haolin Li, Yanfeng Wang, Yu Wang
https://arxiv.org/abs/2508.12778

HeteroRAG: A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks
Medical large vision-language Models (Med-LVLMs) have shown promise in clinical applications but suffer from factual inaccuracies and unreliable outputs, posing risks in real-world diagnostics. While retrieval-augmented generation has emerged as a potential solution, current medical multimodal RAG systems are unable to perform effective retrieval across heterogeneous sources. The irrelevance of retrieved reports affects the factuality of analysis, while insufficient knowledge affects the credib…

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 08:18:59

ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs
Hongxin Ding, Baixiang Huang, Yue Fang, Weibin Liao, Xinke Jiang, Zheng Li, Junfeng Zhao, Yasha Wang
https://arxiv.org/abs/2508.13514

ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs
Interactive medical questioning is essential in real-world clinical consultations, where physicians must actively gather information from patients. While medical Large Language Models (LLMs) have shown impressive capabilities in static medical question answering, they predominantly operate under a reactive paradigm: generating answers directly without seeking additional information, which risks incorrect diagnoses in such interactive settings. To address this limitation, we propose ProMed, a re…

@arXiv_csHC_bot@mastoxiv.page
2025-09-18 08:56:51

DuetUI: A Bidirectional Context Loop for Human-Agent Co-Generation of Task-Oriented Interfaces
Yuan Xu, Shaowen Xiang, Yizhi Song, Ruoting Sun, Xin Tong
https://arxiv.org/abs/2509.13444

DuetUI: A Bidirectional Context Loop for Human-Agent Co-Generation of Task-Oriented Interfaces
Large Language Models are reshaping task automation, yet remain limited in complex, multi-step real-world tasks that require aligning with vague user intent and enabling dynamic user override. From a formative study with 12 participants, we found that end-users actively seek to shape generative interfaces rather than relying on one-shot outputs. To address this, we introduce the human-agent co-generation paradigm, materialized in DuetUI. This LLM-empowered system unfolds alongside task progress…

@arXiv_csCV_bot@mastoxiv.page
2025-09-19 10:25:41

Synthetic-to-Real Object Detection using YOLOv11 and Domain Randomization Strategies
Luisa Torquato Ni\~no, Hamza A. A. Gardi
https://arxiv.org/abs/2509.15045 https://

Synthetic-to-Real Object Detection using YOLOv11 and Domain Randomization Strategies
This paper addresses the synthetic-to-real domain gap in object detection, focusing on training a YOLOv11 model to detect a specific object (a soup can) using only synthetic data and domain randomization strategies. The methodology involves extensive experimentation with data augmentation, dataset composition, and model scaling. While synthetic validation metrics were consistently high, they proved to be poor predictors of real-world performance. Consequently, models were also evaluated qualita…

@arXiv_csRO_bot@mastoxiv.page
2025-09-17 10:41:40

Empowering Multi-Robot Cooperation via Sequential World Models
Zijie Zhao, Honglei Guo, Shengqian Chen, Kaixuan Xu, Bo Jiang, Yuanheng Zhu, Dongbin Zhao
https://arxiv.org/abs/2509.13095

Empowering Multi-Robot Cooperation via Sequential World Models
Model-based reinforcement learning (MBRL) has shown significant potential in robotics due to its high sample efficiency and planning capability. However, extending MBRL to multi-robot cooperation remains challenging due to the complexity of joint dynamics. To address this, we propose the Sequential World Model (SeqWM), a novel framework that integrates the sequential paradigm into model-based multi-agent reinforcement learning. SeqWM employs independent, sequentially structured agent-wise world…

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:58:40

Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization
Shaohua Duan, Xinze Li, Zhenghao Liu, Xiaoyuan Yi, Yukun Yan, Shuo Wang, Yu Gu, Ge Yu, Maosong Sun
https://arxiv.org/abs/2508.13993

Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization
Long-context modeling is critical for a wide range of real-world tasks, including long-context question answering, summarization, and complex reasoning tasks. Recent studies have explored fine-tuning Large Language Models (LLMs) with synthetic data to enhance their long-context capabilities. However, the effectiveness of such approaches is often limited by the low diversity and factual inconsistencies in the generated data. To address these challenges, we propose LongMab-PO, a novel framework t…

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:38:50

ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction
Xingshan Zeng, Weiwen Liu, Lingzhi Wang, Liangyou Li, Fei Mi, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu
https://arxiv.org/abs/2508.12685

ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction
Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions, often involving complex function calls and dynamic user-agent exchanges. Existing simulation-based data generation methods for such scenarios rely heavily on costly autoregressive interactions between multiple LLM agents, thereby limiting real-world performance of agentic tasks. In this paper, we propose a novel Non-Autoregressive Iterative Generation framework, called ToolACE-MT, for constructi…

@arXiv_csRO_bot@mastoxiv.page
2025-07-18 08:23:12

VLMgineer: Vision Language Models as Robotic Toolsmiths
George Jiayuan Gao, Tianyu Li, Junyao Shi, Yihan Li, Zizhe Zhang, Nadia Figueroa, Dinesh Jayaraman
https://arxiv.org/abs/2507.12644

VLMgineer: Vision Language Models as Robotic Toolsmiths
Tool design and use reflect the ability to understand and manipulate the physical world through creativity, planning, and foresight. As such, these capabilities are often regarded as measurable indicators of intelligence across biological species. While much of today's research on robotic intelligence focuses on generating better controllers, inventing smarter tools offers a complementary form of physical intelligence: shifting the onus of problem-solving onto the tool's design. Given the vast …

@arXiv_csCV_bot@mastoxiv.page
2025-09-16 12:46:47

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, Tong He
https://arxiv.org/abs/2509.12201…

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to …

@arXiv_csSE_bot@mastoxiv.page
2025-09-17 08:49:59

When Large Language Models Meet UAVs: How Far Are We?
Yihua Chen, Xingle Que, Jiashuo Zhang, Ting Chen, Guangshun Li, Jiachi Chen
https://arxiv.org/abs/2509.12795 https://

When Large Language Models Meet UAVs: How Far Are We?
The integration of unmanned aerial vehicles (UAVs) and large language models (LLMs) has emerged as a research direction of growing interest, with the potential to address challenges in autonomous decision-making, human-UAV interaction, and real-time adaptability. However, existing studies have remained largely in preliminary exploration with a limited understanding of real-world practice, risking a misalignment between academic research and practical needs and hindering the translation of resul…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:37:01

TextMine: LLM-Powered Knowledge Extraction for Humanitarian Mine Action
Chenyue Zhou, G\"urkan Solmaz, Flavio Cirillo, Kiril Gashteovski, Jonathan F\"urst
https://arxiv.org/abs/2509.15098

TextMine: LLM-Powered Knowledge Extraction for Humanitarian Mine Action
Humanitarian Mine Action has generated extensive best-practice knowledge, but much remains locked in unstructured reports. We introduce TextMine, an ontology-guided pipeline that uses Large Language Models to extract knowledge triples from HMA texts. TextMine integrates document chunking, domain-aware prompting, triple extraction, and both reference-based and LLM-as-a-Judge evaluation. We also create the first HMA ontology and a curated dataset of real-world demining reports. Experiments show o…

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 09:24:20

EvoPSF: Online Evolution of Autonomous Driving Models via Planning-State Feedback
Jiayue Jin, Lang Qian, Jingyu Zhang, Chuanyu Ju, Liang Song
https://arxiv.org/abs/2508.11453 ht…

EvoPSF: Online Evolution of Autonomous Driving Models via Planning-State Feedback
Recent years have witnessed remarkable progress in autonomous driving, with systems evolving from modular pipelines to end-to-end architectures. However, most existing methods are trained offline and lack mechanisms to adapt to new environments during deployment. As a result, their generalization ability diminishes when faced with unseen variations in real-world driving scenarios. In this paper, we break away from the conventional "train once, deploy forever" paradigm and propose EvoPSF, a nove…

@arXiv_csCV_bot@mastoxiv.page
2025-08-18 09:52:00

ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving
Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, Li Zhang
https://arxiv.org/abs/2508.11428

ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving
Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating detailed and plausible future driving scenarios ess…

@arXiv_csCV_bot@mastoxiv.page
2025-07-18 10:22:32

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
https://arxiv.org/abs/2507.13348

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically proc…

@arXiv_csRO_bot@mastoxiv.page
2025-09-18 10:02:41

Agile in the Face of Delay: Asynchronous End-to-End Learning for Real-World Aerial Navigation
Yude Li, Zhexuan Zhou, Huizhe Li, Youmin Gong, Jie Mei
https://arxiv.org/abs/2509.13816

Agile in the Face of Delay: Asynchronous End-to-End Learning for Real-World Aerial Navigation
Robust autonomous navigation for Autonomous Aerial Vehicles (AAVs) in complex environments is a critical capability. However, modern end-to-end navigation faces a key challenge: the high-frequency control loop needed for agile flight conflicts with low-frequency perception streams, which are limited by sensor update rates and significant computational cost. This mismatch forces conventional synchronous models into undesirably low control rates. To resolve this, we propose an asynchronous reinfo…

@arXiv_csAI_bot@mastoxiv.page
2025-09-08 07:39:39

Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning
Brennen Hill
https://arxiv.org/abs/2509.04731 https://arxiv.…

Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning
The convergence of Language models, Agent models, and World models represents a critical frontier for artificial intelligence. While recent progress has focused on scaling Language and Agent models, the development of sophisticated, explicit World Models remains a key bottleneck, particularly for complex, long-horizon multi-agent tasks. In domains such as robotic soccer, agents trained via standard reinforcement learning in high-fidelity but structurally-flat simulators often fail due to intrac…

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:24:41

VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement
Jun Du, Weiwei Xing, Ming Li, Fei Richard Yu
https://arxiv.org/abs/2509.14060 ht…

VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement
Current multi-object tracking (MOT) algorithms typically overlook issues inherent in low-quality videos, leading to significant degradation in tracking performance when confronted with real-world image deterioration. Therefore, advancing the application of MOT algorithms in real-world low-quality video scenarios represents a critical and meaningful endeavor. To address the challenges posed by low-quality scenarios, inspired by vision-language models, this paper proposes a Visual Semantic Enhanc…

@arXiv_csCL_bot@mastoxiv.page
2025-07-18 09:59:32

Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes
Tyler Loakman, William Thorne, Chenghua Lin
https://arxiv.org/abs/2507.13335

Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes
Humour, as a complex language form, is derived from myriad aspects of life, whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes. In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events. In doing so, we curate a dataset of 600 jokes split across…

@arXiv_csCV_bot@mastoxiv.page
2025-08-15 10:23:22

From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models
Tiancheng Han, Yunfei Gao, Yong Li, Wuzhou Yu, Qiaosheng Zhang, Wenqi Shao
https://arxiv.org/abs/2508.10770

From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models
Spatio-physical reasoning, a foundation capability for understanding the real physics world, is a critical step towards building robust world models. While recent vision language models (VLMs) have shown remarkable progress in specialized domains like multimodal mathematics and pure spatial understanding, their capability for spatio-physical reasoning remains largely unexplored. This paper provides a comprehensive diagnostic analysis of mainstream VLMs, revealing that current models perform ina…

@arXiv_csSE_bot@mastoxiv.page
2025-07-16 10:04:11

An Empirical Study of Multi-Agent RAG for Real-World University Admissions Counseling
Anh Nguyen-Duc, Chien Vu Manh, Bao Anh Tran, Viet Phuong Ngo, Luan Le Chi, Anh Quang Nguyen
https://arxiv.org/abs/2507.11272

An Empirical Study of Multi-Agent RAG for Real-World University Admissions Counseling
This paper presents MARAUS (Multi-Agent and Retrieval-Augmented University Admission System), a real-world deployment of a conversational AI platform for higher education admissions counseling in Vietnam. While large language models (LLMs) offer potential for automating advisory tasks, most existing solutions remain limited to prototypes or synthetic benchmarks. MARAUS addresses this gap by combining hybrid retrieval, multi-agent orchestration, and LLM-based generation into a system tailored fo…

@arXiv_csRO_bot@mastoxiv.page
2025-07-16 10:22:31

Acting and Planning with Hierarchical Operational Models on a Mobile Robot: A Study with RAE UPOM
Oscar Lima, Marc Vinci, Sunandita Patra, Sebastian Stock, Joachim Hertzberg, Martin Atzmueller, Malik Ghallab, Dana Nau, Paolo Traverso
https://arxiv.org/abs/2507.11345

Acting and Planning with Hierarchical Operational Models on a Mobile Robot: A Study with RAE+UPOM
Robotic task execution faces challenges due to the inconsistency between symbolic planner models and the rich control structures actually running on the robot. In this paper, we present the first physical deployment of an integrated actor-planner system that shares hierarchical operational models for both acting and planning, interleaving the Reactive Acting Engine (RAE) with an anytime UCT-like Monte Carlo planner (UPOM). We implement RAE+UPOM on a mobile manipulator in a real-world deployment…

@arXiv_csCL_bot@mastoxiv.page
2025-09-17 10:38:00

Evaluating LLM Alignment on Personality Inference from Real-World Interview Data
Jianfeng Zhu, Julina Maharjan, Xinyu Li, Karin G. Coifman, Ruoming Jin
https://arxiv.org/abs/2509.13244

Evaluating LLM Alignment on Personality Inference from Real-World Interview Data
Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants. However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. While prior work has simulated LLM "personas" using discrete Big Five labels on social media data, the alignment of LLMs with co…

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:25:51

An Exploratory Study on Abstract Images and Visual Representations Learned from Them
Haotian Li, Jianbo Jiao
https://arxiv.org/abs/2509.14149 https://arxiv…

An Exploratory Study on Abstract Images and Visual Representations Learned from Them
Imagine living in a world composed solely of primitive shapes, could you still recognise familiar objects? Recent studies have shown that abstract images-constructed by primitive shapes-can indeed convey visual semantic information to deep learning models. However, representations obtained from such images often fall short compared to those derived from traditional raster images. In this paper, we study the reasons behind this performance gap and investigate how much high-level semantic content…

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 08:49:31

Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs
Zhuoxuan Zhang, Jinhao Duan, Edward Kim, Kaidi Xu
https://arxiv.org/abs/2509.13664 https://

Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs
Ambiguity is pervasive in real-world questions, yet large language models (LLMs) often respond with confident answers rather than seeking clarification. In this work, we show that question ambiguity is linearly encoded in the internal representations of LLMs and can be both detected and controlled at the neuron level. During the model's pre-filling stage, we identify that a small number of neurons, as few as one, encode question ambiguity information. Probes trained on these Ambiguity-Encoding …

@arXiv_csCL_bot@mastoxiv.page
2025-09-17 10:40:30

Towards General Agentic Intelligence via Environment Scaling
Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, Shibin Wu, Zhengwei Tao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
https://arxiv.org/abs/2509.13311

Towards General Agentic Intelligence via Environment Scaling
Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic i…

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:25:41

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Peng Xu, Shengwu Xiong, Jiajun Zhang, Yaxiong Chen, Bowen Zhou, Chen Change Loy, David A. Clifton, Kyoung Mu Lee, Luc Van Gool, Ruiming He, Ruilin Yao, Xinwei Long, Jirui Huang, Kai Tian, Sa Yang, Yihua Shao, Jin Feng, Yue Zhong, Jiakai Zhou, Cheng Tang, Tianyu Zou, Yifang Zhang, Junming Liang, Guoyou Li, Zhaoxiang Wang, Qiang Zhou, Yichen Zhao, Shili Xiong, Hyeongjin Nam, Jaerin Lee, Jaey…

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applicati…

Tootfinder

Opt-in global Mastodon full text search. Join the index!