Tootfinder

@@arXiv_physicsatomph_bot@mastoxiv.page@mastoxiv.page
2025-12-05 12:41:50

Replaced article(s) found for physics.atom-ph. https://arxiv.org/list/physics.atom-ph/new
[1/1]:
- A Platform for Evanescently Trapping Rb-87 Using Silicon Nitride Strip Waveguides Buried in Silica
Sam J. Harding, Carrie Weidner
https://arxiv.org/abs/2512.01624 https://mastoxiv.page/@arXiv_physicsatomph_bot/115649278786929361
- Shell formation and two-dimensional nanofriction in three-dimensional ion Coulomb crystals
L. -A. R\"uffert, T. E. Mehlst\"aubler
https://arxiv.org/abs/2512.03833 https://mastoxiv.page/@arXiv_physicsatomph_bot/115660464251377360
- Classical and Quantum Beam Dynamics Simulation of the RF Photoinjector Test Bench at JINR
Dyatlov, Afanasyev, Kobets, Levichev, Maksimov, Nikiforov, Nozdrin, Sibiryakova, Yunenko, Karlovets
https://arxiv.org/abs/2509.00732 https://mastoxiv.page/@arXiv_physicsaccph_bot/115139328955562484
toXiv_bot_toot

@pre@boing.world
2025-12-06 12:43:08

You see a detective on the TV and he’s interviewing all the suspects asking them what they were doing on the night of the murder a month ago last Tuesday night.
And on the TV, the suspects all know. Right away.
If you asked me ten years ago though, I’d have had barely any clue. If you’re lucky it’d have been something planned in my calendar but mostly, dunno. Watching TV maybe? No idea what show. Was that a night I was in the pub?
As we all get older this problem increases I’m told. Eventually full on senility sets in.
But what if you have already built the habit to record what you’re doing? To be able to look back and revise and review how you spent your days? An external aid as a crutch to your own forgetful brain’s cortex?
So I started this Exocortex Log over a decade ago and now I can answer: Ten years ago on Tuesday I was having dinner with the guitarist from my band and his girlfriend and they burned the pudding.
The app has been half finished and barely able to even record let alone review for most of that time, but now it’s ready enough that someone else might use it too if they want.
Try it out: #lifeLog #app #memoryAid

@whitequark@mastodon.social
2025-12-02 21:48:53

> try to test my software
> it uses codeberg.org for a test checkout
> /info/refs?service=git-upload-pack": dial tcp 217.197.84.140:443: i/o timeout
> open codeberg status page
> it times out
What am i supposed to do now. Touch grass?

@arXiv_csLG_bot@mastoxiv.page
2025-12-22 13:54:55

Replaced article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[4/5]:
- Sample, Don't Search: Rethinking Test-Time Alignment for Language Models
Gon\c{c}alo Faria, Noah A. Smith
https://arxiv.org/abs/2504.03790 https://mastoxiv.page/@arXiv_csCL_bot/114301112970577326
- A Survey on Archetypal Analysis
Aleix Alcacer, Irene Epifanio, Sebastian Mair, Morten M{\o}rup
https://arxiv.org/abs/2504.12392 https://mastoxiv.page/@arXiv_statME_bot/114357826909813483
- The Stochastic Occupation Kernel (SOCK) Method for Learning Stochastic Differential Equations
Michael L. Wells, Kamel Lahouel, Bruno Jedynak
https://arxiv.org/abs/2505.11622 https://mastoxiv.page/@arXiv_statML_bot/114539065460187982
- BOLT: Block-Orthonormal Lanczos for Trace estimation of matrix functions
Kingsley Yeon, Promit Ghosal, Mihai Anitescu
https://arxiv.org/abs/2505.12289 https://mastoxiv.page/@arXiv_mathNA_bot/114539035462135281
- Clustering and Pruning in Causal Data Fusion
Otto Tabell, Santtu Tikka, Juha Karvanen
https://arxiv.org/abs/2505.15215 https://mastoxiv.page/@arXiv_statML_bot/114550346291754635
- On the performance of multi-fidelity and reduced-dimensional neural emulators for inference of ph...
Chloe H. Choi, Andrea Zanoni, Daniele E. Schiavazzi, Alison L. Marsden
https://arxiv.org/abs/2506.11683 https://mastoxiv.page/@arXiv_statML_bot/114692410563481289
- Beyond Force Metrics: Pre-Training MLFFs for Stable MD Simulations
Maheshwari, Tang, Ock, Kolluru, Farimani, Kitchin
https://arxiv.org/abs/2506.14850 https://mastoxiv.page/@arXiv_physicschemph_bot/114709402590755731
- Quantifying Uncertainty in the Presence of Distribution Shifts
Yuli Slavutsky, David M. Blei
https://arxiv.org/abs/2506.18283 https://mastoxiv.page/@arXiv_statML_bot/114738165218533987
- ZKPROV: A Zero-Knowledge Approach to Dataset Provenance for Large Language Models
Mina Namazi, Alexander Nemecek, Erman Ayday
https://arxiv.org/abs/2506.20915 https://mastoxiv.page/@arXiv_csCR_bot/114754394485208892
- SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars
Zhao, Huang, Xue, Kong, Liu, Tang, Beers, Ting, Luo
https://arxiv.org/abs/2507.01939 https://mastoxiv.page/@arXiv_astrophIM_bot/114788369702591337
- Towards Facilitated Fairness Assessment of AI-based Skin Lesion Classifiers Through GenAI-based I...
Ko Watanabe, Stanislav Frolov, Aya Hassan, David Dembinsky, Adriano Lucieri, Andreas Dengel
https://arxiv.org/abs/2507.17860 https://mastoxiv.page/@arXiv_csCV_bot/114912976717523345
- PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning
Yushi Feng, Junye Du, Yingying Hong, Qifan Wang, Lequan Yu
https://arxiv.org/abs/2508.10501 https://mastoxiv.page/@arXiv_csAI_bot/115032101532614110
- Unified Acoustic Representations for Screening Neurological and Respiratory Pathologies from Voice
Ran Piao, Yuan Lu, Hareld Kemps, Tong Xia, Aaqib Saeed
https://arxiv.org/abs/2508.20717 https://mastoxiv.page/@arXiv_csSD_bot/115111255835875066
- Machine Learning-Driven Predictive Resource Management in Complex Science Workflows
Tasnuva Chowdhury, et al.
https://arxiv.org/abs/2509.11512 https://mastoxiv.page/@arXiv_csDC_bot/115213444524490263
- MatchFixAgent: Language-Agnostic Autonomous Repository-Level Code Translation Validation and Repair
Ali Reza Ibrahimzada, Brandon Paulsen, Reyhaneh Jabbarvand, Joey Dodds, Daniel Kroening
https://arxiv.org/abs/2509.16187 https://mastoxiv.page/@arXiv_csSE_bot/115247172280557686
- Automated Machine Learning Pipeline: Large Language Models-Assisted Automated Dataset Generation ...
Adam Lahouari, Jutta Rogal, Mark E. Tuckerman
https://arxiv.org/abs/2509.21647 https://mastoxiv.page/@arXiv_condmatmtrlsci_bot/115286737423175311
- Quantifying the Impact of Structured Output Format on Large Language Models through Causal Inference
Han Yuan, Yue Zhao, Li Zhang, Wuqiong Luo, Zheng Ma
https://arxiv.org/abs/2509.21791 https://mastoxiv.page/@arXiv_csCL_bot/115287166674809413
- The Generation Phases of Flow Matching: a Denoising Perspective
Anne Gagneux, S\'egol\`ene Martin, R\'emi Gribonval, Mathurin Massias
https://arxiv.org/abs/2510.24830 https://mastoxiv.page/@arXiv_csCV_bot/115462527449411627
- Data-driven uncertainty-aware seakeeping prediction of the Delft 372 catamaran using ensemble Han...
Giorgio Palma, Andrea Serani, Matteo Diez
https://arxiv.org/abs/2511.04461 https://mastoxiv.page/@arXiv_eessSY_bot/115507785247809767
- Generalized infinite dimensional Alpha-Procrustes based geometries
Salvish Goomanee, Andi Han, Pratik Jawanpuria, Bamdev Mishra
https://arxiv.org/abs/2511.09801 https://mastoxiv.page/@arXiv_statML_bot/115547135711272091
toXiv_bot_toot

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:40:51

BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)
Tomas Ruiz, Siyao Peng, Barbara Plank, Carsten Schwemmer
https://arxiv.org/abs/2510.12516

BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)
Test-time scaling is a family of techniques to improve LLM outputs at inference time by performing extra computation. To the best of our knowledge, test-time scaling has been limited to domains with verifiably correct answers, like mathematics and coding. We transfer test-time scaling to the LeWiDi-2025 tasks to evaluate annotation disagreements. We experiment with three test-time scaling methods: two benchmark algorithms (Model Averaging and Majority Voting), and a Best-of-N sampling method. T…

@arXiv_csLG_bot@mastoxiv.page
2025-12-22 11:50:43

Crosslisted article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[3/3]:
- Fraud detection in credit card transactions using Quantum-Assisted Restricted Boltzmann Machines
Jo\~ao Marcos Cavalcanti de Albuquerque Neto, Gustavo Castro do Amaral, Guilherme Penello Tempor\~ao
https://arxiv.org/abs/2512.17660 https://mastoxiv.page/@arXiv_quantph_bot/115762703945731580
- Vidarc: Embodied Video Diffusion Model for Closed-loop Control
Feng, Xiang, Mao, Tan, Zhang, Huang, Zheng, Liu, Su, Zhu
https://arxiv.org/abs/2512.17661 https://mastoxiv.page/@arXiv_csRO_bot/115762650859932523
- Imputation Uncertainty in Interpretable Machine Learning Methods
Pegah Golchian, Marvin N. Wright
https://arxiv.org/abs/2512.17689 https://mastoxiv.page/@arXiv_statML_bot/115762577479255577
- Revisiting the Broken Symmetry Phase of Solid Hydrogen: A Neural Network Variational Monte Carlo ...
Shengdu Chai, Chen Lin, Xinyang Dong, Yuqiang Li, Wanli Ouyang, Lei Wang, X. C. Xie
https://arxiv.org/abs/2512.17703 https://mastoxiv.page/@arXiv_condmatstrel_bot/115762481116668454
- Breast Cancer Neoadjuvant Chemotherapy Treatment Response Prediction Using Aligned Longitudinal M...
Rahul Ravi, Ruizhe Li, Tarek Abdelfatah, Stephen Chan, Xin Chen
https://arxiv.org/abs/2512.17759 https://mastoxiv.page/@arXiv_eessIV_bot/115762481771898369
- MedNeXt-v2: Scaling 3D ConvNeXts for Large-Scale Supervised Representation Learning in Medical Im...
Roy, Kirchhoff, Ulrich, Rokuss, Wald, Isensee, Maier-Hein
https://arxiv.org/abs/2512.17774 https://mastoxiv.page/@arXiv_eessIV_bot/115762492258209812
- Domain-Aware Quantum Circuit for QML
Gurinder Singh, Thaddeus Pellegrini, Kenneth M. Merz, Jr
https://arxiv.org/abs/2512.17800 https://mastoxiv.page/@arXiv_quantph_bot/115762723607200478
- Visually Prompted Benchmarks Are Surprisingly Fragile
Feng, Lian, Dunlap, Shu, Wang, Wang, Darrell, Suhr, Kanazawa
https://arxiv.org/abs/2512.17875 https://mastoxiv.page/@arXiv_csCV_bot/115762781936221554
- Learning vertical coordinates via automatic differentiation of a dynamical core
Tim Whittaker, Seth Taylor, Elsa Cardoso-Bihlo, Alejandro Di Luca, Alex Bihlo
https://arxiv.org/abs/2512.17877 https://mastoxiv.page/@arXiv_physicsaoph_bot/115762405092703069
- RadarGen: Automotive Radar Point Cloud Generation from Cameras
Tomer Borreda, Fangqiang Ding, Sanja Fidler, Shengyu Huang, Or Litany
https://arxiv.org/abs/2512.17897 https://mastoxiv.page/@arXiv_csCV_bot/115762783246540528
- Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy
Gahlawat, Aboudonia, Banik, Hovakimyan, Matni, Ames, Zardini, Speranzon
https://arxiv.org/abs/2512.17899 https://mastoxiv.page/@arXiv_eessSY_bot/115762532257741954
- Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting
Ananta R. Bhattarai, Helge Rhodin
https://arxiv.org/abs/2512.17908 https://mastoxiv.page/@arXiv_csCV_bot/115762785868778349
toXiv_bot_toot

@theodric@social.linux.pizza
2025-10-21 18:42:15

Very good overview of the effectiveness of Honeywell PTM7950 phase-change thermal compound compared to traditional thermal pastes https://www.igorslab.de/en/overhyped-honeywell-ptm7950-in-lab-test-and-as-game-changer-for-graphics-cards/4…

Not over-hyped: Honeywell PTM7950 in a lab test and a real game changer for graphics cards | Page 4 | igor´sLAB
Many manufacturers have been experimenting with the Honeywell PTM7950 thermal pad and graphene pads on graphics cards for some time now. With good success, as AMD board partners in particular (and AMD…

@arXiv_csCV_bot@mastoxiv.page
2025-10-13 10:35:30

D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
Jisu Han, Wonjun Hwang
https://arxiv.org/abs/2510.09473 https://…

D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the …

@jredlund@social.linux.pizza
2025-10-22 18:59:02

Squirrels at War: First Strike
#writingcommunity #visualinspiration Earlier installments can be read from this page. *** Normally with a new ship we would go on a shakedown cruise for several weeks to test systems and fix bugs. We didn’t have the time. We had to hope t…

Squirrels at War: First Strike
#writingcommunity #visualinspiration Earlier installments can be read from this page. *** Normally with a new ship we would go on a shakedown cruise for several weeks to test systems and fix bugs. …

@arXiv_csAI_bot@mastoxiv.page
2025-10-13 10:08:10

Titans Revisited: A Lightweight Reimplementation and Critical Analysis of a Test-Time Memory Model
Gavriel Di Nepi, Federico Siciliano, Fabrizio Silvestri
https://arxiv.org/abs/2510.09551

Titans Revisited: A Lightweight Reimplementation and Critical Analysis of a Test-Time Memory Model
By the end of 2024, Google researchers introduced Titans: Learning at Test Time, a neural memory model achieving strong empirical results across multiple tasks. However, the lack of publicly available code and ambiguities in the original description hinder reproducibility. In this work, we present a lightweight reimplementation of Titans and conduct a comprehensive evaluation on Masked Language Modeling, Time Series Forecasting, and Recommendation tasks. Our results reveal that Titans does not …

@arXiv_csSE_bot@mastoxiv.page
2025-10-13 09:20:10

Search-based Hyperparameter Tuning for Python Unit Test Generation
Stephan Lukasczyk, Gordon Fraser
https://arxiv.org/abs/2510.08716 https://arxiv.org/pdf/…

Search-based Hyperparameter Tuning for Python Unit Test Generation
Search-based test-generation algorithms have countless configuration options. Users rarely adjust these options and usually stick to the default values, which may not lead to the best possible results. Tuning an algorithm's hyperparameters is a method to find better hyperparameter values, but it typically comes with a high demand of resources. Meta-heuristic search algorithms -- that effectively solve the test-generation problem -- have been proposed as a solution to also efficiently tune param…

@arXiv_csHC_bot@mastoxiv.page
2025-10-15 10:02:51

Data-Model Co-Evolution: Growing Test Sets to Refine LLM Behavior
Minjae Lee, Minsuk Kahng
https://arxiv.org/abs/2510.12728 https://arxiv.org/pdf/2510.1272…

Data-Model Co-Evolution: Growing Test Sets to Refine LLM Behavior
A long-standing challenge in machine learning has been the rigid separation between data work and model refinement, enforced by slow fine-tuning cycles. The rise of Large Language Models (LLMs) overcomes this historical barrier, allowing applications developers to instantly govern model behavior by editing prompt instructions. This shift enables a new paradigm: data-model co-evolution, where a living test set and a model's instructions evolve in tandem. We operationalize this paradigm in an int…

@arXiv_statME_bot@mastoxiv.page
2025-10-15 08:55:31

A Martingale Kernel Two-Sample Test
Anirban Chatterjee, Aaditya Ramdas
https://arxiv.org/abs/2510.11853 https://arxiv.org/pdf/2510.11853

A Martingale Kernel Two-Sample Test
The Maximum Mean Discrepancy (MMD) is a widely used multivariate distance metric for two-sample testing. The standard MMD test statistic has an intractable null distribution typically requiring costly resampling or permutation approaches for calibration. In this work we leverage a martingale interpretation of the estimated squared MMD to propose martingale MMD (mMMD), a quadratic-time statistic which has a limiting standard Gaussian distribution under the null. Moreover we show that the test is…

@arXiv_csCR_bot@mastoxiv.page
2025-10-14 11:48:48

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
Guan-Yan Yang, Tzu-Yu Cheng, Ya-Wen Teng, Farn Wanga, Kuo-Hui Yeh
https://arxiv.org/abs/2510.10281 htt…

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
The integration of Large Language Models (LLMs) into computer applications has introduced transformative capabilities but also significant security challenges. Existing safety alignments, which primarily focus on semantic interpretation, leave LLMs vulnerable to attacks that use non-standard data representations. This paper introduces ArtPerception, a novel black-box jailbreak framework that strategically leverages ASCII art to bypass the security measures of state-of-the-art (SOTA) LLMs. Unlik…

@arXiv_csRO_bot@mastoxiv.page
2025-10-08 10:05:09

Verifier-free Test-Time Sampling for Vision Language Action Models
Suhyeok Jang, Dongyoung Kim, Changyeon Kim, Youngsuk Kim, Jinwoo Shin
https://arxiv.org/abs/2510.05681 https:/…

Verifier-free Test-Time Sampling for Vision Language Action Models
Vision-Language-Action models (VLAs) have demonstrated remarkable performance in robot control. However, they remain fundamentally limited in tasks that require high precision due to their single-inference paradigm. While test-time scaling approaches using external verifiers have shown promise, they require additional training and fail to generalize to unseen conditions. We propose Masking Distribution Guided Selection (MG-Select), a novel test-time scaling framework for VLAs that leverages the…

@publicvoit@graz.social
2025-12-13 23:14:30

If you're using #lazyblorg as your static website generator: I've updated the project today.
It now used "uv" for dependency management, script invocation and unit test execution. Furthermore, I adapted the code to match the #pandoc version of Debian 13 Trixie.
Although you ne…

lazyblorg
Tag page for tag lazyblorg

@arXiv_csCY_bot@mastoxiv.page
2025-10-13 09:09:10

Non-traditional data in pandemic preparedness and response: identifying and addressing first and last-mile challenges
Mattia Mazzoli, Irma Varela-Lasheras, Sonia Namorado, Constantino Pereira Caetano, Andreia Leite, Lisa Hermans, Niel Hens, Polen T\"urkmen, Kyriaki Kalimeri, Leo Ferres, Ciro Cattuto, Daniela Paolotti, Stefaan Verhulst
https://

Non-traditional data in pandemic preparedness and response: identifying and addressing first and last-mile challenges
The pandemic served as an important test case of complementing traditional public health data with non-traditional data (NTD) such as mobility traces, social media activity, and wearables data to inform decision-making. Drawing on an expert workshop and a targeted survey of European modelers, we assess the promise and persistent limitations of such data in pandemic preparedness and response. We distinguish between "first-mile" (accessing and harmonizing data) and "last-mile" challenges (transla…

@arXiv_statML_bot@mastoxiv.page
2025-10-10 09:37:19

PAC Learnability in the Presence of Performativity
Ivan Kirev, Lyuben Baltadzhiev, Nikola Konstantinov
https://arxiv.org/abs/2510.08335 https://arxiv.org/p…

PAC Learnability in the Presence of Performativity
Following the wide-spread adoption of machine learning models in real-world applications, the phenomenon of performativity, i.e. model-dependent shifts in the test distribution, becomes increasingly prevalent. Unfortunately, since models are usually trained solely based on samples from the original (unshifted) distribution, this performative shift may lead to decreased test-time performance. In this paper, we study the question of whether and when performative binary classification problems are…

@arXiv_astrophHE_bot@mastoxiv.page
2025-10-15 09:14:01

DIPLODOCUS II: Implementation of transport equations and test cases relevant to micro-scale physics of jetted astrophysical sources
Christopher N. Everett, Marc Klinger-Plaisier, Garret Cotter
https://arxiv.org/abs/2510.12505

DIPLODOCUS II: Implementation of transport equations and test cases relevant to micro-scale physics of jetted astrophysical sources
DIPLODOCUS (Distribution-In-PLateaux methODOlogy for the CompUtation of transport equationS) is a novel framework being developed for the general transport of particle distribution functions through the seven dimensions of phase space, including forcing terms and interactions between particles. Following Paper I, which details the background analytic framework, this second paper provides an overview of the numerical implementation in the form of the code package Diplodocus.jl, written in Julia,…

@arXiv_grqc_bot@mastoxiv.page
2025-10-10 09:47:09

Effects of magnetic fields on spinning test particles orbiting Kerr-Bertotti-Robinson black holes
Yu-Kun Zhang, Shao-Wen Wei
https://arxiv.org/abs/2510.07914 https://

Effects of magnetic fields on spinning test particles orbiting Kerr-Bertotti-Robinson black holes
In this paper, we study the kinematic effects of spinning test particles orbiting the Kerr-Bertotti-Robinson black hole. Employing with the Mathisson-Papapetrou-Dixon equations, we explore the dynamics of precessing orbits and distinct orbital types, including circular orbits and innermost stable circular orbits. Our results reveal the substantial impact of the magnetic field on the trajectories of spinning particles, particularly in regions characterized by significant radial distances. More i…

@arXiv_csDB_bot@mastoxiv.page
2025-10-09 08:35:41

Automated Discovery of Test Oracles for Database Management Systems Using LLMs
Qiuyang Mang, Runyuan He, Suyang Zhong, Xiaoxuan Liu, Huanchen Zhang, Alvin Cheung
https://arxiv.org/abs/2510.06663

Automated Discovery of Test Oracles for Database Management Systems Using LLMs
Since 2020, automated testing for Database Management Systems (DBMSs) has flourished, uncovering hundreds of bugs in widely-used systems. A cornerstone of these techniques is test oracle, which typically implements a mechanism to generate equivalent query pairs, thereby identifying bugs by checking the consistency between their results. However, while applying these oracles can be automated, their design remains a fundamentally manual endeavor. This paper explores the use of large language mode…

@knurd42@social.linux.pizza
2025-10-12 09:45:59

PSA for users that regularly test #Fedora Beta as well as proposed updates once the new version was released:
Do not enable updates-testing[1] by modifying /etc/yum.repos.d/fedora-updates-testing.repo; instead do it like this:
$ sudo dnf config-manager setopt updates-testing.enabled=true
Otherwise updates-testing will be disabled shortly before the release of a new version (t…

Screenshot from the top of the linked page

@arXiv_quantph_bot@mastoxiv.page
2025-10-10 11:19:49

Guess your neighbor's input: Quantum advantage in Feige's game
Simon Schmidt, Sigurd A. L. Storgaard, Michael Walter, Yuming Zhao
https://arxiv.org/abs/2510.08484 https:…

Guess your neighbor's input: Quantum advantage in Feige's game
In this article, we study a nonlocal game with two questions and three answers per player, which was first considered by Feige in 1991, and show that there is quantum advantage in this game. We prove that the game is a robust self-test for the $3$-dimensional maximally entangled state. Furthermore, we show that the game can be seen as the "or" of two games that each do not have quantum advantage. Lastly, we investigate the behavior of the game with respect to parallel repetition in the classica…

@arXiv_astrophCO_bot@mastoxiv.page
2025-10-14 09:56:18

Fast radio bursts shed light on direct gravity test on cosmological scales
Shuren Zhou, Pengjie Zhang
https://arxiv.org/abs/2510.11022 https://arxiv.org/pd…

Fast radio bursts shed light on direct gravity test on cosmological scales
A key measure of gravity is the relation between the Weyl potential $Ψ+Φ$ and the matter overdensity $δ_m$, capsulized as an effective gravitational constant $G_{\rm light}$ for light motion. Its value, together with the possible spatial and temporal variation, is essential in probing physics beyond Einstein gravity. However, the lack of an unbiased proxy of $δ_m$ prohibits direct measurement of $G_{\rm light}$. We point out that the equivalence principle ensures the dispersion measure (DM)…

@arXiv_mathAG_bot@mastoxiv.page
2025-10-08 08:35:19

Mirror symmetry for singular double cover Calabi--Yau varieties: quantum test
Tsung-Ju Lee, Bong H. Lian, Shing-Tung Yau
https://arxiv.org/abs/2510.05470 https://

Mirror symmetry for singular double cover Calabi--Yau varieties: quantum test
We continue our study on the pairs of singular Calabi--Yau varieties arising from double covers over semi-Fano toric manifolds. In this paper, we first investigate singular CY double covers of $\mathbb{P}^{3}$ branched along (1) a union of eight hyperplanes in general position, and (2) a union of four hyperplanes and a quartic in generation. Our previous construction produces hypothetical singular mirror partners. We prove that they are mirror pairs in the sense that the $B$-model of one (v…

@arXiv_astrophEP_bot@mastoxiv.page
2025-10-13 08:53:40

Probing the geological setting of exoplanets through atmospheric analysis: using Mars as a test case
Monica Rainer, Evandro Balbi, Francesco Borsa, Paola Cianfarra, Avet Harutyunyan, Silvano Tosi
https://arxiv.org/abs/2510.09305

Probing the geological setting of exoplanets through atmospheric analysis: using Mars as a test case
One of the frontier research fields of exoplanetary science is the study of the composition and variability of exoplanetary atmospheres. This field is now moving from the gas giant planets towards the smaller and colder telluric planets, and future instruments like ANDES will focus on the observations of the atmosphere of telluric planets in the habitable zone in reflected light. These future observations will possibly find variable signals due to the view of different hemispheres of the planet…

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:38:31

Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test
Nikoleta Pantelidou, Evelina Leivada, Paolo Morosi
https://arxiv.org/abs/2510.12463

Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test
The linguistic abilities of Large Language Models are a matter of ongoing debate. This study contributes to this discussion by investigating model performance in a morphological generalization task that involves novel words. Using a multilingual adaptation of the Wug Test, six models were tested across four partially unrelated languages (Catalan, English, Greek, and Spanish) and compared with human speakers. The aim is to determine whether model accuracy approximates human competence and whethe…

@arXiv_mathNT_bot@mastoxiv.page
2025-10-10 08:22:18

The $n^{th}$ centered moments of a large orthogonal family of automorphic $L$-functions
Vorrapan Chandee, Yoonbok Lee, Xiannan Li
https://arxiv.org/abs/2510.07647 https://

The $n^{th}$ centered moments of a large orthogonal family of automorphic $L$-functions
We obtain the $n$th centered moments of one level densities of a large orthogonal family of $L$-functions associated with holomorphic Hecke newforms of level $q$, averaged over $q\sim Q$. We verify the Katz-Sarnak conjecture for these statistics, in the range where the sum of the supports of the Fourier transforms of test functions lies in $(-4, 4)$. In so doing, we need to understand certain phantom oversized terms, which allow us to extract the right off-diagonal contributions. We further nee…

@arXiv_csIR_bot@mastoxiv.page
2025-10-07 09:54:12

Topic-Specific Classifiers are Better Relevance Judges than Prompted LLMs
Lukas Gienapp, Martin Potthast, Harrisen Scells, Eugene Yang
https://arxiv.org/abs/2510.04633 https://

Topic-Specific Classifiers are Better Relevance Judges than Prompted LLMs
The unjudged document problem, where pooled test collections have incomplete relevance judgments for evaluating new retrieval systems, is a key obstacle to the reusability of test collections in information retrieval. While the de facto standard to deal with the problem is to treat unjudged documents as non-relevant, many alternatives have been proposed, including the use of large language models (LLMs) as a relevance judge (LLM-as-a-judge). However, this has been criticized as circular, since …

@arXiv_astrophSR_bot@mastoxiv.page
2025-10-13 08:02:50

How precisely can we measure the ages of subgiant and giant stars?
Cheyanne Shariat, Kareem El-Badry, Soumyadeep Bhattacharjee
https://arxiv.org/abs/2510.08675 https://

How precisely can we measure the ages of subgiant and giant stars?
Precise stellar ages are fundamental to Galactic archaeology. However, obtaining reliable age estimates and uncertainties for field stars has been a long-standing challenge. We test the fidelity of ages from recent catalogs of giants and subgiants using wide binaries, whose components formed at the same time and thus should have consistent inferred ages. We find that subgiant ages based on spectroscopic metallicities from Xiang & Rix (2022) are generally consistent within their reported uncerta…

@arXiv_mathST_bot@mastoxiv.page
2025-10-07 09:23:32

Asymptotic distributions of four linear hypotheses test statistics under generalized spiked model
Zhijun Liu, Jiang Hu, Zhidong Bai, Zhihui Lv
https://arxiv.org/abs/2510.04185 h…

Asymptotic distributions of four linear hypotheses test statistics under generalized spiked model
In this paper, we establish the Central Limit Theorem (CLT) for linear spectral statistics (LSSs) of large-dimensional generalized spiked sample covariance matrices, where the spiked eigenvalues may be either bounded or diverge to infinity. Building upon this theorem, we derive the asymptotic distributions of linear hypothesis test statistics under the generalized spiked model, including Wilks' likelihood ratio test statistic U, the Lawley-Hotelling trace test statistic W, and the Bartlett-Nand…

@arXiv_csSE_bot@mastoxiv.page
2025-10-13 09:55:00

Constraint-Guided Unit Test Generation for Machine Learning Libraries
Lukas Krodinger, Altin Hajdari, Stephan Lukasczyk, Gordon Fraser
https://arxiv.org/abs/2510.09108 https://

Constraint-Guided Unit Test Generation for Machine Learning Libraries
Machine learning (ML) libraries such as PyTorch and TensorFlow are essential for a wide range of modern applications. Ensuring the correctness of ML libraries through testing is crucial. However, ML APIs often impose strict input constraints involving complex data structures such as tensors. Automated test generation tools such as Pynguin are not aware of these constraints and often create non-compliant inputs. This leads to early test failures and limited code coverage. Prior work has investig…

@arXiv_csAI_bot@mastoxiv.page
2025-10-08 10:28:19

ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models
Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Zhiyuan Yu, Qipeng Guo, Xuanjing Huang, Xipeng Qiu
https://arxiv.org/abs/2510.06014

ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models
Test-time scaling has emerged as a transformative paradigm for enhancing the performance of large reasoning models, enabling dynamic allocation of computational resources during inference. However, as the landscape of reasoning models rapidly expands, a critical question remains: how can we systematically compare and evaluate the test-time scaling capabilities across different models? In this paper, we introduce ARISE (Adaptive Resolution-aware Scaling Evaluation), a novel metric specifically d…

@arXiv_csCV_bot@mastoxiv.page
2025-10-09 10:47:01

GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation
Wen Ye, Zhaocheng Liu, Yuwei Gui, Tingyu Yuan, Yunyue Su, Bowen Fang, Chaoyang Zhao, Qiang Liu, Liang Wang
https://arxiv.org/abs/2510.07217

GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation
Text-to-image synthesis has made remarkable progress, yet accurately interpreting complex and lengthy prompts remains challenging, often resulting in semantic inconsistencies and missing details. Existing solutions, such as fine-tuning, are model-specific and require training, while prior automatic prompt optimization (APO) approaches typically lack systematic error analysis and refinement strategies, resulting in limited reliability and effectiveness. Meanwhile, test-time scaling methods opera…

@arXiv_statME_bot@mastoxiv.page
2025-10-14 11:16:39

A Kolmogorov-Smirnov-Type Test for Dependently Double-Truncated Data
Anne-Marie Toparkus, Rafael Weissbach
https://arxiv.org/abs/2510.11517 https://arxiv.o…

A Kolmogorov-Smirnov-Type Test for Dependently Double-Truncated Data
With double-truncated lifespans, we test the hypothesis of a parametric distribution family for the lifespan. The typical finding from demography is an instationary behaviour of the life expectancy, and a copula models the resulting weak dependence of lifespan and the age at truncation. Our main example is the Farlie-Gumbel-Morgenststern copula. The test is based on Donsker-class arguments and the functional delta method for empirical processes. The assumptions also allow parametric inference, …

@arXiv_csHC_bot@mastoxiv.page
2025-10-15 09:56:52

Gauging the Competition: Understanding Social Comparison and Anxiety through Eye-tracking in Virtual Reality Group Interview
Shi-Ting Ni, Kairong Fang, Yuyang Wang, Pan Hui
https://arxiv.org/abs/2510.12590

Gauging the Competition: Understanding Social Comparison and Anxiety through Eye-tracking in Virtual Reality Group Interview
Virtual Reality (VR) is a promising tool for interview training, yet the psychological dynamics of group interviews, such as social comparison, remain underexplored. We investigate this phenomenon by developing an immersive VR group interview system and conducting an eye-tracking study with 73 participants. We manipulated peer performance using ambiguous behavioral cues (e.g., hand-raising) and objective information (public test scores) to measure their effect on participants' attention and sel…

@arXiv_csCR_bot@mastoxiv.page
2025-10-08 09:18:49

AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
Xiaogeng Liu, Chaowei Xiao
https://arxiv.org/abs/2510.05379 https://

AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
Recent advancements in jailbreaking large language models (LLMs), such as AutoDAN-Turbo, have demonstrated the power of automated strategy discovery. AutoDAN-Turbo employs a lifelong learning agent to build a rich library of attack strategies from scratch. While highly effective, its test-time generation process involves sampling a strategy and generating a single corresponding attack prompt, which may not fully exploit the potential of the learned strategy library. In this paper, we propose to…

@arXiv_csLG_bot@mastoxiv.page
2025-10-08 10:40:39

LATTA: Langevin-Anchored Test-Time Adaptation for Enhanced Robustness and Stability
Harshil Vejendla
https://arxiv.org/abs/2510.05530 https://arxiv.org/pdf…

LATTA: Langevin-Anchored Test-Time Adaptation for Enhanced Robustness and Stability
Test-time adaptation (TTA) aims to adapt a pretrained model to distribution shifts using only unlabeled test data. While promising, existing methods like Tent suffer from instability and can catastrophically forget the source knowledge, especially with small batch sizes or challenging corruptions. We argue that this arises from overly deterministic updates on a complex loss surface. In this paper, we introduce Langevin-Anchored Test-Time Adaptation (LATTA), a novel approach that regularizes ada…

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:43:40

Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation
Sondos Mahmoud Bsharat, Zhiqiang Shen
https://arxiv.org/abs/2510.09599 https://arxi…

Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation
Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually se…

@@arXiv_physicsatomph_bot@mastoxiv.page@mastoxiv.page
2025-11-21 12:06:55

Replaced article(s) found for physics.atom-ph. https://arxiv.org/list/physics.atom-ph/new
[1/1]:
- Experiment to test one of the incompleteness of quantum mechanics
Michel Gondran (AFBL), Alexandre Gondran (ENAC)

@arXiv_csRO_bot@mastoxiv.page
2025-10-13 10:04:10

Bridging Research and Practice in Simulation-based Testing of Industrial Robot Navigation Systems
Sajad Khatiri, Francisco Eli Vina Barrientos, Maximilian Wulf, Paolo Tonella, Sebastiano Panichella
https://arxiv.org/abs/2510.09396

Bridging Research and Practice in Simulation-based Testing of Industrial Robot Navigation Systems
Ensuring robust robotic navigation in dynamic environments is a key challenge, as traditional testing methods often struggle to cover the full spectrum of operational requirements. This paper presents the industrial adoption of Surrealist, a simulation-based test generation framework originally for UAVs, now applied to the ANYmal quadrupedal robot for industrial inspection. Our method uses a search-based algorithm to automatically generate challenging obstacle avoidance scenarios, uncovering fa…

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 08:07:37

Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem
Muhammad Maaz, Liam DeVoe, Zac Hatfield-Dodds, Nicholas Carlini
https://arxiv.org/abs/2510.09907 https:/…

Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem
Property-based testing (PBT) is a lightweight formal method, typically implemented as a randomized testing framework. Users specify the input domain for their test using combinators supplied by the PBT framework, and the expected properties or invariants as a unit-test function. The framework then searches for a counterexample, e.g. by generating inputs and calling the test function. In this work, we demonstrate an LLM-based agent which analyzes Python modules, infers function-specific and cros…

@arXiv_csAI_bot@mastoxiv.page
2025-10-08 10:34:39

Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification
Weihao Zeng, Keqing He, Chuqiao Kuang, Xiaoguang Li, Junxian He
https://arxiv.org/abs/2510.06135 htt…

Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification
Test-time compute can be scaled both sequentially and in parallel. Sequential scaling involves lengthening the generation process, while parallel scaling involves verifying and selecting among multiple candidate outputs. Combining these two strategies has led to the most powerful AI systems, such as Grok 4 Heavy and GPT-5 Pro. In certain contexts (e.g., solving Sudoku puzzles), verifying responses can be substantially easier than generating them. This property, referred to as \emph{asymmetric v…

@arXiv_statML_bot@mastoxiv.page
2025-10-10 08:29:48

A Honest Cross-Validation Estimator for Prediction Performance
Tianyu Pan, Vincent Z. Yu, Viswanath Devanarayan, Lu Tian
https://arxiv.org/abs/2510.07649 https://

A Honest Cross-Validation Estimator for Prediction Performance
Cross-validation is a standard tool for obtaining a honest assessment of the performance of a prediction model. The commonly used version repeatedly splits data, trains the prediction model on the training set, evaluates the model performance on the test set, and averages the model performance across different data splits. A well-known criticism is that such cross-validation procedure does not directly estimate the performance of the particular model recommended for future use. In this paper, w…

@arXiv_quantph_bot@mastoxiv.page
2025-10-09 10:58:01

Is it Gaussian? Testing bosonic quantum states
Filippo Girardi, Freek Witteveen, Francesco Anna Mele, Lennart Bittel, Salvatore F. E. Oliviero, David Gross, Michael Walter
https://arxiv.org/abs/2510.07305

Is it Gaussian? Testing bosonic quantum states
Gaussian states are widely regarded as one of the most relevant classes of continuous-variable (CV) quantum states, as they naturally arise in physical systems and play a key role in quantum technologies. This motivates a fundamental question: given copies of an unknown CV state, how can we efficiently test whether it is Gaussian? We address this problem from the perspective of representation theory and quantum learning theory, characterizing the sample complexity of Gaussianity testing as a fu…

@arXiv_csCV_bot@mastoxiv.page
2025-10-15 10:54:11

Efficient Real-World Deblurring using Single Images: AIM 2025 Challenge Report
Daniel Feijoo, Paula Garrido-Mellado, Marcos V. Conde, Jaesung Rim, Alvaro Garcia, Sunghyun Cho, Radu Timofte
https://arxiv.org/abs/2510.12788

Efficient Real-World Deblurring using Single Images: AIM 2025 Challenge Report
This paper reviews the AIM 2025 Efficient Real-World Deblurring using Single Images Challenge, which aims to advance in efficient real-blur restoration. The challenge is based on a new test set based on the well known RSBlur dataset. Pairs of blur and degraded images in this dataset are captured using a double-camera system. Participant were tasked with developing solutions to effectively deblur these type of images while fulfilling strict efficiency constraints: fewer than 5 million model para…

@arXiv_astrophCO_bot@mastoxiv.page
2025-10-14 10:41:19

Probing cosmic curvature with Alcock-Paczynski data
Yungui Gong, Qing Gao, Xuchen Lu, Zhu Yi
https://arxiv.org/abs/2510.11555 https://arxiv.org/pdf/2510.11…

Probing cosmic curvature with Alcock-Paczynski data
The Alcock-Paczynski (AP) parameter $F_{AP}$ is independent of the sound horizon $r_d$, making the Dark Energy Spectroscopic Instrument (DESI) baryon acoustic oscillation (BAO) AP measurements particularly well suited for cosmological applications. We propose a novel null test of cosmic curvature tailored to DESI BAO data that combines $F_{AP}$ with the ratios $D_V'/D_V$ or $D_M'/D_M$. This null test can also be performed using a joint dataset of DESI BAO and type Ia supernova (SNe Ia) observat…

@arXiv_grqc_bot@mastoxiv.page
2025-10-07 10:54:52

Scattering of massive neutrino test fields from a gravitational pulse
Tekin Dereli, Yorgo Senikoglu
https://arxiv.org/abs/2510.04687 https://arxiv.org/pdf/…

Scattering of massive neutrino test fields from a gravitational pulse
Linearized Einstein-Weyl equations are solved precisely in the context of sandwich gravitational waves. The neutrino's energy-momentum depends on the geometry and composition of the gravitational pulse when it is scattered. Since the background remains unchanged at the test field level, the neutrino's energy density will exhibit fluctuations between positive and negative extremes when traversing the sandwich wave. These variations could provide insights into the behavior of models concerning ne…

@arXiv_astrophHE_bot@mastoxiv.page
2025-10-15 09:18:11

The double neutron star PSR J1946 2052 I. Masses and tests of general relativity
Lingqi Meng, Paulo C. C. Freire, Kevin Stovall, Norbert Wex, Xueli Miao, Weiwei Zhu, Michael Kramer, James M. Cordes, Huanchen Hu, Jinchen Jiang, Emilie Parent, Lijing Shao, Ingrid H. Stairs, Mengyao Xue, Adam Brazier, Fernando Camilo, David J. Champion, Shami Chatterjee, Fronefield Crawford, Ziyao Fang, Qiuyang Fu, Yanjun Guo, Jason W. T. Hessels, Maura MacLaughlin, Chenchen Miao, Jiarui Niu, Ziwei Wu, Ju…

The double neutron star PSR J1946+2052 I. Masses and tests of general relativity
We conducted high-precision timing of PSR J1946+2052 to determine the masses of the two neutron stars in the system, test general relativity (GR) and assessed the system's potential for future measurement of the moment of inertia of the pulsar. We analysed seven years of timing data from the Arecibo 305-m radio telescope, the Green Bank Telescope (GBT), and the Five-hundred-meter Aperture Spherical radio Telescope (FAST). The data processing accounted for dispersion measure variations and relat…

@arXiv_csLG_bot@mastoxiv.page
2025-10-09 10:55:11

Test-Time Graph Search for Goal-Conditioned Reinforcement Learning
Evgenii Opryshko, Junwei Quan, Claas Voelcker, Yilun Du, Igor Gilitschenski
https://arxiv.org/abs/2510.07257 h…

Test-Time Graph Search for Goal-Conditioned Reinforcement Learning
Offline goal-conditioned reinforcement learning (GCRL) trains policies that reach user-specified goals at test time, providing a simple, unsupervised, domain-agnostic way to extract diverse behaviors from unlabeled, reward-free datasets. Nonetheless, long-horizon decision making remains difficult for GCRL agents due to temporal credit assignment and error accumulation, and the offline setting amplifies these effects. To alleviate this issue, we introduce Test-Time Graph Search (TTGS), a lightwe…

@arXiv_csCR_bot@mastoxiv.page
2025-10-14 12:12:18

CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense
Yang Zhuochen, Fok Kar Wai, Thing Vrizlynn
https://arxiv.org/abs/2510.11137 https://arx…

CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense
Large language models have gained widespread attention recently, but their potential security vulnerabilities, especially privacy leakage, are also becoming apparent. To test and evaluate for data extraction risks in LLM, we proposed CoSPED, short for Consistent Soft Prompt targeted data Extraction and Defense. We introduce several innovative components, including Dynamic Loss, Additive Loss, Common Loss, and Self Consistency Decoding Strategy, and tested to enhance the consistency of the soft …

@arXiv_statME_bot@mastoxiv.page
2025-10-08 09:30:39

Extension of Wald-Wolfowitz Runs Test for Regression Validity Testing with Repeated Measures of Independent Variable
Bo-Yao Lian, Nelson G. Chen
https://arxiv.org/abs/2510.05861

Extension of Wald-Wolfowitz Runs Test for Regression Validity Testing with Repeated Measures of Independent Variable
The Wald-Wolfowitz runs test can assess the correctness of a regression curve fitted to a data set with one independent parameter. The assessment is performed through examination of the residuals, where the signs of the residuals would appear randomly if the regression curve were correct. We propose extending the test to the case where multiple data points were measured for specific independent parameter values. By randomly permutating the data points corresponding to each independent parameter…

@arXiv_csAI_bot@mastoxiv.page
2025-10-08 10:37:39

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He
https://arxiv.org/abs/2510.06217

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table ret…

@arXiv_csCV_bot@mastoxiv.page
2025-10-09 10:26:11

TTRV: Test-Time Reinforcement Learning for Vision Language Models
Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, M. Jehanzeb Mirza
https://arxiv.org/abs/2510.06783

TTRV: Test-Time Reinforcement Learning for Vision Language Models
Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency o…

@arXiv_csCL_bot@mastoxiv.page
2025-10-07 12:20:52

Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Omri Uzan, Asaf Yehudai, Roi pony, Eyal Shnarch, Ariel Gera
https://arxiv.org/abs/2510.05038 htt…

Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited…

@arXiv_csLG_bot@mastoxiv.page
2025-10-14 13:41:38

Representation-Based Exploration for Language Models: From Test-Time to Post-Training
Jens Tuyls, Dylan J. Foster, Akshay Krishnamurthy, Jordan T. Ash
https://arxiv.org/abs/2510.11686

Representation-Based Exploration for Language Models: From Test-Time to Post-Training
Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration -- explicitly incentivizing the model to discover novel and diverse behaviors -- and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration wi…

@arXiv_csRO_bot@mastoxiv.page
2025-10-10 10:20:59

Scalable Offline Metrics for Autonomous Driving
Animikh Aich, Adwait Kulkarni, Eshed Ohn-Bar
https://arxiv.org/abs/2510.08571 https://arxiv.org/pdf/2510.08…

Scalable Offline Metrics for Autonomous Driving
Real-World evaluation of perception-based planning models for robotic systems, such as autonomous vehicles, can be safely and inexpensively conducted offline, i.e., by computing model prediction error over a pre-collected validation dataset with ground-truth annotations. However, extrapolating from offline model performance to online settings remains a challenge. In these settings, seemingly minor errors can compound and result in test-time infractions or collisions. This relationship is unders…

@arXiv_statML_bot@mastoxiv.page
2025-10-10 09:26:09

Beyond Real Data: Synthetic Data through the Lens of Regularization
Amitis Shidani, Tyler Farghly, Yang Sun, Habib Ganjgahi, George Deligiannidis
https://arxiv.org/abs/2510.08095

Beyond Real Data: Synthetic Data through the Lens of Regularization
Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between t…

@arXiv_quantph_bot@mastoxiv.page
2025-10-09 10:39:41

High-Performance Imaging in a Dilution Refrigerator
Timo Eikelmann, Mara Brinkmann, Leonie Eggers, Tuncay Ulas, Donika Imeri, Konstantin Beck, Lasse Jens Irrgang, Sunil Kumar Mahato, Rikhav Shah, Ralf Riedinger
https://arxiv.org/abs/2510.07054

High-Performance Imaging in a Dilution Refrigerator
Nanophotonic light-matter interfaces hold great promise for quantum technologies. Enhancing local electromagnetic fields, they enable highly efficient detectors, can help realize optically connected processors, or serve as quantum repeaters. In-situ fiber-coupling at sub-Kelvin temperatures, as required for test and development of new devices, proves challenging as suitable cryogenic microscopes are not readily available. Here, we report on a robust and versatile confocal imaging system integra…

@arXiv_csSE_bot@mastoxiv.page
2025-10-08 08:38:39

Test Case Generation from Bug Reports via Large Language Models: A Cognitive Layered Evaluation Framework
Irtaza Sajid Qureshi (Jack), Zhen Ming (Jack), Jiang
https://arxiv.org/abs/2510.05365

Test Case Generation from Bug Reports via Large Language Models: A Cognitive Layered Evaluation Framework
Large Language Models (LLMs) are increasingly applied to automated software testing, yet their ability to generalize beyond memorized patterns and reason about natural language bug reports remains unclear. We present a systematic evaluation of LLM reasoning in test case generation, structured around the cognitive layers of Bloom's taxonomy: \textit{Remember}, \textit{Understand}, \textit{Apply}, \textit{Analyze}, \textit{Evaluate}, and \textit{Create}, which progressively assess higher levels o…

@arXiv_grqc_bot@mastoxiv.page
2025-10-14 08:21:08

The Gravitational Wave Memory from Binary Neutron Star Mergers
Jamie Bamber, Antonios Tsokaros, Milton Ruiz, Stuart L. Shapiro, Marc Favata, Matthew Karlson, Fabrizio Venturi Pi\~nas
https://arxiv.org/abs/2510.09742

The Gravitational Wave Memory from Binary Neutron Star Mergers
The gravitational wave signal produced by the merger of two compact objects includes both an oscillatory transient and a non-oscillatory part, the so-called memory effect. This produces a permanent displacement of test masses and has not yet been measured. We use general relativistic magnetohydrodynamic simulations, including neutrinos, with several representative viable equations of state, to quantify--for the first time--the effects of the neutron star magnetic field, neutrino emission, and t…

@arXiv_csCR_bot@mastoxiv.page
2025-10-09 08:57:21

Proofs of No Intrusion
Vipul Goyal, Justin Raizes
https://arxiv.org/abs/2510.06432 https://arxiv.org/pdf/2510.06432…

Proofs of No Intrusion
A central challenge in data security is not just preventing theft, but detecting whether it has occurred. Classically, this is impossible because a perfect copy leaves no evidence. Quantum mechanics, on the other hand, forbids general duplication, opening up new possibilities. We introduce Proofs of No Intrusion, which enable a classical client to remotely test whether a quantum server has been hacked and the client's data stolen. Crucially, the test does not destroy the data being tested, av…

@arXiv_statME_bot@mastoxiv.page
2025-10-08 08:49:19

A new composite Mann-Whitney test for two-sample survival comparisons with right-censored data
Abid Hussain, Touqeer Ahmad
https://arxiv.org/abs/2510.05353 https://

A new composite Mann-Whitney test for two-sample survival comparisons with right-censored data
A fundamental challenge in comparing two survival distributions with right censored data is the selection of an appropriate nonparametric test, as the power of standard tests like the Log rank and Wilcoxon is highly dependent on the often unknown nature of the alternative hypothesis. This paper introduces a new, distribution free two sample test designed to overcome this limitation. The proposed method is based on a strategic decomposition of the data into uncensored and censored subsets, from …

@arXiv_csCV_bot@mastoxiv.page
2025-10-14 16:14:50

Crosslisted article(s) found for cs.CV. https://arxiv.org/list/cs.CV/new
[2/3]:
- ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
Guan-Yan Yang, Tzu-Yu Cheng, Ya-Wen Teng, Farn Wanga, Kuo-Hui Yeh

@arXiv_csAI_bot@mastoxiv.page
2025-10-14 17:28:38

Crosslisted article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[8/17]:
- MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning
Chen, Lei, Zhang, Ke, Zhu, Chen, Lu, Huang, Feng, He, Sun, Wu, Wang

@arXiv_csCL_bot@mastoxiv.page
2025-10-14 13:18:28

When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents
Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, Yangyang Yu, Alejandro Lopez-Lira, Peng Lu, Jian-Yun Nie, Guojun Xiong, Jimin Huang, Sophia Ananiadou
https://arxiv.org/abs/2510.11695

When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents
Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, …

@arXiv_csLG_bot@mastoxiv.page
2025-10-08 10:45:59

NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering
Alexander Murphy, Michal Danilowski, Soumyajit Chatterjee, Abhirup Ghosh
https://arxiv.org/abs/2510.05635 h…

NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering
Test-Time Adaptation (TTA) methods are often computationally expensive, require a large amount of data for effective adaptation, or are brittle to hyperparameters. Based on a theoretical foundation of the geometry of the latent space, we are able to significantly improve the alignment between source and distribution-shifted samples by re-centering target data embeddings at the origin. This insight motivates NEO -- a hyperparameter-free fully TTA method, that adds no significant compute compared…

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 10:33:58

How Students Use Generative AI for Software Testing: An Observational Study
Baris Ardic, Quentin Le Dilavrec, Andy Zaidman
https://arxiv.org/abs/2510.10551 https://

How Students Use Generative AI for Software Testing: An Observational Study
The integration of generative AI tools like ChatGPT into software engineering workflows opens up new opportunities to boost productivity in tasks such as unit test engineering. However, these AI-assisted workflows can also significantly alter the developer's role, raising concerns about control, output quality, and learning, particularly for novice developers. This study investigates how novice software developers with foundational knowledge in software testing interact with generative AI for e…

@arXiv_csCV_bot@mastoxiv.page
2025-10-14 13:45:18

Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping
Walid Elbarz, Mohamed Bourriz, Hicham Hajji, Hamd Ait Abdelali, Fran\c{c}ois Bourzeix
https://arxiv.org/abs/2510.11576

Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping
Foundation models are transforming Earth observation, but their potential for hyperspectral crop mapping remains underexplored. This study benchmarks three foundation models for cereal crop mapping using hyperspectral imagery: HyperSigma, DOFA, and Vision Transformers pre-trained on the SpectralEarth dataset (a large multitemporal hyperspectral archive). Models were fine-tuned on manually labeled data from a training region and evaluated on an independent test region. Performance was measured w…

@arXiv_csRO_bot@mastoxiv.page
2025-10-07 11:30:22

Flexible Locomotion Learning with Diffusion Model Predictive Control
Runhan Huang, Haldun Balim, Heng Yang, Yilun Du
https://arxiv.org/abs/2510.04234 https://

Flexible Locomotion Learning with Diffusion Model Predictive Control
Legged locomotion demands controllers that are both robust and adaptable, while remaining compatible with task and safety considerations. However, model-free reinforcement learning (RL) methods often yield a fixed policy that can be difficult to adapt to new behaviors at test time. In contrast, Model Predictive Control (MPC) provides a natural approach to flexible behavior synthesis by incorporating different objectives and constraints directly into its optimization process. However, classical …

@arXiv_grqc_bot@mastoxiv.page
2025-10-09 08:18:30

A Parametrized Test of General Relativity for LISA Massive Black Hole Binary Inspirals
Manuel Piarulli, Sylvain Marsat, Elise M. S\"anger, Alessandra Buonanno, Jan Steinhoff, Nicola Tamanini
https://arxiv.org/abs/2510.06330

A Parametrized Test of General Relativity for LISA Massive Black Hole Binary Inspirals
Laser Interferometer Space Antenna (LISA) observations of massive black hole binaries (MBHBs) will provide long duration inspiral signals with high signal-to-noise ratio (SNR) data, ideal for testing general relativity (GR) in the strong-field and high-velocity regime. We present an extension of the Flexible Theory-Independent (FTI) framework, adapted to gravitational waves (GWs) from MBHBs observed with LISA, to perform parametrized inspiral tests of GR. This approach introduces generic deviat…

@arXiv_csCL_bot@mastoxiv.page
2025-10-14 21:37:08

Replaced article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[2/9]:
- Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning
Wenkai Yang, Shuming Ma, Yankai Lin, Furu Wei

@arXiv_csAI_bot@mastoxiv.page
2025-10-13 10:11:10

LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?
Kaijian Zou, Aaron Xiong, Yunxiang Zhang, Frederick Zhang, Yueqi Ren, Jirong Yang, Ayoung Lee, Shitanshu Bhushan, Lu Wang
https://arxiv.org/abs/2510.09595

LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?
Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad…

@arXiv_csLG_bot@mastoxiv.page
2025-10-15 10:45:41

Learning-To-Measure: In-context Active Feature Acquisition
Yuta Kobayashi, Zilin Jing, Jiayu Yao, Hongseok Namkoong, Shalmali Joshi
https://arxiv.org/abs/2510.12624 https://

Learning-To-Measure: In-context Active Feature Acquisition
Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances by adaptively selecting which features to acquire. In practice, AFA methods often learn from retrospective data with systematic missingness in the features and limited task-specific labels. Most prior work addresses acquisition for a single predetermined task, limiting scalability. To address this limitation, we formalize the meta-AFA problem, where the goal…

@arXiv_statME_bot@mastoxiv.page
2025-10-10 09:05:49

Detection of mean changes in partially observed functional data
\v{S}\'arka Hudecov\'a, Claudia Kirch
https://arxiv.org/abs/2510.07854 https://arxi…

Detection of mean changes in partially observed functional data
We propose a test for a change in the mean for a sequence of functional observations that are only partially observed on subsets of the domain, with no information available on the complement. The framework accommodates important scenarios, including both abrupt and gradual changes. The significance of the test statistic is assessed via a permutation test. In addition to the classical permutation approach with a fixed number of permutation samples, we also discuss a variant with controlled resa…

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 09:59:28

LLMs are All You Need? Improving Fuzz Testing for MOJO with Large Language Models
Linghan Huang, Peizhou Zhao, Huaming Chen
https://arxiv.org/abs/2510.10179 https://

LLMs are All You Need? Improving Fuzz Testing for MOJO with Large Language Models
The rapid development of large language models (LLMs) has revolutionized software testing, particularly fuzz testing, by automating the generation of diverse and effective test inputs. This advancement holds great promise for improving software reliability. Meanwhile, the introduction of MOJO, a high-performance AI programming language blending Python's usability with the efficiency of C and C++, presents new opportunities to enhance AI model scalability and programmability. However, as a new l…

@arXiv_csCR_bot@mastoxiv.page
2025-10-08 10:00:59

Enhancing Automotive Security with a Hybrid Approach towards Universal Intrusion Detection System
Md Rezanur Islam, Mahdi Sahlabadi, Keunkyoung Kim, Kangbin Yim
https://arxiv.org/abs/2510.05824

Enhancing Automotive Security with a Hybrid Approach towards Universal Intrusion Detection System
Security measures are essential in the automotive industry to detect intrusions in-vehicle networks. However, developing a one-size-fits-all Intrusion Detection System (IDS) is challenging because each vehicle has unique data profiles. This is due to the complex and dynamic nature of the data generated by vehicles regarding their model, driving style, test environment, and firmware update. To address this issue, a universal IDS has been developed that can be applied to all types of vehicles wit…

@arXiv_csAI_bot@mastoxiv.page
2025-10-13 09:20:30

LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition
Yushuo Zheng, Zicheng Zhang, Xiongkuo Min, Huiyu Duan, Guangtao Zhai
https://arxiv.org/abs/2510.08928 h…

LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition
Existing benchmarks for large multimodal models (LMMs) often fail to capture their performance in real-time, adversarial environments. We introduce LM Fight Arena (Large Model Fight Arena), a novel framework that evaluates LMMs by pitting them against each other in the classic fighting game Mortal Kombat II, a task requiring rapid visual understanding and tactical, sequential decision-making. In a controlled tournament, we test six leading open- and closed-source models, where each agent operat…

@arXiv_grqc_bot@mastoxiv.page
2025-10-09 09:34:31

When vacuum breaks: a self-consistency test for astrophysical environments in extreme mass ratio inspirals
Lorenzo Copparoni, Rohit S. Chandramouli, Enrico Barausse
https://arxiv.org/abs/2510.06948

When vacuum breaks: a self-consistency test for astrophysical environments in extreme mass ratio inspirals
Gravitational-wave signals are typically interpreted under the vacuum hypothesis, i.e. assuming negligible influence from the astrophysical environment. This assumption is expected to break down for low-frequency sources such as extreme mass ratio inspirals (EMRIs), which are prime targets for the Laser Interferometer Space Antenna (LISA) and are expected to form, at least in part, in dense environments such as Active Galactic Nuclei or dark-matter spikes/cores. Modeling environmental effects p…

@arXiv_csCL_bot@mastoxiv.page
2025-10-07 12:23:42

Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models
Runchu Tian, Junxia Cui, Xueqiang Xu, Feng Yao, Jingbo Shang
https://arxiv.org/abs/2510.05090

Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models
Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering advantages such as accelerated parallel decoding and bidirectional context modeling. However, the vanilla decoding strategy in discrete dLLMs suffers from a critical limitation: once a token is accepted, it can no longer be revised in subsequent steps. As a result, early mistakes persist across iterations, harming both intermediate predictions and final output quality…

@arXiv_csLG_bot@mastoxiv.page
2025-10-14 22:19:32

Replaced article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[13/14]:
- Class-Invariant Test-Time Augmentation for Domain Generalization
Zhicheng Lin, Xiaolin Wu, Xi Zhang

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 10:41:48

Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration
Mohanakrishnan Hariharan, Satish Arvapalli, Seshu Barma, Evangeline Sheela
https://arxiv.org/abs/2510.10824

Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration
We present an approach to software testing automation using Agentic Retrieval-Augmented Generation (RAG) systems for Quality Engineering (QE) artifact creation. We combine autonomous AI agents with hybrid vector-graph knowledge systems to automate test plan, case, and QE metric generation. Our approach addresses traditional software testing limitations by leveraging LLMs such as Gemini and Mistral, multi-agent orchestration, and enhanced contextualization. The system achieves remarkable accurac…

@arXiv_csAI_bot@mastoxiv.page
2025-10-08 10:27:09

MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization
Dayy\'an O'Brien, Barry Haddow, Emily Allaway, Pinzhen Chen
https://arxiv.org/abs/2510.05962

MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization
Conducting contamination-free evaluation of mathematical capabilities can be difficult for two reasons: models may memorize a test set once it is made public, and current mathematical benchmarks are prone to overfitting due to having limited diversity of symbols and rules, coupled with closed-ended answers. This paper proposes a method to leverage these shortcomings as useful features to a construct dynamic, counterfactual benchmark, which can be used to both reveal overfitting and measure true…

@arXiv_csLG_bot@mastoxiv.page
2025-10-10 11:04:09

The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models
Konrad L\"ohr, Shuzhou Yuan, Michael F\"arber
https://arxiv.org/abs/2510.08236

The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models
Large Language Models (LLMs) are increas- ingly integral to information dissemination and decision-making processes. Given their grow- ing societal influence, understanding potential biases, particularly within the political domain, is crucial to prevent undue influence on public opinion and democratic processes. This work investigates political bias and stereotype propa- gation across eight prominent LLMs using the two-dimensional Political Compass Test (PCT). Initially, the PCT is employed to…

@arXiv_csCL_bot@mastoxiv.page
2025-10-10 11:10:59

ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
Qin Liu, Jacob Dineen, Yuxi Huang, Sheng Zhang, Hoifung Poon, Ben Zhou, Muhao Chen
https://arxiv.org/abs/2510.08569

ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. G…

@arXiv_csSE_bot@mastoxiv.page
2025-10-08 08:59:09

UnitTenX: Generating Tests for Legacy Packages with AI Agents Powered by Formal Verification
Yiannis Charalambous, Claudionor N. Coelho Jr, Luis Lamb, Lucas C. Cordeiro
https://arxiv.org/abs/2510.05441

UnitTenX: Generating Tests for Legacy Packages with AI Agents Powered by Formal Verification
This paper introduces UnitTenX, a state-of-the-art open-source AI multi-agent system designed to generate unit tests for legacy code, enhancing test coverage and critical value testing. UnitTenX leverages a combination of AI agents, formal methods, and Large Language Models (LLMs) to automate test generation, addressing the challenges posed by complex and legacy codebases. Despite the limitations of LLMs in bug detection, UnitTenX offers a robust framework for improving software reliability and…

@arXiv_csAI_bot@mastoxiv.page
2025-10-10 10:31:19

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai
https://arxiv.org/abs/2510.08189

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning beha…

@arXiv_csLG_bot@mastoxiv.page
2025-10-07 13:06:22

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts
Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, Amrit Singh Bedi
https://arxiv.org/abs/2510.05040

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts
Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time …

@arXiv_csCL_bot@mastoxiv.page
2025-10-09 10:21:31

PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs
Manuel Frank, Haithem Afli
https://arxiv.org/abs/2510.06730 https://

PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs
Current evaluations of sentence embedding models typically rely on static test beds such as the Massive Text Embedding Benchmark (MTEB). While invaluable, repeated tuning on a fixed suite can inflate reported performance and obscure real-world robustness. We introduce the Paraphrasing Text Embedding Benchmark (PTEB), a dynamic protocol that stochastically generates meaning-preserving paraphrases at evaluation time and aggregates results across multiple runs. Using a cost-efficient LLM-based met…

@arXiv_csLG_bot@mastoxiv.page
2025-10-08 10:54:59

Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime
Andreas Maurer, Erfan Mirzaei, Massimiliano Pontil
https://arxiv.org/abs/2510.06028 https…

Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime
The paper provides data-dependent bounds on the test error of the Gibbs algorithm in the overparameterized interpolation regime, where low training errors are also obtained for impossible data, such as random labels in classification. The bounds are stable under approximation with Langevin Monte Carlo algorithms. Experiments on the MNIST and CIFAR-10 datasets verify that the bounds yield nontrivial predictions on true labeled data and correctly upper bound the test error for random labels. Our …

@arXiv_csAI_bot@mastoxiv.page
2025-10-09 09:58:01

Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning
Wenxun Wu, Yuanyang Li, Guhan Chen, Linyue Wang, Hongyang Chen
https://arxiv.org/abs/2510.07038

Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning
Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers. These approaches have demonstrated significant performance improvements on benchmarks involving mathematical reasoning. However, language models relying solely on direct inference still struggle with tasks demanding up-to-date knowledge or computational tools such as calculators and code interpreters for complex arithmetic operatio…

@arXiv_csSE_bot@mastoxiv.page
2025-10-10 09:09:09

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR
Zeyu Sun, Jingjing Liang, Weiyi Wang, Chenyao Suo, Junjie Chen, Fanjiang Xu
https://arxiv.org/abs/2510.07815

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR
MLIR (Multi-Level Intermediate Representation) has rapidly become a foundational technology for modern compiler frameworks, enabling extensibility across diverse domains. However, ensuring the correctness and robustness of MLIR itself remains challenging. Existing fuzzing approaches-based on manually crafted templates or rule-based mutations-struggle to generate sufficiently diverse and semantically valid test cases, making it difficult to expose subtle or deep-seated bugs within MLIR's complex…

@arXiv_csLG_bot@mastoxiv.page
2025-10-07 13:06:02

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
Nevan Wichers, Aram Ebtekar, Ariana Azarbal, Victor Gillioz, Christine Ye, Emil Ryd, Neil Rathi, Henry Sleight, Alex Mallen, Fabien Roger, Samuel Marks
https://arxiv.org/abs/2510.05024

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to ino…

@arXiv_csCL_bot@mastoxiv.page
2025-10-07 12:12:52

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin
https://arxiv.org/abs/2510.04891

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
Large language models (LLMs) are increasingly deployed in contexts where their failures can have direct sociopolitical consequences. Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control. We introduce SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries, designed to surface where LLMs most acutely fail in politically charge…

@arXiv_csAI_bot@mastoxiv.page
2025-10-08 10:03:59

Large Language Model-Based Uncertainty-Adjusted Label Extraction for Artificial Intelligence Model Development in Upper Extremity Radiography
Hanna Kreutzer, Anne-Sophie Caselitz, Thomas Dratsch, Daniel Pinto dos Santos, Christiane Kuhl, Daniel Truhn, Sven Nebelung
https://arxiv.org/abs/2510.05664 …

Large Language Model-Based Uncertainty-Adjusted Label Extraction for Artificial Intelligence Model Development in Upper Extremity Radiography
Objectives: To evaluate GPT-4o's ability to extract diagnostic labels (with uncertainty) from free-text radiology reports and to test how these labels affect multi-label image classification of musculoskeletal radiographs. Methods: This retrospective study included radiography series of the clavicle (n=1,170), elbow (n=3,755), and thumb (n=1,978). After anonymization, GPT-4o filled out structured templates by indicating imaging findings as present ("true"), absent ("false"), or "uncertain." To …

@arXiv_csCL_bot@mastoxiv.page
2025-10-07 12:20:22

A Set of Quebec-French Corpus of Regional Expressions and Terms
David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury
https://arxiv.org/abs/2510.05026 https://…

A Set of Quebec-French Corpus of Regional Expressions and Terms
The tasks of idiom understanding and dialect understanding are both well-established benchmarks in natural language processing. In this paper, we propose combining them, and using regional idioms as a test of dialect understanding. Towards this end, we propose two new benchmark datasets for the Quebec dialect of French: QFrCoRE, which contains 4,633 instances of idiomatic phrases, and QFrCoRT, which comprises 171 regional instances of idiomatic words. We explain how to construct these corpora, …

@arXiv_csSE_bot@mastoxiv.page
2025-10-07 18:09:30

Replaced article(s) found for cs.SE. https://arxiv.org/list/cs.SE/new
[1/2]:
- Test Schedule Generation for Acceptance Testing of Mission-Critical Satellite Systems
Rapha\"el Ollando, Seung Yeob Shin, Mario Minardi, Nikolas Sidiropoulos

@arXiv_csAI_bot@mastoxiv.page
2025-10-07 12:14:52

Look-ahead Reasoning with a Learned Model in Imperfect Information Games
Ond\v{r}ej Kub\'i\v{c}ek, Viliam Lis\'y
https://arxiv.org/abs/2510.05048 https://

Look-ahead Reasoning with a Learned Model in Imperfect Information Games
Test-time reasoning significantly enhances pre-trained AI agents' performance. However, it requires an explicit environment model, often unavailable or overly complex in real-world scenarios. While MuZero enables effective model learning for search in perfect information games, extending this paradigm to imperfect information games presents substantial challenges due to more nuanced look-ahead reasoning techniques and large number of states relevant for individual decisions. This paper introduc…

@arXiv_csAI_bot@mastoxiv.page
2025-10-07 16:58:36

Crosslisted article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[11/17]:
- MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation...
Chenlu Ding, Jiancan Wu, Leheng Sheng, Fan Zhang, Yancheng Yuan, Xiang Wang, Xiangnan He

@arXiv_csAI_bot@mastoxiv.page
2025-10-07 16:58:47

Crosslisted article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[12/17]:
- Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time
Tan, Woodruff, Warncke, Jose, Rich\'e, Africa, Taylor

Tootfinder

Opt-in global Mastodon full text search. Join the index!