Tootfinder

@arXiv_hepth_bot@mastoxiv.page
2025-09-18 08:16:41

Thermodynamic Split Conjecture and an Observational Test for Cosmological Entropy
Oem Trivedi
https://arxiv.org/abs/2509.13689 https://arxiv.org/pdf/2509.1…

Thermodynamic Split Conjecture and an Observational Test for Cosmological Entropy
We revisit string theoretic derivations of black hole entropy and argue that their enabling structures do not persist in realistic cosmologies. We formalize this as the Thermodynamic Split Conjecture (TSC) which is the statement that in any UV complete quantum gravity, black hole and cosmological horizon thermodynamics are generically inequivalent. The BKE criterion is then formulated to formalize this approach while we also discuss ways to falsify the conjecture. Finally, we propose an observa…

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:40:51

BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)
Tomas Ruiz, Siyao Peng, Barbara Plank, Carsten Schwemmer
https://arxiv.org/abs/2510.12516

BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)
Test-time scaling is a family of techniques to improve LLM outputs at inference time by performing extra computation. To the best of our knowledge, test-time scaling has been limited to domains with verifiably correct answers, like mathematics and coding. We transfer test-time scaling to the LeWiDi-2025 tasks to evaluate annotation disagreements. We experiment with three test-time scaling methods: two benchmark algorithms (Model Averaging and Majority Voting), and a Best-of-N sampling method. T…

@arXiv_csSE_bot@mastoxiv.page
2025-09-18 08:23:21

An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software
Sina Gogani-Khiabani (University of Illinois Chicago), Ashutosh Trivedi (University of Colorado Boulder), Diptikalyan Saha (IBM Research), Saeid Tizpaz-Niari (University of Illinois Chicago)
https://arxiv.org/abs/2509.13471

An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software
Large language models (LLMs) show promise for translating natural-language statutes into executable logic, but reliability in legally critical settings remains challenging due to ambiguity and hallucinations. We present an agentic approach for developing legal-critical software, using U.S. federal tax preparation as a case study. The key challenge is test-case generation under the oracle problem, where correct outputs require interpreting law. Building on metamorphic testing, we introduce highe…

@arXiv_csHC_bot@mastoxiv.page
2025-10-15 10:02:51

Data-Model Co-Evolution: Growing Test Sets to Refine LLM Behavior
Minjae Lee, Minsuk Kahng
https://arxiv.org/abs/2510.12728 https://arxiv.org/pdf/2510.1272…

Data-Model Co-Evolution: Growing Test Sets to Refine LLM Behavior
A long-standing challenge in machine learning has been the rigid separation between data work and model refinement, enforced by slow fine-tuning cycles. The rise of Large Language Models (LLMs) overcomes this historical barrier, allowing applications developers to instantly govern model behavior by editing prompt instructions. This shift enables a new paradigm: data-model co-evolution, where a living test set and a model's instructions evolve in tandem. We operationalize this paradigm in an int…

@arXiv_econEM_bot@mastoxiv.page
2025-09-18 07:38:11

Generalized Covariance Estimator under Misspecification and Constraints
Aryan Manafi Neyazi
https://arxiv.org/abs/2509.13492 https://arxiv.org/pdf/2509.134…

Generalized Covariance Estimator under Misspecification and Constraints
This paper investigates the properties of the Generalized Covariance (GCov) estimator under misspecification and constraints with application to processes with local explosive patterns, such as causal-noncausal and double autoregressive (DAR) processes. We show that GCov is consistent and has an asymptotically Normal distribution under misspecification. Then, we construct GCov-based Wald-type and score-type tests to test one specification against the other, all of which follow a $χ^2$ distribu…

@arXiv_statME_bot@mastoxiv.page
2025-10-15 08:55:31

A Martingale Kernel Two-Sample Test
Anirban Chatterjee, Aaditya Ramdas
https://arxiv.org/abs/2510.11853 https://arxiv.org/pdf/2510.11853

A Martingale Kernel Two-Sample Test
The Maximum Mean Discrepancy (MMD) is a widely used multivariate distance metric for two-sample testing. The standard MMD test statistic has an intractable null distribution typically requiring costly resampling or permutation approaches for calibration. In this work we leverage a martingale interpretation of the estimated squared MMD to propose martingale MMD (mMMD), a quadratic-time statistic which has a limiting standard Gaussian distribution under the null. Moreover we show that the test is…

@arXiv_csLG_bot@mastoxiv.page
2025-10-15 10:45:41

Learning-To-Measure: In-context Active Feature Acquisition
Yuta Kobayashi, Zilin Jing, Jiayu Yao, Hongseok Namkoong, Shalmali Joshi
https://arxiv.org/abs/2510.12624 https://

Learning-To-Measure: In-context Active Feature Acquisition
Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances by adaptively selecting which features to acquire. In practice, AFA methods often learn from retrospective data with systematic missingness in the features and limited task-specific labels. Most prior work addresses acquisition for a single predetermined task, limiting scalability. To address this limitation, we formalize the meta-AFA problem, where the goal…

@arXiv_qfinST_bot@mastoxiv.page
2025-09-18 08:07:01

Holdout cross-validation for large non-Gaussian covariance matrix estimation using Weingarten calculus
Lamia Lamrani, Beno\^it Collins, Jean-Philippe Bouchaud
https://arxiv.org/abs/2509.13923

Holdout cross-validation for large non-Gaussian covariance matrix estimation using Weingarten calculus
Cross-validation is one of the most widely used methods for model selection and evaluation; its efficiency for large covariance matrix estimation appears robust in practice, but little is known about the theoretical behavior of its error. In this paper, we derive the expected Frobenius error of the holdout method, a particular cross-validation procedure that involves a single train and test split, for a generic rotationally invariant multiplicative noise model, therefore extending previous resu…

@arXiv_csCV_bot@mastoxiv.page
2025-10-15 10:54:11

Efficient Real-World Deblurring using Single Images: AIM 2025 Challenge Report
Daniel Feijoo, Paula Garrido-Mellado, Marcos V. Conde, Jaesung Rim, Alvaro Garcia, Sunghyun Cho, Radu Timofte
https://arxiv.org/abs/2510.12788

Efficient Real-World Deblurring using Single Images: AIM 2025 Challenge Report
This paper reviews the AIM 2025 Efficient Real-World Deblurring using Single Images Challenge, which aims to advance in efficient real-blur restoration. The challenge is based on a new test set based on the well known RSBlur dataset. Pairs of blur and degraded images in this dataset are captured using a double-camera system. Participant were tasked with developing solutions to effectively deblur these type of images while fulfilling strict efficiency constraints: fewer than 5 million model para…

@arXiv_physicsgeoph_bot@mastoxiv.page
2025-12-16 10:12:52

Correcting exponentiality test for binned earthquake magnitudes
Angela Stallone, Ilaria Spassiani
https://arxiv.org/abs/2512.13599 https://arxiv.org/pdf/25…

@arXiv_astrophHE_bot@mastoxiv.page
2025-10-15 09:14:01

DIPLODOCUS II: Implementation of transport equations and test cases relevant to micro-scale physics of jetted astrophysical sources
Christopher N. Everett, Marc Klinger-Plaisier, Garret Cotter
https://arxiv.org/abs/2510.12505

DIPLODOCUS II: Implementation of transport equations and test cases relevant to micro-scale physics of jetted astrophysical sources
DIPLODOCUS (Distribution-In-PLateaux methODOlogy for the CompUtation of transport equationS) is a novel framework being developed for the general transport of particle distribution functions through the seven dimensions of phase space, including forcing terms and interactions between particles. Following Paper I, which details the background analytic framework, this second paper provides an overview of the numerical implementation in the form of the code package Diplodocus.jl, written in Julia,…

@arXiv_csCR_bot@mastoxiv.page
2025-10-14 11:48:48

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
Guan-Yan Yang, Tzu-Yu Cheng, Ya-Wen Teng, Farn Wanga, Kuo-Hui Yeh
https://arxiv.org/abs/2510.10281 htt…

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
The integration of Large Language Models (LLMs) into computer applications has introduced transformative capabilities but also significant security challenges. Existing safety alignments, which primarily focus on semantic interpretation, leave LLMs vulnerable to attacks that use non-standard data representations. This paper introduces ArtPerception, a novel black-box jailbreak framework that strategically leverages ASCII art to bypass the security measures of state-of-the-art (SOTA) LLMs. Unlik…

@arXiv_csAI_bot@mastoxiv.page
2025-10-13 10:08:10

Titans Revisited: A Lightweight Reimplementation and Critical Analysis of a Test-Time Memory Model
Gavriel Di Nepi, Federico Siciliano, Fabrizio Silvestri
https://arxiv.org/abs/2510.09551

Titans Revisited: A Lightweight Reimplementation and Critical Analysis of a Test-Time Memory Model
By the end of 2024, Google researchers introduced Titans: Learning at Test Time, a neural memory model achieving strong empirical results across multiple tasks. However, the lack of publicly available code and ambiguities in the original description hinder reproducibility. In this work, we present a lightweight reimplementation of Titans and conduct a comprehensive evaluation on Masked Language Modeling, Time Series Forecasting, and Recommendation tasks. Our results reveal that Titans does not …

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:38:31

Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test
Nikoleta Pantelidou, Evelina Leivada, Paolo Morosi
https://arxiv.org/abs/2510.12463

Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test
The linguistic abilities of Large Language Models are a matter of ongoing debate. This study contributes to this discussion by investigating model performance in a morphological generalization task that involves novel words. Using a multilingual adaptation of the Wug Test, six models were tested across four partially unrelated languages (Catalan, English, Greek, and Spanish) and compared with human speakers. The aim is to determine whether model accuracy approximates human competence and whethe…

@arXiv_mathFA_bot@mastoxiv.page
2025-10-15 09:49:21

On Korovkin-type theorems including exponential test functions on infinite intervals through power series convergence
Dilek S\"oylemez, Mehmet \"Unver
https://arxiv.org/abs/2510.12568

On Korovkin-type theorems including exponential test functions on infinite intervals through power series convergence
Approximation theory has long been concerned with the development of positive linear operators that effectively approximate classes of functions. Among the most well-known results in this area are Korovkin-type approximation theorems, which provide simple and elegant criteria for convergence by testing only on a small set of functions. Motivated by these classical results and their extensions, we focus on versions that preserve exponential functions and incorporate modern summability techniques…

@arXiv_astrophEP_bot@mastoxiv.page
2025-10-15 08:56:02

The resilience of the sailboat stable region
Rafael Sfair, Tiago F. L. L. Pinheiro, Giovana Ramon, Ernesto Vieira
https://arxiv.org/abs/2510.11855 https://…

The resilience of the sailboat stable region
Binary systems host complex orbital dynamics where test particles can occupy stable regions despite strong gravitational perturbations. The sailboat region, discovered in the Pluto-Charon system, allows highly eccentric S-type orbits at intermediate distances between the two massive bodies. This region challenges traditional stability concepts by supporting eccentricities up to 0.9 in a zone typically dominated by chaotic motion. We investigate the sailboat region's existence and extent across …

@arXiv_astrophCO_bot@mastoxiv.page
2025-10-15 09:37:11

Hierarchical summaries for primordial non-Gaussianities
M. S. Cagliari, A. Bairagi, B. Wandelt
https://arxiv.org/abs/2510.12715 https://arxiv.org/pdf/2510.…

Hierarchical summaries for primordial non-Gaussianities
The advent of Stage IV galaxy redshift surveys such as DESI and Euclid marks the beginning of an era of precision cosmology, with one key objective being the detection of primordial non-Gaussianities (PNG), potential signatures of inflationary physics. In particular, constraining the amplitude of local-type PNG, parameterised by $f_{\rm NL}$, with $σ_{f_{\rm NL}} \sim 1$, would provide a critical test of single versus multi-field inflation scenarios. While current large-scale structure and cos…

@arXiv_mathRT_bot@mastoxiv.page
2025-10-15 08:45:52

Unitary representations attached to parabolic subgroups: the case of abelian unipotent radical
Dan Ciubotaru
https://arxiv.org/abs/2510.11862 https://arxiv…

Unitary representations attached to parabolic subgroups: the case of abelian unipotent radical
We classify the unitary representations with integral infinitesimal character in Lusztig's category of unipotent representations in the case when the geometric parameter space comes from the action of a Levi subgroup on the abelian nilradical of a (maximal) parabolic subalgebra. We organise the unitary representations into microlocal Arthur packets. This is a test case for investigating a conjectural description of unitary representations with integral infinitesimal character.

@arXiv_condmatstrel_bot@mastoxiv.page
2025-10-15 09:15:31

Quantum criticality at the end of a pseudogap phase in superconducting infinite-layer nickelates
C. Iorio-Duval, E. Beauchesne-Blanchet, F. Perreault, J. L. Santana Gonz\'alez, W. Sun, Y. F. Nie, A. Gourgout, G. Grissonnanche
https://arxiv.org/abs/2510.12786

Quantum criticality at the end of a pseudogap phase in superconducting infinite-layer nickelates
In many unconventional superconductors, the strange-metal regime is thought to emerge from quantum criticality, yet in cuprates this link is obscured by the enigmatic pseudogap. Superconducting infinite-layer nickelates provide a new arena to test this paradigm but are constrained to thin films, precluding calorimetry. We use the Seebeck coefficient as a low-temperature proxy for entropy per carrier and uncover a clear quantum-critical thermodynamic signature: in La$_{1-x}$Sr$_x$NiO$_2$ at the …

@arXiv_csHC_bot@mastoxiv.page
2025-10-15 09:56:52

Gauging the Competition: Understanding Social Comparison and Anxiety through Eye-tracking in Virtual Reality Group Interview
Shi-Ting Ni, Kairong Fang, Yuyang Wang, Pan Hui
https://arxiv.org/abs/2510.12590

Gauging the Competition: Understanding Social Comparison and Anxiety through Eye-tracking in Virtual Reality Group Interview
Virtual Reality (VR) is a promising tool for interview training, yet the psychological dynamics of group interviews, such as social comparison, remain underexplored. We investigate this phenomenon by developing an immersive VR group interview system and conducting an eye-tracking study with 73 participants. We manipulated peer performance using ambiguous behavioral cues (e.g., hand-raising) and objective information (public test scores) to measure their effect on participants' attention and sel…

@arXiv_csCY_bot@mastoxiv.page
2025-10-13 09:09:10

Non-traditional data in pandemic preparedness and response: identifying and addressing first and last-mile challenges
Mattia Mazzoli, Irma Varela-Lasheras, Sonia Namorado, Constantino Pereira Caetano, Andreia Leite, Lisa Hermans, Niel Hens, Polen T\"urkmen, Kyriaki Kalimeri, Leo Ferres, Ciro Cattuto, Daniela Paolotti, Stefaan Verhulst
https://

Non-traditional data in pandemic preparedness and response: identifying and addressing first and last-mile challenges
The pandemic served as an important test case of complementing traditional public health data with non-traditional data (NTD) such as mobility traces, social media activity, and wearables data to inform decision-making. Drawing on an expert workshop and a targeted survey of European modelers, we assess the promise and persistent limitations of such data in pandemic preparedness and response. We distinguish between "first-mile" (accessing and harmonizing data) and "last-mile" challenges (transla…

@arXiv_csRO_bot@mastoxiv.page
2025-10-13 10:04:10

Bridging Research and Practice in Simulation-based Testing of Industrial Robot Navigation Systems
Sajad Khatiri, Francisco Eli Vina Barrientos, Maximilian Wulf, Paolo Tonella, Sebastiano Panichella
https://arxiv.org/abs/2510.09396

Bridging Research and Practice in Simulation-based Testing of Industrial Robot Navigation Systems
Ensuring robust robotic navigation in dynamic environments is a key challenge, as traditional testing methods often struggle to cover the full spectrum of operational requirements. This paper presents the industrial adoption of Surrealist, a simulation-based test generation framework originally for UAVs, now applied to the ANYmal quadrupedal robot for industrial inspection. Our method uses a search-based algorithm to automatically generate challenging obstacle avoidance scenarios, uncovering fa…

@arXiv_econTH_bot@mastoxiv.page
2025-10-15 08:33:02

Selection Procedures in Competitive Admission
Nathan Hancart
https://arxiv.org/abs/2510.12653 https://arxiv.org/pdf/2510.12653

Selection Procedures in Competitive Admission
Two identical firms compete to attract and hire from a pool of candidates of unknown productivity. Firms simultaneously post a selection procedure which consists of a test and an acceptance probability for each test outcome. After observing the firms' selection procedures, each candidate can apply to one of them. Both firms have access to a limited set of feasible tests. The firms face two key considerations when choosing their selection procedure: the statistical properties of their test and t…

@arXiv_csLG_bot@mastoxiv.page
2025-10-14 13:41:38

Representation-Based Exploration for Language Models: From Test-Time to Post-Training
Jens Tuyls, Dylan J. Foster, Akshay Krishnamurthy, Jordan T. Ash
https://arxiv.org/abs/2510.11686

Representation-Based Exploration for Language Models: From Test-Time to Post-Training
Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration -- explicitly incentivizing the model to discover novel and diverse behaviors -- and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration wi…

@arXiv_csCV_bot@mastoxiv.page
2025-10-13 10:35:30

D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
Jisu Han, Wonjun Hwang
https://arxiv.org/abs/2510.09473 https://…

D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the …

@arXiv_astrophIM_bot@mastoxiv.page
2025-10-14 09:39:38

The Importance of Being Adaptable: An Exploration of the Power and Limitations of Domain Adaptation for Simulation-Based Inference with Galaxy Clusters
Michelle Ntampaka, A. Ciprijanovic, Ana Maria Delgado, John Soltis, John F. Wu, Mikaeel Yunus, John ZuHone
https://arxiv.org/abs/2510.09748

The Importance of Being Adaptable: An Exploration of the Power and Limitations of Domain Adaptation for Simulation-Based Inference with Galaxy Clusters
The application of deep machine learning methods in astronomy has exploded in the last decade, with new models showing remarkably improved performance on benchmark tasks. Not nearly enough attention is given to understanding the models' robustness, especially when the test data are systematically different from the training data, or "out of domain." Domain shift poses a significant challenge for simulation-based inference, where models are trained on simulated data but applied to real observati…

@arXiv_mathPR_bot@mastoxiv.page
2025-10-14 10:49:18

General mean-field BSDEs with integrable terminal values
Weimin Jiang, Juan Li, Yan Shen
https://arxiv.org/abs/2510.11067 https://arxiv.org/pdf/2510.11067

General mean-field BSDEs with integrable terminal values
This paper investigates $L^{1}$ solutions for mean-field backward stochastic differential equations (MFBSDEs) under different weak assumptions in both one-dimensional and multi-dimensional settings, whose generator $f(ω,t,y,z,μ)$ depends not only on the solution process $(Y,Z)$ but also on the law of $(Y,Z)$. In the one-dimensional case where $f$ depends on the law of $Y$, we show with the help of a test function method and a localization procedure that such type of equations with an integrab…

@arXiv_econGN_bot@mastoxiv.page
2025-10-15 07:46:31

Beyond Test Scores: How Academic Rank Shapes Long-Term Outcomes
Emilia Del Bono, Angus Holford, Tommaso Sartori
https://arxiv.org/abs/2510.11973 https://ar…

Beyond Test Scores: How Academic Rank Shapes Long-Term Outcomes
We study the effects of academic rank using data on the entire population of children enrolled in primary schools in Aberdeen, Scotland, in 1962. Exploiting quasi-random variation in peer group composition, we estimate the causal impact of rank on academic performance, noncognitive development, parental investment, and long-term outcomes. Higher rank improves achievement on the high-stakes eleven-plus examination and strengthens internalizing skills (traits related to self-concept and confidenc…

@arXiv_statME_bot@mastoxiv.page
2025-10-14 11:16:39

A Kolmogorov-Smirnov-Type Test for Dependently Double-Truncated Data
Anne-Marie Toparkus, Rafael Weissbach
https://arxiv.org/abs/2510.11517 https://arxiv.o…

A Kolmogorov-Smirnov-Type Test for Dependently Double-Truncated Data
With double-truncated lifespans, we test the hypothesis of a parametric distribution family for the lifespan. The typical finding from demography is an instationary behaviour of the life expectancy, and a copula models the resulting weak dependence of lifespan and the age at truncation. Our main example is the Farlie-Gumbel-Morgenststern copula. The test is based on Donsker-class arguments and the functional delta method for empirical processes. The assumptions also allow parametric inference, …

@arXiv_csAI_bot@mastoxiv.page
2025-10-14 17:28:38

Crosslisted article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[8/17]:
- MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning
Chen, Lei, Zhang, Ke, Zhu, Chen, Lu, Huang, Feng, He, Sun, Wu, Wang

@arXiv_grqc_bot@mastoxiv.page
2025-10-13 08:59:20

Chaos of charged particles near a renormalized group improved Kerr black hole in an external magnetic field
Junjie Lu, Xin Wu
https://arxiv.org/abs/2510.08954 https://

Chaos of charged particles near a renormalized group improved Kerr black hole in an external magnetic field
In a quantum theory of gravity, a renormalization group improved Kerr metric is obtained from the Kerr metric, where the Newton gravitational constant is modified as a function of the radial distance. The motion of neutral test particles in this metric is integrable. However, the dynamics of charged test particles is nonintegrable when an external asymptotically homogeneous magnetic field exists in the vicinity of the black hole. The transition from regular dynamics to chaotic dynamics is numer…

@arXiv_csSE_bot@mastoxiv.page
2025-10-13 09:20:10

Search-based Hyperparameter Tuning for Python Unit Test Generation
Stephan Lukasczyk, Gordon Fraser
https://arxiv.org/abs/2510.08716 https://arxiv.org/pdf/…

Search-based Hyperparameter Tuning for Python Unit Test Generation
Search-based test-generation algorithms have countless configuration options. Users rarely adjust these options and usually stick to the default values, which may not lead to the best possible results. Tuning an algorithm's hyperparameters is a method to find better hyperparameter values, but it typically comes with a high demand of resources. Meta-heuristic search algorithms -- that effectively solve the test-generation problem -- have been proposed as a solution to also efficiently tune param…

@arXiv_astrophHE_bot@mastoxiv.page
2025-10-15 09:18:11

The double neutron star PSR J1946 2052 I. Masses and tests of general relativity
Lingqi Meng, Paulo C. C. Freire, Kevin Stovall, Norbert Wex, Xueli Miao, Weiwei Zhu, Michael Kramer, James M. Cordes, Huanchen Hu, Jinchen Jiang, Emilie Parent, Lijing Shao, Ingrid H. Stairs, Mengyao Xue, Adam Brazier, Fernando Camilo, David J. Champion, Shami Chatterjee, Fronefield Crawford, Ziyao Fang, Qiuyang Fu, Yanjun Guo, Jason W. T. Hessels, Maura MacLaughlin, Chenchen Miao, Jiarui Niu, Ziwei Wu, Ju…

The double neutron star PSR J1946+2052 I. Masses and tests of general relativity
We conducted high-precision timing of PSR J1946+2052 to determine the masses of the two neutron stars in the system, test general relativity (GR) and assessed the system's potential for future measurement of the moment of inertia of the pulsar. We analysed seven years of timing data from the Arecibo 305-m radio telescope, the Green Bank Telescope (GBT), and the Five-hundred-meter Aperture Spherical radio Telescope (FAST). The data processing accounted for dispersion measure variations and relat…

@arXiv_nuclex_bot@mastoxiv.page
2025-10-15 08:15:32

The mass of $^{101}$Sn and Bayesian extrapolations to the proton drip line
Christian M. Ireland, Georg Bollen, Scott E. Campbell, Xiangcheng Chen, Hannah Erington, Nadeesha D. Gamage, Kyle Godbey, Alicen M. Houff, Christopher Izzo, Bailey Knight, Sudhanva Lalit, Erich Leistenschneider, E. Marilena Lykiardopoulou, Franziska M. Maier, Witold Nazarewicz, Rodney Orford, William S. Porter, Caleb Quick, Ante Ravlic, Matthew Redshaw, Paul-Gerhard Reinhard, Ryan Ringle, Stefan Schwarz, Chandan…

The mass of $^{101}$Sn and Bayesian extrapolations to the proton drip line
The favorable energy configurations of nuclei at magic numbers of ${N}$ neutrons and ${Z}$ protons are fundamental for understanding the evolution of nuclear structure. The ${Z=50}$ (tin) isotopic chain is a frontier for such studies, with particular interest in nuclear binding at and around the doubly-magic \textsuperscript{100}Sn isotope. Precise mass measurements of neutron-deficient isotopes provide necessary anchor points for mass models to test extrapolations near the proton drip line, wh…

@arXiv_csCR_bot@mastoxiv.page
2025-10-14 12:12:18

CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense
Yang Zhuochen, Fok Kar Wai, Thing Vrizlynn
https://arxiv.org/abs/2510.11137 https://arx…

CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense
Large language models have gained widespread attention recently, but their potential security vulnerabilities, especially privacy leakage, are also becoming apparent. To test and evaluate for data extraction risks in LLM, we proposed CoSPED, short for Consistent Soft Prompt targeted data Extraction and Defense. We introduce several innovative components, including Dynamic Loss, Additive Loss, Common Loss, and Self Consistency Decoding Strategy, and tested to enhance the consistency of the soft …

@arXiv_statML_bot@mastoxiv.page
2025-10-10 09:37:19

PAC Learnability in the Presence of Performativity
Ivan Kirev, Lyuben Baltadzhiev, Nikola Konstantinov
https://arxiv.org/abs/2510.08335 https://arxiv.org/p…

PAC Learnability in the Presence of Performativity
Following the wide-spread adoption of machine learning models in real-world applications, the phenomenon of performativity, i.e. model-dependent shifts in the test distribution, becomes increasingly prevalent. Unfortunately, since models are usually trained solely based on samples from the original (unshifted) distribution, this performative shift may lead to decreased test-time performance. In this paper, we study the question of whether and when performative binary classification problems are…

@arXiv_astrophSR_bot@mastoxiv.page
2025-10-13 08:02:50

How precisely can we measure the ages of subgiant and giant stars?
Cheyanne Shariat, Kareem El-Badry, Soumyadeep Bhattacharjee
https://arxiv.org/abs/2510.08675 https://

How precisely can we measure the ages of subgiant and giant stars?
Precise stellar ages are fundamental to Galactic archaeology. However, obtaining reliable age estimates and uncertainties for field stars has been a long-standing challenge. We test the fidelity of ages from recent catalogs of giants and subgiants using wide binaries, whose components formed at the same time and thus should have consistent inferred ages. We find that subgiant ages based on spectroscopic metallicities from Xiang & Rix (2022) are generally consistent within their reported uncerta…

@arXiv_mathST_bot@mastoxiv.page
2025-10-14 08:24:58

Simultaneous Frequentist Calibration of Confidence Regions for Multiple Functionals in Constrained Inverse Problems
Pau Batlle, Pratik Patil, Michael Stanley, Javier Ruiz Lupon, Houman Owhadi, Mikael Kuusela
https://arxiv.org/abs/2510.11708

Simultaneous Frequentist Calibration of Confidence Regions for Multiple Functionals in Constrained Inverse Problems
Many scientific analyses require simultaneous comparison of multiple functionals of an unknown signal at once, calling for multidimensional confidence regions with guaranteed simultaneous frequentist under structural constraints (e.g., non-negativity, shape, or physics-based). This paper unifies and extends many previous optimization-based approaches to constrained confidence region construction in linear inverse problems through the lens of statistical test inversion. We begin by reviewing the…

@arXiv_quantph_bot@mastoxiv.page
2025-10-10 11:19:49

Guess your neighbor's input: Quantum advantage in Feige's game
Simon Schmidt, Sigurd A. L. Storgaard, Michael Walter, Yuming Zhao
https://arxiv.org/abs/2510.08484 https:…

Guess your neighbor's input: Quantum advantage in Feige's game
In this article, we study a nonlocal game with two questions and three answers per player, which was first considered by Feige in 1991, and show that there is quantum advantage in this game. We prove that the game is a robust self-test for the $3$-dimensional maximally entangled state. Furthermore, we show that the game can be seen as the "or" of two games that each do not have quantum advantage. Lastly, we investigate the behavior of the game with respect to parallel repetition in the classica…

@arXiv_astrophCO_bot@mastoxiv.page
2025-10-14 09:56:18

Fast radio bursts shed light on direct gravity test on cosmological scales
Shuren Zhou, Pengjie Zhang
https://arxiv.org/abs/2510.11022 https://arxiv.org/pd…

Fast radio bursts shed light on direct gravity test on cosmological scales
A key measure of gravity is the relation between the Weyl potential $Ψ+Φ$ and the matter overdensity $δ_m$, capsulized as an effective gravitational constant $G_{\rm light}$ for light motion. Its value, together with the possible spatial and temporal variation, is essential in probing physics beyond Einstein gravity. However, the lack of an unbiased proxy of $δ_m$ prohibits direct measurement of $G_{\rm light}$. We point out that the equivalence principle ensures the dispersion measure (DM)…

@arXiv_csLG_bot@mastoxiv.page
2025-10-14 22:19:32

Replaced article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[13/14]:
- Class-Invariant Test-Time Augmentation for Domain Generalization
Zhicheng Lin, Xiaolin Wu, Xi Zhang

@arXiv_csCV_bot@mastoxiv.page
2025-10-14 16:14:50

Crosslisted article(s) found for cs.CV. https://arxiv.org/list/cs.CV/new
[2/3]:
- ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
Guan-Yan Yang, Tzu-Yu Cheng, Ya-Wen Teng, Farn Wanga, Kuo-Hui Yeh

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:43:40

Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation
Sondos Mahmoud Bsharat, Zhiqiang Shen
https://arxiv.org/abs/2510.09599 https://arxi…

Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation
Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually se…

@arXiv_csSE_bot@mastoxiv.page
2025-10-13 09:55:00

Constraint-Guided Unit Test Generation for Machine Learning Libraries
Lukas Krodinger, Altin Hajdari, Stephan Lukasczyk, Gordon Fraser
https://arxiv.org/abs/2510.09108 https://

Constraint-Guided Unit Test Generation for Machine Learning Libraries
Machine learning (ML) libraries such as PyTorch and TensorFlow are essential for a wide range of modern applications. Ensuring the correctness of ML libraries through testing is crucial. However, ML APIs often impose strict input constraints involving complex data structures such as tensors. Automated test generation tools such as Pynguin are not aware of these constraints and often create non-compliant inputs. This leads to early test failures and limited code coverage. Prior work has investig…

@arXiv_astrophEP_bot@mastoxiv.page
2025-10-13 08:53:40

Probing the geological setting of exoplanets through atmospheric analysis: using Mars as a test case
Monica Rainer, Evandro Balbi, Francesco Borsa, Paola Cianfarra, Avet Harutyunyan, Silvano Tosi
https://arxiv.org/abs/2510.09305

Probing the geological setting of exoplanets through atmospheric analysis: using Mars as a test case
One of the frontier research fields of exoplanetary science is the study of the composition and variability of exoplanetary atmospheres. This field is now moving from the gas giant planets towards the smaller and colder telluric planets, and future instruments like ANDES will focus on the observations of the atmosphere of telluric planets in the habitable zone in reflected light. These future observations will possibly find variable signals due to the view of different hemispheres of the planet…

@arXiv_csAI_bot@mastoxiv.page
2025-10-13 10:11:10

LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?
Kaijian Zou, Aaron Xiong, Yunxiang Zhang, Frederick Zhang, Yueqi Ren, Jirong Yang, Ayoung Lee, Shitanshu Bhushan, Lu Wang
https://arxiv.org/abs/2510.09595

LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?
Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad…

@arXiv_csCV_bot@mastoxiv.page
2025-10-14 13:45:18

Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping
Walid Elbarz, Mohamed Bourriz, Hicham Hajji, Hamd Ait Abdelali, Fran\c{c}ois Bourzeix
https://arxiv.org/abs/2510.11576

Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping
Foundation models are transforming Earth observation, but their potential for hyperspectral crop mapping remains underexplored. This study benchmarks three foundation models for cereal crop mapping using hyperspectral imagery: HyperSigma, DOFA, and Vision Transformers pre-trained on the SpectralEarth dataset (a large multitemporal hyperspectral archive). Models were fine-tuned on manually labeled data from a training region and evaluated on an independent test region. Performance was measured w…

@arXiv_grqc_bot@mastoxiv.page
2025-10-14 08:21:08

The Gravitational Wave Memory from Binary Neutron Star Mergers
Jamie Bamber, Antonios Tsokaros, Milton Ruiz, Stuart L. Shapiro, Marc Favata, Matthew Karlson, Fabrizio Venturi Pi\~nas
https://arxiv.org/abs/2510.09742

The Gravitational Wave Memory from Binary Neutron Star Mergers
The gravitational wave signal produced by the merger of two compact objects includes both an oscillatory transient and a non-oscillatory part, the so-called memory effect. This produces a permanent displacement of test masses and has not yet been measured. We use general relativistic magnetohydrodynamic simulations, including neutrinos, with several representative viable equations of state, to quantify--for the first time--the effects of the neutron star magnetic field, neutrino emission, and t…

@arXiv_csCL_bot@mastoxiv.page
2025-10-14 13:18:28

When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents
Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, Yangyang Yu, Alejandro Lopez-Lira, Peng Lu, Jian-Yun Nie, Guojun Xiong, Jimin Huang, Sophia Ananiadou
https://arxiv.org/abs/2510.11695

When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents
Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, …

@arXiv_astrophCO_bot@mastoxiv.page
2025-10-14 10:41:19

Probing cosmic curvature with Alcock-Paczynski data
Yungui Gong, Qing Gao, Xuchen Lu, Zhu Yi
https://arxiv.org/abs/2510.11555 https://arxiv.org/pdf/2510.11…

Probing cosmic curvature with Alcock-Paczynski data
The Alcock-Paczynski (AP) parameter $F_{AP}$ is independent of the sound horizon $r_d$, making the Dark Energy Spectroscopic Instrument (DESI) baryon acoustic oscillation (BAO) AP measurements particularly well suited for cosmological applications. We propose a novel null test of cosmic curvature tailored to DESI BAO data that combines $F_{AP}$ with the ratios $D_V'/D_V$ or $D_M'/D_M$. This null test can also be performed using a joint dataset of DESI BAO and type Ia supernova (SNe Ia) observat…

@arXiv_csRO_bot@mastoxiv.page
2025-10-08 10:05:09

Verifier-free Test-Time Sampling for Vision Language Action Models
Suhyeok Jang, Dongyoung Kim, Changyeon Kim, Youngsuk Kim, Jinwoo Shin
https://arxiv.org/abs/2510.05681 https:/…

Verifier-free Test-Time Sampling for Vision Language Action Models
Vision-Language-Action models (VLAs) have demonstrated remarkable performance in robot control. However, they remain fundamentally limited in tasks that require high precision due to their single-inference paradigm. While test-time scaling approaches using external verifiers have shown promise, they require additional training and fail to generalize to unseen conditions. We propose Masking Distribution Guided Selection (MG-Select), a novel test-time scaling framework for VLAs that leverages the…

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 10:33:58

How Students Use Generative AI for Software Testing: An Observational Study
Baris Ardic, Quentin Le Dilavrec, Andy Zaidman
https://arxiv.org/abs/2510.10551 https://

How Students Use Generative AI for Software Testing: An Observational Study
The integration of generative AI tools like ChatGPT into software engineering workflows opens up new opportunities to boost productivity in tasks such as unit test engineering. However, these AI-assisted workflows can also significantly alter the developer's role, raising concerns about control, output quality, and learning, particularly for novice developers. This study investigates how novice software developers with foundational knowledge in software testing interact with generative AI for e…

@arXiv_astrophHE_bot@mastoxiv.page
2025-10-14 10:17:48

Accretion onto Reissner-Nordstr\"{o}m naked singularities
Tomasz Krajewski, W{\l}odek Klu\'zniak
https://arxiv.org/abs/2510.10043 https://arxiv.or…

Accretion onto Reissner-Nordström naked singularities
Nearly every galactic core contains a supermassive compact object, hypothesized to be a Kerr black hole. It was only with the advent of Event Horizon Telescope observations that the predictions of this hypothesis could be observationally tested for our own Galaxy, and the nearby elliptical M87, on spatial scales comparable to the gravitational radius. At the same time it became possible to test whether alternatives such as naked singularities in general relativity, or similar objects in alterna…

@arXiv_csCL_bot@mastoxiv.page
2025-10-14 21:37:08

Replaced article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[2/9]:
- Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning
Wenkai Yang, Shuming Ma, Yankai Lin, Furu Wei

@arXiv_csAI_bot@mastoxiv.page
2025-10-13 09:20:30

LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition
Yushuo Zheng, Zicheng Zhang, Xiongkuo Min, Huiyu Duan, Guangtao Zhai
https://arxiv.org/abs/2510.08928 h…

LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition
Existing benchmarks for large multimodal models (LMMs) often fail to capture their performance in real-time, adversarial environments. We introduce LM Fight Arena (Large Model Fight Arena), a novel framework that evaluates LMMs by pitting them against each other in the classic fighting game Mortal Kombat II, a task requiring rapid visual understanding and tactical, sequential decision-making. In a controlled tournament, we test six leading open- and closed-source models, where each agent operat…

@arXiv_statML_bot@mastoxiv.page
2025-10-10 09:26:09

Beyond Real Data: Synthetic Data through the Lens of Regularization
Amitis Shidani, Tyler Farghly, Yang Sun, Habib Ganjgahi, George Deligiannidis
https://arxiv.org/abs/2510.08095

Beyond Real Data: Synthetic Data through the Lens of Regularization
Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between t…

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 09:59:28

LLMs are All You Need? Improving Fuzz Testing for MOJO with Large Language Models
Linghan Huang, Peizhou Zhao, Huaming Chen
https://arxiv.org/abs/2510.10179 https://

LLMs are All You Need? Improving Fuzz Testing for MOJO with Large Language Models
The rapid development of large language models (LLMs) has revolutionized software testing, particularly fuzz testing, by automating the generation of diverse and effective test inputs. This advancement holds great promise for improving software reliability. Meanwhile, the introduction of MOJO, a high-performance AI programming language blending Python's usability with the efficiency of C and C++, presents new opportunities to enhance AI model scalability and programmability. However, as a new l…

@arXiv_quantph_bot@mastoxiv.page
2025-10-09 10:58:01

Is it Gaussian? Testing bosonic quantum states
Filippo Girardi, Freek Witteveen, Francesco Anna Mele, Lennart Bittel, Salvatore F. E. Oliviero, David Gross, Michael Walter
https://arxiv.org/abs/2510.07305

Is it Gaussian? Testing bosonic quantum states
Gaussian states are widely regarded as one of the most relevant classes of continuous-variable (CV) quantum states, as they naturally arise in physical systems and play a key role in quantum technologies. This motivates a fundamental question: given copies of an unknown CV state, how can we efficiently test whether it is Gaussian? We address this problem from the perspective of representation theory and quantum learning theory, characterizing the sample complexity of Gaussianity testing as a fu…

@arXiv_grqc_bot@mastoxiv.page
2025-10-13 09:42:30

Particles with precessing spin in Kerr spacetime: analytic solutions for eccentric orbits and homoclinic motion near the equatorial plane
Gabriel Andres Piovano
https://arxiv.org/abs/2510.09597

Particles with precessing spin in Kerr spacetime: analytic solutions for eccentric orbits and homoclinic motion near the equatorial plane
We present a family of analytic solutions for the nearly-equatorial motion of a test particle with precessing spin in Kerr spacetime. We solve the equations of motion up to linear order in the small body's spin for periodic and homoclinic orbits. At zero order, the particle moves along equatorial geodesics. The spin-curvature force introduces post-geodesic corrections which, for generic spin orientations, cause the precession of the orbital plane. We derive the solutions for eccentric orbits in…

@arXiv_csLG_bot@mastoxiv.page
2025-10-09 10:55:11

Test-Time Graph Search for Goal-Conditioned Reinforcement Learning
Evgenii Opryshko, Junwei Quan, Claas Voelcker, Yilun Du, Igor Gilitschenski
https://arxiv.org/abs/2510.07257 h…

Test-Time Graph Search for Goal-Conditioned Reinforcement Learning
Offline goal-conditioned reinforcement learning (GCRL) trains policies that reach user-specified goals at test time, providing a simple, unsupervised, domain-agnostic way to extract diverse behaviors from unlabeled, reward-free datasets. Nonetheless, long-horizon decision making remains difficult for GCRL agents due to temporal credit assignment and error accumulation, and the offline setting amplifies these effects. To alleviate this issue, we introduce Test-Time Graph Search (TTGS), a lightwe…

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 10:41:48

Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration
Mohanakrishnan Hariharan, Satish Arvapalli, Seshu Barma, Evangeline Sheela
https://arxiv.org/abs/2510.10824

Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration
We present an approach to software testing automation using Agentic Retrieval-Augmented Generation (RAG) systems for Quality Engineering (QE) artifact creation. We combine autonomous AI agents with hybrid vector-graph knowledge systems to automate test plan, case, and QE metric generation. Our approach addresses traditional software testing limitations by leveraging LLMs such as Gemini and Mistral, multi-agent orchestration, and enhanced contextualization. The system achieves remarkable accurac…

@arXiv_csCR_bot@mastoxiv.page
2025-10-08 09:18:49

AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
Xiaogeng Liu, Chaowei Xiao
https://arxiv.org/abs/2510.05379 https://

AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
Recent advancements in jailbreaking large language models (LLMs), such as AutoDAN-Turbo, have demonstrated the power of automated strategy discovery. AutoDAN-Turbo employs a lifelong learning agent to build a rich library of attack strategies from scratch. While highly effective, its test-time generation process involves sampling a strategy and generating a single corresponding attack prompt, which may not fully exploit the potential of the learned strategy library. In this paper, we propose to…

@arXiv_statME_bot@mastoxiv.page
2025-10-10 09:05:49

Detection of mean changes in partially observed functional data
\v{S}\'arka Hudecov\'a, Claudia Kirch
https://arxiv.org/abs/2510.07854 https://arxi…

Detection of mean changes in partially observed functional data
We propose a test for a change in the mean for a sequence of functional observations that are only partially observed on subsets of the domain, with no information available on the complement. The framework accommodates important scenarios, including both abrupt and gradual changes. The significance of the test statistic is assessed via a permutation test. In addition to the classical permutation approach with a fixed number of permutation samples, we also discuss a variant with controlled resa…

@arXiv_csRO_bot@mastoxiv.page
2025-10-10 10:20:59

Scalable Offline Metrics for Autonomous Driving
Animikh Aich, Adwait Kulkarni, Eshed Ohn-Bar
https://arxiv.org/abs/2510.08571 https://arxiv.org/pdf/2510.08…

Scalable Offline Metrics for Autonomous Driving
Real-World evaluation of perception-based planning models for robotic systems, such as autonomous vehicles, can be safely and inexpensively conducted offline, i.e., by computing model prediction error over a pre-collected validation dataset with ground-truth annotations. However, extrapolating from offline model performance to online settings remains a challenge. In these settings, seemingly minor errors can compound and result in test-time infractions or collisions. This relationship is unders…

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 08:07:37

Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem
Muhammad Maaz, Liam DeVoe, Zac Hatfield-Dodds, Nicholas Carlini
https://arxiv.org/abs/2510.09907 https:/…

Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem
Property-based testing (PBT) is a lightweight formal method, typically implemented as a randomized testing framework. Users specify the input domain for their test using combinators supplied by the PBT framework, and the expected properties or invariants as a unit-test function. The framework then searches for a counterexample, e.g. by generating inputs and calling the test function. In this work, we demonstrate an LLM-based agent which analyzes Python modules, infers function-specific and cros…

@arXiv_astrophCO_bot@mastoxiv.page
2025-10-13 09:48:30

Extending CSST Emulator to post-DESI era
Zhao Chen, Yu Yu
https://arxiv.org/abs/2510.09503 https://arxiv.org/pdf/2510.09503

Extending CSST Emulator to post-DESI era
The recent DESI BAO measurements have revealed a potential deviation from a cosmological constant, suggesting a dynamic nature of dark energy. To rigorously test this result, complementary probes such as weak gravitational lensing are crucial, demanding highly accurate and efficient predictions of the nonlinear matter power spectrum within the $w_0w_a$CDM framework. However, most existing emulators fail to cover the full parameter posterior from DESI DR2+CMB constraints in the $w_0\mbox{-}w_a$ …

@arXiv_csLG_bot@mastoxiv.page
2025-10-08 10:40:39

LATTA: Langevin-Anchored Test-Time Adaptation for Enhanced Robustness and Stability
Harshil Vejendla
https://arxiv.org/abs/2510.05530 https://arxiv.org/pdf…

LATTA: Langevin-Anchored Test-Time Adaptation for Enhanced Robustness and Stability
Test-time adaptation (TTA) aims to adapt a pretrained model to distribution shifts using only unlabeled test data. While promising, existing methods like Tent suffer from instability and can catastrophically forget the source knowledge, especially with small batch sizes or challenging corruptions. We argue that this arises from overly deterministic updates on a complex loss surface. In this paper, we introduce Langevin-Anchored Test-Time Adaptation (LATTA), a novel approach that regularizes ada…

@arXiv_quantph_bot@mastoxiv.page
2025-10-09 10:39:41

High-Performance Imaging in a Dilution Refrigerator
Timo Eikelmann, Mara Brinkmann, Leonie Eggers, Tuncay Ulas, Donika Imeri, Konstantin Beck, Lasse Jens Irrgang, Sunil Kumar Mahato, Rikhav Shah, Ralf Riedinger
https://arxiv.org/abs/2510.07054

High-Performance Imaging in a Dilution Refrigerator
Nanophotonic light-matter interfaces hold great promise for quantum technologies. Enhancing local electromagnetic fields, they enable highly efficient detectors, can help realize optically connected processors, or serve as quantum repeaters. In-situ fiber-coupling at sub-Kelvin temperatures, as required for test and development of new devices, proves challenging as suitable cryogenic microscopes are not readily available. Here, we report on a robust and versatile confocal imaging system integra…

@arXiv_statML_bot@mastoxiv.page
2025-10-10 08:29:48

A Honest Cross-Validation Estimator for Prediction Performance
Tianyu Pan, Vincent Z. Yu, Viswanath Devanarayan, Lu Tian
https://arxiv.org/abs/2510.07649 https://

A Honest Cross-Validation Estimator for Prediction Performance
Cross-validation is a standard tool for obtaining a honest assessment of the performance of a prediction model. The commonly used version repeatedly splits data, trains the prediction model on the training set, evaluates the model performance on the test set, and averages the model performance across different data splits. A well-known criticism is that such cross-validation procedure does not directly estimate the performance of the particular model recommended for future use. In this paper, w…

@arXiv_csCV_bot@mastoxiv.page
2025-10-09 10:47:01

GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation
Wen Ye, Zhaocheng Liu, Yuwei Gui, Tingyu Yuan, Yunyue Su, Bowen Fang, Chaoyang Zhao, Qiang Liu, Liang Wang
https://arxiv.org/abs/2510.07217

GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation
Text-to-image synthesis has made remarkable progress, yet accurately interpreting complex and lengthy prompts remains challenging, often resulting in semantic inconsistencies and missing details. Existing solutions, such as fine-tuning, are model-specific and require training, while prior automatic prompt optimization (APO) approaches typically lack systematic error analysis and refinement strategies, resulting in limited reliability and effectiveness. Meanwhile, test-time scaling methods opera…

@arXiv_csAI_bot@mastoxiv.page
2025-10-08 10:28:19

ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models
Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Zhiyuan Yu, Qipeng Guo, Xuanjing Huang, Xipeng Qiu
https://arxiv.org/abs/2510.06014

ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models
Test-time scaling has emerged as a transformative paradigm for enhancing the performance of large reasoning models, enabling dynamic allocation of computational resources during inference. However, as the landscape of reasoning models rapidly expands, a critical question remains: how can we systematically compare and evaluate the test-time scaling capabilities across different models? In this paper, we introduce ARISE (Adaptive Resolution-aware Scaling Evaluation), a novel metric specifically d…

@arXiv_grqc_bot@mastoxiv.page
2025-10-10 09:47:09

Effects of magnetic fields on spinning test particles orbiting Kerr-Bertotti-Robinson black holes
Yu-Kun Zhang, Shao-Wen Wei
https://arxiv.org/abs/2510.07914 https://

Effects of magnetic fields on spinning test particles orbiting Kerr-Bertotti-Robinson black holes
In this paper, we study the kinematic effects of spinning test particles orbiting the Kerr-Bertotti-Robinson black hole. Employing with the Mathisson-Papapetrou-Dixon equations, we explore the dynamics of precessing orbits and distinct orbital types, including circular orbits and innermost stable circular orbits. Our results reveal the substantial impact of the magnetic field on the trajectories of spinning particles, particularly in regions characterized by significant radial distances. More i…

@arXiv_csCR_bot@mastoxiv.page
2025-10-09 08:57:21

Proofs of No Intrusion
Vipul Goyal, Justin Raizes
https://arxiv.org/abs/2510.06432 https://arxiv.org/pdf/2510.06432…

Proofs of No Intrusion
A central challenge in data security is not just preventing theft, but detecting whether it has occurred. Classically, this is impossible because a perfect copy leaves no evidence. Quantum mechanics, on the other hand, forbids general duplication, opening up new possibilities. We introduce Proofs of No Intrusion, which enable a classical client to remotely test whether a quantum server has been hacked and the client's data stolen. Crucially, the test does not destroy the data being tested, av…

@arXiv_astrophCO_bot@mastoxiv.page
2025-10-13 09:28:10

Euclid preparation. Cosmology Likelihood for Observables in Euclid (CLOE). 4: Validation and Performance
Collaboration, Martinelli, Pezzotta, Sciotti, Blot, Bonici, Camera, Ca\~nas-Herrera, Cardone, Carrilho, Casas, Davini, Di Domizio, Farrens, Goh, Beauchamps, Ili\'c, Joudaki, Keil, Le Brun, Moretti, Pettorino, S\'anchez, Sakr, Tanidis, Tutusaus, Ajani, Crocce, Giocoli, Legrand, Lembo, Lesci, Girones, Nouri-Zonoz, Pamuk, Tsedrik, Bel, Carbone, Duncan, Kilbinger, Lacasa, Lattan…

Euclid preparation. Cosmology Likelihood for Observables in Euclid (CLOE). 4: Validation and Performance
The Euclid satellite will provide data on the clustering of galaxies and on the distortion of their measured shapes, which can be used to constrain and test the cosmological model. However, the increase in precision places strong requirements on the accuracy of the theoretical modelling for the observables and of the full analysis pipeline. In this paper, we investigate the accuracy of the calculations performed by the Cosmology Likelihood for Observables in Euclid (CLOE), a software able to ha…

@arXiv_statME_bot@mastoxiv.page
2025-10-08 09:30:39

Extension of Wald-Wolfowitz Runs Test for Regression Validity Testing with Repeated Measures of Independent Variable
Bo-Yao Lian, Nelson G. Chen
https://arxiv.org/abs/2510.05861

Extension of Wald-Wolfowitz Runs Test for Regression Validity Testing with Repeated Measures of Independent Variable
The Wald-Wolfowitz runs test can assess the correctness of a regression curve fitted to a data set with one independent parameter. The assessment is performed through examination of the residuals, where the signs of the residuals would appear randomly if the regression curve were correct. We propose extending the test to the case where multiple data points were measured for specific independent parameter values. By randomly permutating the data points corresponding to each independent parameter…

@arXiv_csLG_bot@mastoxiv.page
2025-10-10 11:04:09

The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models
Konrad L\"ohr, Shuzhou Yuan, Michael F\"arber
https://arxiv.org/abs/2510.08236

The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models
Large Language Models (LLMs) are increas- ingly integral to information dissemination and decision-making processes. Given their grow- ing societal influence, understanding potential biases, particularly within the political domain, is crucial to prevent undue influence on public opinion and democratic processes. This work investigates political bias and stereotype propa- gation across eight prominent LLMs using the two-dimensional Political Compass Test (PCT). Initially, the PCT is employed to…

@arXiv_csAI_bot@mastoxiv.page
2025-10-08 10:34:39

Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification
Weihao Zeng, Keqing He, Chuqiao Kuang, Xiaoguang Li, Junxian He
https://arxiv.org/abs/2510.06135 htt…

Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification
Test-time compute can be scaled both sequentially and in parallel. Sequential scaling involves lengthening the generation process, while parallel scaling involves verifying and selecting among multiple candidate outputs. Combining these two strategies has led to the most powerful AI systems, such as Grok 4 Heavy and GPT-5 Pro. In certain contexts (e.g., solving Sudoku puzzles), verifying responses can be substantially easier than generating them. This property, referred to as \emph{asymmetric v…

@arXiv_csCV_bot@mastoxiv.page
2025-10-09 10:26:11

TTRV: Test-Time Reinforcement Learning for Vision Language Models
Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, M. Jehanzeb Mirza
https://arxiv.org/abs/2510.06783

TTRV: Test-Time Reinforcement Learning for Vision Language Models
Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency o…

@arXiv_statME_bot@mastoxiv.page
2025-10-08 08:49:19

A new composite Mann-Whitney test for two-sample survival comparisons with right-censored data
Abid Hussain, Touqeer Ahmad
https://arxiv.org/abs/2510.05353 https://

A new composite Mann-Whitney test for two-sample survival comparisons with right-censored data
A fundamental challenge in comparing two survival distributions with right censored data is the selection of an appropriate nonparametric test, as the power of standard tests like the Log rank and Wilcoxon is highly dependent on the often unknown nature of the alternative hypothesis. This paper introduces a new, distribution free two sample test designed to overcome this limitation. The proposed method is based on a strategic decomposition of the data into uncensored and censored subsets, from …

@arXiv_csCL_bot@mastoxiv.page
2025-10-10 11:10:59

ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
Qin Liu, Jacob Dineen, Yuxi Huang, Sheng Zhang, Hoifung Poon, Ben Zhou, Muhao Chen
https://arxiv.org/abs/2510.08569

ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. G…

@arXiv_csLG_bot@mastoxiv.page
2025-10-08 10:45:59

NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering
Alexander Murphy, Michal Danilowski, Soumyajit Chatterjee, Abhirup Ghosh
https://arxiv.org/abs/2510.05635 h…

NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering
Test-Time Adaptation (TTA) methods are often computationally expensive, require a large amount of data for effective adaptation, or are brittle to hyperparameters. Based on a theoretical foundation of the geometry of the latent space, we are able to significantly improve the alignment between source and distribution-shifted samples by re-centering target data embeddings at the origin. This insight motivates NEO -- a hyperparameter-free fully TTA method, that adds no significant compute compared…

@arXiv_csAI_bot@mastoxiv.page
2025-10-08 10:37:39

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He
https://arxiv.org/abs/2510.06217

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table ret…

@arXiv_csCL_bot@mastoxiv.page
2025-10-07 12:20:52

Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Omri Uzan, Asaf Yehudai, Roi pony, Eyal Shnarch, Ariel Gera
https://arxiv.org/abs/2510.05038 htt…

Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited…

@arXiv_csCV_bot@mastoxiv.page
2025-10-06 10:14:19

Test-Time Defense Against Adversarial Attacks via Stochastic Resonance of Latent Ensembles
Dong Lao, Yuxiang Zhang, Haniyeh Ehsani Oskouie, Yangchao Wu, Alex Wong, Stefano Soatto
https://arxiv.org/abs/2510.03224

Test-Time Defense Against Adversarial Attacks via Stochastic Resonance of Latent Ensembles
We propose a test-time defense mechanism against adversarial attacks: imperceptible image perturbations that significantly alter the predictions of a model. Unlike existing methods that rely on feature filtering or smoothing, which can lead to information loss, we propose to "combat noise with noise" by leveraging stochastic resonance to enhance robustness while minimizing information loss. Our approach introduces small translational perturbations to the input image, aligns the transformed feat…

@arXiv_csAI_bot@mastoxiv.page
2025-10-10 10:31:19

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai
https://arxiv.org/abs/2510.08189

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning beha…

@arXiv_csLG_bot@mastoxiv.page
2025-10-08 10:54:59

Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime
Andreas Maurer, Erfan Mirzaei, Massimiliano Pontil
https://arxiv.org/abs/2510.06028 https…

Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime
The paper provides data-dependent bounds on the test error of the Gibbs algorithm in the overparameterized interpolation regime, where low training errors are also obtained for impossible data, such as random labels in classification. The bounds are stable under approximation with Langevin Monte Carlo algorithms. Experiments on the MNIST and CIFAR-10 datasets verify that the bounds yield nontrivial predictions on true labeled data and correctly upper bound the test error for random labels. Our …

@arXiv_csSE_bot@mastoxiv.page
2025-10-10 09:09:09

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR
Zeyu Sun, Jingjing Liang, Weiyi Wang, Chenyao Suo, Junjie Chen, Fanjiang Xu
https://arxiv.org/abs/2510.07815

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR
MLIR (Multi-Level Intermediate Representation) has rapidly become a foundational technology for modern compiler frameworks, enabling extensibility across diverse domains. However, ensuring the correctness and robustness of MLIR itself remains challenging. Existing fuzzing approaches-based on manually crafted templates or rule-based mutations-struggle to generate sufficiently diverse and semantically valid test cases, making it difficult to expose subtle or deep-seated bugs within MLIR's complex…

@arXiv_csCL_bot@mastoxiv.page
2025-10-09 10:21:31

PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs
Manuel Frank, Haithem Afli
https://arxiv.org/abs/2510.06730 https://

PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs
Current evaluations of sentence embedding models typically rely on static test beds such as the Massive Text Embedding Benchmark (MTEB). While invaluable, repeated tuning on a fixed suite can inflate reported performance and obscure real-world robustness. We introduce the Paraphrasing Text Embedding Benchmark (PTEB), a dynamic protocol that stochastically generates meaning-preserving paraphrases at evaluation time and aggregates results across multiple runs. Using a cost-efficient LLM-based met…

@arXiv_csAI_bot@mastoxiv.page
2025-10-08 10:27:09

MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization
Dayy\'an O'Brien, Barry Haddow, Emily Allaway, Pinzhen Chen
https://arxiv.org/abs/2510.05962

MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization
Conducting contamination-free evaluation of mathematical capabilities can be difficult for two reasons: models may memorize a test set once it is made public, and current mathematical benchmarks are prone to overfitting due to having limited diversity of symbols and rules, coupled with closed-ended answers. This paper proposes a method to leverage these shortcomings as useful features to a construct dynamic, counterfactual benchmark, which can be used to both reveal overfitting and measure true…

@arXiv_csLG_bot@mastoxiv.page
2025-10-07 13:06:22

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts
Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, Amrit Singh Bedi
https://arxiv.org/abs/2510.05040

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts
Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time …

@arXiv_csCL_bot@mastoxiv.page
2025-10-06 10:18:59

Self-Reflective Generation at Test Time
Jian Mu, Qixin Zhang, Zhiyong Wang, Menglin Yang, Shuang Qiu, Chengwei Qin, Zhongxiang Dai, Yao Shu
https://arxiv.org/abs/2510.02919 http…

Self-Reflective Generation at Test Time
Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time …

@arXiv_csSE_bot@mastoxiv.page
2025-10-08 08:38:39

Test Case Generation from Bug Reports via Large Language Models: A Cognitive Layered Evaluation Framework
Irtaza Sajid Qureshi (Jack), Zhen Ming (Jack), Jiang
https://arxiv.org/abs/2510.05365

Test Case Generation from Bug Reports via Large Language Models: A Cognitive Layered Evaluation Framework
Large Language Models (LLMs) are increasingly applied to automated software testing, yet their ability to generalize beyond memorized patterns and reason about natural language bug reports remains unclear. We present a systematic evaluation of LLM reasoning in test case generation, structured around the cognitive layers of Bloom's taxonomy: \textit{Remember}, \textit{Understand}, \textit{Apply}, \textit{Analyze}, \textit{Evaluate}, and \textit{Create}, which progressively assess higher levels o…

@arXiv_csAI_bot@mastoxiv.page
2025-10-09 09:58:01

Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning
Wenxun Wu, Yuanyang Li, Guhan Chen, Linyue Wang, Hongyang Chen
https://arxiv.org/abs/2510.07038

Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning
Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers. These approaches have demonstrated significant performance improvements on benchmarks involving mathematical reasoning. However, language models relying solely on direct inference still struggle with tasks demanding up-to-date knowledge or computational tools such as calculators and code interpreters for complex arithmetic operatio…

@arXiv_csLG_bot@mastoxiv.page
2025-10-07 13:06:02

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
Nevan Wichers, Aram Ebtekar, Ariana Azarbal, Victor Gillioz, Christine Ye, Emil Ryd, Neil Rathi, Henry Sleight, Alex Mallen, Fabien Roger, Samuel Marks
https://arxiv.org/abs/2510.05024

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to ino…

@arXiv_csCL_bot@mastoxiv.page
2025-10-07 12:23:42

Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models
Runchu Tian, Junxia Cui, Xueqiang Xu, Feng Yao, Jingbo Shang
https://arxiv.org/abs/2510.05090

Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models
Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering advantages such as accelerated parallel decoding and bidirectional context modeling. However, the vanilla decoding strategy in discrete dLLMs suffers from a critical limitation: once a token is accepted, it can no longer be revised in subsequent steps. As a result, early mistakes persist across iterations, harming both intermediate predictions and final output quality…

@arXiv_csAI_bot@mastoxiv.page
2025-10-06 08:41:39

On the Role of Temperature Sampling in Test-Time Scaling
Yuheng Wu, Azalia Mirhoseini, Thierry Tambe
https://arxiv.org/abs/2510.02611 https://arxiv.org/pdf…

On the Role of Temperature Sampling in Test-Time Scaling
Large language models (LLMs) can improve reasoning at inference time through test-time scaling (TTS), where multiple reasoning traces are generated and the best one is selected. Prior work shows that increasing the number of samples K steadily improves accuracy. In this paper, we demonstrate that this trend does not hold indefinitely: at large K, further scaling yields no gains, and certain hard questions remain unsolved regardless of the number of traces. Interestingly, we find that different …

@arXiv_csSE_bot@mastoxiv.page
2025-10-08 08:59:09

UnitTenX: Generating Tests for Legacy Packages with AI Agents Powered by Formal Verification
Yiannis Charalambous, Claudionor N. Coelho Jr, Luis Lamb, Lucas C. Cordeiro
https://arxiv.org/abs/2510.05441

UnitTenX: Generating Tests for Legacy Packages with AI Agents Powered by Formal Verification
This paper introduces UnitTenX, a state-of-the-art open-source AI multi-agent system designed to generate unit tests for legacy code, enhancing test coverage and critical value testing. UnitTenX leverages a combination of AI agents, formal methods, and Large Language Models (LLMs) to automate test generation, addressing the challenges posed by complex and legacy codebases. Despite the limitations of LLMs in bug detection, UnitTenX offers a robust framework for improving software reliability and…

@arXiv_csAI_bot@mastoxiv.page
2025-10-08 10:03:59

Large Language Model-Based Uncertainty-Adjusted Label Extraction for Artificial Intelligence Model Development in Upper Extremity Radiography
Hanna Kreutzer, Anne-Sophie Caselitz, Thomas Dratsch, Daniel Pinto dos Santos, Christiane Kuhl, Daniel Truhn, Sven Nebelung
https://arxiv.org/abs/2510.05664 …

Large Language Model-Based Uncertainty-Adjusted Label Extraction for Artificial Intelligence Model Development in Upper Extremity Radiography
Objectives: To evaluate GPT-4o's ability to extract diagnostic labels (with uncertainty) from free-text radiology reports and to test how these labels affect multi-label image classification of musculoskeletal radiographs. Methods: This retrospective study included radiography series of the clavicle (n=1,170), elbow (n=3,755), and thumb (n=1,978). After anonymization, GPT-4o filled out structured templates by indicating imaging findings as present ("true"), absent ("false"), or "uncertain." To …

Tootfinder

Opt-in global Mastodon full text search. Join the index!