Tootfinder

@arXiv_physicsaoph_bot@mastoxiv.page
2025-09-05 08:18:41

Finetuning AI Foundation Models to Develop Subgrid-Scale Parameterizations: A Case Study on Atmospheric Gravity Waves
Aman Gupta, Aditi Sheshadri, Sujit Roy, Johannes Schmude, Vishal Gaur, Wei Ji Leong, Manil Maskey, Rahul Ramachandran
https://arxiv.org/abs/2509.03816

Finetuning AI Foundation Models to Develop Subgrid-Scale Parameterizations: A Case Study on Atmospheric Gravity Waves
Global climate models parameterize a range of atmospheric-oceanic processes like gravity waves, clouds, moist convection, and turbulence that cannot be sufficiently resolved. These subgrid-scale closures for unresolved processes are a leading source of model uncertainty. Here, we present a new approach to developing machine learning parameterizations of small-scale climate processes by fine-tuning a pre-trained AI foundation model (FM). FMs are largely unexplored in climate research. A pre-trai…

@arXiv_csCV_bot@mastoxiv.page
2025-10-03 09:39:01

FideDiff: Efficient Diffusion Model for High-Fidelity Image Motion Deblurring
Xiaoyang Liu, Zhengyan Zhou, Zihang Xu, Jiezhang Cao, Zheng Chen, Yulun Zhang
https://arxiv.org/abs/2510.01641

FideDiff: Efficient Diffusion Model for High-Fidelity Image Motion Deblurring
Recent advancements in image motion deblurring, driven by CNNs and transformers, have made significant progress. Large-scale pre-trained diffusion models, which are rich in true-world modeling, have shown great promise for high-quality image restoration tasks such as deblurring, demonstrating stronger generative capabilities than CNN and transformer-based methods. However, challenges such as unbearable inference time and compromised fidelity still limit the full potential of the diffusion model…

@arXiv_csCL_bot@mastoxiv.page
2025-09-03 14:23:13

chDzDT: Word-level morphology-aware language model for Algerian social media text
Abdelkrime Aries
https://arxiv.org/abs/2509.01772 https://arxiv.org/pdf/2…

chDzDT: Word-level morphology-aware language model for Algerian social media text
Pre-trained language models (PLMs) have substantially advanced natural language processing by providing context-sensitive text representations. However, the Algerian dialect remains under-represented, with few dedicated models available. Processing this dialect is challenging due to its complex morphology, frequent code-switching, multiple scripts, and strong lexical influences from other languages. These characteristics complicate tokenization and reduce the effectiveness of conventional word-…

@arXiv_csDC_bot@mastoxiv.page
2025-09-03 08:53:33

LobRA: Multi-tenant Fine-tuning over Heterogeneous Data
Sheng Lin, Fangcheng Fu, Haoyang Li, Hao Ge, Xuanyu Wang, Jiawen Niu, Yaofeng Tu, Bin Cui
https://arxiv.org/abs/2509.01193

LobRA: Multi-tenant Fine-tuning over Heterogeneous Data
With the breakthrough of Transformer-based pre-trained models, the demand for fine-tuning (FT) to adapt the base pre-trained models to downstream applications continues to grow, so it is essential for service providers to reduce the cost of processing FT requests. Low-rank adaption (LoRA) is a widely used FT technique that only trains small-scale adapters and keeps the base model unaltered, conveying the possibility of processing multiple FT tasks by jointly training different LoRA adapters wit…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-03 10:17:13

Speaker-Conditioned Phrase Break Prediction for Text-to-Speech with Phoneme-Level Pre-trained Language Model
Dong Yang, Yuki Saito, Takaaki Saeki, Tomoki Koriyama, Wataru Nakata, Detai Xin, Hiroshi Saruwatari
https://arxiv.org/abs/2509.00675

Speaker-Conditioned Phrase Break Prediction for Text-to-Speech with Phoneme-Level Pre-trained Language Model
This paper advances phrase break prediction (also known as phrasing) in multi-speaker text-to-speech (TTS) systems. We integrate speaker-specific features by leveraging speaker embeddings to enhance the performance of the phrasing model. We further demonstrate that these speaker embeddings can capture speaker-related characteristics solely from the phrasing task. Besides, we explore the potential of pre-trained speaker embeddings for unseen speakers through a few-shot adaptation method. Further…

@arXiv_eessSP_bot@mastoxiv.page
2025-09-03 11:50:13

Fluid Antenna Port Prediction based on Large Language Models
Yali Zhang, Haifan Yin, Weidong Li, Emil Bjornson, Merouane Debbah
https://arxiv.org/abs/2509.01121 https://

Fluid Antenna Port Prediction based on Large Language Models
This study seeks to utilize large language models (LLMs) to forecast the moving ports of fluid antenna (FA). By repositioning the antenna to the locations identified by our proposed model, we intend to address the mobility challenges faced by user equipment (UE). To the best of our knowledge, this paper introduces, for the first time, the application of LLMs in the prediction of FA ports, presenting a novel model termed Port-LLM. The architecture of our model is based on the pre-trained GPT-2 f…

@arXiv_csCR_bot@mastoxiv.page
2025-09-30 11:22:01

StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data
Yixu Wang, Yan Teng, Yingchun Wang, Xingjun Ma
https://arxiv.org/abs/2509.23594 https://ar…

StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data
Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have transformed vision model adaptation, enabling the rapid deployment of customized models. However, the compactness of LoRA adaptations introduces new safety concerns, particularly their vulnerability to model extraction attacks. This paper introduces a new focus of model extraction attacks named LoRA extraction that extracts LoRA-adaptive models based on a public pre-trained model. We then propose a novel extraction method called Stol…

@arXiv_csSE_bot@mastoxiv.page
2025-08-28 09:37:11

Smart Contract Intent Detection with Pre-trained Programming Language Model
Youwei Huang, Jianwen Li, Sen Fang, Yao Li, Peng Yang, Bin Hu, Tao Zhang
https://arxiv.org/abs/2508.20086

Smart Contract Intent Detection with Pre-trained Programming Language Model
Malicious intent in smart contract development can lead to substantial economic losses. SmartIntentNN is a deep learning model specifically designed to identify unsafe intents in smart contracts. This model integrates the Universal Sentence Encoder, a K-means clustering-based intent highlighting mechanism, and a Bidirectional Long Short-Term Memory network for multi-label classification, achieving an F1 of 0.8633 in distinguishing ten different intent categories. In this study, we present an up…

@arXiv_csLG_bot@mastoxiv.page
2025-10-02 11:12:31

Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
Leyla Mirvakhabova, Babak Ehteshami Bejnordi, Gaurav Kumar, Hanxue Liang, Wanru Zhao, Paul Whatmough
https://arxiv.org/abs/2510.01185

Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by …

@arXiv_csCV_bot@mastoxiv.page
2025-10-03 10:14:51

Leveraging Prior Knowledge of Diffusion Model for Person Search
Giyeol Kim, Sooyoung Yang, Jihyong Oh, Myungjoo Kang, Chanho Eom
https://arxiv.org/abs/2510.01841 https://…

Leveraging Prior Knowledge of Diffusion Model for Person Search
Person search aims to jointly perform person detection and re-identification by localizing and identifying a query person within a gallery of uncropped scene images. Existing methods predominantly utilize ImageNet pre-trained backbones, which may be suboptimal for capturing the complex spatial context and fine-grained identity cues necessary for person search. Moreover, they rely on a shared backbone feature for both person detection and re-identification, leading to suboptimal features due to …

@arXiv_physicschemph_bot@mastoxiv.page
2025-09-03 10:11:33

Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force Fields in Doped Materials
Yi Cao, Paulette Clancy
https://arxiv.org/abs/2509.00090

Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force Fields in Doped Materials
Machine-learned force fields (MLFFs), particularly pre-trained foundation models, promise to bring ab initio-level accuracy to the length and time scales of molecular dynamics. Yet this shift raises a central question: is it better to build a specialist model from scratch or adapt a generalist foundation model for a specific system? The trade-offs in data efficiency, predictive accuracy, and risks of out-of-distribution (OOD) failure remain unclear. Here, we present a benchmarking framework tha…

@arXiv_csRO_bot@mastoxiv.page
2025-08-19 10:59:00

Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search
Cyrus Neary, Omar G. Younis, Artur Kuramshin, Ozgur Aslan, Glen Berseth
https://arxiv.org/abs/2508.12211

Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search
Pre-trained vision-language-action (VLA) models offer a promising foundation for generalist robot policies, but often produce brittle behaviours or unsafe failures when deployed zero-shot in out-of-distribution scenarios. We present Vision-Language-Action Planning & Search (VLAPS) -- a novel framework and accompanying algorithms that embed model-based search into the inference procedure of pre-trained VLA policies to improve their performance on robotic tasks. Specifically, our method biases a …

@arXiv_csSD_bot@mastoxiv.page
2025-10-01 08:50:17

LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning
Kang Yang, Yifan Liang, Fangkun Liu, Zhenping Xie, Chengshi Zheng
https://arxiv.org/abs/2509.25670 …

LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning
Lip-to-speech (L2S) synthesis for Mandarin is a significant challenge, hindered by complex viseme-to-phoneme mappings and the critical role of lexical tones in intelligibility. To address this issue, we propose Lexical Tone-Aware Lip-to-Speech (LTA-L2S). To tackle viseme-to-phoneme complexity, our model adapts an English pre-trained audio-visual self-supervised learning (SSL) model via a cross-lingual transfer learning strategy. This strategy not only transfers universal knowledge learned from …

@arXiv_csCV_bot@mastoxiv.page
2025-10-02 10:54:01

NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution
Xiangtao Kong, Rongyuan Wu, Shuaizheng Liu, Lingchen Sun, Lei Zhang
https://arxiv.org/abs/2510.00820

NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution
Most recent real-world image super-resolution (Real-ISR) methods employ pre-trained text-to-image (T2I) diffusion models to synthesize the high-quality image either from random Gaussian noise, which yields realistic results but is slow due to iterative denoising, or directly from the input low-quality image, which is efficient but at the price of lower output quality. These approaches train ControlNet or LoRA modules while keeping the pre-trained model fixed, which often introduces over-enhance…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-03 10:53:13

MixedG2P-T5: G2P-free Speech Synthesis for Mixed-script texts using Speech Self-Supervised Learning and Language Model
Joonyong Park, Daisuke Saito, Nobuaki Minematsu
https://arxiv.org/abs/2509.01391

MixedG2P-T5: G2P-free Speech Synthesis for Mixed-script texts using Speech Self-Supervised Learning and Language Model
This study presents a novel approach to voice synthesis that can substitute the traditional grapheme-to-phoneme (G2P) conversion by using a deep learning-based model that generates discrete tokens directly from speech. Utilizing a pre-trained voice SSL model, we train a T5 encoder to produce pseudo-language labels from mixed-script texts (e.g., containing Kanji and Kana). This method eliminates the need for manual phonetic transcription, reducing costs and enhancing scalability, especially for …

@arXiv_eessIV_bot@mastoxiv.page
2025-09-30 08:51:31

Achieving Fair Skin Lesion Detection through Skin Tone Normalization and Channel Pruning
Zihan Wei, Tapabrata Chakraborti
https://arxiv.org/abs/2509.22712 https://

Achieving Fair Skin Lesion Detection through Skin Tone Normalization and Channel Pruning
Recent works have shown that deep learning based skin lesion image classification models trained on unbalanced dataset can exhibit bias toward protected demographic attributes such as race, age,and gender. Current bias mitigation methods usually either achieve high level of fairness with the degradation of accuracy, or only improve the model fairness on a single attribute. Additionally usually most bias mitigation strategies are either pre hoc through data processing or post hoc through fairnes…

@arXiv_csIR_bot@mastoxiv.page
2025-09-30 10:21:51

Investigating Multi-layer Representations for Dense Passage Retrieval
Zhongbin Xie, Thomas Lukasiewicz
https://arxiv.org/abs/2509.23861 https://arxiv.org/p…

Investigating Multi-layer Representations for Dense Passage Retrieval
Dense retrieval models usually adopt vectors from the last hidden layer of the document encoder to represent a document, which is in contrast to the fact that representations in different layers of a pre-trained language model usually contain different kinds of linguistic knowledge, and behave differently during fine-tuning. Therefore, we propose to investigate utilizing representations from multiple encoder layers to make up the representation of a document, which we denote Multi-layer Represe…

@arXiv_statML_bot@mastoxiv.page
2025-09-29 09:49:28

Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)
Nikita Kornilov, David Li, Tikhon Mavrin, Aleksei Leonov, Nikita Gushchin, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin
https://arxiv.org/abs/2509.22459

Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)
While achieving exceptional generative quality, modern diffusion, flow, and other matching models suffer from slow inference, as they require many steps of iterative generation. Recent distillation methods address this by training efficient one-step generators under the guidance of a pre-trained teacher model. However, these methods are often constrained to only one specific framework, e.g., only to diffusion or only to flow models. Furthermore, these methods are naturally data-free, and to ben…

@arXiv_csAI_bot@mastoxiv.page
2025-09-22 08:08:01

MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents
Pan Tang, Shixiang Tang, Huanqi Pu, Zhiqing Miao, Zhixing Wang
https://arxiv.org/abs/2509.15635

MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents
This paper presents MicroRCA-Agent, an innovative solution for microservice root cause analysis based on large language model agents, which constructs an intelligent fault root cause localization system with multimodal data fusion. The technical innovations are embodied in three key aspects: First, we combine the pre-trained Drain log parsing algorithm with multi-level data filtering mechanism to efficiently compress massive logs into high-quality fault features. Second, we employ a dual anomal…

@arXiv_csLG_bot@mastoxiv.page
2025-09-30 14:45:31

TR2-D2: Tree Search Guided Trajectory-Aware Fine-Tuning for Discrete Diffusion
Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee
https://arxiv.org/abs/2509.25171 https://

TR2-D2: Tree Search Guided Trajectory-Aware Fine-Tuning for Discrete Diffusion
Reinforcement learning with stochastic optimal control offers a promising framework for diffusion fine-tuning, where a pre-trained diffusion model is optimized to generate paths that lead to a reward-tilted distribution. While these approaches enable optimization without access to explicit samples from the optimal distribution, they require training on rollouts under the current fine-tuned model, making them susceptible to reinforcing sub-optimal trajectories that yield poor rewards. To overcom…

@arXiv_csHC_bot@mastoxiv.page
2025-08-15 07:47:02

Pre-trained Transformer-models using chronic invasive electrophysiology for symptom decoding without patient-individual training
Timon Merk, Saeed Salehi, Richard M. Koehler, Qiming Cui, Maria Olaru, Amelia Hahn, Nicole R. Provenza, Simon Little, Reza Abbasi-Asl, Phil A. Starr, Wolf-Julian Neumann
https://arxiv.org/abs/2508.10160

Pre-trained Transformer-models using chronic invasive electrophysiology for symptom decoding without patient-individual training
Neural decoding of pathological and physiological states can enable patient-individualized closed-loop neuromodulation therapy. Recent advances in pre-trained large-scale foundation models offer the potential for generalized state estimation without patient-individual training. Here we present a foundation model trained on chronic longitudinal deep brain stimulation recordings spanning over 24 days. Adhering to long time-scale symptom fluctuations, we highlight the extended context window of 30…

@arXiv_csCL_bot@mastoxiv.page
2025-09-30 14:10:25

Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct
Haoyang Zheng, Xinyang Liu, Cindy Xiangrui Kong, Nan Jiang, Zheyuan Hu, Weijian Luo, Wei Deng, Guang Lin
https://arxiv.org/abs/2509.25035

Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct
Fast generation of language texts is the holy grail that people pursue in the AI era. In this work, we introduced Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that leads to fast language generation models by initializing from a pre-trained (masked) discrete diffusion language model (dLLM). The resulting DiDi-Instruct model outperforms the dLLM counterparts and the GPT-2 baseline with 64x acceleration. In the theoretical part of the paper, we build the foundati…

@arXiv_csIT_bot@mastoxiv.page
2025-09-26 07:36:11

A Deep Transfer Learning-Based Low-overhead Beam Prediction in Vehicle Communications
Zhiqiang Xiao, Yuwen Cao, Mondher Bouazizi, Tomoaki Ohtsuki, Shahid Mumtaz
https://arxiv.org/abs/2509.20659

A Deep Transfer Learning-Based Low-overhead Beam Prediction in Vehicle Communications
Existing transfer learning-based beam prediction approaches primarily rely on simple fine-tuning. When there is a significant difference in data distribution between the target domain and the source domain, simple fine-tuning limits the model's performance in the target domain. To tackle this problem, we propose a transfer learning-based beam prediction method that combines fine-tuning with domain adaptation. We integrate a domain classifier into fine-tuning the pre-trained model. The model ext…

@arXiv_econEM_bot@mastoxiv.page
2025-09-26 08:17:31

Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities
Shanjukta Nath, Jiwon Hong, Jae Ho Chang, Keith Warren, Subhadeep Paul
https://arxiv.org/abs/2509.20634

Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities
We find AI embeddings obtained using a pre-trained transformer-based Large Language Model (LLM) of 80,000-120,000 written affirmations and correction exchanges among residents in low-security correctional facilities to be highly predictive of recidivism. The prediction accuracy is 30\% higher with embedding vectors than with only pre-entry covariates. However, since the text embedding vectors are high-dimensional, we perform Zero-Shot classification of these texts to a low-dimensional vector of…

@mia@hcommons.social
2025-10-09 08:17:27

Looking forward to reading this! “Making BERT Feel at Home. Modelling Domestic Space in 19th-Century British and Irish Fiction”, Journal of Computational Literary Studies4(1). doi: https://doi.org/10.48694/jcls.4164
By Guhr, S., Monaco, J., Sherman, A., Warner, M. & Algee-Hewitt, M

Making BERT Feel at Home. Modelling Domestic Space in 19th-Century British and Irish Fiction
We introduce a novel approach to detecting domestic space in literary texts beyond explicit spatial markers like “home” or “house.” Using a pre-trained English BERT model fine-tuned on manually annotated passages from a corpus of 19th-century British and Irish novels, we develop a method to operationalize and quantify domesticity in fiction. Our model captures the nuances of domestic space by analyzing contextual and relational cues rather than relying solely on toponymic and other expl…

@arXiv_csCV_bot@mastoxiv.page
2025-09-26 10:25:41

A Sentinel-3 foundation model for ocean colour
Geoffrey Dawson, Remy Vandaele, Andrew Taylor, David Moffat, Helen Tamura-Wicks, Sarah Jackson, Rosie Lickorish, Paolo Fraccaro, Hywel Williams, Chunbo Luo, Anne Jones
https://arxiv.org/abs/2509.21273

A Sentinel-3 foundation model for ocean colour
Artificial Intelligence (AI) Foundation models (FMs), pre-trained on massive unlabelled datasets, have the potential to drastically change AI applications in ocean science, where labelled data are often sparse and expensive to collect. In this work, we describe a new foundation model using the Prithvi-EO Vision Transformer architecture which has been pre-trained to reconstruct data from the Sentinel-3 Ocean and Land Colour Instrument (OLCI). We evaluate the model by fine-tuning on two downstrea…

@arXiv_csCL_bot@mastoxiv.page
2025-09-01 09:40:42

Efficient Code Embeddings from Code Generation Models
Daria Kryvosheieva, Saba Sturua, Michael G\"unther, Scott Martens, Han Xiao
https://arxiv.org/abs/2508.21290 https://

Efficient Code Embeddings from Code Generation Models
jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validat…

@arXiv_condmatmtrlsci_bot@mastoxiv.page
2025-08-13 08:00:22

DiffractGPT: Atomic Structure Determination from X-ray Diffraction Patterns using Generative Pre-trained Transformer
Kamal Choudhary
https://arxiv.org/abs/2508.08349 https://

DiffractGPT: Atomic Structure Determination from X-ray Diffraction Patterns using Generative Pre-trained Transformer
Crystal structure determination from powder diffraction patterns is a complex challenge in materials science, often requiring extensive expertise and computational resources. This study introduces DiffractGPT, a generative pre-trained transformer model designed to predict atomic structures directly from X-ray diffraction (XRD) patterns. By capturing the intricate relationships between diffraction patterns and crystal structures, DiffractGPT enables fast and accurate inverse design. Trained on t…

@arXiv_physicsgeoph_bot@mastoxiv.page
2025-09-30 09:17:21

U-SWIFT: A Unified Surface Wave Inversion Framework with Transformer via Normalization of Dispersion Curves
Tianjian Cheng, Hongrui Xu, Jiayu Feng, Xiongyu Hu, Chaofan Yao
https://arxiv.org/abs/2509.24872

U-SWIFT: A Unified Surface Wave Inversion Framework with Transformer via Normalization of Dispersion Curves
Deep learning is an increasingly popular approach for inverting surface wave dispersion curves to obtain Vs profiles. However, its generalizability is constrained by the depth and velocity scales of training data. We propose a unified deep learning framework that overcomes this limitation via normalization of dispersion curves. By leveraging the scaling properties of dispersion curves, our approach enables a single, pre-trained model to predict Vs profiles across diverse scales, from shallow su…

@arXiv_csRO_bot@mastoxiv.page
2025-08-28 08:06:51

LaVA-Man: Learning Visual Action Representations for Robot Manipulation
Chaoran Zhu, Hengyi Wang, Yik Lung Pang, Changjae Oh
https://arxiv.org/abs/2508.19391 https://

LaVA-Man: Learning Visual Action Representations for Robot Manipulation
Visual-textual understanding is essential for language-guided robot manipulation. Recent works leverage pre-trained vision-language models to measure the similarity between encoded visual observations and textual instructions, and then train a model to map this similarity to robot actions. However, this two-step approach limits the model to capture the relationship between visual observations and textual instructions, leading to reduced precision in manipulation tasks. We propose to learn visua…

@arXiv_eessSY_bot@mastoxiv.page
2025-09-16 11:12:16

BERT4beam: Large AI Model Enabled Generalized Beamforming Optimization
Yuhang Li, Yang Lu, Wei Chen, Bo Ai, Zhiguo Ding, Dusit Niyato
https://arxiv.org/abs/2509.11056 https://…

BERT4beam: Large AI Model Enabled Generalized Beamforming Optimization
Artificial intelligence (AI) is anticipated to emerge as a pivotal enabler for the forthcoming sixth-generation (6G) wireless communication systems. However, current research efforts regarding large AI models for wireless communications primarily focus on fine-tuning pre-trained large language models (LLMs) for specific tasks. This paper investigates the large-scale AI model designed for beamforming optimization to adapt and generalize to diverse tasks defined by system utilities and scales. We…

@arXiv_csIR_bot@mastoxiv.page
2025-08-29 07:54:51

ELIXIR: Efficient and LIghtweight model for eXplaIning Recommendations
Ben Kabongo, Vincent Guigue, Pirmin Lemberger
https://arxiv.org/abs/2508.20312 https://

ELIXIR: Efficient and LIghtweight model for eXplaIning Recommendations
Collaborative filtering drives many successful recommender systems but struggles with fine-grained user-item interactions and explainability. As users increasingly seek transparent recommendations, generating textual explanations through language models has become a critical research area. Existing methods employ either RNNs or Transformers. However, RNN-based approaches fail to leverage the capabilities of pre-trained Transformer models, whereas Transformer-based methods often suffer from subo…

@arXiv_csGR_bot@mastoxiv.page
2025-10-07 08:52:42

Paris: A Decentralized Trained Open-Weight Diffusion Model
Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy
https://arxiv.org/abs/2510.03434 https://

Paris: A Decentralized Trained Open-Weight Diffusion Model
We present Paris, the first publicly released diffusion model pre-trained entirely through decentralized computation. Paris demonstrates that high-quality text-to-image generation can be achieved without centrally coordinated infrastructure. Paris is open for research and commercial use. Paris required implementing our Distributed Diffusion Training framework from scratch. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient,…

@arXiv_csAR_bot@mastoxiv.page
2025-08-26 08:09:56

LLMulator: Generalizable Cost Modeling for Dataflow Accelerators with Input-Adaptive Control Flow
Kaiyan Chang, Wenlong Zhu, Shengwen Liang, Huawei Li, Ying Wang
https://arxiv.org/abs/2508.17826

LLMulator: Generalizable Cost Modeling for Dataflow Accelerators with Input-Adaptive Control Flow
Accurate and fast performance prediction for dataflow-based accelerators is vital for efficient hardware design and design space exploration, yet existing methods struggle to generalize across architectures, applications, and input-dependent control flows. We present LLMulator, a progressive numeric modeling framework leveraging the program semantic knowledge of pre-trained large language models (LLMs) for robust, hardware- and application-aware prediction. Our numeric model treats performance …

@arXiv_astrophGA_bot@mastoxiv.page
2025-09-16 11:02:07

Radio Galaxy Zoo: Morphological classification by Fanaroff-Riley designation using self-supervised pre-training
Nutthawara Buatthaisong, Inigo Val Slijepcevic, Anna M. M. Scaife, Micah Bowles, Andrew Hopkins, Devina Mohan, Stanislav S Shabala, O. Ivy Wong
https://arxiv.org/abs/2509.11988

Radio Galaxy Zoo: Morphological classification by Fanaroff-Riley designation using self-supervised pre-training
In this study, we examine over 14,000 radio galaxies finely selected from Radio Galaxy Zoo (RGZ) project and provide classifications for approximately 5,900 FRIs and 8,100 FRIIs. We present an analysis of these predicted radio galaxy morphologies for the RGZ catalogue, classified using a pre-trained radio galaxy foundation model that has been fine-tuned to predict Fanaroff-Riley (FR) morphology. As seen in previous studies, our results show overlap between morphologically classified FRI and FRI…

@arXiv_physicsplasmph_bot@mastoxiv.page
2025-09-17 09:02:10

FusionMAE: large-scale pretrained model to optimize and simplify diagnostic and control of fusion plasma
Zongyu Yang, Zhenghao Yang, Wenjing Tian, Jiyuan Li, Xiang Sun, Guohui Zheng, Songfen Liu, Niannian Wu, Rongpeng Li, Zhaohe Xu, Bo Li, Zhongbing Shi, Zhe Gao, Wei Chen, Xiaoquan Ji, Min Xu, Wulyu Zhong
https://arxiv.org/abs/2509.12945…

FusionMAE: large-scale pretrained model to optimize and simplify diagnostic and control of fusion plasma
In magnetically confined fusion device, the complex, multiscale, and nonlinear dynamics of plasmas necessitate the integration of extensive diagnostic systems to effectively monitor and control plasma behaviour. The complexity and uncertainty arising from these extensive systems and their tangled interrelations has long posed a significant obstacle to the acceleration of fusion energy development. In this work, a large-scale model, fusion masked auto-encoder (FusionMAE) is pre-trained to compre…

@arXiv_csAI_bot@mastoxiv.page
2025-09-22 07:33:01

Knowledge-Driven Hallucination in Large Language Models: An Empirical Study on Process Modeling
Humam Kourani, Anton Antonov, Alessandro Berti, Wil M. P. van der Aalst
https://arxiv.org/abs/2509.15336 …

Knowledge-Driven Hallucination in Large Language Models: An Empirical Study on Process Modeling
The utility of Large Language Models (LLMs) in analytical tasks is rooted in their vast pre-trained knowledge, which allows them to interpret ambiguous inputs and infer missing information. However, this same capability introduces a critical risk of what we term knowledge-driven hallucination: a phenomenon where the model's output contradicts explicit source evidence because it is overridden by the model's generalized internal knowledge. This paper investigates this phenomenon by evaluating LLM…

@arXiv_csCV_bot@mastoxiv.page
2025-10-01 11:42:07

FLOWER: A Flow-Matching Solver for Inverse Problems
Mehrsa Pourya, Bassam El Rawas, Michael Unser
https://arxiv.org/abs/2509.26287 https://arxiv.org/pdf/25…

FLOWER: A Flow-Matching Solver for Inverse Problems
We introduce Flower, a solver for inverse problems. It leverages a pre-trained flow model to produce reconstructions that are consistent with the observed measurements. Flower operates through an iterative procedure over three steps: (i) a flow-consistent destination estimation, where the velocity network predicts a denoised target; (ii) a refinement step that projects the estimated destination onto a feasible set defined by the forward operator; and (iii) a time-progression step that re-projec…

@arXiv_csLG_bot@mastoxiv.page
2025-08-21 10:15:00

Cross-Modality Controlled Molecule Generation with Diffusion Language Model
Yunzhe Zhang, Yifei Wang, Khanh Vinh Nguyen, Pengyu Hong
https://arxiv.org/abs/2508.14748 https://

Cross-Modality Controlled Molecule Generation with Diffusion Language Model
Current SMILES-based diffusion models for molecule generation typically support only unimodal constraint. They inject conditioning signals at the start of the training process and require retraining a new model from scratch whenever the constraint changes. However, real-world applications often involve multiple constraints across different modalities, and additional constraints may emerge over the course of a study. This raises a challenge: how to extend a pre-trained diffusion model not only t…

@arXiv_csSD_bot@mastoxiv.page
2025-08-19 09:25:39

Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection
Bing Han, Anbai Jiang, Xinhu Zheng, Wei-Qiang Zhang, Jia Liu, Pingyi Fan, Yanmin Qian
https://arxiv.org/abs/2508.12230

Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection
Machine anomalous sound detection (ASD) is a valuable technique across various applications. However, its generalization performance is often limited due to challenges in data collection and the complexity of acoustic environments. Inspired by the success of large pre-trained models in numerous fields, this paper introduces a robust ASD model that leverages self-supervised pre-trained models trained on large-scale speech and audio datasets. Although there are inconsistencies between the pre-tra…

@arXiv_csCL_bot@mastoxiv.page
2025-08-25 10:00:40

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai
https://arxiv.org/abs/2508.15884

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient explorat…

@arXiv_csRO_bot@mastoxiv.page
2025-08-26 11:26:46

FlowVLA: Thinking in Motion with a Visual Chain of Thought
Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Wenxuan Song, Jiayi Chen, Haoang Li
https://arxiv.org/abs/2508.18269

FlowVLA: Thinking in Motion with a Visual Chain of Thought
Many Vision-Language-Action (VLA) models rely on an internal world model trained via next-frame prediction. This approach, however, struggles with physical reasoning as it entangles static appearance with dynamic motion, often resulting in implausible visual forecasts and inefficient policy learning. To address these limitations, we introduce the Visual Chain of Thought (Visual CoT): a pre-training framework that encourages a model to reason about how a scene evolves before predicting what it w…

@arXiv_quantph_bot@mastoxiv.page
2025-09-09 11:51:22

Classical Neural Networks on Quantum Devices via Tensor Network Disentanglers: A Case Study in Image Classification
Borja Aizpurua, Sukhbinder Singh, Rom\'an Or\'us
https://arxiv.org/abs/2509.06653

Classical Neural Networks on Quantum Devices via Tensor Network Disentanglers: A Case Study in Image Classification
We address the problem of implementing bottleneck layers from classical pre-trained neural networks on a quantum computer, with the goal of achieving quantum advantage on near-term devices. Our approach begins with a compression step in which the target linear layer is represented as an effective matrix product operator (MPO) without degrading model performance. The MPO is then further disentangled into a more compact form. This enables a hybrid classical-quantum execution scheme, where the dis…

@arXiv_csSI_bot@mastoxiv.page
2025-08-12 08:42:03

Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face
Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg
https://arxiv.org/abs/2508.06811 https://

Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face
Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks th…

@arXiv_csCL_bot@mastoxiv.page
2025-08-29 10:18:21

Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning
Nelson Filipe Costa, Leila Kosseim
https://arxiv.org/abs/2508.20712 https://

Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning
This paper introduces the first multi-lingual and multi-label classification model for implicit discourse relation recognition (IDRR). Our model, HArch, is evaluated on the recently released DiscoGeM 2.0 corpus and leverages hierarchical dependencies between discourse senses to predict probability distributions across all three sense levels in the PDTB 3.0 framework. We compare several pre-trained encoder backbones and find that RoBERTa-HArch achieves the best performance in English, while XLM-…

@arXiv_eessIV_bot@mastoxiv.page
2025-08-25 09:04:20

Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution
Tainyi Zhang, Zheng-Peng Duan, Peng-Tao Jiang, Bo Li, Ming-Ming Cheng, Chun-Le Guo, Chongyi Li
https://arxiv.org/abs/2508.16557

Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution
Diffusion-based real-world image super-resolution (Real-ISR) methods have demonstrated impressive performance. To achieve efficient Real-ISR, many works employ Variational Score Distillation (VSD) to distill pre-trained stable-diffusion (SD) model for one-step SR with a fixed timestep. However, due to the different noise injection timesteps, the SD will perform different generative priors. Therefore, a fixed timestep is difficult for these methods to fully leverage the generative priors in SD, …

@arXiv_csLG_bot@mastoxiv.page
2025-09-18 09:58:41

Towards a Physics Foundation Model
Florian Wiesner, Matthias Wessling, Stephen Baek
https://arxiv.org/abs/2509.13805 https://arxiv.org/pdf/2509.13805

Towards a Physics Foundation Model
Foundation models have revolutionized natural language processing through a ``train once, deploy anywhere'' paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative -- democratizing access to high-fidelity simulations, accelerating scientific discovery, and eliminating the need for specialized solver development. Yet current physics-aware machine learning approaches remain fundamentally…

@arXiv_csSE_bot@mastoxiv.page
2025-08-22 08:43:40

An Empirical Study of Knowledge Distillation for Code Understanding Tasks
Ruiqi Wang, Zezhou Yang, Cuiyun Gao, Xin Xia, Qing Liao
https://arxiv.org/abs/2508.15423 https://

An Empirical Study of Knowledge Distillation for Code Understanding Tasks
Pre-trained language models (PLMs) have emerged as powerful tools for code understanding. However, deploying these PLMs in large-scale applications faces practical challenges due to their computational intensity and inference latency. Knowledge distillation (KD), a promising model compression and acceleration technique, addresses these limitations by transferring knowledge from large teacher models to compact student models, enabling efficient inference while preserving most of the teacher mode…

@arXiv_csCV_bot@mastoxiv.page
2025-09-30 15:02:46

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, Haocheng Xi, Ligeng Zhu, Enze Xie, Song Han, Han Cai
https://arxiv.org/abs/2509.25182

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and ge…

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 09:53:31

SONAR: Self-Distilled Continual Pre-training for Domain Adaptive Audio Representation
Yizhou Zhang, Yuan Gao, Wangjin Zhou, Zicheng Yuan, Keisuke Imoto, Tatsuya Kawahara
https://arxiv.org/abs/2509.15703

SONAR: Self-Distilled Continual Pre-training for Domain Adaptive Audio Representation
Self-supervised learning (SSL) on large-scale datasets like AudioSet has become the dominant paradigm for audio representation learning. While the continuous influx of new, unlabeled audio presents an opportunity to enrich these static representations, a naive approach is to retrain the model from scratch using all available data. However, this method is computationally prohibitive and discards the valuable knowledge embedded in the previously trained model weights. To address this inefficiency…

@arXiv_csGR_bot@mastoxiv.page
2025-08-18 07:37:50

StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation
Seungmi Lee, Kwan Yun, Junyong Noh
https://arxiv.org/abs/2508.11203 https://

StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation
We introduce StyleMM, a novel framework that can construct a stylized 3D Morphable Model (3DMM) based on user-defined text descriptions specifying a target style. Building upon a pre-trained mesh deformation network and a texture generator for original 3DMM-based realistic human faces, our approach fine-tunes these models using stylized facial images generated via text-guided image-to-image (i2i) translation with a diffusion model, which serve as stylization targets for the rendered mesh. To pr…

@arXiv_csAI_bot@mastoxiv.page
2025-10-07 12:14:52

Look-ahead Reasoning with a Learned Model in Imperfect Information Games
Ond\v{r}ej Kub\'i\v{c}ek, Viliam Lis\'y
https://arxiv.org/abs/2510.05048 https://

Look-ahead Reasoning with a Learned Model in Imperfect Information Games
Test-time reasoning significantly enhances pre-trained AI agents' performance. However, it requires an explicit environment model, often unavailable or overly complex in real-world scenarios. While MuZero enables effective model learning for search in perfect information games, extending this paradigm to imperfect information games presents substantial challenges due to more nuanced look-ahead reasoning techniques and large number of states relevant for individual decisions. This paper introduc…

@arXiv_statML_bot@mastoxiv.page
2025-08-13 09:32:02

In-Context Learning as Nonparametric Conditional Probability Estimation: Risk Bounds and Optimality
Chenrui Liu, Falong Tan, Chuanlong Xie, Yicheng Zeng, Lixing Zhu
https://arxiv.org/abs/2508.08673

In-Context Learning as Nonparametric Conditional Probability Estimation: Risk Bounds and Optimality
This paper investigates the expected excess risk of In-Context Learning (ICL) for multiclass classification. We model each task as a sequence of labeled prompt samples and a query input, where a pre-trained model estimates the conditional class probabilities of the query. The expected excess risk is defined as the average truncated Kullback-Leibler (KL) divergence between the predicted and ground-truth conditional class distributions, averaged over a specified family of tasks. We establish a ne…

@arXiv_csCL_bot@mastoxiv.page
2025-08-22 10:01:01

Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Woojin Chung, Jeonghoon Kim
https://arxiv.org/abs/2508.15390 https://arxiv.org/pdf…

Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but the source of the benefit is unclear. We conduct a controlled study that scales the language model's vocabulary from 24K to 196K while holding data, compute, and optimization fixed. We first quantify the complexity of tokenized text, formalized via Kolmogorov complexity, and sho…

@arXiv_csCV_bot@mastoxiv.page
2025-09-30 15:02:56

PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos
Ting-Hsuan Liao, Haowen Liu, Yiran Xu, Songwei Ge, Gengshan Yang, Jia-Bin Huang
https://arxiv.org/abs/2509.25183 h…

PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos
We present PAD3R, a method for reconstructing deformable 3D objects from casually captured, unposed monocular videos. Unlike existing approaches, PAD3R handles long video sequences featuring substantial object deformation, large-scale camera movement, and limited view coverage that typically challenge conventional systems. At its core, our approach trains a personalized, object-centric pose estimator, supervised by a pre-trained image-to-3D model. This guides the optimization of deformable 3D G…

@arXiv_physicsgeoph_bot@mastoxiv.page
2025-08-28 11:51:44

Replaced article(s) found for physics.geo-ph. https://arxiv.org/list/physics.geo-ph/new
[1/1]:
- PRIME-DP: Pre-trained Integrated Model for Earthquake Data Processing
Ziye Yu, Yuqi Cai, Weitao Wang, Yanru An, Lu Li, Yueyang Xia, Yunpeng Zhang

@arXiv_csIR_bot@mastoxiv.page
2025-08-11 09:31:39

LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing
Yao Zhao, Yantian Ding, Zhiyue Zhang, Dapeng Yao, Yanxun Xu
https://arxiv.org/abs/2508.05672 http…

LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing
Retrieval Augmented Generation (RAG) systems often struggle with domain-specific knowledge due to performance deterioration of pre-trained embeddings and prohibitive computational costs of large language model (LLM)-based retrievers. While fine-tuning data augmentation embedding models offers a promising direction, its effectiveness is limited by the need for high-quality training data and reliable chunking strategies that preserve contextual integrity. We propose LMAR (Language Model Augmented…

@arXiv_csLG_bot@mastoxiv.page
2025-08-27 10:35:33

Composition and Alignment of Diffusion Models using Constrained Learning
Shervin Khalafi, Ignacio Hounie, Dongsheng Ding, Alejandro Ribeiro
https://arxiv.org/abs/2508.19104 http…

Composition and Alignment of Diffusion Models using Constrained Learning
Diffusion models have become prevalent in generative modeling due to their ability to sample from complex distributions. To improve the quality of generated samples and their compliance with user requirements, two commonly used methods are: (i) Alignment, which involves fine-tuning a diffusion model to align it with a reward; and (ii) Composition, which combines several pre-trained diffusion models, each emphasizing a desirable attribute in the generated outputs. However, trade-offs often arise…

@arXiv_csCR_bot@mastoxiv.page
2025-10-15 10:03:01

IP-Augmented Multi-Modal Malicious URL Detection Via Token-Contrastive Representation Enhancement and Multi-Granularity Fusion
Ye Tian, Yanqiu Yu, Liangliang Song, Zhiquan Liu, Yanbin Wang, Jianguo Sun
https://arxiv.org/abs/2510.12395

IP-Augmented Multi-Modal Malicious URL Detection Via Token-Contrastive Representation Enhancement and Multi-Granularity Fusion
Malicious URL detection remains a critical cybersecurity challenge as adversaries increasingly employ sophisticated evasion techniques including obfuscation, character-level perturbations, and adversarial attacks. Although pre-trained language models (PLMs) like BERT have shown potential for URL analysis tasks, three limitations persist in current implementations: (1) inability to effectively model the non-natural hierarchical structure of URLs, (2) insufficient sensitivity to character-level o…

@arXiv_csCL_bot@mastoxiv.page
2025-08-25 10:03:40

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation
Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, Xutai Ma
https://arxiv.org/abs/2508.16188

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation
We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of…

@arXiv_csCV_bot@mastoxiv.page
2025-09-16 12:43:17

FS-SAM2: Adapting Segment Anything Model 2 for Few-Shot Semantic Segmentation via Low-Rank Adaptation
Bernardo Forni, Gabriele Lombardi, Federico Pozzi, Mirco Planamente
https://arxiv.org/abs/2509.12105

FS-SAM2: Adapting Segment Anything Model 2 for Few-Shot Semantic Segmentation via Low-Rank Adaptation
Few-shot semantic segmentation has recently attracted great attention. The goal is to develop a model capable of segmenting unseen classes using only a few annotated samples. Most existing approaches adapt a pre-trained model by training from scratch an additional module. Achieving optimal performance with these approaches requires extensive training on large-scale datasets. The Segment Anything Model 2 (SAM2) is a foundational model for zero-shot image and video segmentation with a modular des…

@arXiv_csSE_bot@mastoxiv.page
2025-08-20 11:48:46

Replaced article(s) found for cs.SE. https://arxiv.org/list/cs.SE/new
[1/1]:
- "I see models being a whole other thing": An Empirical Study of Pre-Trained Model Naming Conventi...
Wenxin Jiang, Mingyu Kim, Chingwo Cheung, Heesoo Kim, George K. Thiruvathukal, James C. Davis
…

@arXiv_csRO_bot@mastoxiv.page
2025-09-23 08:38:20

FiLM-Nav: Efficient and Generalizable Navigation via VLM Fine-tuning
Naoki Yokoyama, Sehoon Ha
https://arxiv.org/abs/2509.16445 https://arxiv.org/pdf/2509.…

FiLM-Nav: Efficient and Generalizable Navigation via VLM Fine-tuning
Enabling robotic assistants to navigate complex environments and locate objects described in free-form language is a critical capability for real-world deployment. While foundation models, particularly Vision-Language Models (VLMs), offer powerful semantic understanding, effectively adapting their web-scale knowledge for embodied decision-making remains a key challenge. We present FiLM-Nav (Fine-tuned Language Model for Navigation), an approach that directly fine-tunes pre-trained VLM as the na…

@arXiv_csLG_bot@mastoxiv.page
2025-10-09 10:43:11

Enhancing Speech Emotion Recognition via Fine-Tuning Pre-Trained Models and Hyper-Parameter Optimisation
Aryan Golbaghi, Shuo Zhou
https://arxiv.org/abs/2510.07052 https://

Enhancing Speech Emotion Recognition via Fine-Tuning Pre-Trained Models and Hyper-Parameter Optimisation
We propose a workflow for speech emotion recognition (SER) that combines pre-trained representations with automated hyperparameter optimisation (HPO). Using SpeechBrain wav2vec2-base model fine-tuned on IEMOCAP as the encoder, we compare two HPO strategies, Gaussian Process Bayesian Optimisation (GP-BO) and Tree-structured Parzen Estimators (TPE), under an identical four-dimensional search space and 15-trial budget, with balanced class accuracy (BCA) on the German EmoDB corpus as the objective.…

@arXiv_eessIV_bot@mastoxiv.page
2025-08-20 09:44:50

UNICON: UNIfied CONtinual Learning for Medical Foundational Models
Mohammad Areeb Qazi, Munachiso S Nwadike, Ibrahim Almakky, Mohammad Yaqub, Numan Saeed
https://arxiv.org/abs/2508.14024

UNICON: UNIfied CONtinual Learning for Medical Foundational Models
Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Continual learning offers a solution by fine-tuning a model sequentially on different domains or tasks, enabling it to integrate new knowledge without requiring large datasets for each training phase. In this paper, we propose UNIfied CONtinual Learning for Medical Foundational Model…

@arXiv_csSD_bot@mastoxiv.page
2025-09-24 09:21:54

Scalable Evaluation for Audio Identification via Synthetic Latent Fingerprint Generation
Aditya Bhattacharjee, Marco Pasini, Emmanouil Benetos
https://arxiv.org/abs/2509.18620 h…

Scalable Evaluation for Audio Identification via Synthetic Latent Fingerprint Generation
The evaluation of audio fingerprinting at a realistic scale is limited by the scarcity of large public music databases. We present an audio-free approach that synthesises latent fingerprints which approximate the distribution of real fingerprints. Our method trains a Rectified Flow model on embeddings extracted by pre-trained neural audio fingerprinting systems. The synthetic fingerprints generated using our system act as realistic distractors and enable the simulation of retrieval performance …

@arXiv_csIR_bot@mastoxiv.page
2025-08-14 08:15:52

Personalized Product Search Ranking: A Multi-Task Learning Approach with Tabular and Non-Tabular Data
Lalitesh Morishetti, Abhay Kumar, Jonathan Scott, Kaushiki Nag, Gunjan Sharma, Shanu Vashishtha, Rahul Sridhar, Rohit Chatter, Kannan Achan
https://arxiv.org/abs/2508.09636

Personalized Product Search Ranking: A Multi-Task Learning Approach with Tabular and Non-Tabular Data
In this paper, we present a novel model architecture for optimizing personalized product search ranking using a multi-task learning (MTL) framework. Our approach uniquely integrates tabular and non-tabular data, leveraging a pre-trained TinyBERT model for semantic embeddings and a novel sampling technique to capture diverse customer behaviors. We evaluate our model against several baselines, including XGBoost, TabNet, FT-Transformer, DCN-V2, and MMoE, focusing on their ability to handle mixed d…

@arXiv_csLG_bot@mastoxiv.page
2025-08-22 10:16:41

Amortized In-Context Mixed Effect Transformer Models: A Zero-Shot Approach for Pharmacokinetics
C\'esar Ali Ojeda Marin, Wilhelm Huisinga, Purity Kavwele, Niklas Hartung
https://arxiv.org/abs/2508.15659

Amortized In-Context Mixed Effect Transformer Models: A Zero-Shot Approach for Pharmacokinetics
Accurate dose-response forecasting under sparse sampling is central to precision pharmacotherapy. We present the Amortized In-Context Mixed-Effect Transformer (AICMET) model, a transformer-based latent-variable framework that unifies mechanistic compartmental priors with amortized in-context Bayesian inference. AICMET is pre-trained on hundreds of thousands of synthetic pharmacokinetic trajectories with Ornstein-Uhlenbeck priors over the parameters of compartment models, endowing the model with…

@arXiv_csCL_bot@mastoxiv.page
2025-09-17 09:16:00

MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models
Vijay Govindarajan, Pratik Patel, Sahil Tripathi, Md Azizul Hoque, Gautam Siddharth Kashyap
https://arxiv.org/abs/2509.12591

MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models
Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our metho…

@arXiv_condmatmtrlsci_bot@mastoxiv.page
2025-09-11 09:55:43

Benchmarking CHGNet Universal Machine Learning Interatomic Potential Against DFT and EXAFS: Case of Layered WS2 and MoS2
Pjotrs \v{Z}guns, Inga Pudza, Alexei Kuzmin
https://arxiv.org/abs/2509.08498

Benchmarking CHGNet Universal Machine Learning Interatomic Potential Against DFT and EXAFS: Case of Layered WS2 and MoS2
Universal machine learning interatomic potentials (uMLIPs) deliver near ab initio accuracy in energy and force calculations at low computational cost, making them invaluable for materials modeling. Although uMLIPs are pre-trained on vast ab initio datasets, rigorous validation remains essential for their ongoing adoption. In this study, we use the CHGNet uMLIP to model thermal disorder in isostructural layered 2Hc-WS2 and 2Hc-MoS2, benchmarking it against ab initio data and extended X-ray absor…

@arXiv_csLG_bot@mastoxiv.page
2025-09-10 10:29:21

General Demographic Foundation Models for Enhancing Predictive Performance Across Diseases
Li-Chin Chen, Ji-Tian Sheu, Yuh-Jue Chuang
https://arxiv.org/abs/2509.07330 https://…

General Demographic Foundation Models for Enhancing Predictive Performance Across Diseases
Demographic attributes are universally present in electronic health records and serve as vital predictors in clinical risk stratification and treatment decisions. Despite their significance, these attributes are often relegated to auxiliary roles in model design, with limited attention has been given to learning their representations. This study proposes a General Demographic Pre-trained (GDP) model as a foundational representation framework tailored to age and gender. The model is pre-trained …

@arXiv_csCR_bot@mastoxiv.page
2025-10-08 09:39:59

Membership Inference Attacks on Tokenizers of Large Language Models
Meng Tong, Yuntao Du, Kejiang Chen, Weiming Zhang, Ninghui Li
https://arxiv.org/abs/2510.05699 https://

Membership Inference Attacks on Tokenizers of Large Language Models
Membership inference attacks (MIAs) are widely used to assess the privacy risks associated with machine learning models. However, when these attacks are applied to pre-trained large language models (LLMs), they encounter significant challenges, including mislabeled samples, distribution shifts, and discrepancies in model size between experimental and real-world settings. To address these limitations, we introduce tokenizers as a new attack vector for membership inference. Specifically, a tokeni…

@arXiv_csCV_bot@mastoxiv.page
2025-08-22 10:05:51

Transfer learning optimization based on evolutionary selective fine tuning
Jacinto Colan, Ana Davila, Yasuhisa Hasegawa
https://arxiv.org/abs/2508.15367 https://

Transfer learning optimization based on evolutionary selective fine tuning
Deep learning has shown substantial progress in image analysis. However, the computational demands of large, fully trained models remain a consideration. Transfer learning offers a strategy for adapting pre-trained models to new tasks. Traditional fine-tuning often involves updating all model parameters, which can potentially lead to overfitting and higher computational costs. This paper introduces BioTune, an evolutionary adaptive fine-tuning technique that selectively fine-tunes layers to enh…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:36:21

Can maiBERT Speak for Maithili?
Sumit Yadav, Raju Kumar Yadav, Utsav Maskey, Gautam Siddharth Kashyap Md Azizul Hoque, Ganesh Gautam
https://arxiv.org/abs/2509.15048 https://

Can maiBERT Speak for Maithili?
Natural Language Understanding (NLU) for low-resource languages remains a major challenge in NLP due to the scarcity of high-quality data and language-specific models. Maithili, despite being spoken by millions, lacks adequate computational resources, limiting its inclusion in digital and AI-driven applications. To address this gap, we introducemaiBERT, a BERT-based language model pre-trained specifically for Maithili using the Masked Language Modeling (MLM) technique. Our model is trained on a…

@arXiv_csSD_bot@mastoxiv.page
2025-08-21 09:12:59

ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signal
Yucong Zhang, Juan Liu, Ming Li
https://arxiv.org/abs/2508.14689 https://arxiv.org/p…

ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signal
Pre-trained foundation models have demonstrated remarkable success in vision and language, yet their potential for general machine signal modeling-covering acoustic, vibration, and other industrial sensor data-remains under-explored. Existing approach using sub-band-based encoders has achieved competitive results but are limited by fixed input lengths, and the absence of explicit frequency positional encoding. In this work, we propose a novel foundation model that integrates an advanced band-sp…

@arXiv_csRO_bot@mastoxiv.page
2025-08-11 09:37:49

Bounding Distributional Shifts in World Modeling through Novelty Detection
Eric Jing, Abdeslam Boularias
https://arxiv.org/abs/2508.06096 https://arxiv.org…

Bounding Distributional Shifts in World Modeling through Novelty Detection
Recent work on visual world models shows significant promise in latent state dynamics obtained from pre-trained image backbones. However, most of the current approaches are sensitive to training quality, requiring near-complete coverage of the action and state space during training to prevent divergence during inference. To make a model-based planning algorithm more robust to the quality of the learned world model, we propose in this work to use a variational autoencoder as a novelty detector t…

@arXiv_csLG_bot@mastoxiv.page
2025-10-09 10:40:41

Sharpness-Aware Data Generation for Zero-shot Quantization
Dung Hoang-Anh, Cuong Pham Trung Le, Jianfei Cai, Thanh-Toan Do
https://arxiv.org/abs/2510.07018 https://

Sharpness-Aware Data Generation for Zero-shot Quantization
Zero-shot quantization aims to learn a quantized model from a pre-trained full-precision model with no access to original real training data. The common idea in zero-shot quantization approaches is to generate synthetic data for quantizing the full-precision model. While it is well-known that deep neural networks with low sharpness have better generalization ability, none of the previous zero-shot quantization works considers the sharpness of the quantized model as a criterion for generating tr…

@arXiv_csCV_bot@mastoxiv.page
2025-09-24 15:33:21

Replaced article(s) found for cs.CV. https://arxiv.org/list/cs.CV/new
[2/6]:
- SCoT: Straight Consistent Trajectory for Pre-Trained Diffusion Model Distillations
Zhangkai Wu, Xuhui Fan, Hongyu Wu, Longbing Cao

@arXiv_csSE_bot@mastoxiv.page
2025-10-08 09:37:39

Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding
Nikita Pavlichenko, Iurii Nazarov, Ivan Dolgov, Ekaterina Garanina, Dmitry Ustalov, Ivan Bondyrev, Kseniia Lysaniuk, Evgeniia Vu, Kirill Chekmenev, Joseph Shtok, Yaroslav Golubev, Anton Semenkin, Uladzislau Sazanovich
https://arxiv.org/abs/2510…

Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding
We present the Mellum models family, open-weight code completion models designed for interactive use in JetBrains IDEs. Mellums have 4B parameters, adopt a Llama-style architecture, and are pre-trained on ~4T tokens of permissively licensed, multi-language code. Our studies show that (i) careful data curation and staged training significantly improve the model's quality, (ii) editor-critical capabilities such as context packing are necessary for high-quality suggestions, and (iii) a compact, ta…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 09:46:01

Mitigating data replication in text-to-audio generative diffusion models through anti-memorization guidance
Francisco Messina, Francesca Ronchini, Luca Comanducci, Paolo Bestagini, Fabio Antonacci
https://arxiv.org/abs/2509.14934

Mitigating data replication in text-to-audio generative diffusion models through anti-memorization guidance
A persistent challenge in generative audio models is data replication, where the model unintentionally generates parts of its training data during inference. In this work, we address this issue in text-to-audio diffusion models by exploring the use of anti-memorization strategies. We adopt Anti-Memorization Guidance (AMG), a technique that modifies the sampling process of pre-trained diffusion models to discourage memorization. Our study explores three types of guidance within AMG, each designe…

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 10:16:51

Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nick Karpov, Jagadeesh Balam, Boris Ginsburg
https://arxiv.org/abs/2509.14128

Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
This report introduces Canary-1B-v2, a fast, robust multilingual model for Automatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). Built with a FastConformer encoder and Transformer decoder, it supports 25 languages primarily European. The model was trained on 1.7M hours of total data samples, including Granary and NeMo ASR Set 3.0, with non-speech audio added to reduce hallucinations for ASR and AST. We describe its two-stage pre-training and fine-tuning process with dynamic d…

@arXiv_csLG_bot@mastoxiv.page
2025-08-21 10:10:00

Adaptively Robust LLM Inference Optimization under Prediction Uncertainty
Zixi Chen, Yinyu Ye, Zijie Zhou
https://arxiv.org/abs/2508.14544 https://arxiv.or…

Adaptively Robust LLM Inference Optimization under Prediction Uncertainty
We study the problem of optimizing Large Language Model (LLM) inference scheduling to minimize total latency. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes input requests and generates output tokens sequentially. Therefore, it is vital to improve its scheduling efficiency and reduce the power consumption while a great amount of prompt requests are arriving. A key challenge in LLM inference scheduling is that whil…

@arXiv_csCV_bot@mastoxiv.page
2025-09-22 10:36:11

SegDINO3D: 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features
Jinyuan Qu, Hongyang Li, Xingyu Chen, Shilong Liu, Yukai Shi, Tianhe Ren, Ruitao Jing, Lei Zhang
https://arxiv.org/abs/2509.16098

SegDINO3D: 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features
In this paper, we present SegDINO3D, a novel Transformer encoder-decoder framework for 3D instance segmentation. As 3D training data is generally not as sufficient as 2D training images, SegDINO3D is designed to fully leverage 2D representation from a pre-trained 2D detection model, including both image-level and object-level features, for improving 3D representation. SegDINO3D takes both a point cloud and its associated 2D images as input. In the encoder stage, it first enriches each 3D point …

@arXiv_csRO_bot@mastoxiv.page
2025-10-13 09:04:40

CDE: Concept-Driven Exploration for Reinforcement Learning
Le Mao, Andrew H. Liu, Renos Zabounidis, Zachary Kingston, Joseph Campbell
https://arxiv.org/abs/2510.08851 https://…

CDE: Concept-Driven Exploration for Reinforcement Learning
Intelligent exploration remains a critical challenge in reinforcement learning (RL), especially in visual control tasks. Unlike low-dimensional state-based RL, visual RL must extract task-relevant structure from raw pixels, making exploration inefficient. We propose Concept-Driven Exploration (CDE), which leverages a pre-trained vision-language model (VLM) to generate object-centric visual concepts from textual task descriptions as weak, potentially noisy supervisory signals. Rather than direct…

@arXiv_csSD_bot@mastoxiv.page
2025-09-17 09:14:10

More Similar than Dissimilar: Modeling Annotators for Cross-Corpus Speech Emotion Recognition
James Tavernor, Emily Mower Provost
https://arxiv.org/abs/2509.12295 https://

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:18:51

HARNESS: Lightweight Distilled Arabic Speech Foundation Models
Vrunda N. sukhadia, Shammur Absar Chowdhury
https://arxiv.org/abs/2509.14689 https://arxiv.o…

HARNESS: Lightweight Distilled Arabic Speech Foundation Models
Large pre-trained speech models excel in downstream tasks but their deployment is impractical for resource-limited environments. In this paper, we introduce HArnESS, the first Arabic-centric self-supervised speech model family, designed to capture Arabic speech nuances. Using iterative self-distillation, we train large bilingual HArnESS (HL) SSL models and then distill knowledge into compressed student models (HS, HST), preserving Arabic-specific representations. We use low-rank approximation t…

@arXiv_csCV_bot@mastoxiv.page
2025-09-09 12:30:52

BIR-Adapter: A Low-Complexity Diffusion Model Adapter for Blind Image Restoration
Cem Eteke, Alexander Griessel, Wolfgang Kellerer, Eckehard Steinbach
https://arxiv.org/abs/2509.06904

BIR-Adapter: A Low-Complexity Diffusion Model Adapter for Blind Image Restoration
This paper introduces BIR-Adapter, a low-complexity blind image restoration adapter for diffusion models. The BIR-Adapter enables the utilization of the prior of pre-trained large-scale diffusion models on blind image restoration without training any auxiliary feature extractor. We take advantage of the robustness of pretrained models. We extract features from degraded images via the model itself and extend the self-attention mechanism with these degraded features. We introduce a sampling guida…

@arXiv_csLG_bot@mastoxiv.page
2025-08-20 10:05:00

In-Context Decision Making for Optimizing Complex AutoML Pipelines
Amir Rezaei Balef, Katharina Eggensperger
https://arxiv.org/abs/2508.13657 https://arxiv…

In-Context Decision Making for Optimizing Complex AutoML Pipelines
Combined Algorithm Selection and Hyperparameter Optimization (CASH) has been fundamental to traditional AutoML systems. However, with the advancements of pre-trained models, modern ML workflows go beyond hyperparameter optimization and often require fine-tuning, ensembling, and other adaptation techniques. While the core challenge of identifying the best-performing model for a downstream task remains, the increasing heterogeneity of ML pipelines demands novel AutoML approaches. This work extend…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 08:45:01

SpeechOp: Inference-Time Task Composition for Generative Speech Processing
Justin Lovelace, Rithesh Kumar, Jiaqi Su, Ke Chen, Kilian Q Weinberger, Zeyu Jin
https://arxiv.org/abs/2509.14298

SpeechOp: Inference-Time Task Composition for Generative Speech Processing
While generative Text-to-Speech (TTS) systems leverage vast ``in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To bridge this gap, we present SpeechOp, a multi-task latent diffusion model that transforms pre-trained TTS models into a universal speech processor capable of performing a wide range of speech tasks and composing them i…

@arXiv_csCV_bot@mastoxiv.page
2025-09-15 09:59:21

Grad-CL: Source Free Domain Adaptation with Gradient Guided Feature Disalignment
Rini Smita Thakur, Rajeev Ranjan Dwivedi, Vinod K Kurmi
https://arxiv.org/abs/2509.10134 https:/…

Grad-CL: Source Free Domain Adaptation with Gradient Guided Feature Disalignment
Accurate segmentation of the optic disc and cup is critical for the early diagnosis and management of ocular diseases such as glaucoma. However, segmentation models trained on one dataset often suffer significant performance degradation when applied to target data acquired under different imaging protocols or conditions. To address this challenge, we propose \textbf{Grad-CL}, a novel source-free domain adaptation framework that leverages a pre-trained source model and unlabeled target data to r…

@arXiv_csLG_bot@mastoxiv.page
2025-10-14 13:41:38

Representation-Based Exploration for Language Models: From Test-Time to Post-Training
Jens Tuyls, Dylan J. Foster, Akshay Krishnamurthy, Jordan T. Ash
https://arxiv.org/abs/2510.11686

Representation-Based Exploration for Language Models: From Test-Time to Post-Training
Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration -- explicitly incentivizing the model to discover novel and diverse behaviors -- and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration wi…

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 08:49:31

Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs
Zhuoxuan Zhang, Jinhao Duan, Edward Kim, Kaidi Xu
https://arxiv.org/abs/2509.13664 https://

Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs
Ambiguity is pervasive in real-world questions, yet large language models (LLMs) often respond with confident answers rather than seeking clarification. In this work, we show that question ambiguity is linearly encoded in the internal representations of LLMs and can be both detected and controlled at the neuron level. During the model's pre-filling stage, we identify that a small number of neurons, as few as one, encode question ambiguity information. Probes trained on these Ambiguity-Encoding …

@arXiv_csLG_bot@mastoxiv.page
2025-08-12 12:07:33

Towards Unveiling Predictive Uncertainty Vulnerabilities in the Context of the Right to Be Forgotten
Wei Qian, Chenxu Zhao, Yangyi Li, Wenqian Ye, Mengdi Huai
https://arxiv.org/abs/2508.07458

Towards Unveiling Predictive Uncertainty Vulnerabilities in the Context of the Right to Be Forgotten
Currently, various uncertainty quantification methods have been proposed to provide certainty and probability estimates for deep learning models' label predictions. Meanwhile, with the growing demand for the right to be forgotten, machine unlearning has been extensively studied as a means to remove the impact of requested sensitive data from a pre-trained model without retraining the model from scratch. However, the vulnerabilities of such generated predictive uncertainties with regard to dedic…

@arXiv_eessAS_bot@mastoxiv.page
2025-10-07 10:06:42

Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation guided Structured Pruning
Ze Li, Ming Cheng, Ming Li
https://arxiv.org/abs/2510.04213 https://…

Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation guided Structured Pruning
Large-scale self-supervised Pre-Trained Models (PTMs) have shown significant improvements in the speaker verification (SV) task by providing rich feature representations. In this paper, we utilize w2v-BERT 2.0, a model with approximately 600 million parameters trained on 450 million hours of unlabeled data across 143 languages, for the SV task. The MFA structure with Layer Adapter is employed to process the multi-layer feature outputs from the PTM and extract speaker embeddings. Additionally, w…

@arXiv_csLG_bot@mastoxiv.page
2025-10-15 10:51:11

CoRA: Covariate-Aware Adaptation of Time Series Foundation Models
Guo Qin, Zhi Chen, Yong Liu, Zhiyuan Shi, Haixuan Liu, Xiangdong Huang, Jianmin Wang, Mingsheng Long
https://arxiv.org/abs/2510.12681

CoRA: Covariate-Aware Adaptation of Time Series Foundation Models
Time Series Foundation Models (TSFMs) have shown significant impact through their model capacity, scalability, and zero-shot generalization. However, due to the heterogeneity of inter-variate dependencies and the backbone scalability on large-scale multivariate datasets, most TSFMs are typically pre-trained on univariate time series. This limitation renders them oblivious to crucial information from diverse covariates in real-world forecasting tasks. To further enhance the performance of TSFMs,…

@arXiv_csLG_bot@mastoxiv.page
2025-08-15 10:07:52

Projected Coupled Diffusion for Test-Time Constrained Joint Generation
Hao Luan, Yi Xian Goh, See-Kiong Ng, Chun Kai Ling
https://arxiv.org/abs/2508.10531 https://

Projected Coupled Diffusion for Test-Time Constrained Joint Generation
Modifications to test-time sampling have emerged as an important extension to diffusion algorithms, with the goal of biasing the generative process to achieve a given objective without having to retrain the entire diffusion model. However, generating jointly correlated samples from multiple pre-trained diffusion models while simultaneously enforcing task-specific constraints without costly retraining has remained challenging. To this end, we propose Projected Coupled Diffusion (PCD), a novel te…

@arXiv_csCV_bot@mastoxiv.page
2025-09-11 08:11:03

Two-Stage Swarm Intelligence Ensemble Deep Transfer Learning (SI-EDTL) for Vehicle Detection Using Unmanned Aerial Vehicles
Zeinab Ghasemi Darehnaei, Mohammad Shokouhifar, Hossein Yazdanjouei, S. M. J. Rastegar Fatemi
https://arxiv.org/abs/2509.08026

Two-Stage Swarm Intelligence Ensemble Deep Transfer Learning (SI-EDTL) for Vehicle Detection Using Unmanned Aerial Vehicles
This paper introduces SI-EDTL, a two-stage swarm intelligence ensemble deep transfer learning model for detecting multiple vehicles in UAV images. It combines three pre-trained Faster R-CNN feature extractor models (InceptionV3, ResNet50, GoogLeNet) with five transfer classifiers (KNN, SVM, MLP, C4.5, Naïve Bayes), resulting in 15 different base learners. These are aggregated via weighted averaging to classify regions as Car, Van, Truck, Bus, or background. Hyperparameters are optimized with t…

@arXiv_csCV_bot@mastoxiv.page
2025-10-10 16:31:59

Replaced article(s) found for cs.CV. https://arxiv.org/list/cs.CV/new
[3/5]:
- VisionTS : Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones
Lefei Shen, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren, Jianling Sun, Zhuo Li, Chenghao Liu

@arXiv_csCV_bot@mastoxiv.page
2025-09-08 09:47:20

Efficient Video-to-Audio Generation via Multiple Foundation Models Mapper
Gehui Chen, Guan'an Wang, Xiaowen Huang, Jitao Sang
https://arxiv.org/abs/2509.04957 https://

Efficient Video-to-Audio Generation via Multiple Foundation Models Mapper
Recent Video-to-Audio (V2A) generation relies on extracting semantic and temporal features from video to condition generative models. Training these models from scratch is resource intensive. Consequently, leveraging foundation models (FMs) has gained traction due to their cross-modal knowledge transfer and generalization capabilities. One prior work has explored fine-tuning a lightweight mapper network to connect a pre-trained visual encoder with a text-to-audio generation model for V2A. Inspi…

Tootfinder

Opt-in global Mastodon full text search. Join the index!