Tootfinder

@arXiv_csLG_bot@mastoxiv.page
2025-07-31 09:25:31

Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training
Oleksiy Ostapenko, Charles Guille-Escuret, Luke Kumar, Max Tian, Denis Kocetkov, Gopeshh Subbaraj, Raymond Li, Joel Lamy-Poirier, Sebastien Paquet, Torsten Scholak
https://arxiv.org/abs/2507.22250

Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training
We introduce a framework for optimizing domain-specific dataset construction in foundation model training. Specifically, we seek a cost-efficient way to estimate the quality of data sources (e.g. synthetically generated or filtered web data, etc.) in order to make optimal decisions about resource allocation for data sourcing from these sources for the stage two pre-training phase, aka annealing, with the goal of specializing a generalist pre-trained model to specific domains. Our approach exten…

@arXiv_csCL_bot@mastoxiv.page
2025-06-30 10:22:00

HyperCLOVA X THINK Technical Report
NAVER Cloud HyperCLOVA X Team
https://arxiv.org/abs/2506.22403 https://arxiv.org/pdf/2506.22403…

HyperCLOVA X THINK Technical Report
We introduce HyperCLOVA X THINK, the first reasoning-focused large language model in the HyperCLOVA X family, pre-trained on roughly $6$ trillion high-quality Korean, and English tokens, augmented with targeted synthetic Korean data. It was implemented as a compute-memory-balanced Peri-LN Transformer scaled with $μ$P, pre-trained through a three-stage curriculum that expands the context window to $128$K tokens, and post-trained via supervised fine-tuning with Reinforcement Learning from Verifi…

@arXiv_csCV_bot@mastoxiv.page
2025-07-30 10:40:51

Staining and locking computer vision models without retraining
Oliver J. Sutton, Qinghua Zhou, George Leete, Alexander N. Gorban, Ivan Y. Tyukin
https://arxiv.org/abs/2507.22000

Staining and locking computer vision models without retraining
We introduce new methods of staining and locking computer vision models, to protect their owners' intellectual property. Staining, also known as watermarking, embeds secret behaviour into a model which can later be used to identify it, while locking aims to make a model unusable unless a secret trigger is inserted into input images. Unlike existing methods, our algorithms can be used to stain and lock pre-trained models without requiring fine-tuning or retraining, and come with provable, comput…

@arXiv_csDC_bot@mastoxiv.page
2025-05-30 07:17:04

Speeding up Model Loading with fastsafetensors
Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman
https://arxiv.org/abs/2505.23072

Speeding up Model Loading with fastsafetensors
The rapid increases in model parameter sizes introduces new challenges in pre-trained model loading. Currently, machine learning code often deserializes each parameter as a tensor object in host memory before copying it to device memory. We found that this approach underutilized storage throughput and significantly slowed down loading large models with a widely-used model file formats, safetensors. In this work, we present fastsafetensors, a Python library designed to optimize the deserializati…

@arXiv_csMM_bot@mastoxiv.page
2025-05-30 09:54:06

This https://arxiv.org/abs/2411.17690 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csMM_…

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
The rapid progress of foundation models and large language models (LLMs) has fueled significantly improvement in the capabilities of machine learning systems that benefit from mutlimodal input data. However, existing multimodal models are predominantly built on top of pre-trained LLMs, which can limit accurate modeling of temporal dependencies across other modalities and thus limit the model's ability to jointly process and leverage multimodal inputs. To specifically investigate the alignment o…

@arXiv_csCL_bot@mastoxiv.page
2025-07-30 10:18:51

Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal
Yang Wang, Chenghao Xiao, Yizhi Li, Stuart E. Middleton, Noura Al Moubayed, Chenghua Lin
https://arxiv.org/abs/2507.21750

Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal
Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose …

@arXiv_csSD_bot@mastoxiv.page
2025-05-30 09:56:27

This https://arxiv.org/abs/2505.20745 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csSD_…

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation
Auscultation, particularly heart sound, is a non-invasive technique that provides essential vital sign information. Recently, self-supervised acoustic representation foundation models (FMs) have been proposed to offer insights into acoustics-based vital signs. However, there has been little exploration of the extent to which auscultation is encoded in these pre-trained FM representations. In this work, using a publicly available phonocardiogram (PCG) dataset and a heart rate (HR) estimation mod…

@arXiv_csAI_bot@mastoxiv.page
2025-06-24 11:48:20

How Robust is Model Editing after Fine-Tuning? An Empirical Study on Text-to-Image Diffusion Models
Feng He, Zhenyang Liu, Marco Valentino, Zhixue Zhao
https://arxiv.org/abs/2506.18428

How Robust is Model Editing after Fine-Tuning? An Empirical Study on Text-to-Image Diffusion Models
Model editing offers a low-cost technique to inject or correct a particular behavior in a pre-trained model without extensive retraining, supporting applications such as factual correction and bias mitigation. Despite this common practice, it remains unknown whether edits persist after fine-tuning or whether they are inadvertently reversed. This question has fundamental practical implications. For example, if fine-tuning removes prior edits, it could serve as a defence mechanism against hidden …

@arXiv_csDB_bot@mastoxiv.page
2025-05-30 07:16:53

TailorSQL: An NL2SQL System Tailored to Your Query Workload
Kapil Vaidya, Jialin Ding, Sebastian Kosak, David Kernert, Chuan Lei, Xiao Qin, Abhinav Tripathy, Ramesh Balan, Balakrishnan Narayanaswamy, Tim Kraska
https://arxiv.org/abs/2505.23039

TailorSQL: An NL2SQL System Tailored to Your Query Workload
NL2SQL (natural language to SQL) translates natural language questions into SQL queries, thereby making structured data accessible to non-technical users, serving as the foundation for intelligent data applications. State-of-the-art NL2SQL techniques typically perform translation by retrieving database-specific information, such as the database schema, and invoking a pre-trained large language model (LLM) using the question and retrieved information to generate the SQL query. However, existin…

@arXiv_csCL_bot@mastoxiv.page
2025-07-29 08:31:31

HITSZ's End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track
Xuchen Wei, Yangxin Wu, Yaoyin Zhang, Henglyu Liu, Kehai Chen, Xuefeng Bai, Min Zhang
https://arxiv.org/abs/2507.19616

HITSZ's End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track
This paper presents HITSZ's submission for the IWSLT 2025 Indic track, focusing on speech-to-text translation (ST) for English-to-Indic and Indic-to-English language pairs. To enhance translation quality in this low-resource scenario, we propose an end-to-end system integrating the pre-trained Whisper automated speech recognition (ASR) model with Krutrim, an Indic-specialized large language model (LLM). Experimental results demonstrate that our end-to-end system achieved average BLEU scores of …

@arXiv_csCV_bot@mastoxiv.page
2025-07-28 10:15:31

Back to the Features: DINO as a Foundation for Video World Models
Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, Piotr Bojanowski
https://arxiv.org/abs/2507.19468

Back to the Features: DINO as a Foundation for Video World Models
We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, a…

@arXiv_eessIV_bot@mastoxiv.page
2025-06-27 09:17:49

GANet-Seg: Adversarial Learning for Brain Tumor Segmentation with Hybrid Generative Models
Qifei Cui, Xinyu Lu
https://arxiv.org/abs/2506.21245 https://

GANet-Seg: Adversarial Learning for Brain Tumor Segmentation with Hybrid Generative Models
This work introduces a novel framework for brain tumor segmentation leveraging pre-trained GANs and Unet architectures. By combining a global anomaly detection module with a refined mask generation network, the proposed model accurately identifies tumor-sensitive regions and iteratively enhances segmentation precision using adversarial loss constraints. Multi-modal MRI data and synthetic image augmentation are employed to improve robustness and address the challenge of limited annotated dataset…

@arXiv_astrophIM_bot@mastoxiv.page
2025-07-29 09:23:41

Finetuning Stellar Spectra Foundation Models with LoRA
Xiaosheng Zhao, Yuan-Sen Ting, Alexander S. Szalay, Yang Huang
https://arxiv.org/abs/2507.20972 https://

Finetuning Stellar Spectra Foundation Models with LoRA
Foundation models are beginning to impact stellar spectroscopy, where spectra encode rich physical information in a structured, language-like form. A key challenge is adapting these models across heterogeneous surveys with differing resolution and coverage. We apply Low-Rank Adaptation (LoRA) to fine-tune SpecCLIP--a contrastively pre-trained model on LAMOST and Gaia XP spectra--for downstream tasks on DESI Early Data Release (EDR) spectra. We show that LoRA enables few-shot learning on DESI, w…

@arXiv_csSE_bot@mastoxiv.page
2025-07-22 10:01:10

On the Effect of Token Merging on Pre-trained Models for Code
Mootez Saad, Hao Li, Tushar Sharma, Ahmed E. Hassan
https://arxiv.org/abs/2507.14423 https://…

On the Effect of Token Merging on Pre-trained Models for Code
Tokenization is a fundamental component of language models for code. It involves breaking down the input into units that are later passed to the language model stack to learn high-dimensional representations used in various contexts, from classification to generation. However, the output of these tokenizers is often longer than that traditionally used in compilers and interpreters. This could result in undesirable effects, such as increased computational overhead. In this work, we investigate t…

@arXiv_qbioGN_bot@mastoxiv.page
2025-06-25 08:15:39

eccDNAMamba: A Pre-Trained Model for Ultra-Long eccDNA Sequence Analysis
Zhenke Liu, Jien Li, Ziqi Zhang
https://arxiv.org/abs/2506.18940 https://

eccDNAMamba: A Pre-Trained Model for Ultra-Long eccDNA Sequence Analysis
Extrachromosomal circular DNA (eccDNA) plays key regulatory roles and contributes to oncogene overexpression in cancer through high-copy amplification and long-range interactions. Despite advances in modeling, no pre-trained models currently support full-length circular eccDNA for downstream analysis. Existing genomic models are either limited to single-nucleotide resolution or hindered by the inefficiency of the quadratic attention mechanism. Here, we introduce eccDNAMamba, the first bidirecti…

@arXiv_csSC_bot@mastoxiv.page
2025-05-29 07:21:09

Symbolic Foundation Regressor on Complex Networks
Weiting Liu, Jiaxu Cui, Jiao Hu, En Wang, Bo Yang
https://arxiv.org/abs/2505.21879 https://

Symbolic Foundation Regressor on Complex Networks
In science, we are interested not only in forecasting but also in understanding how predictions are made, specifically what the interpretable underlying model looks like. Data-driven machine learning technology can significantly streamline the complex and time-consuming traditional manual process of discovering scientific laws, helping us gain insights into fundamental issues in modern science. In this work, we introduce a pre-trained symbolic foundation regressor that can effectively compress …

@arXiv_csCR_bot@mastoxiv.page
2025-07-25 09:19:22

LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models
Delong Ran, Xinlei He, Tianshuo Cong, Anyu Wang, Qi Li, Xiaoyun Wang
https://arxiv.org/abs/2507.18302

LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models
Language Models (LMs) typically adhere to a "pre-training and fine-tuning" paradigm, where a universal pre-trained model can be fine-tuned to cater to various specialized domains. Low-Rank Adaptation (LoRA) has gained the most widespread use in LM fine-tuning due to its lightweight computational cost and remarkable performance. Because the proportion of parameters tuned by LoRA is relatively small, there might be a misleading impression that the LoRA fine-tuning data is invulnerable to Membersh…

@arXiv_csSD_bot@mastoxiv.page
2025-07-29 07:51:51

Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion
Hei Shing Cheung, Boya Zhang
https://arxiv.org/abs/2507.19991 https://

Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion
We present a lightweight latent diffusion model for vocal-conditioned musical accompaniment generation that addresses critical limitations in existing music AI systems. Our approach introduces a novel soft alignment attention mechanism that adaptively combines local and global temporal dependencies based on diffusion timesteps, enabling efficient capture of multi- scale musical structure. Operating in the compressed latent space of a pre-trained variational autoencoder, the model achieves a 220…

@arXiv_qfinPM_bot@mastoxiv.page
2025-07-29 09:09:51

Your AI, Not Your View: The Bias of LLMs in Investment Analysis
Hoyoung Lee, Junhyuk Seo, Suhwan Park, Junhyeong Lee, Wonbin Ahn, Chanyeol Choi, Alejandro Lopez-Lira, Yongjae Lee
https://arxiv.org/abs/2507.20957

Your AI, Not Your View: The Bias of LLMs in Investment Analysis
In finance, Large Language Models (LLMs) face frequent knowledge conflicts due to discrepancies between pre-trained parametric knowledge and real-time market data. These conflicts become particularly problematic when LLMs are deployed in real-world investment services, where misalignment between a model's embedded preferences and those of the financial institution can lead to unreliable recommendations. Yet little research has examined what investment views LLMs actually hold. We propose an exp…

@arXiv_qbiobm_bot@mastoxiv.page
2025-06-23 08:34:20

Aptamer-protein interaction prediction model based on transformer
Zhichao Yan, Yue Kang, Buyong Ma
https://arxiv.org/abs/2506.16084 https://

Aptamer-protein interaction prediction model based on transformer
Aptamers are single-stranded DNA/RNAs or short peptides with unique tertiary structures that selectively bind to specific targets. They have great potential in the detection and medical fields. Here, we present SelfTrans-Ensemble, a deep learning model that integrates sequence information models and structural information models to extract multi-scale features for predicting aptamer-protein interactions (APIs). The model employs two pre-trained models, ProtBert and RNA-FM, to encode protein and…

@arXiv_csGR_bot@mastoxiv.page
2025-06-10 09:05:12

Squeeze3D: Your 3D Generation Model is Secretly an Extreme Neural Compressor
Rishit Dagli, Yushi Guan, Sankeerth Durvasula, Mohammadreza Mofayezi, Nandita Vijaykumar
https://arxiv.org/abs/2506.07932

Squeeze3D: Your 3D Generation Model is Secretly an Extreme Neural Compressor
We propose Squeeze3D, a novel framework that leverages implicit prior knowledge learnt by existing pre-trained 3D generative models to compress 3D data at extremely high compression ratios. Our approach bridges the latent spaces between a pre-trained encoder and a pre-trained generation model through trainable mapping networks. Any 3D model represented as a mesh, point cloud, or a radiance field is first encoded by the pre-trained encoder and then transformed (i.e. compressed) into a highly com…

@arXiv_csLG_bot@mastoxiv.page
2025-07-24 09:54:39

Computer Vision for Real-Time Monkeypox Diagnosis on Embedded Systems
Jacob M. Delgado-L\'opez, Ricardo A. Morell-Rodriguez, Sebasti\'an O. Espinosa-Del Rosario, Wilfredo E. Lugo-Beauchamp
https://arxiv.org/abs/2507.17123

Computer Vision for Real-Time Monkeypox Diagnosis on Embedded Systems
The rapid diagnosis of infectious diseases, such as monkeypox, is crucial for effective containment and treatment, particularly in resource-constrained environments. This study presents an AI-driven diagnostic tool developed for deployment on the NVIDIA Jetson Orin Nano, leveraging the pre-trained MobileNetV2 architecture for binary classification. The model was trained on the open-source Monkeypox Skin Lesion Dataset, achieving a 93.07% F1-Score, which reflects a well-balanced performance in p…

@arXiv_csSD_bot@mastoxiv.page
2025-06-23 10:26:40

Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training
Jianyuan Feng, Guangzheng Li, Yangfei Xu
https://arxiv.org/abs/2506.16833

Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training
Language-queried Audio Separation (LASS) employs linguistic queries to isolate target sounds based on semantic descriptions. However, existing methods face challenges in aligning complex auditory features with linguistic context while preserving separation precision. Current research efforts focus primarily on text description augmentation and architectural innovations, yet the potential of integrating pre-trained self-supervised learning (SSL) audio models and Contrastive Language-Audio Pretra…

@arXiv_qbioQM_bot@mastoxiv.page
2025-06-24 08:45:40

BrainSymphony: A Transformer-Driven Fusion of fMRI Time Series and Structural Connectivity
Moein Khajehnejad, Forough Habibollahi, Adeel Razi
https://arxiv.org/abs/2506.18314

BrainSymphony: A Transformer-Driven Fusion of fMRI Time Series and Structural Connectivity
Existing foundation models for neuroimaging are often prohibitively large and data-intensive. We introduce BrainSymphony, a lightweight, parameter-efficient foundation model that achieves state-of-the-art performance while being pre-trained on significantly smaller public datasets. BrainSymphony's strong multimodal architecture processes functional MRI data through parallel spatial and temporal transformer streams, which are then efficiently distilled into a unified representation by a Perceive…

@arXiv_csRO_bot@mastoxiv.page
2025-06-16 07:49:19

Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving
Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, Liam Paull
https://arxiv.org/abs/2506.11234

Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving
We present Poutine, a 3B-parameter vision-language model (VLM) tailored for end-to-end autonomous driving in long-tail driving scenarios. Poutine is trained in two stages. To obtain strong base driving capabilities, we train Poutine-Base in a self-supervised vision-language-trajectory (VLT) next-token prediction fashion on 83 hours of CoVLA nominal driving and 11 hours of Waymo long-tail driving. Accompanying language annotations are auto-generated with a 72B-parameter VLM. Poutine is obtained …

@arXiv_qfinST_bot@mastoxiv.page
2025-06-10 09:46:13

DELPHYNE: A Pre-Trained Model for General and Financial Time Series
Xueying Ding, Aakriti Mittal, Achintya Gopal
https://arxiv.org/abs/2506.06288 https://

DELPHYNE: A Pre-Trained Model for General and Financial Time Series
Time-series data is a vital modality within data science communities. This is particularly valuable in financial applications, where it helps in detecting patterns, understanding market behavior, and making informed decisions based on historical data. Recent advances in language modeling have led to the rise of time-series pre-trained models that are trained on vast collections of datasets and applied to diverse tasks across financial domains. However, across financial applications, existing ti…

@arXiv_eessIV_bot@mastoxiv.page
2025-06-23 10:01:40

Fast Training-free Perceptual Image Compression
Ziran Zhu, Tongda Xu, Minye Huang, Dailan He, Xingtong Ge, Xinjie Zhang, Ling Li, Yan Wang
https://arxiv.org/abs/2506.16102

Fast Training-free Perceptual Image Compression
Training-free perceptual image codec adopt pre-trained unconditional generative model during decoding to avoid training new conditional generative model. However, they heavily rely on diffusion inversion or sample communication, which take 1 min to intractable amount of time to decode a single image. In this paper, we propose a training-free algorithm that improves the perceptual quality of any existing codec with theoretical guarantee. We further propose different implementations for optimal p…

@arXiv_qbioNC_bot@mastoxiv.page
2025-06-04 07:49:14

A Brain Graph Foundation Model: Pre-Training and Prompt-Tuning for Any Atlas and Disorder
Xinxu Wei, Kanhao Zhao, Yong Jiao, Lifang He, Yu Zhang
https://arxiv.org/abs/2506.02044

A Brain Graph Foundation Model: Pre-Training and Prompt-Tuning for Any Atlas and Disorder
As large language models (LLMs) continue to revolutionize AI research, there is a growing interest in building large-scale brain foundation models to advance neuroscience. While most existing brain foundation models are pre-trained on time-series signals or region-of-interest (ROI) features, we propose a novel graph-based pre-training paradigm for constructing a brain graph foundation model. In this paper, we introduce the Brain Graph Foundation Model, termed BrainGFM, a unified framework that …

@arXiv_csSD_bot@mastoxiv.page
2025-07-25 07:39:21

Speaker Disentanglement of Speech Pre-trained Model Based on Interpretability
Xiaoxu Zhu, Junhua Li
https://arxiv.org/abs/2507.17851 https://arxiv.org/pdf/…

Speaker Disentanglement of Speech Pre-trained Model Based on Interpretability
Speech pretrained models contain task-specific information across different layers, but decoupling content and timbre information remains challenging as removing speaker-specific information often causes content loss. Current research lacks direct metrics to quantify timbre residual in model encodings, relying on indirect evaluation through downstream tasks. This paper addresses these challenges through interpretability-based speaker disentanglement in speech pretraining models. We quantitative…

@arXiv_eessSP_bot@mastoxiv.page
2025-06-12 08:36:51

Foundation Model-Aided Deep Reinforcement Learning for RIS-Assisted Wireless Communication
Mohammad Ghassemi, Sara Farrag Mobarak, Han Zhang, Ali Afana, Akram Bin Sediq, Melike Erol-Kantarci
https://arxiv.org/abs/2506.09855

Foundation Model-Aided Deep Reinforcement Learning for RIS-Assisted Wireless Communication
Reconfigurable intelligent surfaces (RIS) have emerged as a promising technology for enhancing wireless communication by dynamically controlling signal propagation in the environment. However, their efficient deployment relies on accurate channel state information (CSI), which leads to high channel estimation overhead due to their passive nature and the large number of reflective elements. In this work, we solve this challenge by proposing a novel framework that leverages a pre-trained open-sou…

@arXiv_csGR_bot@mastoxiv.page
2025-06-26 08:19:00

EditP23: 3D Editing via Propagation of Image Prompts to Multi-View
Roi Bar-On, Dana Cohen-Bar, Daniel Cohen-Or
https://arxiv.org/abs/2506.20652 https://

EditP23: 3D Editing via Propagation of Image Prompts to Multi-View
We present EditP23, a method for mask-free 3D editing that propagates 2D image edits to multi-view representations in a 3D-consistent manner. In contrast to traditional approaches that rely on text-based prompting or explicit spatial masks, EditP23 enables intuitive edits by conditioning on a pair of images: an original view and its user-edited counterpart. These image prompts are used to guide an edit-aware flow in the latent space of a pre-trained multi-view diffusion model, allowing the edit…

@arXiv_csCL_bot@mastoxiv.page
2025-07-14 09:58:42

DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures
Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro
https://arxiv.org/abs/2507.08606

DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures
We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechan…

@arXiv_csCV_bot@mastoxiv.page
2025-06-17 09:58:49

CLIP-HandID: Vision-Language Model for Hand-Based Person Identification
Nathanael L. Baisa, Babu Pallam, Amudhavel Jayavel
https://arxiv.org/abs/2506.12447

CLIP-HandID: Vision-Language Model for Hand-Based Person Identification
This paper introduces a new approach to person identification based on hand images, designed specifically for criminal investigations. The method is particularly valuable in serious crimes like sexual abuse, where hand images are often the sole identifiable evidence available. Our proposed method, CLIP-HandID, leverages pre-trained foundational vision-language model, particularly CLIP, to efficiently learn discriminative deep feature representations from hand images given as input to the image …

@arXiv_csSD_bot@mastoxiv.page
2025-07-29 09:19:01

Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech
Taesoo Kim, Jinju Kim, Dongchan Kim, Jong Hwan Ko, Gyeong-Moon Park
https://arxiv.org/abs/2507.20140

Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech
The rapid advancement of Zero-Shot Text-to-Speech (ZS-TTS) technology has enabled high-fidelity voice synthesis from minimal audio cues, raising significant privacy and ethical concerns. Despite the threats to voice privacy, research to selectively remove the knowledge to replicate unwanted individual voices from pre-trained model parameters has not been explored. In this paper, we address the new challenge of speaker identity unlearning for ZS-TTS systems. To meet this goal, we propose the fir…

@arXiv_csCR_bot@mastoxiv.page
2025-06-09 08:05:12

Stealix: Model Stealing via Prompt Evolution
Zhixiong Zhuang, Hui-Po Wang, Maria-Irina Nicolae, Mario Fritz
https://arxiv.org/abs/2506.05867 https://

Stealix: Model Stealing via Prompt Evolution
Model stealing poses a significant security risk in machine learning by enabling attackers to replicate a black-box model without access to its training data, thus jeopardizing intellectual property and exposing sensitive information. Recent methods that use pre-trained diffusion models for data synthesis improve efficiency and performance but rely heavily on manually crafted prompts, limiting automation and scalability, especially for attackers with little expertise. To assess the risks posed …

@arXiv_eessAS_bot@mastoxiv.page
2025-06-03 07:44:25

Source Tracing of Synthetic Speech Systems Through Paralinguistic Pre-Trained Representations
Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan, Drishti Singh, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
https://arxiv.org/abs/2506.01157

Source Tracing of Synthetic Speech Systems Through Paralinguistic Pre-Trained Representations
In this work, we focus on source tracing of synthetic speech generation systems (STSGS). Each source embeds distinctive paralinguistic features--such as pitch, tone, rhythm, and intonation--into their synthesized speech, reflecting the underlying design of the generation model. While previous research has explored representations from speech pre-trained models (SPTMs), the use of representations from SPTM pre-trained for paralinguistic speech processing, which excel in paralinguistic tasks like…

@arXiv_csDC_bot@mastoxiv.page
2025-07-22 08:53:20

ACME: Adaptive Customization of Large Models via Distributed Systems
Ziming Dai, Chao Qiu, Fei Gao, Yunfeng Zhao, Xiaofei Wang
https://arxiv.org/abs/2507.14802

ACME: Adaptive Customization of Large Models via Distributed Systems
Pre-trained Transformer-based large models have revolutionized personal virtual assistants, but their deployment in cloud environments faces challenges related to data privacy and response latency. Deploying large models closer to the data and users has become a key research area to address these issues. However, applying these models directly often entails significant difficulties, such as model mismatching, resource constraints, and energy inefficiency. Automated design of customized models i…

@arXiv_csSE_bot@mastoxiv.page
2025-06-04 07:44:26

How do Pre-Trained Models Support Software Engineering? An Empirical Study in Hugging Face
Alexandra Gonz\'alez, Xavier Franch, David Lo, Silverio Mart\'inez-Fern\'andez
https://arxiv.org/abs/2506.03013

How do Pre-Trained Models Support Software Engineering? An Empirical Study in Hugging Face
Open-Source Pre-Trained Models (PTMs) provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. To address this gap, we derive a taxonomy encompassing 147 SE tasks and apply an SE-oriented classification to PTMs in a popular open-source ML repository, Hugging Face (HF). Our repository mining study began with a systematically gathered database of PTMs from the HF API, considering their model card de…

@arXiv_csIR_bot@mastoxiv.page
2025-06-10 16:44:49

This https://arxiv.org/abs/2506.02916 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csIR_…

MMM4Rec: An Transfer-Efficient Framework for Multi-modal Sequential Recommendation
Sequential Recommendation (SR) systems model user preferences by analyzing interaction histories. Although transferable multi-modal SR architectures demonstrate superior performance compared to traditional ID-based approaches, current methods incur substantial fine-tuning costs when adapting to new domains due to complex optimization requirements and negative transfer effects - a significant deployment bottleneck that hinders engineers from efficiently repurposing pre-trained models for novel a…

@arXiv_csGR_bot@mastoxiv.page
2025-06-24 09:23:49

Morse: Dual-Sampling for Lossless Acceleration of Diffusion Models
Chao Li, Jiawei Fan, Anbang Yao
https://arxiv.org/abs/2506.18251 https://

Morse: Dual-Sampling for Lossless Acceleration of Diffusion Models
In this paper, we present Morse, a simple dual-sampling framework for accelerating diffusion models losslessly. The key insight of Morse is to reformulate the iterative generation (from noise to data) process via taking advantage of fast jump sampling and adaptive residual feedback strategies. Specifically, Morse involves two models called Dash and Dot that interact with each other. The Dash model is just the pre-trained diffusion model of any type, but operates in a jump sampling regime, creat…

@arXiv_csCV_bot@mastoxiv.page
2025-07-16 10:34:41

Implementing Adaptations for Vision AutoRegressive Model
Kaif Shaikh, Antoni Kowalczuk, Franziska Boenisch, Adam Dziedzic
https://arxiv.org/abs/2507.11441 …

Implementing Adaptations for Vision AutoRegressive Model
Vision AutoRegressive model (VAR) was recently introduced as an alternative to Diffusion Models (DMs) in image generation domain. In this work we focus on its adaptations, which aim to fine-tune pre-trained models to perform specific downstream tasks, like medical data generation. While for DMs there exist many techniques, adaptations for VAR remain underexplored. Similarly, differentially private (DP) adaptations-ones that aim to preserve privacy of the adaptation data-have been extensively st…

@arXiv_csLG_bot@mastoxiv.page
2025-06-10 19:21:11

This https://arxiv.org/abs/2505.23868 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Noise-Robustness Through Noise: Asymmetric LoRA Adaption with Poisoning Expert
Current parameter-efficient fine-tuning methods for adapting pre-trained language models to downstream tasks are susceptible to interference from noisy data. Conventional noise-handling approaches either rely on laborious data pre-processing or employ model architecture modifications prone to error accumulation. In contrast to existing noise-process paradigms, we propose a noise-robust adaptation method via asymmetric LoRA poisoning experts (LoPE), a novel framework that enhances model robustne…

@arXiv_csRO_bot@mastoxiv.page
2025-06-02 10:28:24

This https://arxiv.org/abs/2505.21906 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge
Vision-language-action (VLA) models have emerged as the next generation of models in robotics. However, despite leveraging powerful pre-trained Vision-Language Models (VLMs), existing end-to-end VLA systems often lose key capabilities during fine-tuning as the model adapts to specific robotic tasks. We argue that a generalizable VLA model should retain and expand upon the VLM's core competencies: 1) Open-world embodied reasoning - the VLA should inherit the knowledge from VLM, i.e., recognize a…

@arXiv_csCR_bot@mastoxiv.page
2025-06-02 10:00:25

This https://arxiv.org/abs/2411.16746 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

LoBAM: LoRA-Based Backdoor Attack on Model Merging
Model merging is an emerging technique that integrates multiple models fine-tuned on different tasks to create a versatile model that excels in multiple domains. This scheme, in the meantime, may open up backdoor attack opportunities where one single malicious model can jeopardize the integrity of the merged model. Existing works try to demonstrate the risk of such attacks by assuming substantial computational resources, focusing on cases where the attacker can fully fine-tune the pre-trained m…

@arXiv_csDB_bot@mastoxiv.page
2025-06-09 07:27:22

Training-Free Query Optimization via LLM-Based Plan Similarity
Nikita Vasilenko, Alexander Demin, Vladimir Boorlakov
https://arxiv.org/abs/2506.05853 https…

Training-Free Query Optimization via LLM-Based Plan Similarity
Large language model (LLM) embeddings offer a promising new avenue for database query optimization. In this paper, we explore how pre-trained execution plan embeddings can guide SQL query execution without the need for additional model training. We introduce LLM-PM (LLM-based Plan Mapping), a framework that embeds the default execution plan of a query, finds its k nearest neighbors among previously executed plans, and recommends database hintsets based on neighborhood voting. A lightweight cons…

@arXiv_csGR_bot@mastoxiv.page
2025-07-25 09:26:12

Tiny is not small enough: High-quality, low-resource facial animation models through hybrid knowledge distillation
Zhen Han, Mattias Teye, Derek Yadgaroff, Judith B\"utepage
https://arxiv.org/abs/2507.18352

Tiny is not small enough: High-quality, low-resource facial animation models through hybrid knowledge distillation
The training of high-quality, robust machine learning models for speech-driven 3D facial animation requires a large, diverse dataset of high-quality audio-animation pairs. To overcome the lack of such a dataset, recent work has introduced large pre-trained speech encoders that are robust to variations in the input audio and, therefore, enable the facial animation model to generalize across speakers, audio quality, and languages. However, the resulting facial animation models are prohibitively l…

@arXiv_qbioGN_bot@mastoxiv.page
2025-06-16 09:30:09

Multimodal Modeling of CRISPR-Cas12 Activity Using Foundation Models and Chromatin Accessibility Data
Azim Dehghani Amirabad, Yanfei Zhang, Artem Moskalev, Sowmya Rajesh, Tommaso Mansi, Shuwei Li, Mangal Prakash, Rui Liao
https://arxiv.org/abs/2506.11182

Multimodal Modeling of CRISPR-Cas12 Activity Using Foundation Models and Chromatin Accessibility Data
Predicting guide RNA (gRNA) activity is critical for effective CRISPR-Cas12 genome editing but remains challenging due to limited data, variation across protospacer adjacent motifs (PAMs-short sequence requirements for Cas binding), and reliance on large-scale training. We investigate whether pre-trained biological foundation model originally trained on transcriptomic data can improve gRNA activity estimation even without domain-specific pre-training. Using embeddings from existing RNA foundati…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-17 11:23:45

Stereo sound event localization and detection based on PSELDnet pretraining and BiMamba sequence modeling
Wenmiao Gao, Yang Xiao
https://arxiv.org/abs/2506.13455

Stereo sound event localization and detection based on PSELDnet pretraining and BiMamba sequence modeling
Pre-training methods have achieved significant performance improvements in sound event localization and detection (SELD) tasks, but existing Transformer-based models suffer from high computational complexity. In this work, we propose a stereo sound event localization and detection system based on pre-trained PSELDnet and bidirectional Mamba sequence modeling. We replace the Conformer module with a BiMamba module and introduce asymmetric convolutions to more effectively model the spatiotemporal …

@arXiv_physicsfludyn_bot@mastoxiv.page
2025-07-02 09:44:00

Guided Unconditional and Conditional Generative Models for Super-Resolution and Inference of Quasi-Geostrophic Turbulence
Anantha Narayanan Suresh Babu, Akhil Sadam, Pierre F. J. Lermusiaux
https://arxiv.org/abs/2507.00719

Guided Unconditional and Conditional Generative Models for Super-Resolution and Inference of Quasi-Geostrophic Turbulence
Typically, numerical simulations of the ocean, weather, and climate are coarse, and observations are sparse and gappy. In this work, we apply four generative diffusion modeling approaches to super-resolution and inference of forced two-dimensional quasi-geostrophic turbulence on the beta-plane from coarse, sparse, and gappy observations. Two guided approaches minimally adapt a pre-trained unconditional model: SDEdit modifies the initial condition, and Diffusion Posterior Sampling (DPS) modifies…

@arXiv_csDC_bot@mastoxiv.page
2025-07-01 09:56:13

QPART: Adaptive Model Quantization and Dynamic Workload Balancing for Accuracy-aware Edge Inference
Xiangchen Li, Saeid Ghafouri, Bo Ji, Hans Vandierendonck, Deepu John, Dimitrios S. Nikolopoulos
https://arxiv.org/abs/2506.23934

QPART: Adaptive Model Quantization and Dynamic Workload Balancing for Accuracy-aware Edge Inference
As machine learning inferences increasingly move to edge devices, adapting to diverse computational capabilities, hardware, and memory constraints becomes more critical. Instead of relying on a pre-trained model fixed for all future inference queries across diverse edge devices, we argue that planning an inference pattern with a request-specific model tailored to the device's computational capacity, accuracy requirements, and time constraints is more cost-efficient and robust to diverse scenari…

@arXiv_csLG_bot@mastoxiv.page
2025-06-05 11:00:19

This https://arxiv.org/abs/2506.00486 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs
Despite rapid advancements in the research and deployment of large language models (LLMs), the statistical distribution of model parameters, as well as their influence on initialization, training dynamics, and downstream efficiency, has received surprisingly little attention. A recent work introduced BackSlash, a training-time compression algorithm. It first demonstrated that pre-trained LLM parameters follow generalized Gaussian distributions (GGDs) better. By optimizing GG priors during train…

@arXiv_eessSP_bot@mastoxiv.page
2025-06-12 08:23:31

AI-Driven SEEG Channel Ranking for Epileptogenic Zone Localization
Saeed Hashemi, Genchang Peng, Mehrdad Nourani, Omar Nofal, Jay Harvey
https://arxiv.org/abs/2506.09255

AI-Driven SEEG Channel Ranking for Epileptogenic Zone Localization
Stereo-electroencephalography (SEEG) is an invasive technique to implant depth electrodes and collect data for pre-surgery evaluation. Visual inspection of signals recorded from hundreds of channels is time consuming and inefficient. We propose a machine learning approach to rank the impactful channels by incorporating clinician's selection and computational finding. A classification model using XGBoost is trained to learn the discriminative features of each channel during ictal periods. Then, …

@arXiv_csCL_bot@mastoxiv.page
2025-06-03 08:19:55

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Pierre-Carl Langlais, Carlos Rosas Hinostroza, Mattia Nee, Catherine Arnett, Pavel Chizhov, Eliot Krzystof Jones, Ir\`ene Girard, David Mach, Anastasia Stasenko, Ivan P. Yamshchikov
https://arxiv.org/abs/2506.01732

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for language model pre-training. The data assem…

@arXiv_csCV_bot@mastoxiv.page
2025-06-04 14:59:51

This https://arxiv.org/abs/2505.21920 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…

InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective
The Segment Anything Model (SAM), a vision foundation model, exhibits impressive zero-shot capabilities in general tasks but struggles in specialized domains. Parameter-efficient fine-tuning (PEFT) is a promising approach to unleash the potential of SAM in novel scenarios. However, existing PEFT methods for SAM neglect the domain-invariant relations encoded in the pre-trained model. To bridge this gap, we propose InfoSAM, an information-theoretic approach that enhances SAM fine-tuning by distil…

@arXiv_eessIV_bot@mastoxiv.page
2025-07-11 09:03:21

Label-Efficient Chest X-ray Diagnosis via Partial CLIP Adaptation
Heet Nitinkumar Dalsania
https://arxiv.org/abs/2507.07254 https://a…

Label-Efficient Chest X-ray Diagnosis via Partial CLIP Adaptation
Modern deep learning implementations for medical imaging usually rely on large labeled datasets. These datasets are often difficult to obtain due to privacy concerns, high costs, and even scarcity of cases. In this paper, a label-efficient strategy is proposed for chest X-ray diagnosis that seeks to reflect real-world hospital scenarios. The experiments use the NIH Chest X-ray14 dataset and a pre-trained CLIP ViT-B/32 model. The model is adapted via partial fine-tuning of its visual encoder and…

@arXiv_csCR_bot@mastoxiv.page
2025-07-17 08:12:10

Effective Fine-Tuning of Vision Transformers with Low-Rank Adaptation for Privacy-Preserving Image Classification
Haiwei Lin, Shoko Imaizumi, Hitoshi Kiya
https://arxiv.org/abs/2507.11943

Effective Fine-Tuning of Vision Transformers with Low-Rank Adaptation for Privacy-Preserving Image Classification
We propose a low-rank adaptation method for training privacy-preserving vision transformer (ViT) models that efficiently freezes pre-trained ViT model weights. In the proposed method, trainable rank decomposition matrices are injected into each layer of the ViT architecture, and moreover, the patch embedding layer is not frozen, unlike in the case of the conventional low-rank adaptation methods. The proposed method allows us not only to reduce the number of trainable parameters but to also main…

@arXiv_csGR_bot@mastoxiv.page
2025-06-24 08:11:39

BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
Jiacheng Chen, Ramin Mehran, Xuhui Jia, Saining Xie, Sanghyun Woo
https://arxiv.org/abs/2506.17450

BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to …

@arXiv_csIR_bot@mastoxiv.page
2025-06-05 09:42:02

This https://arxiv.org/abs/2506.02916 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csIR_…

@arXiv_physicsmedph_bot@mastoxiv.page
2025-06-05 07:35:05

Personalized MR-Informed Diffusion Models for 3D PET Image Reconstruction
George Webber, Alexander Hammers, Andrew P. King, Andrew J. Reader
https://arxiv.org/abs/2506.03804

Personalized MR-Informed Diffusion Models for 3D PET Image Reconstruction
Recent work has shown improved lesion detectability and flexibility to reconstruction hyperparameters (e.g. scanner geometry or dose level) when PET images are reconstructed by leveraging pre-trained diffusion models. Such methods train a diffusion model (without sinogram data) on high-quality, but still noisy, PET images. In this work, we propose a simple method for generating subject-specific PET images from a dataset of multi-subject PET-MR scans, synthesizing "pseudo-PET" images by transfor…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-19 08:41:22

Factorized RVQ-GAN For Disentangled Speech Tokenization
Sameer Khurana, Dominik Klement, Antoine Laurent, Dominik Bobos, Juraj Novosad, Peter Gazdik, Ellen Zhang, Zili Huang, Amir Hussein, Ricard Marxer, Yoshiki Masuyama, Ryo Aihara, Chiori Hori, Francois G. Germain, Gordon Wichern, Jonathan Le Roux
https://arxiv.org/abs/2506.15…

Factorized RVQ-GAN For Disentangled Speech Tokenization
We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one…

@arXiv_csSD_bot@mastoxiv.page
2025-06-06 07:21:12

Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning
Hien Ohnaka, Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto
https://arxiv.org/abs/2506.04527

Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning
We propose a model to obtain phonemic and prosodic labels of speech that are coherent with graphemes. Unlike previous methods that simply fine-tune a pre-trained ASR model with the labels, the proposed model conditions the label generation on corresponding graphemes by two methods: 1) Add implicit grapheme conditioning through prompt encoder using pre-trained BERT features. 2) Explicitly prune the label hypotheses inconsistent with the grapheme during inference. These methods enable obtaining p…

@arXiv_csCL_bot@mastoxiv.page
2025-07-17 08:05:40

Cross-lingual Few-shot Learning for Persian Sentiment Analysis with Incremental Adaptation
Farideh Majidi, Ziaeddin Beheshtifard
https://arxiv.org/abs/2507.11634

Cross-lingual Few-shot Learning for Persian Sentiment Analysis with Incremental Adaptation
This research examines cross-lingual sentiment analysis using few-shot learning and incremental learning methods in Persian. The main objective is to develop a model capable of performing sentiment analysis in Persian using limited data, while getting prior knowledge from high-resource languages. To achieve this, three pre-trained multilingual models (XLM-RoBERTa, mDeBERTa, and DistilBERT) were employed, which were fine-tuned using few-shot and incremental learning approaches on small samples o…

@arXiv_astrophIM_bot@mastoxiv.page
2025-06-03 07:41:29

Applying Vision Transformers on Spectral Analysis of Astronomical Objects
Luis Felipe Strano Moraes, Ignacio Becker, Pavlos Protopapas, Guillermo Cabrera-Vives
https://arxiv.org/abs/2506.00294

Applying Vision Transformers on Spectral Analysis of Astronomical Objects
We apply pre-trained Vision Transformers (ViTs), originally developed for image recognition, to the analysis of astronomical spectral data. By converting traditional one-dimensional spectra into two-dimensional image representations, we enable ViTs to capture both local and global spectral features through spatial self-attention. We fine-tune a ViT pretrained on ImageNet using millions of spectra from the SDSS and LAMOST surveys, represented as spectral plots. Our model is evaluated on key task…

@arXiv_csLG_bot@mastoxiv.page
2025-06-10 19:21:44

This https://arxiv.org/abs/2506.01790 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

$IF-GUIDE$: Influence Function-Guided Detoxification of LLMs
We study how training data contributes to the emergence of toxic behaviors in large-language models. Most prior work on reducing model toxicity adopts $reactive$ approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a $proactive$ approach$-$IF-Guide$-$which leverages influence functions to identify harmful tokens within any training data and suppress their impact during training. To this end, we first show that standa…

@arXiv_csIR_bot@mastoxiv.page
2025-06-04 07:26:13

MMM4Rec: An Transfer-Efficient Framework for Multi-modal Sequential Recommendation
Hao Fan, Yanrong Hu, Kai Fang, Qingyang Liu, Hongjiu Liu
https://arxiv.org/abs/2506.02916

@arXiv_csCV_bot@mastoxiv.page
2025-07-03 10:31:50

FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model
Yukang Cao, Chenyang Si, Jinghao Wang, Ziwei Liu
https://arxiv.org/abs/2507.01953 ht…

FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model
We present FreeMorph, the first tuning-free method for image morphing that accommodates inputs with different semantics or layouts. Unlike existing methods that rely on finetuning pre-trained diffusion models and are limited by time constraints and semantic/layout discrepancies, FreeMorph delivers high-fidelity image morphing without requiring per-instance training. Despite their efficiency and potential, tuning-free methods face challenges in maintaining high-quality results due to the non-lin…

@arXiv_csSD_bot@mastoxiv.page
2025-07-08 11:09:50

Improving BERT for Symbolic Music Understanding Using Token Denoising and Pianoroll Prediction
Jun-You Wang, Li Su
https://arxiv.org/abs/2507.04776 https:/…

Improving BERT for Symbolic Music Understanding Using Token Denoising and Pianoroll Prediction
We propose a pre-trained BERT-like model for symbolic music understanding that achieves competitive performance across a wide range of downstream tasks. To achieve this target, we design two novel pre-training objectives, namely token correction and pianoroll prediction. First, we sample a portion of note tokens and corrupt them with a limited amount of noise, and then train the model to denoise the corrupted tokens; second, we also train the model to predict bar-level and local pianoroll-deriv…

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 21:30:25

This https://arxiv.org/abs/2504.19583 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Graph-Based Spectral Decomposition for Parameter Coordination in Language Model Fine-Tuning
This paper proposes a parameter collaborative optimization algorithm for large language models, enhanced with graph spectral analysis. The goal is to improve both fine-tuning efficiency and structural awareness during training. In the proposed method, the parameters of a pre-trained language model are treated as nodes in a graph. A weighted graph is constructed, and Laplacian spectral decomposition is applied to enable frequency-domain modeling and structural representation of the parameter spa…

@arXiv_eessAS_bot@mastoxiv.page
2025-07-15 09:17:51

Enhancing Stereo Sound Event Detection with BiMamba and Pretrained PSELDnet
Wenmiao Gao, Han Yin
https://arxiv.org/abs/2507.09570 https://

Enhancing Stereo Sound Event Detection with BiMamba and Pretrained PSELDnet
Pre-training methods have greatly improved the performance of sound event localization and detection (SELD). However, existing Transformer-based models still face high computational cost. To solve this problem, we present a stereo SELD system using a pre-trained PSELDnet and a bidirectional Mamba sequence model. Specifically, we replace the Conformer module with a BiMamba module. We also use asymmetric convolutions to better capture the time and frequency relationships in the audio signal. Test…

@arXiv_csLG_bot@mastoxiv.page
2025-07-11 10:23:51

Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs
Ziyue Li, Yang Li, Tianyi Zhou
https://arxiv.org/abs/2507.07996 https://arxiv.org/pdf/2507.07996 https://arxiv.org/html/2507.07996
arXiv:2507.07996v1 Announce Type: new
Abstract: Can a pretrained neural network adapt its architecture to different inputs without any finetuning? Do we need all layers for simple tasks, and are they adequate for challenging tasks? We found that the layers of a pretrained large language model (LLM) can be manipulated as separate modules to build a better and even shallower model customized for each test sample. In particular, each layer from the pretrained model can be skipped/pruned or repeated multiple times as recurrent neural networks (RNN), and stacked with others in arbitrary orders, yielding a chain-of-layers (CoLa) per sample. This compositional space greatly expands the scope of existing works on looped/recurrent pretrained modules, layer pruning, or early-exit networks. We develop a Monte Carlo Tree Search (MCTS) protocol to explore and identify the optimal CoLa for each sample from math and commonsense reasoning benchmarks. Compared to a static model of a fixed depth, CoLa allows shortcut paths (fast thinking), recurrence of the same layer(s) (slow thinking), and combining both, offering more flexible, dynamic architectures for different inputs. We conduct an extensive analysis of the MCTS-optimized CoLa, which leads to two key findings: (1) For >75% of samples with correct predictions by the original LLM, we can find shorter CoLa, suggesting a large space for improving inference efficiency; (2) For >60% of samples with originally incorrect predictions, we can identify CoLa achieving correct predictions, suggesting a large space of performance enhancement. Our results highlight the shortcomings of using a fixed architecture of pre-trained LLMs for inference on different samples and pave the way to unlock the generalization power of test-time depth adaptation.
toXiv_bot_toot

@arXiv_csDC_bot@mastoxiv.page
2025-06-04 07:45:04

Memory-Efficient Split Federated Learning for LLM Fine-Tuning on Heterogeneous Mobile Devices
Xiaopei Chen, Liang Li, Fei Ji, Wen Wu
https://arxiv.org/abs/2506.02940

Memory-Efficient Split Federated Learning for LLM Fine-Tuning on Heterogeneous Mobile Devices
In this paper, we propose an edge-assisted split federated learning framework to facilitate large language model (LLM) fine-tuning on heterogeneous mobile devices while alleviating memory pressures on both mobile devices and the edge server. Specifically, mobile devices perform low-rank adaptation (LoRA) fine-tuning on only a subset of lower layers of the pre-trained LLM, tailored to their individual capacities. On the server, a full LLM is maintained, and the corresponding LoRA modules are sel…

@arXiv_csSD_bot@mastoxiv.page
2025-07-14 08:35:02

Audio Inpanting using Discrete Diffusion Model
Tali Dror, Iftach Shoham, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani
https://arxiv.org/abs/2507.08333

Audio Inpanting using Discrete Diffusion Model
Audio inpainting refers to the task of reconstructing missing segments in corrupted audio recordings. While prior approaches-including waveform and spectrogram-based diffusion models-have shown promising results for short gaps, they often degrade in quality when gaps exceed 100 milliseconds (ms). In this work, we introduce a novel inpainting method based on discrete diffusion modeling, which operates over tokenized audio representations produced by a pre-trained audio tokenizer. Our approach mo…

@arXiv_csIR_bot@mastoxiv.page
2025-06-02 09:58:21

This https://arxiv.org/abs/2410.13230 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csIR_…

Starbucks-v2: Improved Training for 2D Matryoshka Embeddings
2D Matryoshka training enables a single embedding model to generate sub-network representations across different layers and embedding dimensions, offering adaptability to diverse computational and task constraints. However, its effectiveness remains well below that of individually trained models of equivalent sizes. To address this, we propose Starbucks, a new training strategy for Matryoshka-style embedding models that combines structured fine-tuning with masked autoencoder (MAE) pre-training.…

@arXiv_csSD_bot@mastoxiv.page
2025-06-16 07:56:49

LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation
Tom Baker, Javier Nistal
https://arxiv.org/abs/2506.11476 https://

LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation
Text-to-audio diffusion models produce high-quality and diverse music but many, if not most, of the SOTA models lack the fine-grained, time-varying controls essential for music production. ControlNet enables attaching external controls to a pre-trained generative model by cloning and fine-tuning its encoder on new conditionings. However, this approach incurs a large memory footprint and restricts users to a fixed set of controls. We propose a lightweight, modular architecture that considerably …

@arXiv_csCL_bot@mastoxiv.page
2025-07-03 09:56:10

Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings
Rifki Afina Putri
https://arxiv.org/abs/2507.01645

Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings
In this paper, we investigate the transferability of pre-trained language models to low-resource Indonesian local languages through the task of sentiment analysis. We evaluate both zero-shot performance and adapter-based transfer on ten local languages using models of different types: a monolingual Indonesian BERT, multilingual models such as mBERT and XLM-R, and a modular adapter-based approach called MAD-X. To better understand model behavior, we group the target languages into three categori…

@arXiv_eessAS_bot@mastoxiv.page
2025-07-16 08:30:01

Physics-Informed Transfer Learning for Data-Driven Sound Source Reconstruction in Near-Field Acoustic Holography
Xinmeng Luan, Mirco Pezzoli, Fabio Antonacci, Augusto Sarti
https://arxiv.org/abs/2507.11070

Physics-Informed Transfer Learning for Data-Driven Sound Source Reconstruction in Near-Field Acoustic Holography
We propose a transfer learning framework for sound source reconstruction in Near-field Acoustic Holography (NAH), which adapts a well-trained data-driven model from one type of sound source to another using a physics-informed procedure. The framework comprises two stages: (1) supervised pre-training of a complex-valued convolutional neural network (CV-CNN) on a large dataset, and (2) purely physics-informed fine-tuning on a single data sample based on the Kirchhoff-Helmholtz integral. This meth…

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 08:21:21

$IF-GUIDE$: Influence Function-Guided Detoxification of LLMs
Zachary Coalson, Juhan Bae, Nicholas Carlini, Sanghyun Hong
https://arxiv.org/abs/2506.01790 h…

@arXiv_csSD_bot@mastoxiv.page
2025-06-16 08:05:59

Confidence-Based Self-Training for EMG-to-Speech: Leveraging Synthetic EMG for Robust Modeling
Xiaodan Chen, Xiaoxue Gao, Mathias Quoy, Alexandre Pitti, Nancy F. Chen
https://arxiv.org/abs/2506.11862

Confidence-Based Self-Training for EMG-to-Speech: Leveraging Synthetic EMG for Robust Modeling
Voiced Electromyography (EMG)-to-Speech (V-ETS) models reconstruct speech from muscle activity signals, facilitating applications such as neurolaryngologic diagnostics. Despite its potential, the advancement of V-ETS is hindered by a scarcity of paired EMG-speech data. To address this, we propose a novel Confidence-based Multi-Speaker Self-training (CoM2S) approach, along with a newly curated Libri-EMG dataset. This approach leverages synthetic EMG data generated by a pre-trained model, followe…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-09 08:06:52

TADA: Training-free Attribution and Out-of-Domain Detection of Audio Deepfakes
Adriana Stan, David Combei, Dan Oneata, Hora Cucu
https://arxiv.org/abs/2506.05802

TADA: Training-free Attribution and Out-of-Domain Detection of Audio Deepfakes
Deepfake detection has gained significant attention across audio, text, and image modalities, with high accuracy in distinguishing real from fake. However, identifying the exact source--such as the system or model behind a deepfake--remains a less studied problem. In this paper, we take a significant step forward in audio deepfake model attribution or source tracing by proposing a training-free, green AI approach based entirely on k-Nearest Neighbors (kNN). Leveraging a pre-trained self-supervi…

@arXiv_csSD_bot@mastoxiv.page
2025-07-14 08:06:12

Distilling Spectrograms into Tokens: Fast and Lightweight Bioacoustic Classification for BirdCLEF 2025
Anthony Miyaguchi, Murilo Gustineli, Adrian Cheung
https://arxiv.org/abs/2507.08236

Distilling Spectrograms into Tokens: Fast and Lightweight Bioacoustic Classification for BirdCLEF+ 2025
The BirdCLEF+ 2025 challenge requires classifying 206 species, including birds, mammals, insects, and amphibians, from soundscape recordings under a strict 90-minute CPU-only inference deadline, making many state-of-the-art deep learning approaches impractical. To address this constraint, the DS@GT BirdCLEF team explored two strategies. First, we establish competitive baselines by optimizing pre-trained models from the Bioacoustics Model Zoo for CPU inference. Using TFLite, we achieved a nearly…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-13 08:02:30

Joint ASR and Speaker Role Tagging with Serialized Output Training
Anfeng Xu, Tiantian Feng, Shrikanth Narayanan
https://arxiv.org/abs/2506.10349 https://

Joint ASR and Speaker Role Tagging with Serialized Output Training
Automatic Speech Recognition systems have made significant progress with large-scale pre-trained models. However, most current systems focus solely on transcribing the speech without identifying speaker roles, a function that is critical for conversational AI. In this work, we investigate the use of serialized output training (SOT) for joint ASR and speaker role tagging. By augmenting Whisper with role-specific tokens and fine-tuning it with SOT, we enable the model to generate role-aware trans…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-12 08:44:21

Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements
Suhas BN, Andrew M. Sherrill, Jyoti Alaparthi, Dominik Mattioli, Rosa I. Arriaga, Chris W. Wiese, Saeed Abdullah
https://arxiv.org/abs/2506.09707

Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements
Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements -- identifying their start and stop times -- directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation …

@arXiv_eessAS_bot@mastoxiv.page
2025-07-10 08:25:51

Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation
Saierdaer Yusuyin, Te Ma, Hao Huang, Zhijian Ou
https://arxiv.org/abs/2507.06249

Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation
Recently, pre-trained models with phonetic supervision have demonstrated their advantages for crosslingual speech recognition in data efficiency and information sharing across languages. However, a limitation is that a pronunciation lexicon is needed for such phoneme-based crosslingual speech recognition. In this study, we aim to eliminate the need for pronunciation lexicons and propose a latent variable model based method, with phonemes being treated as discrete latent variables. The new metho…

@arXiv_csSD_bot@mastoxiv.page
2025-06-03 07:27:19

Improving Code Switching with Supervised Fine Tuning and GELU Adapters
Linh Pham
https://arxiv.org/abs/2506.00291 https://arxiv.org/p…

Improving Code Switching with Supervised Fine Tuning and GELU Adapters
There are few code switching datasets, labeled or unlabled, that exist today. As a result, ASR requires new methods to utilize the vast monolingual data and models that exist. This paper uses OpenAI's open source ASR model, Whisper, which has been pre-trained on 680K hours of audio to perform monolingual ASR tasks. In Part 1, this paper examines how exploiting Whisper's monolingual ability to individually tokenize training text, called "Switching Tokenizers Method", improves transcription accur…

@arXiv_csSD_bot@mastoxiv.page
2025-06-03 07:53:53

Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion
Kumud Tripathi, Chowdam Venkata Kumar, Pankaj Wasnik
https://arxiv.org/abs/2506.01365

Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion
Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that …

@arXiv_eessAS_bot@mastoxiv.page
2025-07-03 09:34:00

Generalizable Detection of Audio Deepfakes
Jose A. Lopez, Georg Stemmer, H\'ector Cordourier Maruri
https://arxiv.org/abs/2507.01750 https://

Generalizable Detection of Audio Deepfakes
In this paper, we present our comprehensive study aimed at enhancing the generalization capabilities of audio deepfake detection models. We investigate the performance of various pre-trained backbones, including Wav2Vec2, WavLM, and Whisper, across a diverse set of datasets, including those from the ASVspoof challenges and additional sources. Our experiments focus on the effects of different data augmentation strategies and loss functions on model performance. The results of our research demons…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-02 10:03:28

This https://arxiv.org/abs/2505.14449 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_ees…

Mitigating Subgroup Disparities in Multi-Label Speech Emotion Recognition: A Pseudo-Labeling and Unsupervised Learning Approach
While subgroup disparities and performance bias are increasingly studied in computational research, fairness in categorical Speech Emotion Recognition (SER) remains underexplored. Existing methods often rely on explicit demographic labels, which are difficult to obtain due to privacy concerns. To address this limitation, we introduce an Implicit Demography Inference (IDI) module that leverages pseudo-labeling from a pre-trained model and unsupervised learning using k-means clustering to mitigat…

Tootfinder

Opt-in global Mastodon full text search. Join the index!