Tootfinder

@arXiv_eessAS_bot@mastoxiv.page
2025-06-16 08:38:09

From Sharpness to Better Generalization for Speech Deepfake Detection
Wen Huang, Xuechen Liu, Xin Wang, Junichi Yamagishi, Yanmin Qian
https://arxiv.org/abs/2506.11532

From Sharpness to Better Generalization for Speech Deepfake Detection
Generalization remains a critical challenge in speech deepfake detection (SDD). While various approaches aim to improve robustness, generalization is typically assessed through performance metrics like equal error rate without a theoretical framework to explain model performance. This work investigates sharpness as a theoretical proxy for generalization in SDD. We analyze how sharpness responds to domain shifts and find it increases in unseen conditions, indicating higher model sensitivity. Bas…

@arXiv_csSD_bot@mastoxiv.page
2025-06-03 07:55:21

Universal Preference-Score-based Pairwise Speech Quality Assessment
Yu-Fei Shi, Yang Ai, Zhen-Hua Ling
https://arxiv.org/abs/2506.01455 https://

Universal Preference-Score-based Pairwise Speech Quality Assessment
To compare the performance of two speech generation sys- tems, one of the most effective approaches is estimating the preference score between their generated speech. This pa- per proposes a novel universal preference-score-based pairwise speech quality assessment (UPPSQA) model, aimed at predict- ing the preference score between paired speech samples to de- termine which one has better quality. The model first predicts the absolute mean opinion score (MOS) for the two speech sam- ples separate…

@arXiv_csSD_bot@mastoxiv.page
2025-06-05 07:21:48

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion
Seymanur Akti, Tuan Nam Nguyen, Alexander Waibel
https://arxiv.org/abs/2506.04013

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion
Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech. In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder, focusing on reducing source timbre leakage and improving linguistic-acoustic disentanglement for better style transfer. To minimize style leakage, we use multilingual discrete speech units for content representation and reinforce embeddi…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-16 08:19:30

Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM
Jeena Prakash, Blessingh Kumar, Kadri Hacioglu, Bidisha Sharma, Sindhuja Gopalan, Malolan Chetlur, Shankar Venkatesan, Andreas Stolcke
https://arxiv.org/abs/2506.11089

Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM
Automatic speech recognition (ASR) models rely on high-quality transcribed data for effective training. Generating pseudo-labels for large unlabeled audio datasets often relies on complex pipelines that combine multiple ASR outputs through multi-stage processing, leading to error propagation, information loss and disjoint optimization. We propose a unified multi-ASR prompt-driven framework using postprocessing by either textual or speech-based large language models (LLMs), replacing voting or o…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-03 16:55:19

This https://arxiv.org/abs/2505.19462 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_ees…

VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
We present VoiceStar, the first zero-shot TTS model that achieves both output duration control and extrapolation. VoiceStar is an autoregressive encoder-decoder neural codec language model, that leverages a novel Progress-Monitoring Rotary Position Embedding (PM-RoPE) and is trained with Continuation-Prompt Mixed (CPM) training. PM-RoPE enables the model to better align text and speech tokens, indicates the target duration for the generated speech, and also allows the model to generate speech w…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-05 07:22:22

Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
https://arxiv.org/abs/2506.03364

Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models
In this work, we introduce the task of singing voice deepfake source attribution (SVDSA). We hypothesize that multimodal foundation models (MMFMs) such as ImageBind, LanguageBind will be most effective for SVDSA as they are better equipped for capturing subtle source-specific characteristics-such as unique timbre, pitch manipulation, or synthesis artifacts of each singing voice deepfake source due to their cross-modality pre-training. Our experiments with MMFMs, speech foundation models and mus…

Tootfinder

Opt-in global Mastodon full text search. Join the index!