Tootfinder

@tezoatlipoca@mas.to
2025-10-15 17:09:28

Ok this is cool. Don't know if this is a studio/publisher thing or Steam is now enforcing this:
>AI Generated Content Disclosure
> We are utilising ElevenLabs' text-to-speech tool to generate voice-over elements within Metro Rivals. All scripts and content are written by Dovetail Games staff, and the voices you hear in-game, which have used ElevenLabs' software, have been licensed by voice actors.
either way, cool!
@…

lashman (@lashman@mastodon.social)
Attached: 1 image Metro Rivals: New York got a store page "Reputation is paramount; strategy is key, and all-out speed can make or break everything. Welcome to Metro Rivals—a genre-defying subway simcade game where you battle Track Titans in solo play or clash with friends in fierce PvP." https://store.steampowered.com/app/3024160 #steam #gaming #videoGames #steamStorePage

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:24:51

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models
Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, Wei Zou
https://arxiv.org/abs/2510.12116

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models
End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities, yet consistently fall short of traditional pipeline systems on semantic understanding benchmarks. In this work, we reveal through systematic experimentation that although LSLMs lose some text input performance after speech-text alignment training, the performance gap between speech and text inputs is more pronounced, which we refer to as the modality gap. To understand this gap, we …

@arXiv_csSD_bot@mastoxiv.page
2025-10-14 11:10:48

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery
https://arxiv.org/abs/2510.10774

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Persian Language, despite being spoken by over 100 million people worldwide, remains severely underrepresented in high-quality speech corpora, particularly for text-to-speech (TTS) synthesis applications. Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for TTS applications. We crea…

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:35:31

Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation
Greta Damo, Elena Cabrio, Serena Villata
https://arxiv.org/abs/2510.12316 https://

Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation
Counter-speech generation is at the core of many expert activities, such as fact-checking and hate speech, to counter harmful content. Yet, existing work treats counter-speech generation as pure text generation task, mainly based on Large Language Models or NGO experts. These approaches show severe drawbacks due to the limited reliability and coherence in the generated countering text, and in scalability, respectively. To close this gap, we introduce a novel framework to model counter-speech ge…

@markrsmith@smithtodon.org
2025-11-14 14:44:46

I used a speech to text program today to type “Shirley” and it entered “Surely”
#airplane

@arXiv_csSD_bot@mastoxiv.page
2025-10-14 11:40:38

BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis
Jingyuan Xing, Mingru Yang, Zhipeng Li, Xiaofen Xing, Xiangmin Xu
https://arxiv.org/abs/2510.11646

BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis
Autoregressive (AR) frameworks have recently achieved remarkable progress in zero-shot text-to-speech (TTS) by leveraging discrete speech tokens and large language model techniques. Despite their success, existing AR-based zero-shot TTS systems face two critical limitations: (i) an inherent speed-quality trade-off, as sequential token generation either reduces frame rates at the cost of expressiveness or enriches tokens at the cost of efficiency, and (ii) a text-oriented supervision mismatch, a…

@arXiv_csLG_bot@mastoxiv.page
2025-10-15 14:37:21

Replaced article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[7/7]:
- ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

@arXiv_csCL_bot@mastoxiv.page
2025-09-15 10:00:11

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Akshat Pandey, Karun Kumar, Raphael Tang
https://arxiv.org/abs/2509.10452 …

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned…

@yaya@jorts.horse
2025-10-12 22:58:36

OKAY IT ONLY JUST OCCURRED TO ME I CAN DO SPEECH TO TEXT FOR ALT TEXT

@arXiv_eessAS_bot@mastoxiv.page
2025-10-15 08:49:32

DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation
Yakun Song, Xiaobin Zhuang, Jiawei Chen, Zhikang Niu, Guanrou Yang, Chenpeng Du, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
https://arxiv.org/abs/2510.12210

DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation
Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot text-to-speech framework that operates entirely in a discrete residual vector quantization (RVQ) code space and tightly couples an AR language model with a masked diffusion model, without forced alignment or a duration pre…

@gadgetboy@gadgetboy.social
2025-11-13 13:29:03

Here's a great use of AI text-to-speech generation: preparing for a live pitch.
Instead of reading your pitch a hundred times so you can edit it for time, use ElevenLabs.
1. Create an account
2. Find a voice that matches your own cadence
3. Paste your script and have it generate the pitch
You'll immediately see how long the audio file is and can adjust your script for length.
Then you can spend your time **rehearsing** instead of editing.

@arXiv_csSD_bot@mastoxiv.page
2025-10-14 11:33:48

Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker
Cheng Gong, Chunyu Qiang, Tianrui Wang, Yu Jiang, Yuheng Lu, Ruihao Jing, Xiaoxiao Miao, Xiaolei Zhang, Longbiao Wang, Jianwu Dang
https://arxiv.org/abs/2510.11124

Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker
Cross-lingual emotional text-to-speech (TTS) aims to produce speech in one language that captures the emotion of a speaker from another language while maintaining the target voice's timbre. This process of cross-lingual emotional speech synthesis presents a complex challenge, necessitating flexible control over emotion, timbre, and language. However, emotion and timbre are highly entangled in speech signals, making fine-grained control challenging. To address this issue, we propose EMM-TTS, a n…

@phpmacher@sueden.social
2025-12-07 14:01:16

Suchst Du nach einer funktionierenden local-only kostenlosen #Diktier-Lösung auf dem #Mac?
#diktat

Fluid - Free Voice to Text App for Mac | Speech Recognition Software
Convert speech to text instantly with Fluid - the fastest free voice typing app for Mac. Lightning-fast transcription, 25+ languages, complete privacy.

@candide@vis.social
2025-11-06 23:16:15

I've been meaning to share this for a while, but for any Android users out who want to use a text-to-speech engine other than Google's, I recommend Sherpa TTS: https://github.com/woheller69/ttsEngine
It's open source, offline, multilingual, and available on F-Droid.
I use text-t…

GitHub - woheller69/ttsEngine
Contribute to woheller69/ttsEngine development by creating an account on GitHub.

@arXiv_csSD_bot@mastoxiv.page
2025-10-13 09:20:00

O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion
Huu Tuong Tu, Huan Vu, cuong tien nguyen, Dien Hy Ngo, Nguyen Thi Thu Trang
https://arxiv.org/abs/2510.09061

O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion
Traditional voice conversion (VC) methods typically attempt to separate speaker identity and linguistic information into distinct representations, which are then combined to reconstruct the audio. However, effectively disentangling these factors remains challenging, often leading to information loss during training. In this paper, we propose a new approach that leverages synthetic speech data generated by a high-quality, pretrained multispeaker text-to-speech (TTS) model. Specifically, syntheti…

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:35:20

The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Nizar El Ghazal, Antoine Caubri\`ere, Valentin Vielzeuf
https://arxiv.org/abs/2510.09424

The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing p…

@aardrian@toot.cafe
2025-11-24 22:31:36

I finally got around to adding the Narrator speech recap feature to my 2020 post “Speech Viewer Logs of Lies”:
https://adrianroselli.com/2020/08/speech-viewer-logs-of-lies.html#Update07
Video! We like video! 323kb of video! Big fat video!

Speech Viewer Logs of Lies
The headline is intentional hyperbole, chosen mostly for the sloppy alliteration. When sighted users test with a screen reader it is common to rely on the visual output — checking to see where focus goes, confirming that controls behave, watching the spoken output in a text log. The problem is…

@arXiv_eessAS_bot@mastoxiv.page
2025-10-13 09:01:50

Unsupervised lexicon learning from speech is limited by representations rather than clustering
Danel Adendorff, Simon Malan, Herman Kamper
https://arxiv.org/abs/2510.09225 https…

Unsupervised lexicon learning from speech is limited by representations rather than clustering
Zero-resource word segmentation and clustering systems aim to tokenise speech into word-like units without access to text labels. Despite progress, the induced lexicons are still far from perfect. In an idealised setting with gold word boundaries, we ask whether performance is limited by the representation of word segments, or by the clustering methods that group them into word-like types. We combine a range of self-supervised speech features (continuous/discrete, frame/word-level) with differe…

@ErikUden@mastodon.de
2025-09-29 23:46:41

This charge is so baseless, the comparison to genocide, the wholesale slaughter of populations. Did the Nazis ask the Jews to leave, kindly leave, go out? Did others? Do you want me to name all the genocidal leaders of history. Just go one by one. Did anyone do this? Did they say “go out so we can come in”?
— Benjamin Netanyahu, 26.09.2025 [

@arXiv_csSD_bot@mastoxiv.page
2025-09-15 08:21:21

DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration
Yanru Huo, Ziyue Jiang, Zuoli Tang, Qingyang Hong, Zhou Zhao
https://arxiv.org/abs/2509.09748

DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration
While Diffusion Transformers (DiT) have advanced non-autoregressive (NAR) speech synthesis, their high computational demands remain an limitation. Existing DiT-based text-to-speech (TTS) model acceleration approaches mainly focus on reducing sampling steps through distillation techniques, yet they remain constrained by training costs. We introduce DiTReducio, a training-free acceleration framework that compresses computations in DiT-based TTS models via progressive calibration. We propose two c…

@arXiv_csCV_bot@mastoxiv.page
2025-10-07 12:47:52

Paper2Video: Automatic Video Generation from Scientific Papers
Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou
https://arxiv.org/abs/2510.05096 https://arxiv…

Paper2Video: Automatic Video Generation from Scientific Papers
Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and h…

@FandaSin@social.linux.pizza
2025-11-25 16:06:31

Youtube have some new "feature" to transcribe speech to text. (or whatever this is)
Watching something about Star Trek and Kardashian, it produced this picture of Kim Kardashian.
I think we shouldn't be afraid of "AI" in any way.😂

Speech to text by youtube (picture from Youtube "rewrite" feature):

Drago 4 features an unusually large temperate zone.
However, it is within three light years of Kardashian space.

[Explanation under Kardashian word]:
Kim Kardashian
Kim Kardashian is american well known model and business woman... [for more pres]

@arXiv_csCL_bot@mastoxiv.page
2025-10-06 10:20:09

Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?
Oriol Pareras, Gerard I. G\'allego, Federico Costa, Cristina Espa\~na-Bonet, Javier Hernando
https://arxiv.org/abs/2510.03093

Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?
Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting unde…

@arXiv_csHC_bot@mastoxiv.page
2025-09-24 09:33:14

M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition
Jiajun He, Xiaohan Shi, Cheng-Hung Hu, Jinyi Mi, Xingfeng Li, Tomoki Toda
https://arxiv.org/abs/2509.18706

M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition
Multimodal speech emotion recognition (SER) has emerged as pivotal for improving human-machine interaction. Researchers are increasingly leveraging both speech and textual information obtained through automatic speech recognition (ASR) to comprehensively recognize emotional states from speakers. Although this approach reduces reliance on human-annotated text data, ASR errors possibly degrade emotion recognition performance. To address this challenge, in our previous work, we introduced two auxi…

@arXiv_eessAS_bot@mastoxiv.page
2025-10-13 08:37:20

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging
Yaya Sy, Christophe Cerisara, Irina Illina
https://arxiv.org/abs/2510.08599 https://arxiv.…

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging
Pruning large pre-trained transformers for low-resource languages is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40% and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, whic…

@arXiv_csCL_bot@mastoxiv.page
2025-10-06 10:20:29

Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation
Jacobo Romero-D\'iaz, Gerard I. G\'allego, Oriol Pareras, Federico Costa, Javier Hernando, Cristina Espa\~na-Bonet
https://arxiv.org/abs/2510.03115

Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation
Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and p…

@ethanwhite@hachyderm.io
2025-09-29 14:16:52

What speech-to-text thinks I'm saying when I say @…
- "are open sigh"
- "our open size"
- "art open side"

@aardrian@toot.cafe
2025-10-24 23:49:36

Futzing around with Narrator (because of its new Braille viewer, which I am working on testing) and reminded of the new speech log (“Speech Recap”):
Narrator Key Alt X
Wanted to confirm this is also true with Narrator:
https://adrianroselli.com/2020/08/speech…

@arXiv_csSD_bot@mastoxiv.page
2025-10-13 08:30:30

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling
Yuxuan Jiang, Zehua Chen, Zeqian Ju, Yusheng Dai, Weibei Dou, Jun Zhu
https://arxiv.org/abs/2510.08878

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling
Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained informatio…

@arXiv_csSE_bot@mastoxiv.page
2025-10-01 10:19:47

Protocode: Prototype-Driven Interpretability for Code Generation in LLMs
Krishna Vamshi Bodla, Haizhao Yang
https://arxiv.org/abs/2509.25247 https://arxiv.…

Protocode: Prototype-Driven Interpretability for Code Generation in LLMs
Since the introduction of Large Language Models (LLMs), they have been widely adopted for various tasks such as text summarization, question answering, speech-to-text translation, and more. In recent times, the use of LLMs for code generation has gained significant attention, with tools such as Cursor and Windsurf demonstrating the ability to analyze massive code repositories and recommend relevant changes. Big tech companies have also acknowledged the growing reliance on LLMs for code generati…

@Techmeme@techhub.social
2025-09-19 17:10:56

Neuralink plans a US clinical trial in October to test a brain implant that translates thoughts into text, hoping to put its device in a healthy person by 2030 (Ike Swetlitz/Bloomberg)
https://www.bloomberg.com/news/articles/202…

@cjust@infosec.exchange
2025-10-02 18:19:24

#TheOatmeal #comics #mentalhealth
https://

The image is a comic strip divided into three panels, each with a different colored background. The style is simple and cartoonish.

Panel 1: The background is a light yellow. Text at the top reads "What you THINK your brain is supposed to do". A large, white, egg-shaped character with a simple face (two black eyes and a small mouth) is walking towards a pink brain with a happy expression. The brain is in the shape of a human brain with a small mouth. The egg-shaped character has a speech bubbl…

@izzychambers@vivaldi.net
2025-11-21 22:30:23

@… I've used Talon, but before I retired I used Dragon Professional on Windows, which I greatly preferred. Now that I'm retired, I don't need speech to text that much. I've only used the free version of Talon, not the "beta", which costs something like $25/month. I found Talon very hard to use, perhaps because I was used to the Dragon way of …

@arXiv_csCL_bot@mastoxiv.page
2025-10-06 08:33:39

KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI
So Kuroki, Yotaro Kubo, Takuya Akiba, Yujin Tang
https://arxiv.org/abs/2510.02327

KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI
Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between …

@arXiv_eessAS_bot@mastoxiv.page
2025-10-09 08:06:51

Towards Responsible Evaluation for Text-to-Speech
Yifan Yang, Hui Wang, Bing Han, Shujie Liu, Jinyu Li, Yong Qin, Xie Chen
https://arxiv.org/abs/2510.06927 https://

Towards Responsible Evaluation for Text-to-Speech
Recent advances in text-to-speech (TTS) technology have enabled systems to produce human-indistinguishable speech, bringing benefits across accessibility, content creation, and human-computer interaction. However, current evaluation practices are increasingly inadequate for capturing the full range of capabilities, limitations, and societal implications. This position paper introduces the concept of Responsible Evaluation and argues that it is essential and urgent for the next phase of TTS deve…

@arXiv_csHC_bot@mastoxiv.page
2025-09-19 09:11:01

Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech
Taesoo Kim, Yongsik Jo, Hyunmin Song, Taehwan Kim
https://arxiv.org/abs/2509.14627 https://

Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech
Human conversation involves language, speech, and visual cues, with each medium providing complementary information. For instance, speech conveys a vibe or tone not fully captured by text alone. While multimodal LLMs focus on generating text responses from diverse inputs, less attention has been paid to generating natural and engaging speech. We propose a human-like agent that generates speech responses based on conversation mood and responsive style information. To achieve this, we build a nov…

@arXiv_csSD_bot@mastoxiv.page
2025-10-06 08:21:49

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech
Hieu-Nghia Huynh-Nguyen, Huynh Nguyen Dang, Ngoc-Son Nguyen, Van Nguyen
https://arxiv.org/abs/2510.02848

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech
Zero-shot Text-to-Speech (TTS) has recently advanced significantly, enabling models to synthesize speech from text using short, limited-context prompts. These prompts serve as voice exemplars, allowing the model to mimic speaker identity, prosody, and other traits without extensive speaker-specific data. Although recent approaches incorporating language models, diffusion, and flow matching have proven their effectiveness in zero-shot TTS, they still encounter challenges such as unreliable synth…

@arXiv_csSD_bot@mastoxiv.page
2025-10-10 08:31:59

IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation
Wei Wang, Rong Cao, Yi Guo, Zhengyang Chen, Kuan Chen, Yuanyuan Huo
https://arxiv.org/abs/2510.07979 …

IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation
Flow-based generative models have greatly improved text-to-speech (TTS) synthesis quality, but inference speed remains limited by the iterative sampling process and multiple function evaluations (NFE). The recent MeanFlow model accelerates generation by modeling average velocity instead of instantaneous velocity. However, its direct application to TTS encounters challenges, including GPU memory overhead from Jacobian-vector products (JVP) and training instability due to self-bootstrap processes…

@arXiv_eessAS_bot@mastoxiv.page
2025-10-07 09:32:02

A Multilingual Framework for Dysarthria: Detection, Severity Classification, Speech-to-Text, and Clean Speech Generation
Ananya Raghu, Anisha Raghu, Nithika Vivek, Sofie Budman, Omar Mansour
https://arxiv.org/abs/2510.03986

A Multilingual Framework for Dysarthria: Detection, Severity Classification, Speech-to-Text, and Clean Speech Generation
Dysarthria is a motor speech disorder that results in slow and often incomprehensible speech. Speech intelligibility significantly impacts communication, leading to barriers in social interactions. Dysarthria is often a characteristic of neurological diseases including Parkinson's and ALS, yet current tools lack generalizability across languages and levels of severity. In this study, we present a unified AI-based multilingual framework that addresses six key components: (1) binary dysarthria de…

@arXiv_csCL_bot@mastoxiv.page
2025-10-03 10:51:21

Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage
Siddhant Arora, Haidar Khan, Kai Sun, Xin Luna Dong, Sajal Choudhary, Seungwhan Moon, Xinyuan Zhang, Adithya Sagar, Surya Teja Appini, Kaushik Patnaik, Sanat Sharma, Shinji Watanabe, Anuj Kumar, Ahmed Aly, Yue Liu, Florian Metze, Zhaojiang Lin
https://arxiv.…

Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage
End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines, generating more natural, expressive responses with significantly lower latency. However, these systems remain prone to hallucinations due to limited factual grounding. While text-based dialogue systems address this challenge by integrating tools such as web search and knowledge graph APIs, we introduce the first approach to extend tool use directly into speech-in speech-…

@arXiv_csSD_bot@mastoxiv.page
2025-10-03 08:03:21

Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
Jianing Yang, Sheng Li, Takahiro Shinozaki, Yuki Saito, Hiroshi Saruwatari
https://arxiv.org/abs/2510.01722

Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
Current emotional Text-To-Speech (TTS) and style transfer methods rely on reference encoders to control global style or emotion vectors, but do not capture nuanced acoustic details of the reference speech. To this end, we propose a novel emotional TTS method that enables fine-grained phoneme-level emotion embedding prediction while disentangling intrinsic attributes of the reference speech. The proposed method employs a style disentanglement method to guide two feature extractors, reducing mutu…

@arXiv_eessAS_bot@mastoxiv.page
2025-10-08 09:42:39

TokenChain: A Discrete Speech Chain via Semantic Token Modeling
Mingxuan Wang, Satoshi Nakamura
https://arxiv.org/abs/2510.06201 https://arxiv.org/pdf/2510…

TokenChain: A Discrete Speech Chain via Semantic Token Modeling
Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax and balanced with supervised ASR via dyna…

@arXiv_csCL_bot@mastoxiv.page
2025-09-23 12:58:41

TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation
Yutong Liu, Ziyue Zhang, Ban Ma-bao, Renzeng Duojie, Yuqing Cai, Yongbin Yu, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi
https://arxiv.org/abs/2509.18060

TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
Tibetan is a low-resource language with limited parallel speech corpora spanning its three major dialects (Ü-Tsang, Amdo, and Kham), limiting progress in speech modeling. To address this issue, we propose TMD-TTS, a unified Tibetan multi-dialect text-to-speech (TTS) framework that synthesizes parallel dialectal speech from explicit dialect labels. Our method features a dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and lingui…

@arXiv_csCL_bot@mastoxiv.page
2025-10-01 11:34:37

The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models
Lina Conti, Dennis Fucci, Marco Gaido, Matteo Negri, Guillaume Wisniewski, Luisa Bentivogli
https://arxiv.org/abs/2509.26543 …

The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models
Contrastive explanations, which indicate why an AI system produced one output (the target) instead of another (the foil), are widely regarded in explainable AI as more informative and interpretable than standard explanations. However, obtaining such explanations for speech-to-text (S2T) generative models remains an open challenge. Drawing from feature attribution techniques, we propose the first method to obtain contrastive explanations in S2T by analyzing how parts of the input spectrogram inf…

@arXiv_eessAS_bot@mastoxiv.page
2025-10-10 09:19:59

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
Hanke Xie, Dake Guo, Chengyou Wang, Yue Li, Wenjie Tian, Xinfa Zhu, Xinsheng Wang, Xiulin Li, Guanqiong Miao, Bo Liu, Lei Xie
https://arxiv.org/abs/2510.08373

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
Recent advances in text-to-speech (TTS) synthesis, particularly those leveraging large language models (LLMs), have significantly improved expressiveness and naturalness. However, generating human-like, interactive dialogue speech remains challenging. Current systems face limitations due to the scarcity of dual-track data and difficulties in achieving naturalness, contextual coherence, and interactional dynamics, such as turn-taking, overlapping speech, and speaker consistency, in multi-turn co…

@arXiv_csLG_bot@mastoxiv.page
2025-09-22 11:50:53

Crosslisted article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[3/3]:
- VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze

@arXiv_csCL_bot@mastoxiv.page
2025-10-01 11:32:17

BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs
Yue Wang, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Wanshun Chen, Huang Liu, Jiadi Yao, Qu Yang, Qingxuan Jiang, Fanghua Ye, Juntao Li, Min Zhang, Zhaopeng Tu, Xiaolong Li, Linus
https://arxiv.org/abs/2509.26514…

BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs
The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model's ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by ``operationalism'' that decouples…

@arXiv_csSD_bot@mastoxiv.page
2025-10-07 09:20:42

Evaluating Self-Supervised Speech Models via Text-Based LLMS
Takashi Maekaku, Keita Goto, Jinchuan Tian, Yusuke Shinohara, Shinji Watanabe
https://arxiv.org/abs/2510.04463 https…

Evaluating Self-Supervised Speech Models via Text-Based LLMS
Self-Supervised Learning (SSL) has gained traction for its ability to learn rich representations with low labeling costs, applicable across diverse downstream tasks. However, assessing the downstream-task performance remains challenging due to the cost of extra training and evaluation. Existing methods for task-agnostic evaluation also require extra training or hyperparameter tuning. We propose a novel evaluation metric using large language models (LLMs). By inputting discrete token sequences a…

@arXiv_csCL_bot@mastoxiv.page
2025-10-01 11:15:37

Optimizing Speech Language Models for Acoustic Consistency
Morteza Rohanian, Michael Krauthammer
https://arxiv.org/abs/2509.26276 https://arxiv.org/pdf/250…

Optimizing Speech Language Models for Acoustic Consistency
We study speech language models that incorporate semantic initialization and planning losses to achieve robust and consistent generation. Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives that target robustness and content planning. We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech. Acoustic studies show that the sp…

@arXiv_csSD_bot@mastoxiv.page
2025-09-29 09:40:57

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
Junjie Cao, Yichen Han, Ruonan Zhang, Xiaoyang Hao, Hongxiang Li, Shuaijiang Zhao, Yue Liu, Xiao-Ping Zhng
https://arxiv.org/abs/2509.22062

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
Existing Large Language Model (LLM) based autoregressive (AR) text-to-speech (TTS) systems, while achieving state-of-the-art quality, still face critical challenges. The foundation of this LLM-based paradigm is the discretization of the continuous speech waveform into a sequence of discrete tokens by neural audio codec. However, single codebook modeling is well suited to text LLMs, but suffers from significant information loss; hierarchical acoustic tokens, typically generated via Residual Vect…

@arXiv_csCL_bot@mastoxiv.page
2025-10-07 12:05:22

A Low-Resource Speech-Driven NLP Pipeline for Sinhala Dyslexia Assistance
Peshala Perera, Deshan Sumanathilaka
https://arxiv.org/abs/2510.04750 https://arx…

A Low-Resource Speech-Driven NLP Pipeline for Sinhala Dyslexia Assistance
Dyslexia in adults remains an under-researched and under-served area, particularly in non-English-speaking contexts, despite its significant impact on personal and professional lives. This work addresses that gap by focusing on Sinhala, a low-resource language with limited tools for linguistic accessibility. We present an assistive system explicitly designed for Sinhala-speaking adults with dyslexia. The system integrates Whisper for speech-to-text conversion, SinBERT, an open-sourced fine-tune…

@arXiv_csSD_bot@mastoxiv.page
2025-10-07 08:23:12

Audio Forensics Evaluation (SAFE) Challenge
Kirill Trapeznikov, Paul Cummer, Pranay Pherwani, Jai Aslam, Michael S. Davinroy, Peter Bautista, Laura Cassani, Matthew Stamm, Jill Crisman
https://arxiv.org/abs/2510.03387

Audio Forensics Evaluation (SAFE) Challenge
The increasing realism of synthetic speech generated by advanced text-to-speech (TTS) models, coupled with post-processing and laundering techniques, presents a significant challenge for audio forensic detection. In this paper, we introduce the SAFE (Synthetic Audio Forensics Evaluation) Challenge, a fully blind evaluation framework designed to benchmark detection models across progressively harder scenarios: raw synthetic speech, processed audio (e.g., compression, resampling), and laundered a…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-30 10:56:51

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
Tianrui Wang, Haoyu Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Ziyang Ma, Zikang Huang, Guanrou Yang, Xiaobao Wang, Eng Siong Chng, Xie Chen, Longbiao Wang, Jianwu Dang
https://arxiv.org/abs/2509.24629

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework th…

@arXiv_csCL_bot@mastoxiv.page
2025-09-23 12:57:41

Cross-Attention is Half Explanation in Speech-to-Text Models
Sara Papi, Dennis Fucci, Marco Gaido, Matteo Negri, Luisa Bentivogli
https://arxiv.org/abs/2509.18010 https://

Cross-Attention is Half Explanation in Speech-to-Text Models
Cross-attention is a core mechanism in encoder-decoder architectures, widespread in many fields, including speech-to-text (S2T) processing. Its scores have been repurposed for various downstream applications--such as timestamp estimation and audio-text alignment--under the assumption that they reflect the dependencies between input speech representation and the generated text. While the explanatory nature of attention mechanisms has been widely debated in the broader NLP literature, this assump…

@arXiv_csCL_bot@mastoxiv.page
2025-09-25 10:42:42

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training
Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu
https://arxiv.org/abs/2509.20072

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training
Recent advances in large language models have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-in speech-out conversational systems. However, existing multimodal models handling interleaved audio and text, such as MOSHI require complex multi stage training pipelines, incurring substantial computational costs. Moreover, these models uniformly apply autoregressive generation to both text and audio tokens, overlooking a fundamental asy…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-25 07:48:42

Selective Classifier-free Guidance for Zero-shot Text-to-speech
John Zheng, Farhad Maleki
https://arxiv.org/abs/2509.19668 https://arxiv.org/pdf/2509.19668…

Selective Classifier-free Guidance for Zero-shot Text-to-speech
In zero-shot text-to-speech, achieving a balance between fidelity to the target speaker and adherence to text content remains a challenge. While classifier-free guidance (CFG) strategies have shown promising results in image generation, their application to speech synthesis are underexplored. Separating the conditions used for CFG enables trade-offs between different desired characteristics in speech synthesis. In this paper, we evaluate the adaptability of CFG strategies originally developed f…

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 08:57:31

Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
Xinlei Niu, Jianbo Ma, Dylan Harper-Harris, Xiangyu Zhang, Charles Patrick Martin, Jing Zhang
https://arxiv.org/abs/2509.15492

Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
The generation of realistic, context-aware audio is important in real-world applications such as video game development. While existing video-to-audio (V2A) methods mainly focus on Foley sound generation, they struggle to produce intelligible speech. Meanwhile, current environmental speech synthesis approaches remain text-driven and fail to temporally align with dynamic video content. In this paper, we propose Beyond Video-to-SFX (BVS), a method to generate synchronized audio with environmental…

@arXiv_csSD_bot@mastoxiv.page
2025-10-07 10:00:32

Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba
Baher Mohammad, Magauiya Zhussip, Stamatios Lefkimmiatis
https://arxiv.org/abs/2510.04738

Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba
We introduce MAVE (Mamba with Cross-Attention for Voice Editing and Synthesis), a novel autoregressive architecture for text-conditioned voice editing and high-fidelity text-to-speech (TTS) synthesis, built on a cross-attentive Mamba backbone. MAVE achieves state-of-the-art performance in speech editing and very competitive results in zero-shot TTS, while not being explicitly trained on the latter task, outperforming leading autoregressive and diffusion models on diverse, real-world audio. By i…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:34:11

Cross-Modal Knowledge Distillation for Speech Large Language Models
Enzhi Wang, Qicheng Li, Zhiyuan Tang, Yuhang Jia
https://arxiv.org/abs/2509.14930 https://

Cross-Modal Knowledge Distillation for Speech Large Language Models
In this work, we present the first systematic evaluation of catastrophic forgetting and modality inequivalence in speech large language models, showing that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual, and performance further decreases with spoken queries. To address these challenges, we propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-base…

@arXiv_eessAS_bot@mastoxiv.page
2025-10-07 10:15:32

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models
Wenhao Guan, Zhikang Niu, Ziyue Jiang, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li, Xie Chen
https://arxiv.org/abs/2510.04593

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models
Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and g…

@arXiv_csSD_bot@mastoxiv.page
2025-09-23 08:26:00

Speech-to-See: End-to-End Speech-Driven Open-Set Object Detection
Wenhuan Lu, Xinyue Song, Wenjun Ke, Zhizhi Yu, Wenhao Yang, Jianguo Wei
https://arxiv.org/abs/2509.16670 https:…

Speech-to-See: End-to-End Speech-Driven Open-Set Object Detection
Audio grounding, or speech-driven open-set object detection, aims to localize and identify objects directly from speech, enabling generalization beyond predefined categories. This task is crucial for applications like human-robot interaction where textual input is impractical. However, progress in this domain faces a fundamental bottleneck from the scarcity of large-scale, paired audio-image data, and is further constrained by previous methods that rely on indirect, text-mediated pipelines. In …

@arXiv_csSD_bot@mastoxiv.page
2025-09-23 10:12:10

Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing
Wataru Nakata, Yuki Saito, Yota Ueda, Hiroshi Saruwatari
https://arxiv.org/abs/2509.17052

Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing
Large-scale text-to-speech (TTS) systems are limited by the scarcity of clean, multilingual recordings. We introduce Sidon, a fast, open-source speech restoration model that converts noisy in-the-wild speech into studio-quality speech and scales to dozens of languages. Sidon consists of two models: w2v-BERT 2.0 finetuned feature predictor to cleanse features from noisy speech and vocoder trained to synthesize restored speech from the cleansed features. Sidon achieves restoration performance com…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-30 11:27:21

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song
https://arxiv.org/abs/2509.24773

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
Video-conditioned sound and speech generation, encompassing video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks, are conventionally addressed as separate tasks, with limited exploration to unify them within a signle framework. Recent attempts to unify V2S and VisualTTS face challenges in handling distinct condition types (e.g., heterogeneous video and transcript conditions) and require complex training stages. Unifying these two tasks remains an open problem. To bridge this gap, we…

@arXiv_csSD_bot@mastoxiv.page
2025-09-26 09:07:01

i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents
Anupam Purwar, Aditya Choudhary
https://arxiv.org/abs/2509.20971 https://arxiv.org/pd…

i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents
We experiment with a low-latency, end-to-end voice-to-voice communication model to optimize it for real-time conversational applications. By analyzing components essential to voice to voice (V-2-V) system viz. automatic speech recognition (ASR), text-to-speech (TTS), and dialog management, our work analyzes how to reduce processing time while maintaining high-quality interactions to identify the levers for optimizing V-2-V system. Our work identifies that TTS component which generates life-like…

@arXiv_csCL_bot@mastoxiv.page
2025-09-16 12:27:27

Preservation of Language Understanding Capabilities in Speech-aware Large Language Models
Marek Kubis, Pawe{\l} Sk\'orzewski, Iwona Christop, Mateusz Czy\.znikiewicz, Jakub Kubiak, {\L}ukasz Bondaruk, Marcin Lewandowski
https://arxiv.org/abs/2509.12171

Preservation of Language Understanding Capabilities in Speech-aware Large Language Models
The paper presents C3T (Cross-modal Capabilities Conservation Test), a new benchmark for assessing the performance of speech-aware large language models. The benchmark utilizes textual tasks and a voice cloning text-to-speech model to quantify the extent to which language understanding capabilities are preserved when the model is accessed via speech input. C3T quantifies the fairness of the model for different categories of speakers and its robustness across text and speech modalities.

@arXiv_csSD_bot@mastoxiv.page
2025-09-23 10:26:40

STAR: Speech-to-Audio Generation via Representation Learning
Zeyu Xie, Xuenan Xu, Yixuan Li, Mengyue Wu, Yuexian Zou
https://arxiv.org/abs/2509.17164 https://

STAR: Speech-to-Audio Generation via Representation Learning
This work presents STAR, the first end-to-end speech-to-audio generation framework, designed to enhance efficiency and address error propagation inherent in cascaded systems. Unlike prior approaches relying on text or vision, STAR leverages speech as it constitutes a natural modality for interaction. As an initial step to validate the feasibility of the system, we demonstrate through representation learning experiments that spoken sound event semantics can be effectively extracted from raw spee…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-24 09:13:34

Group Relative Policy Optimization for Text-to-Speech with Large Language Models
Chang Liu, Ya-Jun Hu, Ying-Ying Gao, Shi-Lei Zhang, Zhen-Hua Ling
https://arxiv.org/abs/2509.18798

Group Relative Policy Optimization for Text-to-Speech with Large Language Models
This paper proposes a GRPO-based approach to enhance the performance of large language model (LLM)-based text-to-speech (TTS) models by deriving rewards from an off-the-shelf automatic speech recognition (ASR) model. Compared to previous reinforcement learning methods for LLM-based TTS, our method requires no dedicated model for reward computation or training. Moreover, we design a composite reward function that combines character error rate (CER) with negative log-likelihood (NLL) obtained fro…

@arXiv_csSD_bot@mastoxiv.page
2025-09-30 09:07:21

DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation
Ziqi Chen, Gongyu Chen, Yihua Wang, Chaofan Ding, Zihao chen, Wei-Qiang Zhang
https://arxiv.org/abs/2509.22727

DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation
Dialect speech embodies rich cultural and linguistic diversity, yet building text-to-speech (TTS) systems for dialects remains challenging due to scarce data, inconsistent orthographies, and complex phonetic variation. To address these issues, we present DiaMoE-TTS, a unified IPA-based framework that standardizes phonetic representations and resolves grapheme-to-phoneme ambiguities. Built upon the F5-TTS architecture, the system introduces a dialect-aware Mixture-of-Experts (MoE) to model phono…

@arXiv_csSD_bot@mastoxiv.page
2025-09-25 08:47:12

Eliminating stability hallucinations in llm-based tts models via attention guidance
ShiMing Wang, ZhiHao Du, Yang Xiang, TianYu Zhao, Han Zhao, Qian Chen, XianGang Li, HanJie Guo, ZhenHua Ling
https://arxiv.org/abs/2509.19852

Eliminating stability hallucinations in llm-based tts models via attention guidance
This paper focuses on resolving stability hallucinations (e.g., repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models by improving and leveraging the attention mechanism. First, we analyzed the alignment mechanism between text tokens and speech tokens in LLMs. We then proposed a metric termed the Optimal Alignment Score (OAS), which employs the Viterbi algorithm to evaluate text-speech alignment quality. Subsequently, OAS was integrated into the training of CosyVoice2 to assist…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 08:45:01

SpeechOp: Inference-Time Task Composition for Generative Speech Processing
Justin Lovelace, Rithesh Kumar, Jiaqi Su, Ke Chen, Kilian Q Weinberger, Zeyu Jin
https://arxiv.org/abs/2509.14298

SpeechOp: Inference-Time Task Composition for Generative Speech Processing
While generative Text-to-Speech (TTS) systems leverage vast ``in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To bridge this gap, we present SpeechOp, a multi-task latent diffusion model that transforms pre-trained TTS models into a universal speech processor capable of performing a wide range of speech tasks and composing them i…

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 09:51:31

Direct Simultaneous Translation Activation for Large Audio-Language Models
Pei Zhang, Yiming Wang, Jialong Tang, Baosong Yang, Rui Wang, Derek F. Wong, Fei Huang
https://arxiv.org/abs/2509.15692

Direct Simultaneous Translation Activation for Large Audio-Language Models
Simultaneous speech-to-text translation (Simul-S2TT) aims to translate speech into target text in real time, outputting translations while receiving source speech input, rather than waiting for the entire utterance to be spoken. Simul-S2TT research often modifies model architectures to implement read-write strategies. However, with the rise of large audio-language models (LALMs), a key challenge is how to directly activate Simul-S2TT capabilities in base models without additional architectural …

@arXiv_eessAS_bot@mastoxiv.page
2025-09-24 09:43:44

Direct Preference Optimization for Speech Autoregressive Diffusion Models
Zhijun Liu, Dongya Jia, Xiaoqiang Wang, Chenpeng Du, Shuai Wang, Zhuo Chen, Haizhou Li
https://arxiv.org/abs/2509.18928

Direct Preference Optimization for Speech Autoregressive Diffusion Models
Autoregressive diffusion models (ARDMs) have recently been applied to speech generation, achieving state-of-the-art (SOTA) performance in zero-shot text-to-speech. By autoregressively generating continuous speech tokens with next-token diffusion, these models offer a promising alternative to next-token prediction, avoiding the technical complexities associated with discrete speech tokenization. As a relatively new paradigm, research on reinforcement learning (RL)-based fine-tuning of speech ARD…

@arXiv_csCL_bot@mastoxiv.page
2025-09-16 12:15:57

From Fuzzy Speech to Medical Insight: Benchmarking LLMs on Noisy Patient Narratives
Eden Mama, Liel Sheri, Yehudit Aperstein, Alexander Apartsin
https://arxiv.org/abs/2509.11803

From Fuzzy Speech to Medical Insight: Benchmarking LLMs on Noisy Patient Narratives
The widespread adoption of large language models (LLMs) in healthcare raises critical questions about their ability to interpret patient-generated narratives, which are often informal, ambiguous, and noisy. Existing benchmarks typically rely on clean, structured clinical text, offering limited insight into model performance under realistic conditions. In this work, we present a novel synthetic dataset designed to simulate patient self-descriptions characterized by varying levels of linguistic n…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 09:33:21

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
Ye-Xin Lu, Yu Gu, Kun Wei, Hui-Peng Du, Yang Ai, Zhen-Hua Ling
https://arxiv.org/abs/2509.14684

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms …

@arXiv_csSD_bot@mastoxiv.page
2025-10-08 11:16:51

Crosslisted article(s) found for cs.SD. https://arxiv.org/list/cs.SD/new
[1/1]:
- Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech
Rikuto Kotoge, Yuichi Sasaki

@arXiv_eessAS_bot@mastoxiv.page
2025-09-22 09:19:11

VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze
https://arxiv.org/abs/2509.15969 https://

VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, t…

@arXiv_csCL_bot@mastoxiv.page
2025-09-16 12:21:17

SENSE models: an open source solution for multilingual and multimodal semantic-based tasks
Salima Mdhaffar, Haroun Elleuch, Chaimae Chellaf, Ha Nguyen, Yannick Est\`eve
https://arxiv.org/abs/2509.12093

SENSE models: an open source solution for multilingual and multimodal semantic-based tasks
This paper introduces SENSE (Shared Embedding for N-lingual Speech and tExt), an open-source solution inspired by the SAMU-XLSR framework and conceptually similar to Meta AI's SONAR models. These approaches rely on a teacher-student framework to align a self-supervised speech encoder with the language-agnostic continuous representations of a text encoder at the utterance level. We describe how the original SAMU-XLSR method has been updated by selecting a stronger teacher text model and a better…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-25 09:19:32

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens
Pin-Jui Ku, He Huang, Jean-Marie Lemercier, Subham Sekhar Sahoo, Zhehuai Chen, Ante Juki\'c
https://arxiv.org/abs/2509.20060

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens
This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves significantly better reconstruction quality, stronger ASR performance, and faster inference. We provide a comprehensive analysis of applying DDMs to speech reconstruction, examining sampler choices, inference steps, and robustness to length-scale estimation errors. Furth…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-30 08:41:21

BFA: Real-time Multilingual Text-to-speech Forced Alignment
Abdul Rehman, Jingyao Cai, Jian-Jun Zhang, Xiaosong Yang
https://arxiv.org/abs/2509.23147 https://

BFA: Real-time Multilingual Text-to-speech Forced Alignment
We present Bournemouth Forced Aligner (BFA), a system that combines a Contextless Universal Phoneme Encoder (CUPE) with a connectionist temporal classification (CTC)based decoder. BFA introduces explicit modelling of inter-phoneme gaps and silences and hierarchical decoding strategies, enabling fine-grained boundary prediction. Evaluations on TIMIT and Buckeye corpora show that BFA achieves competitive recall relative to Montreal Forced Aligner at relaxed tolerance levels, while predicting both…

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 10:17:41

CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset
Brian Yan, Injy Hamed, Shuichiro Shimizu, Vasista Lodagala, William Chen, Olga Iakovenko, Bashar Talafha, Amir Hussein, Alexander Polok, Kalvin Chang, Dominik Klement, Sara Althubaiti, Puyuan Peng, Matthew Wiesner, Thamar Solorio, Ahmed Ali, Sanjeev Khudanpur, Shinji Watanabe, Chih-Chen Chen, Zhen Wu, Karim Benharrak, Anuj Diwan, Samuele Cornell, Eunjung Yeo, Kwanghee Choi, Carlos Carvalho, Karen Rosero

CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset
We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X langu…

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 10:03:31

Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation
Qi Wang, Shituo Ma, Guoxin Yu, Hanyang Peng, Yue Yu
https://arxiv.org/abs/2509.16010 https://…

Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation
Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-29 10:01:27

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis
Zhikang Niu, Shujie Hu, Jeongsoo Choi, Yushen Chen, Peining Chen, Pengcheng Zhu, Yunting Yang, Bowen Zhang, Jian Zhao, Chunhui Wang, Xie Chen
https://arxiv.org/abs/2509.22167

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis
While mel-spectrograms have been widely utilized as intermediate representations in zero-shot text-to-speech (TTS), their inherent redundancy leads to inefficiency in learning text-speech alignment. Compact VAE-based latent representations have recently emerged as a stronger alternative, but they also face a fundamental optimization dilemma: higher-dimensional latent spaces improve reconstruction quality and speaker similarity, but degrade intelligibility, while lower-dimensional spaces improve…

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 10:16:51

Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nick Karpov, Jagadeesh Balam, Boris Ginsburg
https://arxiv.org/abs/2509.14128

Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
This report introduces Canary-1B-v2, a fast, robust multilingual model for Automatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). Built with a FastConformer encoder and Transformer decoder, it supports 25 languages primarily European. The model was trained on 1.7M hours of total data samples, including Granary and NeMo ASR Set 3.0, with non-speech audio added to reduce hallucinations for ASR and AST. We describe its two-stage pre-training and fine-tuning process with dynamic d…

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 08:16:11

Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech data
Youngwon Choi, Jaeyoon Jung, Hyeonyu Kim, Huu-Kim Nguyen, Hwayeon Kim
https://arxiv.org/abs/2509.15389

Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech data
Large Audio Language Models (LALMs) have emerged as powerful tools for speech-related tasks but remain underexplored for fine-tuning, especially with limited speech data. To bridge this gap, we systematically examine how different fine-tuning schemes including text-only, direct mixing, and curriculum learning affect spoken language understanding (SLU), focusing on scenarios where text-label pairs are abundant while paired speech-label data are limited. Results show that LALMs already achieve co…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-18 07:52:31

TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models
Haolong Zheng, Yekaterina Yegorova, Mark Hasegawa-Johnson
https://arxiv.org/abs/2509.13395

TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models
Speech foundation models have recently demonstrated the ability to perform Speech In-Context Learning (SICL). Selecting effective in-context examples is crucial for SICL performance, yet selection methodologies remain underexplored. In this work, we propose Text-Embedding KNN for SICL (TICL), a simple pipeline that uses semantic context to enhance off-the-shelf large multimodal models' speech recognition ability without fine-tuning. Across challenging automatic speech recognition tasks, includi…

@arXiv_csCL_bot@mastoxiv.page
2025-09-25 10:43:02

OLaPh: Optimal Language Phonemizer
Johannes Wirth
https://arxiv.org/abs/2509.20086 https://arxiv.org/pdf/2509.20086…

OLaPh: Optimal Language Phonemizer
Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and …

@arXiv_eessAS_bot@mastoxiv.page
2025-09-16 09:13:06

Length-Aware Rotary Position Embedding for Text-Speech Alignment
Hyeongju Kim, Juheon Lee, Jinhyeok Yang, Jacob Morton
https://arxiv.org/abs/2509.11084 https://

Length-Aware Rotary Position Embedding for Text-Speech Alignment
Many recent text-to-speech (TTS) systems are built on transformer architectures and employ cross-attention mechanisms for text-speech alignment. Within these systems, rotary position embedding (RoPE) is commonly used to encode positional information in text and speech representations. In this work, we introduce length-aware RoPE (LARoPE), a simple yet effective extension of RoPE that improves text-speech alignment. Unlike RoPE, which relies on absolute indices, LARoPE computes relative distance…

@arXiv_csSD_bot@mastoxiv.page
2025-09-23 10:04:20

MBCodec:Thorough disentangle for high-fidelity audio compression
Ruonan Zhang, Xiaoyang Hao, Yichen Han, Junjie Cao, Yue Liu, Kai Zhang
https://arxiv.org/abs/2509.17006 https://…

MBCodec:Thorough disentangle for high-fidelity audio compression
High-fidelity neural audio codecs in Text-to-speech (TTS) aim to compress speech signals into discrete representations for faithful reconstruction. However, prior approaches faced challenges in effectively disentangling acoustic and semantic information within tokens, leading to a lack of fine-grained details in synthesized speech. In this study, we propose MBCodec, a novel multi-codebook audio codec based on Residual Vector Quantization (RVQ) that learns a hierarchically structured representat…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-22 09:01:21

Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS
Ziqi Dai, Yiting Chen, Jiacheng Xu, Liufei Xie, Yuchen Wang, Zhenchuan Yang, Bingsong Bai, Yangsheng Gao, Wenjiang Zhou, Weifeng Zhao, Ruohua Zhou
https://arxiv.org/abs/2509.15845

Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS
The pipeline for multi-participant audiobook production primarily consists of three stages: script analysis, character voice timbre selection, and speech synthesis. Among these, script analysis can be automated with high accuracy using NLP models, whereas character voice timbre selection still relies on manual effort. Speech synthesis uses either manual dubbing or text-to-speech (TTS). While TTS boosts efficiency, it struggles with emotional expression, intonation control, and contextual scene …

@arXiv_csSD_bot@mastoxiv.page
2025-09-19 09:33:01

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
Qingyu Liu, Yushen Chen, Zhikang Niu, Chunhui Wang, Yunting Yang, Bowen Zhang, Jian Zhao, Pengcheng Zhu, Kai Yu, Xie Chen
https://arxiv.org/abs/2509.14579

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
Flow-matching-based text-to-speech (TTS) models have shown high-quality speech synthesis. However, most current flow-matching-based TTS models still rely on reference transcripts corresponding to the audio prompt for synthesis. This dependency prevents cross-lingual voice cloning when audio prompt transcripts are unavailable, particularly for unseen languages. The key challenges for flow-matching-based TTS models to remove audio prompt transcripts are identifying word boundaries during training…

@arXiv_csSD_bot@mastoxiv.page
2025-10-06 08:55:29

AudioToolAgent: An Agentic Framework for Audio-Language Models
Gijs Wijngaard, Elia Formisano, Michel Dumontier
https://arxiv.org/abs/2510.02995 https://ar…

AudioToolAgent: An Agentic Framework for Audio-Language Models
Large Audio-Language Models (LALMs) perform well on audio understanding tasks but lack multi-step reasoning and tool-calling found in recent Large Language Models (LLMs). This paper presents AudioToolAgent, a framework that coordinates audio-language models as tools via a central LLM agent that accesses tool adapters for audio question answering and speech-to-text. The agent selects tools, asks follow-up questions, and compares outputs for verification. Experiments with MMAU, MMAR, and MMAU-Pro…

@arXiv_csCL_bot@mastoxiv.page
2025-09-23 12:45:20

DIVERS-Bench: Evaluating Language Identification Across Domain Shifts and Code-Switching
Jessica Ojo, Zina Kamel, David Ifeoluwa Adelani
https://arxiv.org/abs/2509.17768 https:/…

DIVERS-Bench: Evaluating Language Identification Across Domain Shifts and Code-Switching
Language Identification (LID) is a core task in multilingual NLP, yet current systems often overfit to clean, monolingual data. This work introduces DIVERS-BENCH, a comprehensive evaluation of state-of-the-art LID models across diverse domains, including speech transcripts, web text, social media texts, children's stories, and code-switched text. Our findings reveal that while models achieve high accuracy on curated datasets, performance degrades sharply on noisy and informal inputs. We also in…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-18 09:34:01

Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems
Yi-Cheng Lin, Huang-Cheng Chou, Tzu-Chieh Wei, Kuan-Yu Chen, Hung-yi Lee
https://arxiv.org/abs/2509.13989

Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems
Instruction-guided text-to-speech (ITTS) enables users to control speech generation through natural language prompts, offering a more intuitive interface than traditional TTS. However, the alignment between user style instructions and listener perception remains largely unexplored. This work first presents a perceptual analysis of ITTS controllability across two expressive dimensions (adverbs of degree and graded emotion intensity) and collects human ratings on speaker age and word-level emphas…

@arXiv_csSD_bot@mastoxiv.page
2025-09-23 10:09:00

Bridging the gap between training and inference in LM-based TTS models
Ruonan Zhang, Lingzhou Mu, Xixin Wu, Kai Zhang
https://arxiv.org/abs/2509.17021 https://

Bridging the gap between training and inference in LM-based TTS models
Recent advancements in text-to-speech (TTS) have shown that language model (LM) based systems offer competitive performance compared to traditional approaches. However, in training, TTS models use ground-truth (GT) tokens as prefixes to predict the next token, while in inference these tokens are not available, a gap between training and inference that is often neglected. In this study, we propose a prompt-guided hybrid training scheme to mitigate exposure bias in popular LM-based TTS systems. O…

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 07:59:41

Emotion-Aware Speech Generation with Character-Specific Voices for Comics
Zhiwen Qian, Jinhua Liang, Huan Zhang
https://arxiv.org/abs/2509.15253 https://ar…

Emotion-Aware Speech Generation with Character-Specific Voices for Comics
This paper presents an end-to-end pipeline for generating character-specific, emotion-aware speech from comics. The proposed system takes full comic volumes as input and produces speech aligned with each character's dialogue and emotional state. An image processing module performs character detection, text recognition, and emotion intensity recognition. A large language model performs dialogue attribution and emotion analysis by integrating visual information with the evolving plot context. Spe…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-23 11:22:20

Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech
Zirui Li, Jens Edlund, Yicheng Gu, Nhan Phan, Lauri Juvela, Mikko Kurimo
https://arxiv.org/abs/2509.17988 h…

Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech
Text-to-speech (TTS) development is limited by scarcity of high-quality, publicly available speech data for most languages outside a few high-resource languages. We present Nord-Parl-TTS, an open TTS dataset for Finnish and Swedish based on speech found in the wild. Using recordings of Nordic parliamentary proceedings, we extract 900 hours of Finnish and 5090 hours of Swedish speech suitable for TTS training. The dataset is built using an adapted version of the Emilia data processing pipeline a…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-23 10:57:41

Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook
Min Liu, JingJing Yin, Xiang Zhang, Siyu Hao, Yanni Hu, Bin Lin, Yuan Feng, Hongbin Zhou, Jianhao Ye
https://arxiv.org/abs/2509.17516

Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook
Existing text-to-speech systems predominantly focus on single-sentence synthesis and lack adequate contextual modeling as well as fine-grained performance control capabilities for generating coherent multicast audiobooks. To address these limitations, we propose a context-aware and emotion controllable speech synthesis framework specifically engineered for multicast audiobooks with three key innovations: a context mechanism for contextual consistency, a disentanglement paradigm to decouple styl…

@arXiv_csSD_bot@mastoxiv.page
2025-09-17 09:41:00

A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis
Javeria Amir, Farwa Attaria, Mah Jabeen, Umara Noor, Zahid Rashid
https://arxiv.org/abs/2509.12831

A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis
Recent developments in voice cloning and talking head generation demonstrate impressive capabilities in synthesizing natural speech and realistic lip synchronization. Current methods typically require and are trained on large scale datasets and computationally intensive processes using clean studio recorded inputs that is infeasible in noisy or low resource environments. In this paper, we introduce a new modular pipeline comprising Tortoise text to speech. It is a transformer based latent diffu…

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 08:53:51

A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication
Ryan Collette, Ross Greenwood, Serena Nicoll
https://arxiv.org/abs/2509.15462 https://

A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication
While existing speech audio codecs designed for compression exploit limited forms of temporal redundancy and allow for multi-scale representations, they tend to represent all features of audio in the same way. In contrast, generative voice models designed for text-to-speech and voice transfer tasks have recently proved effective at factorizing audio signals into high-level semantic representations of fundamentally distinct features. In this paper, we leverage such representations in a novel sem…

@arXiv_csSD_bot@mastoxiv.page
2025-10-01 12:49:32

Crosslisted article(s) found for cs.SD. https://arxiv.org/list/cs.SD/new
[1/1]:
- Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization
Jiacheng Shi, Hongfei Du, Yangfan He, Y. Alicia Hong, Ye Gao

Tootfinder

Opt-in global Mastodon full text search. Join the index!