Tootfinder

Opt-in global Mastodon full text search. Join the index!

@tezoatlipoca@mas.to
2025-10-15 17:09:28

Ok this is cool. Don't know if this is a studio/publisher thing or Steam is now enforcing this:
>AI Generated Content Disclosure
> We are utilising ElevenLabs' text-to-speech tool to generate voice-over elements within Metro Rivals. All scripts and content are written by Dovetail Games staff, and the voices you hear in-game, which have used ElevenLabs' software, have been licensed by voice actors.
either way, cool!
@…

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:24:51

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models
Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, Wei Zou
arxiv.org/abs/2510.12116

@arXiv_csSD_bot@mastoxiv.page
2025-10-14 11:10:48

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery
arxiv.org/abs/2510.10774

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:35:31

Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation
Greta Damo, Elena Cabrio, Serena Villata
arxiv.org/abs/2510.12316

@markrsmith@smithtodon.org
2025-11-14 14:44:46

I used a speech to text program today to type “Shirley” and it entered “Surely”
#airplane

i am serious leslie nielsen GIF
@arXiv_csSD_bot@mastoxiv.page
2025-10-14 11:40:38

BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis
Jingyuan Xing, Mingru Yang, Zhipeng Li, Xiaofen Xing, Xiangmin Xu
arxiv.org/abs/2510.11646

@arXiv_csLG_bot@mastoxiv.page
2025-10-15 14:37:21

Replaced article(s) found for cs.LG. arxiv.org/list/cs.LG/new
[7/7]:
- ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

@arXiv_csCL_bot@mastoxiv.page
2025-09-15 10:00:11

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Akshat Pandey, Karun Kumar, Raphael Tang
arxiv.org/abs/2509.10452

@yaya@jorts.horse
2025-10-12 22:58:36

OKAY IT ONLY JUST OCCURRED TO ME I CAN DO SPEECH TO TEXT FOR ALT TEXT

@arXiv_eessAS_bot@mastoxiv.page
2025-10-15 08:49:32

DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation
Yakun Song, Xiaobin Zhuang, Jiawei Chen, Zhikang Niu, Guanrou Yang, Chenpeng Du, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
arxiv.org/abs/2510.12210

@gadgetboy@gadgetboy.social
2025-11-13 13:29:03

Here's a great use of AI text-to-speech generation: preparing for a live pitch.
Instead of reading your pitch a hundred times so you can edit it for time, use ElevenLabs.
1. Create an account
2. Find a voice that matches your own cadence
3. Paste your script and have it generate the pitch
You'll immediately see how long the audio file is and can adjust your script for length.
Then you can spend your time **rehearsing** instead of editing.

@arXiv_csSD_bot@mastoxiv.page
2025-10-14 11:33:48

Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker
Cheng Gong, Chunyu Qiang, Tianrui Wang, Yu Jiang, Yuheng Lu, Ruihao Jing, Xiaoxiao Miao, Xiaolei Zhang, Longbiao Wang, Jianwu Dang
arxiv.org/abs/2510.11124

@phpmacher@sueden.social
2025-12-07 14:01:16

Suchst Du nach einer funktionierenden local-only kostenlosen #Diktier-Lösung auf dem #Mac?
#diktat

@candide@vis.social
2025-11-06 23:16:15

I've been meaning to share this for a while, but for any Android users out who want to use a text-to-speech engine other than Google's, I recommend Sherpa TTS: github.com/woheller69/ttsEngine
It's open source, offline, multilingual, and available on F-Droid.
I use text-t…

@arXiv_csSD_bot@mastoxiv.page
2025-10-13 09:20:00

O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion
Huu Tuong Tu, Huan Vu, cuong tien nguyen, Dien Hy Ngo, Nguyen Thi Thu Trang
arxiv.org/abs/2510.09061

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:35:20

The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Nizar El Ghazal, Antoine Caubri\`ere, Valentin Vielzeuf
arxiv.org/abs/2510.09424

@aardrian@toot.cafe
2025-11-24 22:31:36

I finally got around to adding the Narrator speech recap feature to my 2020 post “Speech Viewer Logs of Lies”:
adrianroselli.com/2020/08/spee
Video! We like video! 323kb of video! Big fat video!

@arXiv_eessAS_bot@mastoxiv.page
2025-10-13 09:01:50

Unsupervised lexicon learning from speech is limited by representations rather than clustering
Danel Adendorff, Simon Malan, Herman Kamper
arxiv.org/abs/2510.09225

@ErikUden@mastodon.de
2025-09-29 23:46:41

This charge is so baseless, the comparison to genocide, the wholesale slaughter of populations. Did the Nazis ask the Jews to leave, kindly leave, go out? Did others? Do you want me to name all the genocidal leaders of history. Just go one by one. Did anyone do this? Did they say “go out so we can come in”?
— Benjamin Netanyahu, 26.09.2025 [

@arXiv_csSD_bot@mastoxiv.page
2025-09-15 08:21:21

DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration
Yanru Huo, Ziyue Jiang, Zuoli Tang, Qingyang Hong, Zhou Zhao
arxiv.org/abs/2509.09748

@arXiv_csCV_bot@mastoxiv.page
2025-10-07 12:47:52

Paper2Video: Automatic Video Generation from Scientific Papers
Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou
arxiv.org/abs/2510.05096 arxiv…

@FandaSin@social.linux.pizza
2025-11-25 16:06:31

Youtube have some new "feature" to transcribe speech to text. (or whatever this is)
Watching something about Star Trek and Kardashian, it produced this picture of Kim Kardashian.
I think we shouldn't be afraid of "AI" in any way.😂

Speech to text by youtube (picture from Youtube "rewrite" feature):

Drago 4 features an unusually large temperate zone.
However, it is within three light years of Kardashian space.

[Explanation under Kardashian word]:
Kim Kardashian
Kim Kardashian is american well known model and business woman... [for more pres]
@arXiv_csCL_bot@mastoxiv.page
2025-10-06 10:20:09

Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?
Oriol Pareras, Gerard I. G\'allego, Federico Costa, Cristina Espa\~na-Bonet, Javier Hernando
arxiv.org/abs/2510.03093

@arXiv_csHC_bot@mastoxiv.page
2025-09-24 09:33:14

M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition
Jiajun He, Xiaohan Shi, Cheng-Hung Hu, Jinyi Mi, Xingfeng Li, Tomoki Toda
arxiv.org/abs/2509.18706

@arXiv_eessAS_bot@mastoxiv.page
2025-10-13 08:37:20

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging
Yaya Sy, Christophe Cerisara, Irina Illina
arxiv.org/abs/2510.08599 arxiv.…

@arXiv_csCL_bot@mastoxiv.page
2025-10-06 10:20:29

Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation
Jacobo Romero-D\'iaz, Gerard I. G\'allego, Oriol Pareras, Federico Costa, Javier Hernando, Cristina Espa\~na-Bonet
arxiv.org/abs/2510.03115

@ethanwhite@hachyderm.io
2025-09-29 14:16:52

What speech-to-text thinks I'm saying when I say @…
- "are open sigh"
- "our open size"
- "art open side"

@aardrian@toot.cafe
2025-10-24 23:49:36

Futzing around with Narrator (because of its new Braille viewer, which I am working on testing) and reminded of the new speech log (“Speech Recap”):
Narrator Key Alt X
Wanted to confirm this is also true with Narrator:
adrianroselli.com/2020/08/spee

@arXiv_csSD_bot@mastoxiv.page
2025-10-13 08:30:30

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling
Yuxuan Jiang, Zehua Chen, Zeqian Ju, Yusheng Dai, Weibei Dou, Jun Zhu
arxiv.org/abs/2510.08878

@arXiv_csSE_bot@mastoxiv.page
2025-10-01 10:19:47

Protocode: Prototype-Driven Interpretability for Code Generation in LLMs
Krishna Vamshi Bodla, Haizhao Yang
arxiv.org/abs/2509.25247 arxiv.…

@Techmeme@techhub.social
2025-09-19 17:10:56

Neuralink plans a US clinical trial in October to test a brain implant that translates thoughts into text, hoping to put its device in a healthy person by 2030 (Ike Swetlitz/Bloomberg)
bloomberg.com/news/articles/20

@cjust@infosec.exchange
2025-10-02 18:19:24

#TheOatmeal #comics #mentalhealth

The image is a comic strip divided into three panels, each with a different colored background. The style is simple and cartoonish.

Panel 1: The background is a light yellow. Text at the top reads "What you THINK your brain is supposed to do". A large, white, egg-shaped character with a simple face (two black eyes and a small mouth) is walking towards a pink brain with a happy expression. The brain is in the shape of a human brain with a small mouth. The egg-shaped character has a speech bubbl…
@izzychambers@vivaldi.net
2025-11-21 22:30:23

@… I've used Talon, but before I retired I used Dragon Professional on Windows, which I greatly preferred. Now that I'm retired, I don't need speech to text that much. I've only used the free version of Talon, not the "beta", which costs something like $25/month. I found Talon very hard to use, perhaps because I was used to the Dragon way of …

@arXiv_csCL_bot@mastoxiv.page
2025-10-06 08:33:39

KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI
So Kuroki, Yotaro Kubo, Takuya Akiba, Yujin Tang
arxiv.org/abs/2510.02327

@arXiv_eessAS_bot@mastoxiv.page
2025-10-09 08:06:51

Towards Responsible Evaluation for Text-to-Speech
Yifan Yang, Hui Wang, Bing Han, Shujie Liu, Jinyu Li, Yong Qin, Xie Chen
arxiv.org/abs/2510.06927

@arXiv_csHC_bot@mastoxiv.page
2025-09-19 09:11:01

Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech
Taesoo Kim, Yongsik Jo, Hyunmin Song, Taehwan Kim
arxiv.org/abs/2509.14627

@arXiv_csSD_bot@mastoxiv.page
2025-10-06 08:21:49

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech
Hieu-Nghia Huynh-Nguyen, Huynh Nguyen Dang, Ngoc-Son Nguyen, Van Nguyen
arxiv.org/abs/2510.02848

@arXiv_csSD_bot@mastoxiv.page
2025-10-10 08:31:59

IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation
Wei Wang, Rong Cao, Yi Guo, Zhengyang Chen, Kuan Chen, Yuanyuan Huo
arxiv.org/abs/2510.07979

@arXiv_eessAS_bot@mastoxiv.page
2025-10-07 09:32:02

A Multilingual Framework for Dysarthria: Detection, Severity Classification, Speech-to-Text, and Clean Speech Generation
Ananya Raghu, Anisha Raghu, Nithika Vivek, Sofie Budman, Omar Mansour
arxiv.org/abs/2510.03986

@arXiv_csCL_bot@mastoxiv.page
2025-10-03 10:51:21

Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage
Siddhant Arora, Haidar Khan, Kai Sun, Xin Luna Dong, Sajal Choudhary, Seungwhan Moon, Xinyuan Zhang, Adithya Sagar, Surya Teja Appini, Kaushik Patnaik, Sanat Sharma, Shinji Watanabe, Anuj Kumar, Ahmed Aly, Yue Liu, Florian Metze, Zhaojiang Lin
arxiv.…

@arXiv_csSD_bot@mastoxiv.page
2025-10-03 08:03:21

Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
Jianing Yang, Sheng Li, Takahiro Shinozaki, Yuki Saito, Hiroshi Saruwatari
arxiv.org/abs/2510.01722

@arXiv_eessAS_bot@mastoxiv.page
2025-10-08 09:42:39

TokenChain: A Discrete Speech Chain via Semantic Token Modeling
Mingxuan Wang, Satoshi Nakamura
arxiv.org/abs/2510.06201 arxiv.org/pdf/2510…

@arXiv_csCL_bot@mastoxiv.page
2025-09-23 12:58:41

TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation
Yutong Liu, Ziyue Zhang, Ban Ma-bao, Renzeng Duojie, Yuqing Cai, Yongbin Yu, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi
arxiv.org/abs/2509.18060

@arXiv_csCL_bot@mastoxiv.page
2025-10-01 11:34:37

The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models
Lina Conti, Dennis Fucci, Marco Gaido, Matteo Negri, Guillaume Wisniewski, Luisa Bentivogli
arxiv.org/abs/2509.26543

@arXiv_eessAS_bot@mastoxiv.page
2025-10-10 09:19:59

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
Hanke Xie, Dake Guo, Chengyou Wang, Yue Li, Wenjie Tian, Xinfa Zhu, Xinsheng Wang, Xiulin Li, Guanqiong Miao, Bo Liu, Lei Xie
arxiv.org/abs/2510.08373

@arXiv_csLG_bot@mastoxiv.page
2025-09-22 11:50:53

Crosslisted article(s) found for cs.LG. arxiv.org/list/cs.LG/new
[3/3]:
- VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze

@arXiv_csCL_bot@mastoxiv.page
2025-10-01 11:32:17

BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs
Yue Wang, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Wanshun Chen, Huang Liu, Jiadi Yao, Qu Yang, Qingxuan Jiang, Fanghua Ye, Juntao Li, Min Zhang, Zhaopeng Tu, Xiaolong Li, Linus
arxiv.org/abs/2509.26514

@arXiv_csSD_bot@mastoxiv.page
2025-10-07 09:20:42

Evaluating Self-Supervised Speech Models via Text-Based LLMS
Takashi Maekaku, Keita Goto, Jinchuan Tian, Yusuke Shinohara, Shinji Watanabe
arxiv.org/abs/2510.04463

@arXiv_csCL_bot@mastoxiv.page
2025-10-01 11:15:37

Optimizing Speech Language Models for Acoustic Consistency
Morteza Rohanian, Michael Krauthammer
arxiv.org/abs/2509.26276 arxiv.org/pdf/250…

@arXiv_csSD_bot@mastoxiv.page
2025-09-29 09:40:57

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
Junjie Cao, Yichen Han, Ruonan Zhang, Xiaoyang Hao, Hongxiang Li, Shuaijiang Zhao, Yue Liu, Xiao-Ping Zhng
arxiv.org/abs/2509.22062

@arXiv_csCL_bot@mastoxiv.page
2025-10-07 12:05:22

A Low-Resource Speech-Driven NLP Pipeline for Sinhala Dyslexia Assistance
Peshala Perera, Deshan Sumanathilaka
arxiv.org/abs/2510.04750 arx…

@arXiv_csSD_bot@mastoxiv.page
2025-10-07 08:23:12

Audio Forensics Evaluation (SAFE) Challenge
Kirill Trapeznikov, Paul Cummer, Pranay Pherwani, Jai Aslam, Michael S. Davinroy, Peter Bautista, Laura Cassani, Matthew Stamm, Jill Crisman
arxiv.org/abs/2510.03387

@arXiv_eessAS_bot@mastoxiv.page
2025-09-30 10:56:51

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
Tianrui Wang, Haoyu Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Ziyang Ma, Zikang Huang, Guanrou Yang, Xiaobao Wang, Eng Siong Chng, Xie Chen, Longbiao Wang, Jianwu Dang
arxiv.org/abs/2509.24629

@arXiv_csCL_bot@mastoxiv.page
2025-09-23 12:57:41

Cross-Attention is Half Explanation in Speech-to-Text Models
Sara Papi, Dennis Fucci, Marco Gaido, Matteo Negri, Luisa Bentivogli
arxiv.org/abs/2509.18010

@arXiv_csCL_bot@mastoxiv.page
2025-09-25 10:42:42

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training
Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu
arxiv.org/abs/2509.20072

@arXiv_eessAS_bot@mastoxiv.page
2025-09-25 07:48:42

Selective Classifier-free Guidance for Zero-shot Text-to-speech
John Zheng, Farhad Maleki
arxiv.org/abs/2509.19668 arxiv.org/pdf/2509.19668…

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 08:57:31

Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
Xinlei Niu, Jianbo Ma, Dylan Harper-Harris, Xiangyu Zhang, Charles Patrick Martin, Jing Zhang
arxiv.org/abs/2509.15492

@arXiv_csSD_bot@mastoxiv.page
2025-10-07 10:00:32

Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba
Baher Mohammad, Magauiya Zhussip, Stamatios Lefkimmiatis
arxiv.org/abs/2510.04738

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:34:11

Cross-Modal Knowledge Distillation for Speech Large Language Models
Enzhi Wang, Qicheng Li, Zhiyuan Tang, Yuhang Jia
arxiv.org/abs/2509.14930

@arXiv_eessAS_bot@mastoxiv.page
2025-10-07 10:15:32

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models
Wenhao Guan, Zhikang Niu, Ziyue Jiang, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li, Xie Chen
arxiv.org/abs/2510.04593

@arXiv_csSD_bot@mastoxiv.page
2025-09-23 08:26:00

Speech-to-See: End-to-End Speech-Driven Open-Set Object Detection
Wenhuan Lu, Xinyue Song, Wenjun Ke, Zhizhi Yu, Wenhao Yang, Jianguo Wei
arxiv.org/abs/2509.16670

@arXiv_csSD_bot@mastoxiv.page
2025-09-23 10:12:10

Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing
Wataru Nakata, Yuki Saito, Yota Ueda, Hiroshi Saruwatari
arxiv.org/abs/2509.17052

@arXiv_eessAS_bot@mastoxiv.page
2025-09-30 11:27:21

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song
arxiv.org/abs/2509.24773

@arXiv_csSD_bot@mastoxiv.page
2025-09-26 09:07:01

i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents
Anupam Purwar, Aditya Choudhary
arxiv.org/abs/2509.20971 arxiv.org/pd…

@arXiv_csCL_bot@mastoxiv.page
2025-09-16 12:27:27

Preservation of Language Understanding Capabilities in Speech-aware Large Language Models
Marek Kubis, Pawe{\l} Sk\'orzewski, Iwona Christop, Mateusz Czy\.znikiewicz, Jakub Kubiak, {\L}ukasz Bondaruk, Marcin Lewandowski
arxiv.org/abs/2509.12171

@arXiv_csSD_bot@mastoxiv.page
2025-09-23 10:26:40

STAR: Speech-to-Audio Generation via Representation Learning
Zeyu Xie, Xuenan Xu, Yixuan Li, Mengyue Wu, Yuexian Zou
arxiv.org/abs/2509.17164

@arXiv_eessAS_bot@mastoxiv.page
2025-09-24 09:13:34

Group Relative Policy Optimization for Text-to-Speech with Large Language Models
Chang Liu, Ya-Jun Hu, Ying-Ying Gao, Shi-Lei Zhang, Zhen-Hua Ling
arxiv.org/abs/2509.18798

@arXiv_csSD_bot@mastoxiv.page
2025-09-30 09:07:21

DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation
Ziqi Chen, Gongyu Chen, Yihua Wang, Chaofan Ding, Zihao chen, Wei-Qiang Zhang
arxiv.org/abs/2509.22727

@arXiv_csSD_bot@mastoxiv.page
2025-09-25 08:47:12

Eliminating stability hallucinations in llm-based tts models via attention guidance
ShiMing Wang, ZhiHao Du, Yang Xiang, TianYu Zhao, Han Zhao, Qian Chen, XianGang Li, HanJie Guo, ZhenHua Ling
arxiv.org/abs/2509.19852

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 08:45:01

SpeechOp: Inference-Time Task Composition for Generative Speech Processing
Justin Lovelace, Rithesh Kumar, Jiaqi Su, Ke Chen, Kilian Q Weinberger, Zeyu Jin
arxiv.org/abs/2509.14298

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 09:51:31

Direct Simultaneous Translation Activation for Large Audio-Language Models
Pei Zhang, Yiming Wang, Jialong Tang, Baosong Yang, Rui Wang, Derek F. Wong, Fei Huang
arxiv.org/abs/2509.15692

@arXiv_eessAS_bot@mastoxiv.page
2025-09-24 09:43:44

Direct Preference Optimization for Speech Autoregressive Diffusion Models
Zhijun Liu, Dongya Jia, Xiaoqiang Wang, Chenpeng Du, Shuai Wang, Zhuo Chen, Haizhou Li
arxiv.org/abs/2509.18928

@arXiv_csCL_bot@mastoxiv.page
2025-09-16 12:15:57

From Fuzzy Speech to Medical Insight: Benchmarking LLMs on Noisy Patient Narratives
Eden Mama, Liel Sheri, Yehudit Aperstein, Alexander Apartsin
arxiv.org/abs/2509.11803

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 09:33:21

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
Ye-Xin Lu, Yu Gu, Kun Wei, Hui-Peng Du, Yang Ai, Zhen-Hua Ling
arxiv.org/abs/2509.14684

@arXiv_csSD_bot@mastoxiv.page
2025-10-08 11:16:51

Crosslisted article(s) found for cs.SD. arxiv.org/list/cs.SD/new
[1/1]:
- Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech
Rikuto Kotoge, Yuichi Sasaki

@arXiv_eessAS_bot@mastoxiv.page
2025-09-22 09:19:11

VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze
arxiv.org/abs/2509.15969

@arXiv_csCL_bot@mastoxiv.page
2025-09-16 12:21:17

SENSE models: an open source solution for multilingual and multimodal semantic-based tasks
Salima Mdhaffar, Haroun Elleuch, Chaimae Chellaf, Ha Nguyen, Yannick Est\`eve
arxiv.org/abs/2509.12093

@arXiv_eessAS_bot@mastoxiv.page
2025-09-25 09:19:32

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens
Pin-Jui Ku, He Huang, Jean-Marie Lemercier, Subham Sekhar Sahoo, Zhehuai Chen, Ante Juki\'c
arxiv.org/abs/2509.20060

@arXiv_eessAS_bot@mastoxiv.page
2025-09-30 08:41:21

BFA: Real-time Multilingual Text-to-speech Forced Alignment
Abdul Rehman, Jingyao Cai, Jian-Jun Zhang, Xiaosong Yang
arxiv.org/abs/2509.23147

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 10:17:41

CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset
Brian Yan, Injy Hamed, Shuichiro Shimizu, Vasista Lodagala, William Chen, Olga Iakovenko, Bashar Talafha, Amir Hussein, Alexander Polok, Kalvin Chang, Dominik Klement, Sara Althubaiti, Puyuan Peng, Matthew Wiesner, Thamar Solorio, Ahmed Ali, Sanjeev Khudanpur, Shinji Watanabe, Chih-Chen Chen, Zhen Wu, Karim Benharrak, Anuj Diwan, Samuele Cornell, Eunjung Yeo, Kwanghee Choi, Carlos Carvalho, Karen Rosero

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 10:03:31

Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation
Qi Wang, Shituo Ma, Guoxin Yu, Hanyang Peng, Yue Yu
arxiv.org/abs/2509.16010

@arXiv_eessAS_bot@mastoxiv.page
2025-09-29 10:01:27

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis
Zhikang Niu, Shujie Hu, Jeongsoo Choi, Yushen Chen, Peining Chen, Pengcheng Zhu, Yunting Yang, Bowen Zhang, Jian Zhao, Chunhui Wang, Xie Chen
arxiv.org/abs/2509.22167

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 10:16:51

Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nick Karpov, Jagadeesh Balam, Boris Ginsburg
arxiv.org/abs/2509.14128

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 08:16:11

Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech data
Youngwon Choi, Jaeyoon Jung, Hyeonyu Kim, Huu-Kim Nguyen, Hwayeon Kim
arxiv.org/abs/2509.15389

@arXiv_eessAS_bot@mastoxiv.page
2025-09-18 07:52:31

TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models
Haolong Zheng, Yekaterina Yegorova, Mark Hasegawa-Johnson
arxiv.org/abs/2509.13395

@arXiv_csCL_bot@mastoxiv.page
2025-09-25 10:43:02

OLaPh: Optimal Language Phonemizer
Johannes Wirth
arxiv.org/abs/2509.20086 arxiv.org/pdf/2509.20086

@arXiv_eessAS_bot@mastoxiv.page
2025-09-16 09:13:06

Length-Aware Rotary Position Embedding for Text-Speech Alignment
Hyeongju Kim, Juheon Lee, Jinhyeok Yang, Jacob Morton
arxiv.org/abs/2509.11084

@arXiv_csSD_bot@mastoxiv.page
2025-09-23 10:04:20

MBCodec:Thorough disentangle for high-fidelity audio compression
Ruonan Zhang, Xiaoyang Hao, Yichen Han, Junjie Cao, Yue Liu, Kai Zhang
arxiv.org/abs/2509.17006

@arXiv_eessAS_bot@mastoxiv.page
2025-09-22 09:01:21

Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS
Ziqi Dai, Yiting Chen, Jiacheng Xu, Liufei Xie, Yuchen Wang, Zhenchuan Yang, Bingsong Bai, Yangsheng Gao, Wenjiang Zhou, Weifeng Zhao, Ruohua Zhou
arxiv.org/abs/2509.15845

@arXiv_csSD_bot@mastoxiv.page
2025-09-19 09:33:01

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
Qingyu Liu, Yushen Chen, Zhikang Niu, Chunhui Wang, Yunting Yang, Bowen Zhang, Jian Zhao, Pengcheng Zhu, Kai Yu, Xie Chen
arxiv.org/abs/2509.14579

@arXiv_csSD_bot@mastoxiv.page
2025-10-06 08:55:29

AudioToolAgent: An Agentic Framework for Audio-Language Models
Gijs Wijngaard, Elia Formisano, Michel Dumontier
arxiv.org/abs/2510.02995 ar…

@arXiv_csCL_bot@mastoxiv.page
2025-09-23 12:45:20

DIVERS-Bench: Evaluating Language Identification Across Domain Shifts and Code-Switching
Jessica Ojo, Zina Kamel, David Ifeoluwa Adelani
arxiv.org/abs/2509.17768

@arXiv_eessAS_bot@mastoxiv.page
2025-09-18 09:34:01

Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems
Yi-Cheng Lin, Huang-Cheng Chou, Tzu-Chieh Wei, Kuan-Yu Chen, Hung-yi Lee
arxiv.org/abs/2509.13989

@arXiv_csSD_bot@mastoxiv.page
2025-09-23 10:09:00

Bridging the gap between training and inference in LM-based TTS models
Ruonan Zhang, Lingzhou Mu, Xixin Wu, Kai Zhang
arxiv.org/abs/2509.17021

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 07:59:41

Emotion-Aware Speech Generation with Character-Specific Voices for Comics
Zhiwen Qian, Jinhua Liang, Huan Zhang
arxiv.org/abs/2509.15253 ar…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-23 11:22:20

Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech
Zirui Li, Jens Edlund, Yicheng Gu, Nhan Phan, Lauri Juvela, Mikko Kurimo
arxiv.org/abs/2509.17988

@arXiv_eessAS_bot@mastoxiv.page
2025-09-23 10:57:41

Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook
Min Liu, JingJing Yin, Xiang Zhang, Siyu Hao, Yanni Hu, Bin Lin, Yuan Feng, Hongbin Zhou, Jianhao Ye
arxiv.org/abs/2509.17516

@arXiv_csSD_bot@mastoxiv.page
2025-09-17 09:41:00

A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis
Javeria Amir, Farwa Attaria, Mah Jabeen, Umara Noor, Zahid Rashid
arxiv.org/abs/2509.12831

@arXiv_csSD_bot@mastoxiv.page
2025-09-22 08:53:51

A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication
Ryan Collette, Ross Greenwood, Serena Nicoll
arxiv.org/abs/2509.15462

@arXiv_csSD_bot@mastoxiv.page
2025-10-01 12:49:32

Crosslisted article(s) found for cs.SD. arxiv.org/list/cs.SD/new
[1/1]:
- Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization
Jiacheng Shi, Hongfei Du, Yangfan He, Y. Alicia Hong, Ye Gao