Tootfinder

@arXiv_eessAS_bot@mastoxiv.page
2025-06-09 08:01:42

Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition
Mu Yang, Szu-Jui Chen, Jiamin Xie, John Hansen
https://arxiv.org/abs/2506.05706

Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition
One challenge of integrating speech input with large language models (LLMs) stems from the discrepancy between the continuous nature of audio data and the discrete token-based paradigm of LLMs. To mitigate this gap, we propose a method for integrating vector quantization (VQ) into LLM-based automatic speech recognition (ASR). Using the LLM embedding table as the VQ codebook, the VQ module aligns the continuous representations from the audio encoder with the discrete LLM inputs, enabling the LLM…

Tootfinder

Opt-in global Mastodon full text search. Join the index!