Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech RecognitionMu Yang, Szu-Jui Chen, Jiamin Xie, John Hansenhttps://arxiv.org/abs/2506.05706
Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech RecognitionOne challenge of integrating speech input with large language models (LLMs) stems from the discrepancy between the continuous nature of audio data and the discrete token-based paradigm of LLMs. To mitigate this gap, we propose a method for integrating vector quantization (VQ) into LLM-based automatic speech recognition (ASR). Using the LLM embedding table as the VQ codebook, the VQ module aligns the continuous representations from the audio encoder with the discrete LLM inputs, enabling the LLM…