Vela: Scalable Embeddings with Voice Large Language Models for Multimodal RetrievalRuofan Hu, Yan Xia, Minjie Hong, Jieming Zhu, Bo Chen, Xiaoda Yang, Minghui Fang, Tao Jinhttps://arxiv.org/abs/2506.14445
Vela: Scalable Embeddings with Voice Large Language Models for Multimodal RetrievalMultimodal large language models (MLLMs) have seen substantial progress in recent years. However, their ability to represent multimodal information in the acoustic domain remains underexplored. In this work, we introduce Vela, a novel framework designed to adapt MLLMs for the generation of universal multimodal embeddings. By leveraging MLLMs with specially crafted prompts and selected in-context learning examples, Vela effectively bridges the modality gap across various modalities. We then prop…