Tootfinder

@arXiv_csLG_bot@mastoxiv.page
2025-12-22 10:34:10

Exploiting ID-Text Complementarity via Ensembling for Sequential Recommendation
Liam Collins, Bhuvesh Kumar, Clark Mingxuan Ju, Tong Zhao, Donald Loveland, Leonardo Neves, Neil Shah
https://arxiv.org/abs/2512.17820 https://arxiv.org/pdf/2512.17820 https://arxiv.org/html/2512.17820
arXiv:2512.17820v1 Announce Type: new
Abstract: Modern Sequential Recommendation (SR) models commonly utilize modality features to represent items, motivated in large part by recent advancements in language and vision modeling. To do so, several works completely replace ID embeddings with modality embeddings, claiming that modality embeddings render ID embeddings unnecessary because they can match or even exceed ID embedding performance. On the other hand, many works jointly utilize ID and modality features, but posit that complex fusion strategies, such as multi-stage training and/or intricate alignment architectures, are necessary for this joint utilization. However, underlying both these lines of work is a lack of understanding of the complementarity of ID and modality features. In this work, we address this gap by studying the complementarity of ID- and text-based SR models. We show that these models do learn complementary signals, meaning that either should provide performance gain when used properly alongside the other. Motivated by this, we propose a new SR method that preserves ID-text complementarity through independent model training, then harnesses it through a simple ensembling strategy. Despite this method's simplicity, we show it outperforms several competitive SR baselines, implying that both ID and text features are necessary to achieve state-of-the-art SR performance but complex fusion architectures are not.
toXiv_bot_toot

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 10:15:31

Artificial Intelligence Virtual Cells: From Measurements to Decisions across Modality, Scale, Dynamics, and Evaluation
Chengpeng Hu, Calvin Yu-Chian Chen
https://arxiv.org/abs/2510.12498

Artificial Intelligence Virtual Cells: From Measurements to Decisions across Modality, Scale, Dynamics, and Evaluation
Artificial Intelligence Virtual Cells (AIVCs) aim to learn executable, decision-relevant models of cell state from multimodal, multiscale measurements. Recent studies have introduced single-cell and spatial foundation models, improved cross-modality alignment, scaled perturbation atlases, and explored pathway-level readouts. Nevertheless, although held-out validation is standard practice, evaluations remain predominantly within single datasets and settings; evidence indicates that transport acr…

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:24:51

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models
Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, Wei Zou
https://arxiv.org/abs/2510.12116

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models
End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities, yet consistently fall short of traditional pipeline systems on semantic understanding benchmarks. In this work, we reveal through systematic experimentation that although LSLMs lose some text input performance after speech-text alignment training, the performance gap between speech and text inputs is more pronounced, which we refer to as the modality gap. To understand this gap, we …

@arXiv_csSD_bot@mastoxiv.page
2025-10-14 11:34:48

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
KiHyun Nam, Jongmin Choi, Hyeongkeun Lee, Jungwoo Heo, Joon Son Chung
https://arxiv.org/abs/2510.11330

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP…

@seeingwithsound@mas.to
2025-12-13 14:33:38

(PDF, PhD thesis 2018) Improving visual-to-auditory cross-modality information conversions https://eprints.nottingham.ac.uk/55721/ by Shern Shiou Tan, on visual-to-auditory sensory substitution (VASS) devices.
"The integration of visual recognition in parallel with the soundscape will be the…

@arXiv_csIR_bot@mastoxiv.page
2025-10-14 10:00:58

Self-Supervised Representation Learning with ID-Content Modality Alignment for Sequential Recommendation
Donglin Zhou, Weike Pan, Zhong Ming
https://arxiv.org/abs/2510.10556 htt…

Self-Supervised Representation Learning with ID-Content Modality Alignment for Sequential Recommendation
Sequential recommendation (SR) models often capture user preferences based on the historically interacted item IDs, which usually obtain sub-optimal performance when the interaction history is limited. Content-based sequential recommendation has recently emerged as a promising direction that exploits items' textual and visual features to enhance preference learning. However, there are still three key challenges: (i) how to reduce the semantic gap between different content modality representatio…

@arXiv_csGR_bot@mastoxiv.page
2025-10-13 07:52:00

A 3D Generation Framework from Cross Modality to Parameterized Primitive
Yiming Liang, Huan Yu, Zili Wang, Shuyou Zhang, Guodong Yi, Jin Wang, Jianrong Tan
https://arxiv.org/abs/2510.08656

A 3D Generation Framework from Cross Modality to Parameterized Primitive
Recent advancements in AI-driven 3D model generation have leveraged cross modality, yet generating models with smooth surfaces and minimizing storage overhead remain challenges. This paper introduces a novel multi-stage framework for generating 3D models composed of parameterized primitives, guided by textual and image inputs. In the framework, A model generation algorithm based on parameterized primitives, is proposed, which can identifies the shape features of the model constituent elements, …

@arXiv_csCV_bot@mastoxiv.page
2025-12-12 10:44:10

Mull-Tokens: Modality-Agnostic Latent Thinking
Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, Wen-Sheng Chu
https://arxiv.org/abs/2512.10941

@arXiv_csMM_bot@mastoxiv.page
2025-10-15 12:37:23

Replaced article(s) found for cs.MM. https://arxiv.org/list/cs.MM/new
[1/1]:
- Towards Robust and Realible Multimodal Misinformation Recognition with Incomplete Modality
Hengyang Zhou, Yiwei Wei, Jian Yang, Zhenyu Zhang

@arXiv_physicsinsdet_bot@mastoxiv.page
2025-10-14 10:38:38

Towards polarization-enhanced PET: Study of random background in polarization-correlated Compton events
Ana Marija Ko\v{z}uljevi\'c, Tomislav Bokuli\'c, Darko Gro\v{s}ev, Siddharth Parashari, Luka Paveli\'c, Marinko Rade, Marijan \v{Z}uvi\'c, Mihael Makek
https://arxiv.org/abs/2510.11504

Towards polarization-enhanced PET: Study of random background in polarization-correlated Compton events
Positron Emission Tomography (PET) is a medical imaging modality that utilizes positron-emitting isotopes, such as Ga-68 and F-18, for many diagnostic purposes. The positron annihilates with an electron from the surrounding area, creating two photons of 511 keV energy and opposite momenta, entangled in their orthogonal polarizations. When each photon undergoes a Compton scattering process, the difference of their azimuthal scattering angles reflects the initial orthogonality of their polarizati…

@arXiv_csCV_bot@mastoxiv.page
2025-10-13 10:38:20

TC-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control
Minkyoung Cho, Ruben Ohana, Christian Jacobsen, Adityan Jothi, Min-Hung Chen, Z. Morley Mao, Ethem Can
https://arxiv.org/abs/2510.09561

TC-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control
Current controllable diffusion models typically rely on fixed architectures that modify intermediate activations to inject guidance conditioned on a new modality. This approach uses a static conditioning strategy for a dynamic, multi-stage denoising process, limiting the model's ability to adapt its response as the generation evolves from coarse structure to fine detail. We introduce TC-LoRA (Temporally Modulated Conditional LoRA), a new paradigm that enables dynamic, context-aware control by c…

@arXiv_eessAS_bot@mastoxiv.page
2025-10-15 12:46:19

Replaced article(s) found for eess.AS. https://arxiv.org/list/eess.AS/new
[1/1]:
- Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech E...
Huang-Cheng Chou, Haibin Wu, Hung-yi Lee, Chi-Chun Lee

@arXiv_csIR_bot@mastoxiv.page
2025-10-14 11:39:48

Characterizing Web Search in The Age of Generative AI
Elisabeth Kirsten, Jost Grosse Perdekamp, Mihir Upadhyay, Krishna P. Gummadi, Muhammad Bilal Zafar
https://arxiv.org/abs/2510.11560

Characterizing Web Search in The Age of Generative AI
The advent of LLMs has given rise to a new type of web search: Generative search, where LLMs retrieve web pages related to a query and generate a single, coherent text as a response. This output modality stands in stark contrast to traditional web search, where results are returned as a ranked list of independent web pages. In this paper, we ask: Along what dimensions do generative search outputs differ from traditional web search? We compare Google, a traditional web search engine, with four g…

@arXiv_physicsmedph_bot@mastoxiv.page
2025-10-14 09:18:48

Stochastic numerical head phantoms to enable virtual imaging studies of transcranial photoacoustic computed tomography
Hsuan-Kai Huang, Joseph Kuo, Seonyeong Park, Umberto Villa, Lihong V. Wang, Mark A. Anastasio
https://arxiv.org/abs/2510.09758

Stochastic numerical head phantoms to enable virtual imaging studies of transcranial photoacoustic computed tomography
Transcranial photoacoustic computed tomography (PACT) is an emerging neuroimaging modality, but skull-induced aberrations can result in severe image artifacts if not compensated for during image reconstruction. The development of advanced image reconstruction methods for transcranial PACT is hindered by the lack of well-characterized, clinically relevant evaluation frameworks. Virtual imaging studies offer a solution, but require realistic numerical phantoms. To address this need, this study in…

@arXiv_csSD_bot@mastoxiv.page
2025-10-13 08:08:00

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection
Benjamin Shiue-Hal Chou, Purvish Jajal, Nick John Eliopoulos, James C. Davis, George K. Thiruvathukal, Kristen Yeon-Ji Yun, Yung-Hsiang Lu
https://arxiv.org/abs/2510.08580

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection
Music learners can greatly benefit from tools that accurately detect errors in their practice. Existing approaches typically compare audio recordings to music scores using heuristics or learnable models. This paper introduces \textit{LadderSym}, a novel Transformer-based method for music error detection. \textit{LadderSym} is guided by two key observations about the state-of-the-art approaches: (1) late fusion limits inter-stream alignment and cross-modality comparison capability; and (2) relia…

@arXiv_csCV_bot@mastoxiv.page
2025-10-13 10:35:30

D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
Jisu Han, Wonjun Hwang
https://arxiv.org/abs/2510.09473 https://…

D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the …

@arXiv_csIR_bot@mastoxiv.page
2025-10-15 09:56:41

SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
Lin Lin, Jiefeng Long, Zhihe Wan, Yuchi Wang, Dingkang Yang, Shuang Yang, Yueyang Yao, Xu Chen, Zirui Guo, Shengqiang Li, Weiran Li, Hanyu Li, Yaling Mou, Yan Qiu, Haiyang Yu, Xiao Liang, Hongsheng Li, Chao Feng
https://arxiv.org/abs/2510.12709

SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding …

@arXiv_csIR_bot@mastoxiv.page
2025-10-14 11:05:58

Decoupled Multimodal Fusion for User Interest Modeling in Click-Through Rate Prediction
Alin Fan, Hanqing Li, Sihan Lu, Jingsong Yuan, Jiandong Zhang
https://arxiv.org/abs/2510.11066

Decoupled Multimodal Fusion for User Interest Modeling in Click-Through Rate Prediction
Modern industrial recommendation systems improve recommendation performance by integrating multimodal representations from pre-trained models into ID-based Click-Through Rate (CTR) prediction frameworks. However, existing approaches typically adopt modality-centric modeling strategies that process ID-based and multimodal embeddings independently, failing to capture fine-grained interactions between content semantics and behavioral signals. In this paper, we propose Decoupled Multimodal Fusion (…

Tootfinder

Opt-in global Mastodon full text search. Join the index!