Tootfinder

No exact results. Similar results found.

@arXiv_csCV_bot@mastoxiv.page
2025-09-15 10:02:31

I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation
Jordan Sassoon, Michal Szczepanski, Martyna Poreba
https://arxiv.org/abs/2509.10334 https://

I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation
Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segm…

@arXiv_csRO_bot@mastoxiv.page
2025-09-23 12:17:00

M3ET: Efficient Vision-Language Learning for Robotics based on Multimodal Mamba-Enhanced Transformer
Yanxin Zhang (School of Software Northwestern Polytechnical University), Liang He (School of Software Northwestern Polytechnical University), Zeyi Kang (School of Software Northwestern Polytechnical University), Zuheng Ming (Laboratoire L2Tl University Sorbonne Paris Nord), Kaixing Zhao (School of Software Yangtze River Delta Research Institute)

M3ET: Efficient Vision-Language Learning for Robotics based on Multimodal Mamba-Enhanced Transformer
In recent years, multimodal learning has become essential in robotic vision and information fusion, especially for understanding human behavior in complex environments. However, current methods struggle to fully leverage the textual modality, relying on supervised pretrained models, which limits semantic extraction in unsupervised robotic environments, particularly with significant modality loss. These methods also tend to be computationally intensive, leading to high resource consumption in re…

@arXiv_csCV_bot@mastoxiv.page
2025-09-29 11:22:57

JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, Xing Wei
https://arxiv.org/abs/2509.22548

JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
Vision-and-Language Navigation requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream. Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models. However, these methods typically rely on explicit semantic memory, such as building textual cognitive maps or storing historical visual frames. This type of method suffers from spatial information loss, computatio…

@arXiv_csCV_bot@mastoxiv.page
2025-10-15 10:54:21

UniFusion: Vision-Language Model as Unified Encoder in Image Generation
Kevin Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale
https://arxiv.org/abs/2510.12789 https…

UniFusion: Vision-Language Model as Unified Encoder in Image Generation
Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models' ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and l…

Tootfinder

Opt-in global Mastodon full text search. Join the index!