ClustViT: Clustering-based Token Merging for Semantic SegmentationFabio Montello, Ronja G\"uldenring, Lazaros Nalpantidishttps://arxiv.org/abs/2510.01948 https://
ClustViT: Clustering-based Token Merging for Semantic SegmentationVision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmen…