Tootfinder

@arXiv_csCV_bot@mastoxiv.page
2025-08-19 12:08:10

Precise Action-to-Video Generation Through Visual Action Prompts
Yuang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, Ruizhen Hu
https://arxiv.org/abs/2508.13104

Precise Action-to-Video Generation Through Visual Action Prompts
We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dyna…

@arXiv_csHC_bot@mastoxiv.page
2025-09-19 09:31:41

UMind: A Unified Multitask Network for Zero-Shot M/EEG Visual Decoding
Chengjian Xu, Yonghao Song, Zelin Liao, Haochuan Zhang, Qiong Wang, Qingqing Zheng
https://arxiv.org/abs/2509.14772

UMind: A Unified Multitask Network for Zero-Shot M/EEG Visual Decoding
Decoding visual information from time-resolved brain recordings, such as EEG and MEG, plays a pivotal role in real-time brain-computer interfaces. However, existing approaches primarily focus on direct brain-image feature alignment and are limited to single-task frameworks or task-specific models. In this paper, we propose a Unified MultItask Network for zero-shot M/EEG visual Decoding (referred to UMind), including visual stimulus retrieval, classification, and reconstruction, where multiple t…

@seeingwithsound@mas.to
2025-09-20 18:42:27

(2024) Visual neuroprostheses for impaired human nervous system: State-of-the-art and future outlook #BCI

Visual pathway of the human visual system.

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:28:51

V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
Qidong Wang, Junjie Hu, Ming Jiang
https://arxiv.org/abs/2509.14837

V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
Recent advances in causal interpretability have extended from language models to vision-language models (VLMs), seeking to reveal their internal mechanisms through input interventions. While textual interventions often target semantics, visual interventions typically rely on coarse pixel-level perturbations, limiting semantic insights on multimodal integration. In this study, we introduce V-SEAM, a novel framework that combines Visual Semantic Editing and Attention Modulating for causal interpr…

@arXiv_csSD_bot@mastoxiv.page
2025-09-19 10:09:31

From Hype to Insight: Rethinking Large Language Model Integration in Visual Speech Recognition
Rishabh Jain, Naomi Harte
https://arxiv.org/abs/2509.14880 https://

From Hype to Insight: Rethinking Large Language Model Integration in Visual Speech Recognition
Advances in self-supervised encoders have improved Visual Speech Recognition (VSR). Recent approaches integrating these encoders with LLM decoders improves transcription accuracy; however, it remains unclear whether these gains stem from visual understanding or stronger language modeling. In this work, we systematically evaluate LLM decoders by freezing or selectively updating the visual encoder, scaling decoder size, comparing adaptation strategies and architectures, and varying training data …

@arXiv_qbioNC_bot@mastoxiv.page
2025-09-19 09:00:51

Mouse vs. AI: A Neuroethological Benchmark for Visual Robustness and Neural Alignment
Marius Schneider, Joe Canzano, Jing Peng, Yuchen Hou, Spencer LaVere Smith, Michael Beyeler
https://arxiv.org/abs/2509.14446

Mouse vs. AI: A Neuroethological Benchmark for Visual Robustness and Neural Alignment
Visual robustness under real-world conditions remains a critical bottleneck for modern reinforcement learning agents. In contrast, biological systems such as mice show remarkable resilience to environmental changes, maintaining stable performance even under degraded visual input with minimal exposure. Inspired by this gap, we propose the Mouse vs. AI: Robust Foraging Competition, a novel bioinspired visual robustness benchmark to test generalization in reinforcement learning (RL) agents trained…

@arXiv_csRO_bot@mastoxiv.page
2025-09-19 09:48:21

BEV-ODOM2: Enhanced BEV-based Monocular Visual Odometry with PV-BEV Fusion and Dense Flow Supervision for Ground Robots
Yufei Wei, Wangtao Lu, Sha Lu, Chenxiao Hu, Fuzhang Han, Rong Xiong, Yue Wang
https://arxiv.org/abs/2509.14636

BEV-ODOM2: Enhanced BEV-based Monocular Visual Odometry with PV-BEV Fusion and Dense Flow Supervision for Ground Robots
Bird's-Eye-View (BEV) representation offers a metric-scaled planar workspace, facilitating the simplification of 6-DoF ego-motion to a more robust 3-DoF model for monocular visual odometry (MVO) in intelligent transportation systems. However, existing BEV methods suffer from sparse supervision signals and information loss during perspective-to-BEV projection. We present BEV-ODOM2, an enhanced framework addressing both limitations without additional annotations. Our approach introduces: (1) dens…

@arXiv_eessIV_bot@mastoxiv.page
2025-08-20 09:13:40

Automated Cervical Cancer Detection through Visual Inspection with Acetic Acid in Resource-Poor Settings with Lightweight Deep Learning Models Deployed on an Android Device
Leander Melroy Maben, Keerthana Prasad, Shyamala Guruvare, Vidya Kudva, P C Siddalingaswamy
https://arxiv.org/abs/2508.13253

Automated Cervical Cancer Detection through Visual Inspection with Acetic Acid in Resource-Poor Settings with Lightweight Deep Learning Models Deployed on an Android Device
Cervical cancer is among the most commonly occurring cancer among women and claims a huge number of lives in low and middle-income countries despite being relatively easy to treat. Several studies have shown that public screening programs can bring down cervical cancer incidence and mortality rates significantly. While several screening tests are available, visual inspection with acetic acid (VIA) presents itself as the most viable option for low-resource settings due to the affordability and s…

@arXiv_csCR_bot@mastoxiv.page
2025-08-19 11:38:10

Unlearning Comparator: A Visual Analytics System for Comparative Evaluation of Machine Unlearning Methods
Jaeung Lee, Suhyeon Yu, Yurim Jang, Simon S. Woo, Jaemin Jo
https://arxiv.org/abs/2508.12730

Unlearning Comparator: A Visual Analytics System for Comparative Evaluation of Machine Unlearning Methods
Machine Unlearning (MU) aims to remove target training data from a trained model so that the removed data no longer influences the model's behavior, fulfilling "right to be forgotten" obligations under data privacy laws. Yet, we observe that researchers in this rapidly emerging field face challenges in analyzing and understanding the behavior of different MU methods, especially in terms of three fundamental principles in MU: accuracy, efficiency, and privacy. Consequently, they often rely on ag…

@seeingwithsound@mas.to
2025-08-19 07:49:42

(LinkedIn) Revision Implant is despite its name already quickly looking for markets beyond visual prostheses https://www.linkedin.com/posts/revision-implant-nv_elmedix-medicalinnovation-oncology-activity-7363137424329252864-bp…

The expertise that we have developed in extreme miniaturization of medical devices is also finding its way to use cases beyond visual prostheses: in that regard, we are proud to announce that we have… | ReVision Implant
The expertise that we have developed in extreme miniaturization of medical devices is also finding its way to use cases beyond visual prostheses: in that regard, we are proud to announce that we have signed a long-term supply contract with Elmedix, to deliver a critical component to be used in their therapy. Elmedix is a clinical-stage medtech company that is developing a revolutionary heat-based therapy for metastasized cancers. We wish them all the best in their coming second round of clinic…

@arXiv_csHC_bot@mastoxiv.page
2025-09-19 08:31:51

Sensing the Shape of Data: Non-Visual Exploration of Statistical Concepts in Histograms with Blind and Low-Vision Learners
Sanchita S. Kamath, Omar Khan, Aziz N Zeidieh, JooYoung Seo
https://arxiv.org/abs/2509.14452

Sensing the Shape of Data: Non-Visual Exploration of Statistical Concepts in Histograms with Blind and Low-Vision Learners
Statistical concepts often rely heavily on visual cues for comprehension, presenting challenges for individuals who face difficulties using visual information, such as the blind and low-vision (BLV) community. While prior work has explored making data visualizations accessible, limited research examines how BLV individuals conceptualize and learn the underlying statistical concepts these visualizations represent. To better understand BLV individuals' learning strategies for potentially unfamili…

@arXiv_csCV_bot@mastoxiv.page
2025-09-19 10:29:21

Leveraging Geometric Visual Illusions as Perceptual Inductive Biases for Vision Models
Haobo Yang, Minghao Guo, Dequan Yang, Wenyu Wang
https://arxiv.org/abs/2509.15156 https://…

Leveraging Geometric Visual Illusions as Perceptual Inductive Biases for Vision Models
Contemporary deep learning models have achieved impressive performance in image classification by primarily leveraging statistical regularities within large datasets, but they rarely incorporate structured insights drawn directly from perceptual psychology. To explore the potential of perceptually motivated inductive biases, we propose integrating classic geometric visual illusions well-studied phenomena from human perception into standard image-classification training pipelines. Specifically, …

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:34:40

AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings
Haoxuan Li, Wei Song, Aofan Liu, Peiwu Qin
https://arxiv.org/abs/2508.13606 ht…

AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings
Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments due to context limitations and insufficient training data. This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations: a hybrid text retrieval architecture for effective document segmentation, an intelligent data augmentation pipeline that automatically generates high-quality reasoning question-answ…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 08:47:41

Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior
Yochai Yemini, Rami Ben-Ari, Sharon Gannot, Ethan Fetaya
https://arxiv.org/abs/2509.14379

Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior
In this paper, we address the problem of single-microphone speech separation in the presence of ambient noise. We propose a generative unsupervised technique that directly models both clean speech and structured noise components, training exclusively on these individual signals rather than noisy mixtures. Our approach leverages an audio-visual score model that incorporates visual cues to serve as a strong generative speech prior. By explicitly modelling the noise distribution alongside the spee…

@arXiv_eessSY_bot@mastoxiv.page
2025-08-20 09:03:40

Model-based Multi-object Visual Tracking: Identification and Standard Model Limitations
Jan Krej\v{c}\'i, Oliver Kost, Yuxuan Xia, Lennart Svensson, Ond\v{r}ej Straka
https://arxiv.org/abs/2508.13647

Model-based Multi-object Visual Tracking: Identification and Standard Model Limitations
This paper uses multi-object tracking methods known from the radar tracking community to address the problem of pedestrian tracking using 2D bounding box detections. The standard point-object (SPO) model is adopted, and the posterior density is computed using the Poisson multi-Bernoulli mixture (PMBM) filter. The selection of the model parameters rooted in continuous time is discussed, including the birth and survival probabilities. Some parameters are selected from the first principles, while …

@fanf@mendeddrum.org
2025-08-20 20:42:03

from my link log —
Game math: precise control over numeric springing.
https://allenchou.net/2015/04/game-math-precise-control-over-numeric-springing/
saved 2025-05-21

Game Math: Precise Control over Numeric Springing | Ming-Lun "Allen" Chou | 周明倫
[latexpage] This post is part of my Game Math Series. Source files are on GitHub Check out this post if you want to see more visual examples of numeric springing. Numeric springing is a very powerful tool for procedural animation. You specify the initial value, initial velocity, target value, and some spring-related parameters; the result

@arXiv_csSD_bot@mastoxiv.page
2025-08-20 07:44:19

Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement
Rong Chao, Wenze Ren, You-Jin Li, Kuo-Hsuan Hung, Sung-Feng Huang, Szu-Wei Fu, Wen-Huang Cheng, Yu Tsao
https://arxiv.org/abs/2508.13624

Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement
Recent Mamba-based models have shown promise in speech enhancement by efficiently modeling long-range temporal dependencies. However, models like Speech Enhancement Mamba (SEMamba) remain limited to single-speaker scenarios and struggle in complex multi-speaker environments such as the cocktail party problem. To overcome this, we introduce AVSEMamba, an audio-visual speech enhancement model that integrates full-face visual cues with a Mamba-based temporal backbone. By leveraging spatiotemporal …

@arXiv_condmatsoft_bot@mastoxiv.page
2025-08-18 08:57:10

Large-scale dynamics in visual quorum sensing chiral suspensions
Yuxin Zhou, Qingqing Yin, Shubhadip Nayak, Poulami Bag, Pulak K. Ghosh, Yunyun Li, Fabio Marchesoni
https://arxiv.org/abs/2508.11254

Large-scale dynamics in visual quorum sensing chiral suspensions
Motility induced phase separation is an efficient aggregation mechanism of active matter, yet biological systems exhibit richer organization through communication among constituents. We investigate suspensions of active particles that change chirality when neighbor density within their visual cone exceeds a threshold, a communication based non-reciprocal interaction akin to quorum sensing. Tuning the visual cone triggers programmable transitions: from disorder to phase separation to hyper-unifo…

@blakes7bot@mas.torpidity.net
2025-08-20 12:18:34

Series C, Episode 07 - Children of Auron
C.A. ONE: For what reason?
FRANTON: Conquest.
C.A. ONE: That's ridiculous.
FRANTON: Surely it would be safer to wait for the Liberator. At least we can trust them, and we know they're coming, Zelda heard.
https://blake.torpidity.net/m/307/223

Claude 3.7 describes the image as: "The image appears to be from a classic British television production, likely from the late 1970s or early 1980s based on the visual quality and aesthetic.

The scene shows an elderly person with white hair wearing a dark jacket with teal/green elements and decorative trim. The individual has a serious, contemplative expression and appears to be in a conversation within what looks like an interior setting with light-colored walls or panels visible in the back…

@arXiv_csSE_bot@mastoxiv.page
2025-09-16 11:03:17

VisDocSketcher: Towards Scalable Visual Documentation with Agentic Systems
Lu\'is F. Gomes, Xin Zhou, David Lo, Rui Abreu
https://arxiv.org/abs/2509.11942 https://

VisDocSketcher: Towards Scalable Visual Documentation with Agentic Systems
Visual documentation is an effective tool for reducing the cognitive barrier developers face when understanding unfamiliar code, enabling more intuitive comprehension. Compared to textual documentation, it provides a higher-level understanding of the system structure and data flow. Developers usually prefer visual representations over lengthy textual descriptions for large software systems. Visual documentation is both difficult to produce and challenging to evaluate. Manually creating it is ti…

@arXiv_csRO_bot@mastoxiv.page
2025-07-18 08:41:12

ASC-SW: Atrous strip convolution network with sliding windows for visual-assisted map navigation
Cheng Liu, Fan Zhu, Yaoyu Zhuang Zhinan Chen Jiefeng Tang
https://arxiv.org/abs/2507.12744

ASC-SW: Atrous strip convolution network with sliding windows for visual-assisted map navigation
With the rapid development of lightweight visual neural network architectures, traditional high-performance vision models have undergone significant compression, greatly improving their computational efficiency and energy consumption ratio. This makes them feasible for deployment on resource-constrained edge computing devices. We propose a visual-assisted navigation framework called Atrous Strip Convolution-Sliding Window (ASC-SW), which leverages a depth camera and a lightweight visual neural …

@arXiv_csCV_bot@mastoxiv.page
2025-08-19 12:06:00

Omni Survey for Multimodality Analysis in Visual Object Tracking
Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Hui Li, Shaochuan Zhao, Tao Zhou, Chunyang Cheng, Xiaojun Wu, Josef Kittler
https://arxiv.org/abs/2508.13000

Omni Survey for Multimodality Analysis in Visual Object Tracking
The development of smart cities has led to the generation of massive amounts of multi-modal data in the context of a range of tasks that enable a comprehensive monitoring of the smart city infrastructure and services. This paper surveys one of the most critical tasks, multi-modal visual object tracking (MMVOT), from the perspective of multimodality analysis. Generally, MMVOT differs from single-modal tracking in four key aspects, data collection, modality alignment and annotation, model designi…

@seeingwithsound@mas.to
2025-08-17 21:23:36

Neuralink for visual prosthesis #Neuralink

Visual Prosthesis | Neuralink
Learn more about our future visual prosthesis clinical trials.

@arXiv_csIR_bot@mastoxiv.page
2025-07-17 08:36:10

Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker
Rachna Saxena, Abhijeet Kumar, Suresh Shanmugam
https://arxiv.org/abs/2507.12378

Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker
Traditional information extraction systems face challenges with text only language models as it does not consider infographics (visual elements of information) such as tables, charts, images etc. often used to convey complex information to readers. Multimodal LLM (MLLM) face challenges of finding needle in the haystack problem i.e., either longer context length or substantial number of documents as search space. Late interaction mechanism over visual language models has shown state of the art p…

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 10:46:20

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding
Ashish Seth, Utkarsh Tyagi, Ramaneswaran Selvakumar, Nishit Anand, Sonal Kumar, Sreyan Ghosh, Ramani Duraiswami, Chirag Agarwal, Dinesh Manocha
https://arxiv.org/abs/2508.12687

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in complex multimodal tasks. While MLLMs excel at visual perception and reasoning in third-person and egocentric videos, they are prone to hallucinations, generating coherent yet inaccurate responses. We present EgoIllusion, a first benchmark to evaluate MLLM hallucinations in egocentric videos. EgoIllusion comprises 1,400 videos paired with 8,000 human-annotated open and closed-ended questions designed to trigger…

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:16:30

VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization
Jailing Lin, Shu Jiang, Qingyuan Zeng, Zhenzhong Wang, Min Jiang
https://arxiv.org/abs/2508.13792

VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization
The intrinsic dynamics of an object governs its physical behavior in the real world, playing a critical role in enabling physically plausible interactive simulation with 3D assets. Existing methods have attempted to infer the intrinsic dynamics of objects from visual observations, but generally face two major challenges: one line of work relies on manually defined constitutive priors, making it difficult to generalize to complex scenarios; the other models intrinsic dynamics using neural networ…

@arXiv_csHC_bot@mastoxiv.page
2025-09-18 09:23:21

Py maidr: Bridging Visual and Non-Visual Data Experiences Through a Unified Python Framework
JooYoung Seo, Saairam Venkatesh, Daksh Pokar, Sanchita Kamath, Krishna Anandan Ganesan
https://arxiv.org/abs/2509.13532

Py maidr: Bridging Visual and Non-Visual Data Experiences Through a Unified Python Framework
Although recent efforts have developed accessible data visualization tools for blind and low-vision (BLV) users, most follow a "design for them" approach that creates an unintentional divide between sighted creators and BLV consumers. This unidirectional paradigm perpetuates a power dynamic where sighted creators produce non-visual content boundaries for BLV consumers to access. This paper proposes a bidirectional approach, "design for us," where both sighted and BLV collaborators can employ th…

@arXiv_csSD_bot@mastoxiv.page
2025-08-19 09:46:50

Cross-Modal Knowledge Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection
Qing Wang, Ya Jiang, Hang Chen, Sabato Marco Siniscalchi, Jun Du, Jianqing Gao
https://arxiv.org/abs/2508.12334

Cross-Modal Knowledge Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection
This work presents a cross-modal knowledge distillation (CMKD) framework combined with multi-level data augmentation for low-resource audio-visual (AV) sound event localization and detection (SELD). An audio-only SELD model acts as the teacher, transferring knowledge to an AV student model through both output responses and intermediate feature representations. To enhance learning, data augmentation is applied by mixing features randomly selected from multiple network layers and associated loss …

@arXiv_csCL_bot@mastoxiv.page
2025-08-18 09:48:00

Dataset Creation for Visual Entailment using Generative AI
Rob Reijtenbach, Suzan Verberne, Gijs Wijnholds
https://arxiv.org/abs/2508.11605 https://arxiv.o…

Dataset Creation for Visual Entailment using Generative AI
In this paper we present and validate a new synthetic dataset for training visual entailment models. Existing datasets for visual entailment are small and sparse compared to datasets for textual entailment. Manually creating datasets is labor-intensive. We base our synthetic dataset on the SNLI dataset for textual entailment. We take the premise text from SNLI as input prompts in a generative image model, Stable Diffusion, creating an image to replace each textual premise. We evaluate our datas…

@seeingwithsound@mas.to
2025-07-19 19:52:16

Optimizing electrical stimulation parameters to enhance visual cortex activation in retina degeneration rats #BCI

Optimizing electrical stimulation parameters to enhance visual cortex activation in retina degeneration rats - Scientific Reports
In patients with degenerative retinal diseases such as retinitis pigmentosa and age-related macular degeneration, retinal prostheses offer a promising approach to restoring partial vision. Among these, epiretinal prostheses have shown encouraging preliminary clinical efficacy; however, optimizing stimulation parameters remains essential for improving efficiency and reducing power consumption. In this study, we investigated the effects of key electrical stimulation parameters— phase duration, …

@arXiv_csRO_bot@mastoxiv.page
2025-09-19 09:09:21

Learning Discrete Abstractions for Visual Rearrangement Tasks Using Vision-Guided Graph Coloring
Abhiroop Ajith, Constantinos Chamzas
https://arxiv.org/abs/2509.14460 https://…

Learning Discrete Abstractions for Visual Rearrangement Tasks Using Vision-Guided Graph Coloring
Learning abstractions directly from data is a core challenge in robotics. Humans naturally operate at an abstract level, reasoning over high-level subgoals while delegating execution to low-level motor skills -- an ability that enables efficient problem solving in complex environments. In robotics, abstractions and hierarchical reasoning have long been central to planning, yet they are typically hand-engineered, demanding significant human effort and limiting scalability. Automating the discove…

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:17:10

Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering
Diaa Addeen Abuhani, Marco Seccaroni, Martina Mazzarello, Imran Zualkernan, Fabio Duarte, Carlo Ratti
https://arxiv.org/abs/2508.13814

Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering
Urban tree biodiversity is critical for climate resilience, ecological stability, and livability in cities, yet most municipalities lack detailed knowledge of their canopies. Field-based inventories provide reliable estimates of Shannon and Simpson diversity but are costly and time-consuming, while supervised AI methods require labeled data that often fail to generalize across regions. We introduce an unsupervised clustering framework that integrates visual embeddings from street-level imagery …

@arXiv_qbioNC_bot@mastoxiv.page
2025-08-19 09:37:20

Synchronization and semantization in deep spiking networks
Jonas Oberste-Frielinghaus, Anno C. Kurth, Julian G\"oltz, Laura Kriener, Junji Ito, Mihai A. Petrovici, Sonja Gr\"un
https://arxiv.org/abs/2508.12975

Synchronization and semantization in deep spiking networks
Recent studies have shown how spiking networks can learn complex functionality through error-correcting plasticity, but the resulting structures and dynamics remain poorly studied. To elucidate how these models may link to observed dynamics in vivo and thus how they may ultimately explain cortical computation, we need a better understanding of their emerging patterns. We train a multi-layer spiking network, as a conceptual analog of the bottom-up visual hierarchy, for visual input classificatio…

@arXiv_csSD_bot@mastoxiv.page
2025-09-19 10:09:51

Temporally Heterogeneous Graph Contrastive Learning for Multimodal Acoustic event Classification
Yuanjian Chen, Yang Xiao, Jinjie Huang
https://arxiv.org/abs/2509.14893 https://…

Temporally Heterogeneous Graph Contrastive Learning for Multimodal Acoustic event Classification
Multimodal acoustic event classification plays a key role in audio-visual systems. Although combining audio and visual signals improves recognition, it is still difficult to align them over time and to reduce the effect of noise across modalities. Existing methods often treat audio and visual streams separately, fusing features later with contrastive or mutual information objectives. Recent advances explore multimodal graph learning, but most fail to distinguish between intra- and inter-modal t…

@arXiv_csAI_bot@mastoxiv.page
2025-08-20 11:23:47

Crosslisted article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[4/6]:
- End-to-End Audio-Visual Learning for Cochlear Implant Sound Coding in Noisy Environments
Meng-Ping Lin, Enoch Hsin-Ho Huang, Shao-Yi Chien, Yu Tsao

@arXiv_csRO_bot@mastoxiv.page
2025-08-19 11:29:30

Manipulate-to-Navigate: Reinforcement Learning with Visual Affordances and Manipulability Priors
Yuying Zhang, Joni Pajarinen
https://arxiv.org/abs/2508.13151 https://

Manipulate-to-Navigate: Reinforcement Learning with Visual Affordances and Manipulability Priors
Mobile manipulation in dynamic environments is challenging due to movable obstacles blocking the robot's path. Traditional methods, which treat navigation and manipulation as separate tasks, often fail in such 'manipulate-to-navigate' scenarios, as obstacles must be removed before navigation. In these cases, active interaction with the environment is required to clear obstacles while ensuring sufficient space for movement. To address the manipulate-to-navigate problem, we propose a reinforcemen…

@arXiv_csHC_bot@mastoxiv.page
2025-08-19 09:48:30

fCrit: A Visual Explanation System for Furniture Design Creative Support
Vuong Nguyen, Gabriel Vigliensoni
https://arxiv.org/abs/2508.12416 https://arxiv.o…

fCrit: A Visual Explanation System for Furniture Design Creative Support
We introduce fCrit, a dialogue-based AI system designed to critique furniture design with a focus on explainability. Grounded in reflective learning and formal analysis, fCrit employs a multi-agent architecture informed by a structured design knowledge base. We argue that explainability in the arts should not only make AI reasoning transparent but also adapt to the ways users think and talk about their designs. We demonstrate how fCrit supports this process by tailoring explanations to users' d…

@seeingwithsound@mas.to
2025-09-18 09:12:54

Visual image reconstruction from brain activity via latent representation https://www.annualreviews.org/content/journals/10.1146/annurev-vision-110423-023616 by @…

Psychological measurement of subjective visual experiences through image reconstruction. (a) Mapping of brain, stimulus, and mind. Dots represent instances of visual experience (e.g., an image, perception, and corresponding brain activity). Veridical perception assumes that the mind accurately represents stimuli. The brain–mind mapping is considered fixed, while the brain–stimulus relationship is empirically identified. (b) Nonveridical perception (e.g., mental imagery, attentional modulation, …

@arXiv_csCL_bot@mastoxiv.page
2025-07-18 09:48:32

Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis
Wang Xi, Quan Shi, Tian Yu, Yujie Peng, Jiayi Sun, Mengxing Ren, Zenghui Ding, Ningguang Yao
https://arxiv.org/abs/2507.13285

Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis
Automated generation of high-quality media presentations is challenging, requiring robust content extraction, narrative planning, visual design, and overall quality optimization. Existing methods often produce presentations with logical inconsistencies and suboptimal layouts, thereby struggling to meet professional standards. To address these challenges, we introduce RCPS (Reflective Coherent Presentation Synthesis), a novel framework integrating three key components: (1) Deep Structured Narrat…

@arXiv_csHC_bot@mastoxiv.page
2025-09-19 09:03:51

VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models
Huanchen Wang, Wencheng Zhang, Zhiqiang Wang, Zhicong Lu, Yuxin Ma
https://arxiv.org/abs/2509.14571

VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models
Vision-language (VL) models have shown transformative potential across various critical domains due to their capability to comprehend multi-modal information. However, their performance frequently degrades under distribution shifts, making it crucial to assess and improve robustness against real-world data corruption encountered in practical applications. While advancements in VL benchmark datasets and data augmentation (DA) have contributed to robustness evaluation and improvement, there remai…

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 09:36:10

Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks
Jakub {\L}ucki, Jonathan Becktor, Georgios Georgakis, Robert Royce, Shehryar Khattak
https://arxiv.org/abs/2508.11584

Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks
Deploying multiple machine learning models on resource-constrained robotic platforms for different perception tasks often results in redundant computations, large memory footprints, and complex integration challenges. In response, this work presents Visual Perception Engine (VPEngine), a modular framework designed to enable efficient GPU usage for visual multitasking while maintaining extensibility and developer accessibility. Our framework architecture leverages a shared foundation model backb…

@arXiv_csCV_bot@mastoxiv.page
2025-07-18 10:20:02

Leveraging Pre-Trained Visual Models for AI-Generated Video Detection
Keerthi Veeramachaneni, Praveen Tirupattur, Amrit Singh Bedi, Mubarak Shah
https://arxiv.org/abs/2507.13224

Leveraging Pre-Trained Visual Models for AI-Generated Video Detection
Recent advances in Generative AI (GenAI) have led to significant improvements in the quality of generated visual content. As AI-generated visual content becomes increasingly indistinguishable from real content, the challenge of detecting the generated content becomes critical in combating misinformation, ensuring privacy, and preventing security threats. Although there has been substantial progress in detecting AI-generated images, current methods for video detection are largely focused on deep…

@arXiv_csAI_bot@mastoxiv.page
2025-08-20 09:54:40

V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task
Jikai Chen, Long Chen, Dong Wang, Leilei Gan, Chenyi Zhuang, Jinjie Gu
https://arxiv.org/abs/2508.13634

V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task
Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and (2) uniform labeling fails to distinguish between center and edges of the target UI element, leadin…

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:25:01

Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
Yaru Chen, Ruohao Guo, Liting Gao, Yang Xiang, Qingyu Luo, Zhenbo Li, Wenwu Wang
https://arxiv.org/abs/2509.14097

Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via a…

@arXiv_csRO_bot@mastoxiv.page
2025-07-18 09:10:52

FFI-VTR: Lightweight and Robust Visual Teach and Repeat Navigation based on Feature Flow Indicator and Probabilistic Motion Planning
Jikai Wang, Yunqi Cheng, Zonghai Chen
https://arxiv.org/abs/2507.12800

FFI-VTR: Lightweight and Robust Visual Teach and Repeat Navigation based on Feature Flow Indicator and Probabilistic Motion Planning
Though visual and repeat navigation is a convenient solution for mobile robot self-navigation, achieving balance between efficiency and robustness in task environment still remains challenges. In this paper, we propose a novel visual and repeat robotic autonomous navigation method that requires no accurate localization and dense reconstruction modules, which makes our system featured by lightweight and robustness. Firstly, feature flow is introduced and we develop a qualitative mapping between …

@seeingwithsound@mas.to
2025-09-17 15:05:02

How tree shrews see the world - A compressed hierarchy for visual form processing in the tree shrew #neuroscience

A compressed hierarchy for visual form processing in the tree shrew - Nature
Tree shrews show a primate-like hierarchical organization in their visual pathway and object decoding accuracy, along with strongly face-selective cells, demonstrating how core computational principles of visual form processing found in primates are conserved yet compressed.

@arXiv_csCV_bot@mastoxiv.page
2025-08-19 12:07:40

Checkmate: interpretable and explainable RSVQA is the endgame
Lucrezia Tosato, Christel Tartini Chappuis, Syrielle Montariol, Flora Weissgerber, Sylvain Lobry, Devis Tuia
https://arxiv.org/abs/2508.13086

Checkmate: interpretable and explainable RSVQA is the endgame
Remote Sensing Visual Question Answering (RSVQA) presents unique challenges in ensuring that model decisions are both understandable and grounded in visual content. Current models often suffer from a lack of interpretability and explainability, as well as from biases in dataset distributions that lead to shortcut learning. In this work, we tackle these issues by introducing a novel RSVQA dataset, Chessboard, designed to minimize biases through 3'123'253 questions and a balanced answer distribut…

@arXiv_csHC_bot@mastoxiv.page
2025-08-20 08:17:50

Visuo-Tactile Feedback with Hand Outline Styles for Modulating Affective Roughness Perception
Minju Baeck, Yoonseok Shin, Dooyoung Kim, Hyunjin Lee, Sang Ho Yoon, Woontack Woo
https://arxiv.org/abs/2508.13504

Visuo-Tactile Feedback with Hand Outline Styles for Modulating Affective Roughness Perception
We propose a visuo-tactile feedback method that combines virtual hand visualization and fingertip vibrations to modulate affective roughness perception in VR. While prior work has focused on object-based textures and vibrotactile feedback, the role of visual feedback on virtual hands remains underexplored. Our approach introduces affective visual cues including line shape, motion, and color applied to hand outlines, and examines their influence on both affective responses (arousal, valence) and…

@arXiv_csCV_bot@mastoxiv.page
2025-08-18 09:53:10

OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring
Ruoxin Xiong, Yanyu Wang, Jiannan Cai, Kaijian Liu, Yuansheng Zhu, Pingbo Tang, Nora El-Gohary
https://arxiv.org/abs/2508.11482

OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring
The construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring. High-quality, domain-specific datasets, comprising images, videos, and point clouds, capture site geometry and spatiotemporal dynamics, including the location and interaction of objects, workers, and materials. However, despite growing interest in leveraging visual datasets, existing resources vary widely in sizes, data modalities, anno…

@seeingwithsound@mas.to
2025-09-20 16:23:45

Neurons and Pixels https://www.neurotechreports.com/pages/publishersletterJul24.html by James Cavuoto on Neuralink Blindsight and the graveyard of commercial failures: Optobionics, Retina Implant, Second Sight, Pixium Vision and others.

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 08:01:30

Robust Online Calibration for UWB-Aided Visual-Inertial Navigation with Bias Correction
Yizhi Zhou, Jie Xu, Jiawei Xia, Zechen Hu, Weizi Li, Xuan Wang
https://arxiv.org/abs/2508.10999

Robust Online Calibration for UWB-Aided Visual-Inertial Navigation with Bias Correction
This paper presents a novel robust online calibration framework for Ultra-Wideband (UWB) anchors in UWB-aided Visual-Inertial Navigation Systems (VINS). Accurate anchor positioning, a process known as calibration, is crucial for integrating UWB ranging measurements into state estimation. While several prior works have demonstrated satisfactory results by using robot-aided systems to autonomously calibrate UWB systems, there are still some limitations: 1) these approaches assume accurate robot l…

@arXiv_csCV_bot@mastoxiv.page
2025-09-19 10:31:41

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation
Xiaoyu Yue, Zidong Wang, Yuqing Wang, Wenlong Zhang, Xihui Liu, Wanli Ouyang, Lei Bai, Luping Zhou
https://arxiv.org/abs/2509.15185

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation
Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder th…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:20:41

UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets
Pengyu Wang, Shaojun Zhou, Chenkun Tan, Xinghao Wang, Wei Huang, Zhen Ye, Zhaowei Li, Botian Jiang, Dong Zhang, Xipeng Qiu
https://arxiv.org/abs/2509.14738

UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets
Unified vision large language models (VLLMs) have recently achieved impressive advancements in both multimodal understanding and generation, powering applications such as visual question answering and text-guided image synthesis. However, progress in unified VLLMs remains constrained by the lack of datasets that fully exploit the synergistic potential between these two core abilities. Existing datasets typically address understanding and generation in isolation, thereby limiting the performance…

@arXiv_csRO_bot@mastoxiv.page
2025-09-18 10:11:51

BIM Informed Visual SLAM for Construction Monitoring
Asier Bikandi, Miguel Fernandez-Cortizas, Muhammad Shaheer, Ali Tourani, Holger Voos, Jose Luis Sanchez-Lopez
https://arxiv.org/abs/2509.13972

BIM Informed Visual SLAM for Construction Monitoring
Simultaneous Localization and Mapping (SLAM) is a key tool for monitoring construction sites, where aligning the evolving as-built state with the as-planned design enables early error detection and reduces costly rework. LiDAR-based SLAM achieves high geometric precision, but its sensors are typically large and power-demanding, limiting their use on portable platforms. Visual SLAM offers a practical alternative with lightweight cameras already embedded in most mobile devices. however, visually …

@arXiv_csCV_bot@mastoxiv.page
2025-09-19 10:26:21

OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation
Bo-Wen Yin, Jiao-Long Cao, Xuying Zhang, Yuming Chen, Ming-Ming Cheng, Qibin Hou
https://arxiv.org/abs/2509.15096

OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation
Recent research on representation learning has proved the merits of multi-modal clues for robust semantic segmentation. Nevertheless, a flexible pretrain-and-finetune pipeline for multiple visual modalities remains unexplored. In this paper, we propose a novel multi-modal learning framework, termed OmniSegmentor. It has two key innovations: 1) Based on ImageNet, we assemble a large-scale dataset for multi-modal pretraining, called ImageNeXt, which contains five popular visual modalities. 2) We …

@arXiv_csHC_bot@mastoxiv.page
2025-07-17 09:44:10

Deconstructing Implicit Beliefs in Visual Data Journalism: Unstable Meanings Behind Data as Truth & Design for Insight
Ke Er Amy Zhang, Jodie Jenkinson, Laura Garrison
https://arxiv.org/abs/2507.12377

Deconstructing Implicit Beliefs in Visual Data Journalism: Unstable Meanings Behind Data as Truth & Design for Insight
We conduct a deconstructive reading of a qualitative interview study with 17 visual data journalists from newsrooms across the globe. We borrow a deconstruction approach from literary critique to explore the instability of meaning in language and reveal implicit beliefs in words and ideas. Through our analysis we surface two sets of opposing implicit beliefs in visual data journalism: objectivity/subjectivity and humanism/mechanism. We contextualize these beliefs through a genealogical analysis…

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:16:40

A Fully Transformer Based Multimodal Framework for Explainable Cancer Image Segmentation Using Radiology Reports
Enobong Adahada, Isabel Sassoon, Kate Hone, Yongmin Li
https://arxiv.org/abs/2508.13796 …

A Fully Transformer Based Multimodal Framework for Explainable Cancer Image Segmentation Using Radiology Reports
We introduce Med-CTX, a fully transformer based multimodal framework for explainable breast cancer ultrasound segmentation. We integrate clinical radiology reports to boost both performance and interpretability. Med-CTX achieves exact lesion delineation by using a dual-branch visual encoder that combines ViT and Swin transformers, as well as uncertainty aware fusion. Clinical language structured with BI-RADS semantics is encoded by BioClinicalBERT and combined with visual features utilising cro…

@arXiv_csRO_bot@mastoxiv.page
2025-07-16 10:16:51

Comparison of Localization Algorithms between Reduced-Scale and Real-Sized Vehicles Using Visual and Inertial Sensors
Tobias Kern, Leon Tolksdorf, Christian Birkner
https://arxiv.org/abs/2507.11241

Comparison of Localization Algorithms between Reduced-Scale and Real-Sized Vehicles Using Visual and Inertial Sensors
Physically reduced-scale vehicles are emerging to accelerate the development of advanced automated driving functions. In this paper, we investigate the effects of scaling on self-localization accuracy with visual and visual-inertial algorithms using cameras and an inertial measurement unit (IMU). For this purpose, ROS2-compatible visual and visual-inertial algorithms are selected, and datasets are chosen as a baseline for real-sized vehicles. A test drive is conducted to record data of reduced-…

@arXiv_csHC_bot@mastoxiv.page
2025-09-17 10:22:00

More than Meets the Eye: Understanding the Effect of Individual Objects on Perceived Visual Privacy
Mete Harun Akcay, Siddharth Prakash Rao, Alexandros Bakas, Buse Gul Atli
https://arxiv.org/abs/2509.13051

More than Meets the Eye: Understanding the Effect of Individual Objects on Perceived Visual Privacy
User-generated content, such as photos, comprises the majority of online media content and drives engagement due to the human ability to process visual information quickly. Consequently, many online platforms are designed for sharing visual content, with billions of photos posted daily. However, photos often reveal more than they intended through visible and contextual cues, leading to privacy risks. Previous studies typically treat privacy as a property of the entire image, overlooking individ…

@arXiv_csCV_bot@mastoxiv.page
2025-09-15 10:03:01

Towards Understanding Visual Grounding in Visual Language Models
Georgios Pantazopoulos, Eda B. \"Ozyi\u{g}it
https://arxiv.org/abs/2509.10345 https://

Towards Understanding Visual Grounding in Visual Language Models
Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in various domains, including referring expression comprehension, answering questions pertinent to fine-grained details in images or videos, caption visual context by explicitly referring to entities, as well as low and high-level control in simulated and real …

@arXiv_csCV_bot@mastoxiv.page
2025-09-17 10:53:10

HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models
Xu Li, Yuxuan Liang, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue
https://arxiv.org/abs/2509.13067

HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models
By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover t…

@arXiv_csRO_bot@mastoxiv.page
2025-07-17 09:56:40

Assessing the Value of Visual Input: A Benchmark of Multimodal Large Language Models for Robotic Path Planning
Jacinto Colan, Ana Davila, Yasuhisa Hasegawa
https://arxiv.org/abs/2507.12391

Assessing the Value of Visual Input: A Benchmark of Multimodal Large Language Models for Robotic Path Planning
Large Language Models (LLMs) show potential for enhancing robotic path planning. This paper assesses visual input's utility for multimodal LLMs in such tasks via a comprehensive benchmark. We evaluated 15 multimodal LLMs on generating valid and optimal paths in 2D grid environments, simulating simplified robotic planning, comparing text-only versus text-plus-visual inputs across varying model sizes and grid complexities. Our results indicate moderate success rates on simpler small grids, where …

@seeingwithsound@mas.to
2025-09-16 12:22:42

Encoding visual stimuli by striatal neurons (in mice) https://www.biorxiv.org/content/10.1101/2025.09.15.676378v1 "Although visual object encoding is considered a cortical attribute, subcortical areas also contain visual processing circuits."

@arXiv_csRO_bot@mastoxiv.page
2025-09-17 10:35:50

DVDP: An End-to-End Policy for Mobile Robot Visual Docking with RGB-D Perception
Haohan Min, Zhoujian Li, Yu Yang, Jinyu Chen, Shenghai Yuan
https://arxiv.org/abs/2509.13024 htt…

DVDP: An End-to-End Policy for Mobile Robot Visual Docking with RGB-D Perception
Automatic docking has long been a significant challenge in the field of mobile robotics. Compared to other automatic docking methods, visual docking methods offer higher precision and lower deployment costs, making them an efficient and promising choice for this task. However, visual docking methods impose strict requirements on the robot's initial position at the start of the docking process. To overcome the limitations of current vision-based methods, we propose an innovative end-to-end visua…

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:22:31

Distractor-Aware Memory-Based Visual Object Tracking
Jovana Videnovic, Matej Kristan, Alan Lukezic
https://arxiv.org/abs/2509.13864 https://arxiv.org/pdf/2…

Distractor-Aware Memory-Based Visual Object Tracking
Recent emergence of memory-based video segmentation methods such as SAM2 has led to models with excellent performance in segmentation tasks, achieving leading results on numerous benchmarks. However, these modes are not fully adjusted for visual object tracking, where distractors (i.e., objects visually similar to the target) pose a key challenge. In this paper we propose a distractor-aware drop-in memory module and introspection-based management method for SAM2, leading to DAM4SAM. Our design …

@arXiv_csCV_bot@mastoxiv.page
2025-07-18 10:22:02

$\pi^3$: Scalable Permutation-Equivariant Visual Geometry Learning
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He
https://arxiv.org/abs/2507.13347

$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning
We introduce $π^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $π^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps …

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:25:51

An Exploratory Study on Abstract Images and Visual Representations Learned from Them
Haotian Li, Jianbo Jiao
https://arxiv.org/abs/2509.14149 https://arxiv…

An Exploratory Study on Abstract Images and Visual Representations Learned from Them
Imagine living in a world composed solely of primitive shapes, could you still recognise familiar objects? Recent studies have shown that abstract images-constructed by primitive shapes-can indeed convey visual semantic information to deep learning models. However, representations obtained from such images often fall short compared to those derived from traditional raster images. In this paper, we study the reasons behind this performance gap and investigate how much high-level semantic content…

@arXiv_csCV_bot@mastoxiv.page
2025-08-18 09:53:50

Hierarchical Graph Feature Enhancement with Adaptive Frequency Modulation for Visual Recognition
Feiyue Zhao, Zhichao Zhang
https://arxiv.org/abs/2508.11497 https://

Hierarchical Graph Feature Enhancement with Adaptive Frequency Modulation for Visual Recognition
Convolutional neural networks (CNNs) have demonstrated strong performance in visual recognition tasks, but their inherent reliance on regular grid structures limits their capacity to model complex topological relationships and non-local semantics within images. To address this limita tion, we propose the hierarchical graph feature enhancement (HGFE), a novel framework that integrates graph-based rea soning into CNNs to enhance both structural awareness and feature representation. HG…

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:24:41

VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement
Jun Du, Weiwei Xing, Ming Li, Fei Richard Yu
https://arxiv.org/abs/2509.14060 ht…

VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement
Current multi-object tracking (MOT) algorithms typically overlook issues inherent in low-quality videos, leading to significant degradation in tracking performance when confronted with real-world image deterioration. Therefore, advancing the application of MOT algorithms in real-world low-quality video scenarios represents a critical and meaningful endeavor. To address the challenges posed by low-quality scenarios, inspired by vision-language models, this paper proposes a Visual Semantic Enhanc…

@arXiv_csRO_bot@mastoxiv.page
2025-07-18 09:23:22

LaViPlan : Language-Guided Visual Path Planning with RLVR
Hayeon Oh
https://arxiv.org/abs/2507.12911 https://arxiv.org/pdf/2507.12911…

LaViPlan : Language-Guided Visual Path Planning with RLVR
Out-of-distribution (OOD) scenarios in autonomous driving refer to situations that deviate from the training domain, often leading to unexpected and potentially hazardous behavior from planners that lack prior exposure to such cases. Recently, Vision-Language Models (VLMs) have been introduced into autonomous driving research for their promising generalization capabilities in OOD settings. Early studies demonstrated that VLMs could recognize OOD scenarios and generate user-level decisions such …

@arXiv_csCV_bot@mastoxiv.page
2025-08-18 09:52:30

Inside Knowledge: Graph-based Path Generation with Explainable Data Augmentation and Curriculum Learning for Visual Indoor Navigation
Daniel Airinei, Elena Burceanu, Marius Leordeanu
https://arxiv.org/abs/2508.11446

Inside Knowledge: Graph-based Path Generation with Explainable Data Augmentation and Curriculum Learning for Visual Indoor Navigation
Indoor navigation is a difficult task, as it generally comes with poor GPS access, forcing solutions to rely on other sources of information. While significant progress continues to be made in this area, deployment to production applications is still lacking, given the complexity and additional requirements of current solutions. Here, we introduce an efficient, real-time and easily deployable deep learning approach, based on visual input only, that can predict the direction towards a target fro…

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 08:35:20

GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning
Kelin Yu, Sheng Zhang, Harshit Soora, Furong Huang, Heng Huang, Pratap Tokekar, Ruohan Gao
https://arxiv.org/abs/2508.11049

GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning
Recent advances have shown that video generation models can enhance robot learning by deriving effective robot actions through inverse dynamics. However, these methods heavily depend on the quality of generated data and struggle with fine-grained manipulation due to the lack of environment feedback. While video-based reinforcement learning improves policy robustness, it remains constrained by the uncertainty of video generation and the challenges of collecting large-scale robot datasets for tra…

@arXiv_csCV_bot@mastoxiv.page
2025-09-16 12:43:47

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models
Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, Jiajun Zhang
https://arxiv.org/abs/2509.12132 …

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models
Recent advances in text-only "slow-thinking" reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models (\textbf{VRMs}). owever, such transfer faces critical challenges: Effective "slow thinking" in VRMs requires \textbf{visual reflection}, the ability to check the reasoning process based on visual information. Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention …

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:21:31

Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models
Weihang Wang, Xinhao Li, Ziyue Wang, Yan Pang, Jielei Zhang, Peiyi Li, Qiang Zhang, Longwen Gao
https://arxiv.org/abs/2509.13836

Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models
Object hallucination in Large Vision-Language Models (LVLMs) significantly impedes their real-world applicability. As the primary component for accurately interpreting visual information, the choice of visual encoder is pivotal. We hypothesize that the diverse training paradigms employed by different visual encoders instill them with distinct inductive biases, which leads to their diverse hallucination performances. Existing benchmarks typically focus on coarse-grained hallucination detection a…

@arXiv_csCV_bot@mastoxiv.page
2025-07-17 10:27:10

Describe Anything Model for Visual Question Answering on Text-rich Images
Yen-Linh Vu, Dinh-Thang Duong, Truong-Binh Duong, Anh-Khoi Nguyen, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Jianhua Xing, Xingjian Li, Tianyang Wang, Ulas Bagci, Min Xu
https://arxiv.org/abs/2507.12441

Describe Anything Model for Visual Question Answering on Text-rich Images
Recent progress has been made in region-aware vision-language modeling, particularly with the emergence of the Describe Anything Model (DAM). DAM is capable of generating detailed descriptions of any specific image areas or objects without the need for additional localized image-text alignment supervision. We hypothesize that such region-level descriptive capability is beneficial for the task of Visual Question Answering (VQA), especially in challenging scenarios involving images with dense tex…

@arXiv_csCV_bot@mastoxiv.page
2025-09-17 10:52:50

Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models
Yan Chen, Long Li, Teng Xi, Long Zeng, Jingdong Wang
https://arxiv.org/abs/2509.13031

Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models
Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual in…

@arXiv_csCV_bot@mastoxiv.page
2025-09-19 10:33:31

Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation
Luca Bartolomei, Enrico Mannocci, Fabio Tosi, Matteo Poggi, Stefano Mattoccia
https://arxiv.org/abs/2509.15224

Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation
Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a cross-modal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). O…

@arXiv_csCV_bot@mastoxiv.page
2025-09-19 10:22:31

PRISM: Product Retrieval In Shopping Carts using Hybrid Matching
Arda Kabadayi, Senem Velipasalar, Jiajing Chen
https://arxiv.org/abs/2509.14985 https://ar…

PRISM: Product Retrieval In Shopping Carts using Hybrid Matching
Compared to traditional image retrieval tasks, product retrieval in retail settings is even more challenging. Products of the same type from different brands may have highly similar visual appearances, and the query image may be taken from an angle that differs significantly from view angles of the stored catalog images. Foundational models, such as CLIP and SigLIP, often struggle to distinguish these subtle but important local differences. Pixel-wise matching methods, on the other hand, are co…

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:23:41

Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation
Gia Khanh Nguyen, Yifeng Huang, Minh Hoai
https://arxiv.org/abs/2509.13939 htt…

Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation
Visual counting is a fundamental yet challenging task, especially when users need to count objects of a specific type in complex scenes. While recent models, including class-agnostic counting models and large vision-language models (VLMs), show promise in counting tasks, their ability to perform fine-grained, intent-driven counting remains unclear. In this paper, we introduce PairTally, a benchmark dataset specifically designed to evaluate fine-grained visual counting. Each of the 681 high-reso…

@arXiv_csCV_bot@mastoxiv.page
2025-07-18 10:20:22

VITA: Vision-to-Action Flow Matching Policy
Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani
https://arxiv.org/abs/2507.13231

VITA: Vision-to-Action Flow Matching Policy
We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learnin…

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:20:40

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
https://arxiv.org/abs/2508.13968 ht…

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite t…

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:15:30

Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks
Yeji Park, Minyoung Lee, Sanghyuk Chun, Junsuk Choe
https://arxiv.org/abs/2508.13744 https://

Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks
Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model's output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage du…

@arXiv_csCV_bot@mastoxiv.page
2025-07-18 10:22:32

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
https://arxiv.org/abs/2507.13348

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically proc…

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:14:10

HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes
Keliang Li, Hongze Shen, Hao Shi, Ruibing Hou, Hong Chang, Jie Huang, Chenghao Jia, Wen Wang, Yiling Wu, Dongmei Jiang, Shiguang Shan, Xilin Chen
https://arxiv.org/abs/2508.13692

HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes
The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing m…

@arXiv_csCV_bot@mastoxiv.page
2025-09-19 10:22:11

SPATIALGEN: Layout-guided 3D Indoor Scene Generation
Chuan Fang, Heng Li, Yixun Liang, Jia Zheng, Yongsen Mao, Yuan Liu, Rui Tang, Zihan Zhou, Ping Tan
https://arxiv.org/abs/2509.14981

SPATIALGEN: Layout-guided 3D Indoor Scene Generation
Creating high-fidelity 3D models of indoor environments is essential for applications in design, virtual reality, and robotics. However, manual 3D modeling remains time-consuming and labor-intensive. While recent advances in generative AI have enabled automated scene synthesis, existing methods often face challenges in balancing visual quality, diversity, semantic consistency, and user control. A major bottleneck is the lack of a large-scale, high-quality dataset tailored to this task. To addre…

@arXiv_csCV_bot@mastoxiv.page
2025-08-19 12:07:20

Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation
Tanjim Islam Riju, Shuchismita Anwar, Saman Sarker Joy, Farig Sadeque, Swakkhar Shatabda
https://arxiv.org/abs/2508.13068

Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation
We propose a two-stage multimodal framework that enhances disease classification and region-aware radiology report generation from chest X-rays, leveraging the MIMIC-Eye dataset. In the first stage, we introduce a gaze-guided contrastive learning architecture for disease classification. It integrates visual features, clinical labels, bounding boxes, and radiologist eye-tracking signals and is equipped with a novel multi-term gaze-attention loss combining MSE, KL divergence, correlation, and cen…

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:15:10

Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance
Yiming Cao, Yanjie Li, Kaisheng Liang, Yuni Lai, Bin Xiao
https://arxiv.org/abs/2508.13739

Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance
Targeted adversarial attacks are essential for proactively identifying security flaws in Vision-Language Models before real-world deployment. However, current methods perturb images to maximize global similarity with the target text or reference image at the encoder level, collapsing rich visual semantics into a single global vector. This limits attack granularity, hindering fine-grained manipulations such as modifying a car while preserving its background. Furthermore, these methods largely ov…

@arXiv_csCV_bot@mastoxiv.page
2025-08-18 09:55:50

Controlling Multimodal LLMs via Reward-guided Decoding
Oscar Ma\~nas, Pierluca D'Oro, Koustuv Sinha, Adriana Romero-Soriano, Michal Drozdzal, Aishwarya Agrawal
https://arxiv.org/abs/2508.11616

Controlling Multimodal LLMs via Reward-guided Decoding
As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process. Co…

Tootfinder

Opt-in global Mastodon full text search. Join the index!