
2025-06-18 08:45:05
Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments
Xuan Wang, Siyuan Liang, Zhe Liu, Yi Yu, Yuliang Lu, Xiaochun Cao, Ee-Chien Chang
https://arxiv.org/abs/2506.13205
Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments
Xuan Wang, Siyuan Liang, Zhe Liu, Yi Yu, Yuliang Lu, Xiaochun Cao, Ee-Chien Chang
https://arxiv.org/abs/2506.13205
FEWSim: A Visual Analytic Framework for Exploring the Nexus of Food-Energy-Water Simulations
Fan Lei, David A. Sampson, Jiayi Hong, Yuxin Ma, Giuseppe Mascaro, Dave White, Rimjhim Agarwal, Ross Maciejewski
https://arxiv.org/abs/2506.14056
#Windows11's new #StickyNotes app (a thinly disguised #OneNote) is fucking annoying. Since it is so-called "smart", and it attempts to provide context around the "source" of your note, its window is…
Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments
Amirreza Payandeh, Anuj Pokhrel, Daeun Song, Marcos Zampieri, Xuesu Xiao
https://arxiv.org/abs/2506.14233
ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models
Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, Sarah Adel Bargal
https://arxiv.org/abs/2506.12198
XGraphRAG: Interactive Visual Analysis for Graph-based Retrieval-Augmented Generation
Ke Wang, Bo Pan, Yingchaojie Feng, Yuwei Wu, Jieyi Chen, Minfeng Zhu, Wei Chen
https://arxiv.org/abs/2506.13782
An Empirical Study of Bugs in Data Visualization Libraries
Weiqi Lu, Yongqiang Tian, Xiaohan Zhong, Haoyang Ma, Zhenyang Xu, Shing-Chi Cheung, Chengnian Sun
https://arxiv.org/abs/2506.15084
A post from the archive 📫:
Find the address of an object in Visual Studio
https://www.poppastring.com/blog/find-the-address-of-an-object-in-visual-studio
Oh hey, it's #ScreenshotSaturday , here's the latest mockup from the video game adaptation of Battle of Tarot I'm working on with some lovely people.
Everything is still a work-in-progress and not final at all, but trying to get the visual style nailed down.
#GameDev
Audio-Visual Driven Compression for Low-Bitrate Talking Head Videos
Riku Takahashi, Ryugo Morita, Jinjia Zhou
https://arxiv.org/abs/2506.13419 https://
Navigating High-Dimensional Backstage: A Guide for Exploring Literature for the Reliable Use of Dimensionality Reduction
Hyeon Jeon, Hyunwook Lee, Yun-Hsin Kuo, Taehyun Yang, Daniel Archambault, Sungahn Ko, Takanori Fujiwara, Kwan-Liu Ma, Jinwook Seo
https://arxiv.org/abs/2506.14820
Apple says Visual Intelligence will now be able to search on-screen content in addition to analyzing real world objects, expanding on camera search (Tim Hardwick/MacRumors)
https://www.macrumors.com/2025/06/09/ios-26-visual-intellige…
GHAR: GeoPose-based Handheld Augmented Reality for Architectural Positioning, Manipulation and Visual Exploration
Sabahat Israr, Dawar Khan, Zhanglin Cheng, Mukhtaj Khan, Kiyoshi Kiyokawa
https://arxiv.org/abs/2506.14414
Series B, Episode 09 - Countdown
VETNOR: All right, come with me, we're searching the next level.
PROVINE: Right, sir.
[Teleport section. Avon and Grant are suited up]
AVON: You adjust the temperature with this [Points to knob on suit] You all set?
https://blake.torpidity.net/m/209/356
What's in the Box? Reasoning about Unseen Objects from Multimodal Cues
Lance Ying, Daniel Xu, Alicia Zhang, Katherine M. Collins, Max H. Siegel, Joshua B. Tenenbaum
https://arxiv.org/abs/2506.14212
GRaD-Nav : Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics
Qianzhong Chen, Naixiang Gao, Suning Huang, JunEn Low, Timothy Chen, Jiankai Sun, Mac Schwager
https://arxiv.org/abs/2506.14009
Feature Complementation Architecture for Visual Place Recognition
Weiwei Wang, Meijia Wang, Haoyi Wang, Wenqiang Guo, Jiapan Guo, Changming Sun, Lingkun Ma, Weichuan Zhang
https://arxiv.org/abs/2506.12401
Sigh, Visual Studio Installer has the feature to "rollback" to a previous version with just one click. But only for one step back. (For me, it was from 17.14.5 to 17.14.2.) When you have done that, it doesn't offer any further rollbacks. Irritating. #VisualStudio
Edit: But oh well, I figured out another way to work around my problem.
Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models
Jinming Wen, Xinyi Wu, Shuai Zhao, Yanhao Jia, Yuwen Li
https://arxiv.org/abs/2506.11521
D\'ej\`a Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse
Jinwoo Hwang, Daeun Kim, Sangyeop Lee, Yoonsung Kim, Guseul Heo, Hojoon Kim, Yunseok Jeong, Tadiwos Meaza, Eunhyeok Park, Jeongseob Ahn, Jongse Park
https://arxiv.org/abs/2506.14107
The neurological basis for non-visual illustration https://intellectdiscover.com/content/journals/10.1386/jill_00117_7 "this article poses the question of whether there is a theoretical precedent for creating visual imagery in the minds of blind peopl…
Replaced article(s) found for astro-ph.HE. https://arxiv.org/list/astro-ph.HE/new
[1/1]:
- Exploring blazars through sonification. Visual and auditory insights into multifrequency variability
Gustavo Magallanes-Guij\'on, Sergio Mendoza
ATK: Automatic Task-driven Keypoint Selection for Robust Policy Learning
Yunchu Zhang, Shubham Mittal, Zhengyu Zhang, Liyiming Ke, Siddhartha Srinivasa, Abhishek Gupta
https://arxiv.org/abs/2506.13867
Learning From the Past with Cascading Eligibility Traces
Tokiniaina Raharison Ralambomihanta, Ivan Anokhin, Roman Pogodin, Samira Ebrahimi Kahou, Jonathan Cornford, Blake Aaron Richards
https://arxiv.org/abs/2506.14598
Series D, Episode 01 - Rescue
TARRANT: He will if Orac's working. Now come on. We're wasting time. [starts to climb]
[Dayna does not follow, but continues to explore the small room. A hatch slides open in the floor.]
DAYNA: I knew it. Tarrant.
https://blake.torpidity.net/m/401/422
Omnidirectional Video Super-Resolution using Deep Learning
Arbind Agrahari Baniya, Tsz-Kwan Lee, Peter W. Eklund, Sunil Aryal
https://arxiv.org/abs/2506.14803
Visual metrics on boundaries of hyperbolic spaces
Emily Stark
https://arxiv.org/abs/2506.10108 https://arxiv.org/pdf/2506.10108
Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments
Xuan Wang, Siyuan Liang, Zhe Liu, Yi Yu, Yuliang Lu, Xiaochun Cao, Ee-Chien Chang
https://arxiv.org/abs/2506.13205
Focusing on Students, not Machines: Grounded Question Generation and Automated Answer Grading
G\'er\^ome Meyer, Philip Breuer
https://arxiv.org/abs/2506.12066
Just published 🚀: A Kind of Blue
#visualstudio
Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction
Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li
https://arxiv.org/abs/2506.09792
AMPLIFY: Actionless Motion Priors for Robot Learning from Videos
Jeremy A. Collins, Lor\'and Cheng, Kunal Aneja, Albert Wilcox, Benjamin Joffe, Animesh Garg
https://arxiv.org/abs/2506.14198
Breaking the Multi-Enhancement Bottleneck: Domain-Consistent Quality Enhancement for Compressed Images
Qunliang Xing, Mai Xu, Jing Yang, Shengxi Li
https://arxiv.org/abs/2506.14152
VGR: Visual Grounded Reasoning
Jiacong Wang, Zijiang Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, Jun Xiao
https://arxiv.org/abs/2506.11991
A Study on Speech Assessment with Visual Cues
Shafique Ahmed, Ryandhimas E. Zezario, Nasir Saleem, Amir Hussain, Hsin-Min Wang, Yu Tsao
https://arxiv.org/abs/2506.09549
However if I towel dry in the shower before getting out, and I am drying my face and hair, I instantly lose visual reference since the towel covers my eyes. So now I only have two points in space unless I use one hand on the towel, and one hand on a grab bar. (3)
SkinCells: Sparse Skinning using Voronoi Cells
Egor Larionov, Igor Santesteban, Hsiao-yu Chen, Gene Lin, Philipp Herholz, Ryan Goldade, Ladislav Kavan, Doug Roble, Tuur Stuyck
https://arxiv.org/abs/2506.14714
A post from the archive 📫:
Using Visual Studio to search objects in a memory dump
https://www.poppastring.com/blog/using-visual-studio-to-search-objects-in-a-memory-dump
Insights Informed Generative AI for Design: Incorporating Real-world Data for Text-to-Image Output
Richa Gupta, Alexander Htet Kyaw
https://arxiv.org/abs/2506.15008
Visual mental imagery and #aphantasia lesions map onto a convergent brain network https://www.medrxiv.org/content/10.1101/2025.05.23.25328072v1 by @…
Wow. Total surprise for me... this game looks amazing... & free to play for Xbox Game Pass!
✅ Clockwork Revolution: inXile on Time Travel, Visual Reactivity, that Foulmouthed Doll, and More — Exclusive Interview - Xbox Wire
https://news.xbox.com/en-us/2025/06/0…
Binary Mixtures of Intelligent Active Brownian Particles with Visual Perception
Rajendra Singh Negi, Roland G. Winkler, Gerhard Gompper
https://arxiv.org/abs/2506.09698
• 🔧 Extensible architecture via #PHP and #JavaScript plugins with lazy loading capabilities for performance
• 🎨 Visual #CMS functionality enabling live website content editing and drag & dr…
3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting
Yuke Xing, Jiarui Wang, Peizhi Niu, Wenjie Huang, Guangtao Zhai, Yiling Xu
https://arxiv.org/abs/2506.14642
Characterization of the Visual Binary TOI-6883AB and its dynamical implications for the planetary companion TOI-6883Ab
G. Conzo, F. Campos, F. Conti, I. Sharp
https://arxiv.org/abs/2506.08798
ABC: Adaptive BayesNet Structure Learning for Computational Scalable Multi-task Image Compression
Yufeng Zhang, Wenrui Dai, Hang Yu, Shizhan Liu, Junhui Hou, Jianguo Li, Weiyao Lin
https://arxiv.org/abs/2506.15228
This https://arxiv.org/abs/2505.16933 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
DesignCoder: Hierarchy-Aware and Self-Correcting UI Code Generation with Large Language Models
Yunnong Chen, Shixian Ding, YingYing Zhang, Wenkai Chen, Jinzhou Du, Lingyun Sun, Liuqing Chen
https://arxiv.org/abs/2506.13663
How Vox built a huge YouTube presence, becoming an incubator that invented its own visual language, as ex-staffers like Johnny Harris grow their own channels (Simon Owens/The Long Story with Simon Owens)
https://thelongstory.substack.com/p/why-the-best-jour…
ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM
Yujun Wang, Jinhe Bi, Yunpu Ma, Soeren Pirk
https://arxiv.org/abs/2506.14766
#Blakes7 Series B, Episode 02 - Shadow
ZEN: Information. Main visual is available. [displays Space City on screen.]
VILA: So?
ZEN: You expressed a desire to see what it is like.
https://blake.torpidity.net/m/202…
Foundation of Affective Computing and Interaction
Changzeng Fu
https://arxiv.org/abs/2506.15497 https://arxiv.org/pdf/2506.15497
Structured Graph Representations for Visual Narrative Reasoning: A Hierarchical Framework for Comics
Yi-Chun Chen
https://arxiv.org/abs/2506.10008 https://…
ViSAGe: Video-to-Spatial Audio Generation
Jaeyeon Kim, Heeseung Yun, Gunhee Kim
https://arxiv.org/abs/2506.12199 https://arxiv.org/pd…
In-Hand Object Pose Estimation via Visual-Tactile Fusion
Felix Nonnengie{\ss}er, Alap Kshirsagar, Boris Belousov, Jan Peters
https://arxiv.org/abs/2506.10787
People keep making the same mistake, again and again and again and again forever, of thinking that it is syntax that makes software development hard.
Oh honey.
Re this from @mathaetaes:
https://infosec.exchange/@mathaetaes/114656764053846137
(P.S. Visual coding is actually really cool, and IMO an underexplored PL design space — but is very much coding, and very much tricky for the same reasons as any other kind of coding.)
ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies
Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma, Xuehai Zhang, Xiao Liu, Zhaopeng Cui, Yuewen Ma
https://arxiv.org/abs/2506.14315
VisText-Mosquito: A Multimodal Dataset and Benchmark for AI-Based Mosquito Breeding Site Detection and Reasoning
Md. Adnanul Islam, Md. Faiyaz Abdullah Sayeedi, Md. Asaduzzaman Shuvo, Muhammad Ziaur Rahman, Shahanur Rahman Bappy, Raiyan Rahman, Swakkhar Shatabda
https://arxiv.org/abs/2506.14629
See What I Mean? CUE: A Cognitive Model of Understanding Explanations
Tobias Labarta, Nhi Hoang, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin, Leander Weber
https://arxiv.org/abs/2506.14775
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt
Yitong Zhang, Jia Li, Liyi Cai, Ge Li
https://arxiv.org/abs/2506.09353
GAF: Gaussian Action Field as a Dvnamic World Model for Robotic Mlanipulation
Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Liangjun Xing, Hongwen Zhang, Yebin Liu
https://arxiv.org/abs/2506.14135
To ChatGPT: Brain implants in visual cortex such as Neuralink Blindsight cannot directly convey visual textures, shading and smooth surfaces, because simultaneously activating many electrodes above phosphene threshold would cause seizures. Any solutions? https://chatgpt.com/share/68469685-71c
A novel visual data-based diagnostic approach for estimation of regime transition in pool boiling
Pranay Nirapure, Ayushman Singh, Srikanth Rangarajan, Bahgat Sammakia
https://arxiv.org/abs/2506.10832
We all keep our balance, our awareness of position in space, by analysing feedback we get from receivers in our skin and joints (proprioceptors) and from our visual assessment of our position in space. I am a below knee #amputee, so I have lost the position sensors in my foot, ankle, and leg. I gauge the location of my leg in space through my knee and secondarily, through my hip. 🧵
Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models
Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, Jiaheng Wei
https://arxiv.org/abs/2506.14674
UniDet-D: A Unified Dynamic Spectral Attention Model for Object Detection under Adverse Weathers
Yuantao Wang, Haowei Yang, Wei Zhang, Shijian Lu
https://arxiv.org/abs/2506.12324 …
Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
Haven Kim, Zachary Novack, Weihan Xu, Julian McAuley, Hao-Wen Dong
https://arxiv.org/abs/2506.12573
Sparse Autoencoders Bridge The Deep Learning Model and The Brain
Ziming Mao, Jia Xu, Zeqi Zheng, Haofang Zheng, Dabing Sheng, Yaochu Jin, Guoyuan Yang
https://arxiv.org/abs/2506.11123
This https://arxiv.org/abs/2310.17451 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csAI_…
Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models
Zahra Babaiee, Peyman M. Kiasari, Daniela Rus, Radu Grosu
https://arxiv.org/abs/2506.06242
Control Architecture and Design for a Multi-robotic Visual Servoing System in Automated Manufacturing Environment
Rongfei Li
https://arxiv.org/abs/2506.11387
Can Sound Replace Vision in LLaVA With Token Substitution?
Ali Vosoughi, Jing Bi, Pinxin Liu, Yunlong Tang, Chenliang Xu
https://arxiv.org/abs/2506.10416 h…
Our body always wants to have three reference points since we live in 3D space. This can be a combination of two position sensors, and vision. Here is an interesting situation. When I shower, and am standing up, I place my normal leg and foot on the floor of the shower, and kneel on a shower stool. Combine that with my visual reference of my position in space, and I am golden. (2)
A post from the archive 📫:
Debug managed Linux core dumps with Visual Studio
https://www.poppastring.com/blog/debug-managed-linux-core-dumps-with-visual-studio
Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation
Shizhe Chen, Ricardo Garcia, Paul Pacaud, Cordelia Schmid
https://arxiv.org/abs/2506.11261
Series B, Episode 03 - Weapon
COSER: Now, even if Security trace us to this planet, they'll assume the ship crashed and we died in the explosion. Did you hear what I said?
RASHEL: Yes.
COSER: Well?
RASHEL: It's a very clever plan, sir.
https://blake.torpidity.net/m/203/2
Replaced article(s) found for cs.MM. https://arxiv.org/list/cs.MM/new
[1/1]:
Multiverse Through Deepfakes: The MultiFakeVerse Dataset of Person-Centric Visual and Conceptual ...
TermSight: Making Service Contracts Approachable
Ziheng Huang, Tal August, Hari Sundaram
https://arxiv.org/abs/2506.12332 https://arx…
To ChatGPT: Write a pitch against the use of visual-to-auditory sensory substitution for the blind. https://chatgpt.com/share/684c2ab3-2c68-8004-bf53-c3b401c10064 "Let's stop romanticizing sensory substitution and start prioritizing solutions that actu…
Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge
Zhaoyang Li, Haodong Zhou, Longjie Luo, Xiaoxiao Li, Yongxin Chen, Lin Li, Qingyang Hong
https://arxiv.org/abs/2506.02621
Innovative Adaptive Imaged Based Visual Servoing Control of 6 DoFs Industrial Robot Manipulators
Rongfei Li, Francis Assadian
https://arxiv.org/abs/2506.10240
Stop Misusing t-SNE and UMAP for Visual Analytics
Hyeon Jeon, Jeongin Park, Sungbok Shin, Jinwook Seo
https://arxiv.org/abs/2506.08725 https://
Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation
Runhao Zeng, Qi Deng, Ronghao Zhang, Shuaicheng Niu, Jian Chen, Xiping Hu, Victor C. M. Leung
https://arxiv.org/abs/2506.12481
A Novel Feedforward Youla Parameterization Method for Avoiding Local Minima in Stereo Image Based Visual Servoing Control
Rongfei Li, Francis Assadian
https://arxiv.org/abs/2506.10252
Object knowledge representation in the human visual cortex requires a connection with the language system https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3003161 "Our experiments reveal the contribution of the vision-la…
$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time
Sarthak Kumar Maharana, Saksham Singh Kushwaha, Baoming Zhang, Adrian Rodriguez, Songtao Wei, Yapeng Tian, Yunhui Guo
https://arxiv.org/abs/2506.00358
This https://arxiv.org/abs/2505.18675 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…
Hotel stays of individuals with a visual impairment: a qualitative study with a focus on sensory substitution https://www.tandfonline.com/doi/full/10.1080/17483107.2025.2511982 "Sensory substitution devices (SSDs) hold the potential to aid individuals …
This https://arxiv.org/abs/2505.18700 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…
This https://arxiv.org/abs/2506.03589 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…
This https://arxiv.org/abs/2505.03448 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…
SEMNAV: A Semantic Segmentation-Driven Approach to Visual Semantic Navigation
Rafael Flor-Rodr\'iguez, Carlos Guti\'errez-\'Alvarez, Francisco Javier Acevedo-Rodr\'iguez, Sergio Lafuente-Arroyo, Roberto J. L\'opez-Sastre
https://arxiv.org/abs/2506.01418
This https://arxiv.org/abs/2505.19028 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…
GenIR: Generative Visual Feedback for Mental Image Retrieval
Diji Yang, Minghao Liu, Chung-Hsiang Lo, Yi Zhang, James Davis
https://arxiv.org/abs/2506.06220
This https://arxiv.org/abs/2505.21036 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…
This https://arxiv.org/abs/2505.18668 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…
This https://arxiv.org/abs/2505.17132 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…
CoMemo: LVLMs Need Image Context with Image Memory
Shi Liu, Weijie Su, Xizhou Zhu, Wenhai Wang, Jifeng Dai
https://arxiv.org/abs/2506.06279 https://…