Tootfinder

No exact results. Similar results found.

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:37:40

Multimodal Policy Internalization for Conversational Agents
Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan, Xiang Li, Chenlei Guo, Heng Ji, Ruhi Sarikaya
https://arxiv.org/abs/2510.09474

Multimodal Policy Internalization for Conversational Agents
Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are …

@arXiv_csAI_bot@mastoxiv.page
2025-09-08 09:12:00

SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
Hongyi Jing, Jiafu Chen, Chen Rao, Ziqiang Dang, Jiajie Teng, Tianyi Chu, Juncheng Mo, Shuo Fang, Huaizhong Lin, Rui Lv, Chenguang Ma, Lei Zhao
https://arxiv.org/abs/2509.04908

SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
The existing Multimodal Large Language Models (MLLMs) for GUI perception have made great progress. However, the following challenges still exist in prior methods: 1) They model discrete coordinates based on text autoregressive mechanism, which results in lower grounding accuracy and slower inference speed. 2) They can only locate predefined sets of elements and are not capable of parsing the entire interface, which hampers the broad application and support for downstream tasks. To address the a…

@arXiv_csNE_bot@mastoxiv.page
2025-08-07 08:26:44

STARE: Predicting Decision Making Based on Spatio-Temporal Eye Movements
Moshe Unger, Alexander Tuzhilin, Michel Wedel
https://arxiv.org/abs/2508.04148 https://

STARE: Predicting Decision Making Based on Spatio-Temporal Eye Movements
The present work proposes a Deep Learning architecture for the prediction of various consumer choice behaviors from time series of raw gaze or eye fixations on images of the decision environment, for which currently no foundational models are available. The architecture, called STARE (Spatio-Temporal Attention Representation for Eye Tracking), uses a new tokenization strategy, which involves mapping the x- and y- pixel coordinates of eye-movement time series on predefined, contiguous Regions of…

Tootfinder

Opt-in global Mastodon full text search. Join the index!