Tootfinder

No exact results. Similar results found.

@arXiv_csIR_bot@mastoxiv.page
2025-09-16 09:24:57

ReFineG: Synergizing Small Supervised Models and LLMs for Low-Resource Grounded Multimodal NER
Jielong Tang, Shuang Wang, Zhenxing Wang, Jianxing Yu, Jian Yin
https://arxiv.org/abs/2509.10975

ReFineG: Synergizing Small Supervised Models and LLMs for Low-Resource Grounded Multimodal NER
Grounded Multimodal Named Entity Recognition (GMNER) extends traditional NER by jointly detecting textual mentions and grounding them to visual regions. While existing supervised methods achieve strong performance, they rely on costly multimodal annotations and often underperform in low-resource domains. Multimodal Large Language Models (MLLMs) show strong generalization but suffer from Domain Knowledge Conflict, producing redundant or incorrect mentions for domain-specific entities. To address…

@arXiv_csCV_bot@mastoxiv.page
2025-10-06 10:04:49

Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention
Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Xu Zheng, Linfeng Zhang, Xuming Hu
https://arxiv.org/abs/2510.02912

Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention
Despite their powerful capabilities, Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention or [\texttt{CLS}] attention to assess and discard redundant visual tokens. In this work, we identify a critical limitation of such attention-first pruning approaches, i.e., they tend to preserve semanticall…

@arXiv_csCY_bot@mastoxiv.page
2025-09-09 10:30:02

Stack Overflow Is Not Dead Yet: Crowd Answers Still Matter
Denis Helic, Tiago Santos
https://arxiv.org/abs/2509.05879 https://arxiv.org/pdf/2509.05879

Stack Overflow Is Not Dead Yet: Crowd Answers Still Matter
Millions of users visit Stack Overflow regularly to ask community for answers to their programming questions. However, like many other platforms, Stack Overflow consistently struggles with low user retention and declining levels of user contributions to the platform. With the introduction of ChatGPT in November 2022, these ongoing difficulties on Stack Overflow were further magnified, as many users moved toward ChatGPT for programming help. In this paper, we build upon recent research on this p…

@arXiv_csCV_bot@mastoxiv.page
2025-08-11 10:15:29

Aligning Effective Tokens with Video Anomaly in Large Language Models
Yingxian Chen, Jiahui Liu, Ruifan Di, Yanwei Li, Chirui Chang, Shizhen Zhao, Wilton W. T. Fok, Xiaojuan Qi, Yik-Chung Wu
https://arxiv.org/abs/2508.06350

Aligning Effective Tokens with Video Anomaly in Large Language Models
Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and g…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:32:21

A Multi-To-One Interview Paradigm for Efficient MLLM Evaluation
Ye Shen, Junying Wang, Farong Wen, Yijin Guo, Qi Jia, Zicheng Zhang, Guangtao Zhai
https://arxiv.org/abs/2509.14886

A Multi-To-One Interview Paradigm for Efficient MLLM Evaluation
The rapid progress of Multi-Modal Large Language Models (MLLMs) has spurred the creation of numerous benchmarks. However, conventional full-coverage Question-Answering evaluations suffer from high redundancy and low efficiency. Inspired by human interview processes, we propose a multi-to-one interview paradigm for efficient MLLM evaluation. Our framework consists of (i) a two-stage interview strategy with pre-interview and formal interview phases, (ii) dynamic adjustment of interviewer weights …

@arXiv_csCV_bot@mastoxiv.page
2025-08-06 10:32:10

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration
Shaoguang Wang (The Hong Kong University of Science and Technology), Jianxiang He (The Hong Kong University of Science and Technology), Yijie Xu (The Hong Kong University of Science and Technology), Ziyang Chen (The Hong Kong University of Science and Technology), Weiyu Guo (The Hong Kong University of Science and Technology), Hui Xiong (The Hong Kong University of Science and Technology)

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration
The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While increasing the number of sampled frames is a common strategy, we observe a "less is more" phenomenon where excessive frames can paradoxically degrade performance due to context dilution. Concurrently, state-of-the-art keyframe selection methods, while effective, still yield significant temporal redunda…

Tootfinder

Opt-in global Mastodon full text search. Join the index!