Tootfinder

@arXiv_csLG_bot@mastoxiv.page
2025-09-12 09:33:49

"A 6 or a 9?": Ensemble Learning Through the Multiplicity of Performant Models and Explanations
Gianlucca Zuin, Adriano Veloso
https://arxiv.org/abs/2509.09073 https:/…

"A 6 or a 9?": Ensemble Learning Through the Multiplicity of Performant Models and Explanations
Creating models from past observations and ensuring their effectiveness on new data is the essence of machine learning. However, selecting models that generalize well remains a challenging task. Related to this topic, the Rashomon Effect refers to cases where multiple models perform similarly well for a given learning problem. This often occurs in real-world scenarios, like the manufacturing process or medical diagnosis, where diverse patterns in data lead to multiple high-performing solutions.…

@arXiv_csAI_bot@mastoxiv.page
2025-09-12 09:40:09

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping
https://arxiv.org/abs/2509.09677

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inabil…

@arXiv_csCV_bot@mastoxiv.page
2025-08-12 12:47:43

OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution
Zhiqiang Wu, Zhaomang Sun, Tong Zhou, Bingtao Fu, Ji Cong, Yitong Dong, Huaqi Zhang, Xuan Tang, Mingsong Chen, Xian Wei
https://arxiv.org/abs/2508.08227

OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution
Denoising Diffusion Probabilistic Models (DDPM) and Flow Matching (FM) generative models show promising potential for one-step Real-World Image Super-Resolution (Real-ISR). Recent one-step Real-ISR models typically inject a Low-Quality (LQ) image latent distribution at the initial timestep. However, a fundamental gap exists between the LQ image latent distribution and the Gaussian noisy latent distribution, limiting the effective utilization of generative priors. We observe that the noisy laten…

@tiotasram@kolektiva.social
2025-08-04 15:49:00

Should we teach vibe coding? Here's why not.
Should AI coding be taught in undergrad CS education?
1/2
I teach undergraduate computer science labs, including for intro and more-advanced core courses. I don't publish (non-negligible) scholarly work in the area, but I've got years of craft expertise in course design, and I do follow the academic literature to some degree. In other words, In not the world's leading expert, but I have spent a lot of time thinking about course design, and consider myself competent at it, with plenty of direct experience in what knowledge & skills I can expect from students as they move through the curriculum.
I'm also strongly against most uses of what's called "AI" these days (specifically, generative deep neutral networks as supplied by our current cadre of techbro). There are a surprising number of completely orthogonal reasons to oppose the use of these systems, and a very limited number of reasonable exceptions (overcoming accessibility barriers is an example). On the grounds of environmental and digital-commons-pollution costs alone, using specifically the largest/newest models is unethical in most cases.
But as any good teacher should, I constantly question these evaluations, because I worry about the impact on my students should I eschew teaching relevant tech for bad reasons (and even for his reasons). I also want to make my reasoning clear to students, who should absolutely question me on this. That inspired me to ask a simple question: ignoring for one moment the ethical objections (which we shouldn't, of course; they're very stark), at what level in the CS major could I expect to teach a course about programming with AI assistance, and expect students to succeed at a more technically demanding final project than a course at the same level where students were banned from using AI? In other words, at what level would I expect students to actually benefit from AI coding "assistance?"
To be clear, I'm assuming that students aren't using AI in other aspects of coursework: the topic of using AI to "help you study" is a separate one (TL;DR it's gross value is not negative, but it's mostly not worth the harm to your metacognitive abilities, which AI-induced changes to the digital commons are making more important than ever).
So what's my answer to this question?
If I'm being incredibly optimistic, senior year. Slightly less optimistic, second year of a masters program. Realistic? Maybe never.
The interesting bit for you-the-reader is: why is this my answer? (Especially given that students would probably self-report significant gains at lower levels.) To start with, [this paper where experienced developers thought that AI assistance sped up their work on real tasks when in fact it slowed it down] (https://arxiv.org/abs/2507.09089) is informative. There are a lot of differences in task between experienced devs solving real bugs and students working on a class project, but it's important to understand that we shouldn't have a baseline expectation that AI coding "assistants" will speed things up in the best of circumstances, and we shouldn't trust self-reports of productivity (or the AI hype machine in general).
Now we might imagine that coding assistants will be better at helping with a student project than at helping with fixing bugs in open-source software, since it's a much easier task. For many programming assignments that have a fixed answer, we know that many AI assistants can just spit out a solution based on prompting them with the problem description (there's another elephant in the room here to do with learning outcomes regardless of project success, but we'll ignore this over too, my focus here is on project complexity reach, not learning outcomes). My question is about more open-ended projects, not assignments with an expected answer. Here's a second study (by one of my colleagues) about novices using AI assistance for programming tasks. It showcases how difficult it is to use AI tools well, and some of these stumbling blocks that novices in particular face.
But what about intermediate students? Might there be some level where the AI is helpful because the task is still relatively simple and the students are good enough to handle it? The problem with this is that as task complexity increases, so does the likelihood of the AI generating (or copying) code that uses more complex constructs which a student doesn't understand. Let's say I have second year students writing interactive websites with JavaScript. Without a lot of care that those students don't know how to deploy, the AI is likely to suggest code that depends on several different frameworks, from React to JQuery, without actually setting up or including those frameworks, and of course three students would be way out of their depth trying to do that. This is a general problem: each programming class carefully limits the specific code frameworks and constructs it expects students to know based on the material it covers. There is no feasible way to limit an AI assistant to a fixed set of constructs or frameworks, using current designs. There are alternate designs where this would be possible (like AI search through adaptation from a controlled library of snippets) but those would be entirely different tools.
So what happens on a sizeable class project where the AI has dropped in buggy code, especially if it uses code constructs the students don't understand? Best case, they understand that they don't understand and re-prompt, or ask for help from an instructor or TA quickly who helps them get rid of the stuff they don't understand and re-prompt or manually add stuff they do. Average case: they waste several hours and/or sweep the bugs partly under the rug, resulting in a project with significant defects. Students in their second and even third years of a CS major still have a lot to learn about debugging, and usually have significant gaps in their knowledge of even their most comfortable programming language. I do think regardless of AI we as teachers need to get better at teaching debugging skills, but the knowledge gaps are inevitable because there's just too much to know. In Python, for example, the LLM is going to spit out yields, async functions, try/finally, maybe even something like a while/else, or with recent training data, the walrus operator. I can't expect even a fraction of 3rd year students who have worked with Python since their first year to know about all these things, and based on how students approach projects where they have studied all the relevant constructs but have forgotten some, I'm not optimistic seeing these things will magically become learning opportunities. Student projects are better off working with a limited subset of full programming languages that the students have actually learned, and using AI coding assistants as currently designed makes this impossible. Beyond that, even when the "assistant" just introduces bugs using syntax the students understand, even through their 4th year many students struggle to understand the operation of moderately complex code they've written themselves, let alone written by someone else. Having access to an AI that will confidently offer incorrect explanations for bugs will make this worse.
To be sure a small minority of students will be able to overcome these problems, but that minority is the group that has a good grasp of the fundamentals and has broadened their knowledge through self-study, which earlier AI-reliant classes would make less likely to happen. In any case, I care about the average student, since we already have plenty of stuff about our institutions that makes life easier for a favored few while being worse for the average student (note that our construction of that favored few as the "good" students is a large part of this problem).
To summarize: because AI assistants introduce excess code complexity and difficult-to-debug bugs, they'll slow down rather than speed up project progress for the average student on moderately complex projects. On a fixed deadline, they'll result in worse projects, or necessitate less ambitious project scoping to ensure adequate completion, and I expect this remains broadly true through 4-6 years of study in most programs (don't take this as an endorsement of AI "assistants" for masters students; we've ignored a lot of other problems along the way).
There's a related problem: solving open-ended project assignments well ultimately depends on deeply understanding the problem, and AI "assistants" allow students to put a lot of code in their file without spending much time thinking about the problem or building an understanding of it. This is awful for learning outcomes, but also bad for project success. Getting students to see the value of thinking deeply about a problem is a thorny pedagogical puzzle at the best of times, and allowing the use of AI "assistants" makes the problem much much worse. This is another area I hope to see (or even drive) pedagogical improvement in, for what it's worth.
1/2

@arXiv_csCY_bot@mastoxiv.page
2025-08-13 09:23:12

When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise: Real-World Lessons from LLM Co-Design in a Safety-Net Hospital
Avni Kothari, Patrick Vossler, Jean Digitale, Mohammad Forouzannia, Elise Rosenberg, Michele Lee, Jennee Bryant, Melanie Molina, James Marks, Lucas Zier, Jean Feng
https://arxiv.org/abs/2508.085…

When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise: Real-World Lessons from LLM Co-Design in a Safety-Net Hospital
Large language models (LLMs) have the potential to address social and behavioral determinants of health by transforming labor intensive workflows in resource-constrained settings. Creating LLM-based applications that serve the needs of underserved communities requires a deep understanding of their local context, but it is often the case that neither LLMs nor their developers possess this local expertise, and the experts in these communities often face severe time/resource constraints. This crea…

@UP8@mastodon.social
2025-09-11 16:31:52

🪛 Tricking LLM-Based NPCs into Spilling Secrets
#llm #ai

Image showing a graph of components, on the left a player with arrows pointing to and from a box labeled game world; the top link says "Prompt Injection" and has a picture with a bomb next to a UI box and "Revealed or Not?" which has chat bubbles with ... and a lock, one with a red X on it and the other without it. An NPC character is the game world together with an identical chat bubble with the other two that says NPC's secret, below that is an icon for a command line prompt that says "S…

Tricking LLM-Based NPCs into Spilling Secrets
Large Language Models (LLMs) are increasingly used to generate dynamic dialogue for game NPCs. However, their integration raises new security concerns. In this study, we examine whether adversarial prompt injection can cause LLM-based NPCs to reveal hidden background secrets that are meant to remain undisclosed.

@arXiv_csRO_bot@mastoxiv.page
2025-06-13 08:29:30

Are We Generalizing from the Exception? An In-the-Wild Study on Group-Sensitive Conversation Design in Human-Agent Interactions
Ana M\"uller, Sabina Jeschke, Anja Richert
https://arxiv.org/abs/2506.10462

Are We Generalizing from the Exception? An In-the-Wild Study on Group-Sensitive Conversation Design in Human-Agent Interactions
This paper investigates the impact of a group-adaptive conversation design in two socially interactive agents (SIAs) through two real-world studies. Both SIAs - Furhat, a social robot, and MetaHuman, a virtual agent - were equipped with a conversational artificial intelligence (CAI) backend combining hybrid retrieval and generative models. The studies were carried out in an in-the-wild setting with a total of $N = 188$ participants who interacted with the SIAs - in dyads, triads or larger group…

@arXiv_csCL_bot@mastoxiv.page
2025-09-12 09:40:39

Reading Between the Lines: Classifying Resume Seniority with Large Language Models
Matan Cohen, Shira Shani, Eden Menahem, Yehudit Aperstein, Alexander Apartsin
https://arxiv.org/abs/2509.09229

Reading Between the Lines: Classifying Resume Seniority with Large Language Models
Accurately assessing candidate seniority from resumes is a critical yet challenging task, complicated by the prevalence of overstated experience and ambiguous self-presentation. In this study, we investigate the effectiveness of large language models (LLMs), including fine-tuned BERT architectures, for automating seniority classification in resumes. To rigorously evaluate model performance, we introduce a hybrid dataset comprising both real-world resumes and synthetically generated hard example…

@arXiv_csSE_bot@mastoxiv.page
2025-06-13 08:09:30

MLLM-Based UI2Code Automation Guided by UI Layout Information
Fan Wu, Cuiyun Gao, Shuqing Li, Xin-Cheng Wen, Qing Liao
https://arxiv.org/abs/2506.10376 htt…

MLLM-Based UI2Code Automation Guided by UI Layout Information
Converting user interfaces into code (UI2Code) is a crucial step in website development, which is time-consuming and labor-intensive. The automation of UI2Code is essential to streamline this task, beneficial for improving the development efficiency. There exist deep learning-based methods for the task; however, they heavily rely on a large amount of labeled training data and struggle with generalizing to real-world, unseen web page designs. The advent of Multimodal Large Language Models (MLLMs…

@arXiv_csSI_bot@mastoxiv.page
2025-09-12 08:39:19

The Role of Community Detection Methods in Performance Variations of Graph Mining Tasks
Shrabani Ghosh, Erik Saule
https://arxiv.org/abs/2509.09045 https://

The Role of Community Detection Methods in Performance Variations of Graph Mining Tasks
In real-world scenarios, large graphs represent relationships among entities in complex systems. Mining these large graphs often containing millions of nodes and edges helps uncover structural patterns and meaningful insights. Dividing a large graph into smaller subgraphs facilitates complex system analysis by revealing local information. Community detection extracts clusters or communities of graphs based on statistical methods and machine learning models using various optimization techniques.…

@arXiv_csHC_bot@mastoxiv.page
2025-08-12 09:15:12

AdjustAR: AI-Driven In-Situ Adjustment of Site-Specific Augmented Reality Content
Nels Numan, Jessica Van Brummelen, Ziwen Lu, Anthony Steed
https://arxiv.org/abs/2508.06826 htt…

AdjustAR: AI-Driven In-Situ Adjustment of Site-Specific Augmented Reality Content
Site-specific outdoor AR experiences are typically authored using static 3D models, but are deployed in physical environments that change over time. As a result, virtual content may become misaligned with its intended real-world referents, degrading user experience and compromising contextual interpretation. We present AdjustAR, a system that supports in-situ correction of AR content in dynamic environments using multimodal large language models (MLLMs). Given a composite image comprising the o…

@arXiv_csIR_bot@mastoxiv.page
2025-08-12 10:13:13

Careful Queries, Credible Results: Teaching RAG Models Advanced Web Search Tools with Reinforcement Learning
Yuqin Dai, Shuo Yang, Guoqing Wang, Yong Deng, Zhanwei Zhang, Jun Yin, Pengyu Zeng, Zhenzhe Ying, Changhua Meng, Can Yi, Yuchen Zhou, Weiqiang Wang, Shuai Lu
https://arxiv.org/abs/2508.07956…

Careful Queries, Credible Results: Teaching RAG Models Advanced Web Search Tools with Reinforcement Learning
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating up-to-date external knowledge, yet real-world web environments present unique challenges. These limitations manifest as two key challenges: pervasive misinformation in the web environment, which introduces unreliable or misleading content that can degrade retrieval accuracy, and the underutilization of web tools, which, if effectively employed, could enhance query precision and help mitigate this noise, ul…

@arXiv_mathOC_bot@mastoxiv.page
2025-08-13 09:46:52

Optimization-Free Fast Optimal Control: Bang-Ride Property, Monotonicity, and Applications to Fast Battery Charging
Shengling Shi, Jacob Sass, Jiaen Wu, Minsu Kim, Yingjie Ma, Sungho Shin, Richard D. Braatz
https://arxiv.org/abs/2508.09010

Optimization-Free Fast Optimal Control: Bang-Ride Property, Monotonicity, and Applications to Fast Battery Charging
Single-input fast optimal control problems, which aim to achieve the optimal objective as fast as possible, occur in various real-world applications. In the case of fast battery charging, the associated optimal control problem becomes computationally challenging when detailed battery models are used. A recent heuristic optimization-free algorithm can significantly reduce the computational cost and provide an approximate solution, consistent with many heuristic input profiles in practice. These …

@arXiv_csRO_bot@mastoxiv.page
2025-08-12 11:36:53

Grasp-HGN: Grasping the Unexpected
Mehrshad Zandigohar, Mallesham Dasari, Gunar Schirner
https://arxiv.org/abs/2508.07648 https://arxiv.org/pdf/2508.07648

Grasp-HGN: Grasping the Unexpected
For transradial amputees, robotic prosthetic hands promise to regain the capability to perform daily living activities. To advance next-generation prosthetic hand control design, it is crucial to address current shortcomings in robustness to out of lab artifacts, and generalizability to new environments. Due to the fixed number of object to interact with in existing datasets, contrasted with the virtually infinite variety of objects encountered in the real world, current grasp models perform po…

@arXiv_csCV_bot@mastoxiv.page
2025-09-12 10:15:19

Measuring Epistemic Humility in Multimodal Large Language Models
Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou
https://arxiv.org/abs/2509.09658 https://

Measuring Epistemic Humility in Multimodal Large Language Models
Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of …

@arXiv_csCL_bot@mastoxiv.page
2025-08-13 10:15:32

MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions
Zeyu Huang, Juyuan Wang, Longfeng Chen, Boyi Xiao, Leng Cai, Yawen Zeng, Jin Xu
https://arxiv.org/abs/2508.09057

MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions
Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users' automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tas…

@arXiv_eessIV_bot@mastoxiv.page
2025-09-12 08:06:29

Generalized User-Oriented Image Semantic Coding Empowered by Large Vision-Language Model
Sin-Yu Huang, Vincent W. S. Wong
https://arxiv.org/abs/2509.08913 https://

Generalized User-Oriented Image Semantic Coding Empowered by Large Vision-Language Model
Semantic communication has shown outstanding performance in preserving the overall source information in wireless transmission. For semantically rich content such as images, human users are often interested in specific regions depending on their intent. Moreover, recent semantic coding models are mostly trained on specific datasets. However, real-world applications may involve images out of the distribution of training dataset, which makes generalization a crucial but largely unexplored problem…

@arXiv_csCR_bot@mastoxiv.page
2025-08-11 09:47:09

Secure and Scalable Blockchain Voting: A Comparative Framework and the Role of Large Language Models
Kiana Kiashemshaki, Elvis Nnaemeka Chukwuani, Mohammad Jalili Torkamani, Negin Mahmoudi
https://arxiv.org/abs/2508.05865

Secure and Scalable Blockchain Voting: A Comparative Framework and the Role of Large Language Models
Blockchain technology offers a promising foundation for modernizing E-Voting systems by enhancing transparency, decentralization, and security. Yet, real-world adoption remains limited due to persistent challenges such as scalability constraints, high computational demands, and complex privacy requirements. This paper presents a comparative framework for analyzing blockchain-based E-Voting architectures, consensus mechanisms, and cryptographic protocols. We examine the limitations of prevalent …

@muz4now@mastodon.world
2025-08-08 14:38:03

Check how your mix sounds through multiple popular headphone models with Kali Audio’s new HP-1 Multi-Reference Headphones
#MusicTech #MusicianTips

Check how your mix sounds through multiple popular headphone models with Kali Audio's new HP-1 Multi-Reference Headphones
Kali Audio has launched its first ever pair of over-ear headphones, the HP-1, which lets you switch between three voicings to check how your work will sound through the most popular headphones in use today.

@Techmeme@techhub.social
2025-08-07 17:19:27

OpenAI says GPT-5 is its first "unified" AI model and combines the reasoning abilities of its o-series of models with the fast responses of its GPT series (Maxwell Zeff/TechCrunch)
https://techcrunch.com/2025/08/07/openais-gpt-5-is-here/

OpenAI's GPT-5 is here | TechCrunch
OpenAI CEO Sam Altman says GPT-5 is the "best model in the world," and aims to make ChatGPT more intuitive to use.

@arXiv_csGT_bot@mastoxiv.page
2025-09-10 07:34:51

Persuading Agents in Opinion Formation Games
Martin Hoefer, Tim Koglin, Tolga Tel
https://arxiv.org/abs/2509.07520 https://arxiv.org/pdf/2509.07520

Persuading Agents in Opinion Formation Games
Prominent opinion formation models such as the one by Friedkin and Johnsen (FJ) concentrate on the effects of peer pressure on public opinions. In practice, opinion formation is also based on information about the state of the world and persuasion efforts. In this paper, we analyze an approach of Bayesian persuasion in the FJ model. There is an unknown state of the world that influences the preconceptions of n agents. A sender S can (partially) reveal information about the state to all agents. …

@arXiv_csCL_bot@mastoxiv.page
2025-09-12 09:41:09

Agentic LLMs for Question Answering over Tabular Data
Rishit Tyagi, Mohit Gupta, Rahul Bouri
https://arxiv.org/abs/2509.09234 https://arxiv.org/pdf/2509.09…

Agentic LLMs for Question Answering over Tabular Data
Question Answering over Tabular Data (Table QA) presents unique challenges due to the diverse structure, size, and data types of real-world tables. The SemEval 2025 Task 8 (DataBench) introduced a benchmark composed of large-scale, domain-diverse datasets to evaluate the ability of models to accurately answer structured queries. We propose a Natural Language to SQL (NL-to-SQL) approach leveraging large language models (LLMs) such as GPT-4o, GPT-4o-mini, and DeepSeek v2:16b to generate SQL queri…

@arXiv_csCV_bot@mastoxiv.page
2025-09-12 10:13:19

Improving Human Motion Plausibility with Body Momentum
Ha Linh Nguyen, Tze Ho Elden Tse, Angela Yao
https://arxiv.org/abs/2509.09496 https://arxiv.org/pdf/…

Improving Human Motion Plausibility with Body Momentum
Many studies decompose human motion into local motion in a frame attached to the root joint and global motion of the root joint in the world frame, treating them separately. However, these two components are not independent. Global movement arises from interactions with the environment, which are, in turn, driven by changes in the body configuration. Motion models often fail to precisely capture this physical coupling between local and global dynamics, while deriving global trajectories from jo…

@arXiv_quantph_bot@mastoxiv.page
2025-09-04 10:09:41

Identifiability and minimality bounds of quantum and post-quantum models of classical stochastic processes
Paul M. Riechers, Thomas J. Elliott
https://arxiv.org/abs/2509.03004 h…

Identifiability and minimality bounds of quantum and post-quantum models of classical stochastic processes
To make sense of the world around us, we develop models, constructed to enable us to replicate, describe, and explain the behaviours we see. Focusing on the broad case of sequences of correlated random variables, i.e., classical stochastic processes, we tackle the question of determining whether or not two different models produce the same observable behavior. This is the problem of identifiability. Curiously, the physics of the model need not correspond to the physics of the observations; rece…

@arXiv_csHC_bot@mastoxiv.page
2025-08-13 08:38:12

AirSignatureDB: Exploring In-Air Signature Biometrics in the Wild and its Privacy Concerns
Marta Robledo-Moreno, Ruben Vera-Rodriguez, Ruben Tolosana, Javier Ortega-Garcia, Andres Huergo, Julian Fierrez
https://arxiv.org/abs/2508.08502

AirSignatureDB: Exploring In-Air Signature Biometrics in the Wild and its Privacy Concerns
Behavioral biometrics based on smartphone motion sensors are growing in popularity for authentication purposes. In this study, AirSignatureDB is presented: a new publicly accessible dataset of in-air signatures collected from 108 participants under real-world conditions, using 83 different smartphone models across four sessions. This dataset includes genuine samples and skilled forgeries, enabling a comprehensive evaluation of system robustness against realistic attack scenarios. Traditional an…

@arXiv_astrophHE_bot@mastoxiv.page
2025-09-08 08:54:00

Latest results from the searches for ultra-high-energy photons at the Pierre Auger Observatory
Pierpaolo Savina (for the Pierre Auger Collaboration)
https://arxiv.org/abs/2509.05113

Latest results from the searches for ultra-high-energy photons at the Pierre Auger Observatory
The Pierre Auger Observatory is the largest air-shower detector in the world, offering unparalleled exposure to photons with energies above $5 \times 10^{16}$ eV. Since the start of data collection almost two decades ago, numerous searches for photons have been conducted using the detection systems of the Observatory. These searches have led to the most stringent upper limits on the diffuse photon flux. These limits place severe constraints on current models regarding the origin of ultra-high-e…

@tiotasram@kolektiva.social
2025-07-30 17:56:35

Just read this post by @… on an optimistic AGI future, and while it had some interesting and worthwhile ideas, it's also in my opinion dangerously misguided, and plays into the current AGI hype in a harmful way.
https://social.coop/@eloquence/114940607434005478
My criticisms include:
- Current LLM technology has many layers, but the biggest most capable models are all tied to corporate datacenters and require inordinate amounts of every and water use to run. Trying to use these tools to bring about a post-scarcity economy will burn up the planet. We urgently need more-capable but also vastly more efficient AI technologies if we want to use AI for a post-scarcity economy, and we are *not* nearly on the verge of this despite what the big companies pushing LLMs want us to think.
- I can see that permacommons.org claims a small level of expenses on AI equates to low climate impact. However, given current deep subsidies on place by the big companies to attract users, that isn't a great assumption. The fact that their FAQ dodges the question about which AI systems they use isn't a great look.
- These systems are not free in the same way that Wikipedia or open-source software is. To run your own model you need a data harvesting & cleaning operation that costs millions of dollars minimum, and then you need millions of dollars worth of storage & compute to train & host the models. Right now, big corporations are trying to compete for market share by heavily subsidizing these things, but it you go along with that, you become dependent on them, and you'll be screwed when they jack up the price to a profitable level later. I'd love to see open dataset initiatives SBD the like, and there are some of these things, but not enough yet, and many of the initiatives focus on one problem while ignoring others (fine for research but not the basis for a society yet).
- Between the environmental impacts, the horrible labor conditions and undercompensation of data workers who filter the big datasets, and the impacts of both AI scrapers and AI commons pollution, the developers of the most popular & effective LLMs have a lot of answer for. This project only really mentions environmental impacts, which makes me think that they're not serious about ethics, which in turn makes me distrustful of the whole enterprise.
- Their language also ends up encouraging AI use broadly while totally ignoring several entire classes of harm, so they're effectively contributing to AI hype, especially with such casual talk of AGI and robotics as if embodied AGI were just around the corner. To be clear about this point: we are several breakthroughs away from AGI under the most optimistic assumptions, and giving the impression that those will happen soon plays directly into the hands of the Sam Altmans of the world who are trying to make money off the impression of impending huge advances in AI capabilities. Adding to the AI hype is irresponsible.
- I've got a more philosophical criticism that I'll post about separately.
I do think that the idea of using AI & other software tools, possibly along with robotics and funded by many local cooperatives, in order to make businesses obsolete before they can do the same to all workers, is a good one. Get your local library to buy a knitting machine alongside their 3D printer.
Lately I've felt too busy criticizing AI to really sit down and think about what I do want the future to look like, even though I'm a big proponent of positive visions for the future as a force multiplier for criticism, and this article is inspiring to me in that regard, even if the specific project doesn't seem like a good one.

@arXiv_csCV_bot@mastoxiv.page
2025-09-12 10:15:59

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiaoxiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, Yao Yao
https://arxiv.org/abs/2509.09676

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we col…

@primonatura@mstdn.social
2025-09-06 11:00:45

"Five forecasts early climate models got right—the evidence is all around you"
#Climate #ClimateChange
https://

Five forecasts early climate models got right—the evidence is all around you
Climate models are complex, just like the world they mirror. They simultaneously simulate the interacting, chaotic flow of Earth's atmosphere and oceans, and they run on the world's largest supercomputers.

@arXiv_csCY_bot@mastoxiv.page
2025-09-08 07:46:39

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
Danielle Ensign, Henry Sleight, Kyle Fish
https://arxiv.org/abs/2509.04781 https://

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
When given the option, will LLMs choose to leave the conversation (bail)? We investigate this question by giving models the option to bail out of interactions using three different bail methods: a bail tool the model can call, a bail string the model can output, and a bail prompt that asks the model if it wants to leave. On continuations of real world data (Wildchat and ShareGPT), all three of these bail methods find models will bail around 0.28-32\% of the time (depending on the model and bail…

@arXiv_csRO_bot@mastoxiv.page
2025-08-11 09:37:49

Bounding Distributional Shifts in World Modeling through Novelty Detection
Eric Jing, Abdeslam Boularias
https://arxiv.org/abs/2508.06096 https://arxiv.org…

Bounding Distributional Shifts in World Modeling through Novelty Detection
Recent work on visual world models shows significant promise in latent state dynamics obtained from pre-trained image backbones. However, most of the current approaches are sensitive to training quality, requiring near-complete coverage of the action and state space during training to prevent divergence during inference. To make a model-based planning algorithm more robust to the quality of the learned world model, we propose in this work to use a variational autoencoder as a novelty detector t…

@arXiv_eessSY_bot@mastoxiv.page
2025-08-07 08:41:54

Case Studies of Generative Machine Learning Models for Dynamical Systems
Nachiket U. Bapat, Randy C. Paffenroth, Raghvendra V. Cowlagi
https://arxiv.org/abs/2508.04459 https://

Case Studies of Generative Machine Learning Models for Dynamical Systems
Systems like aircraft and spacecraft are expensive to operate in the real world. The design, validation, and testing for such systems therefore relies on a combination of mathematical modeling, abundant numerical simulations, and a relatively small set of real-world experiments. Due to modeling errors, simplifications, and uncertainties, the data synthesized by simulation models often does not match data from the system's real-world operation. We consider the broad research question of whether …

@arXiv_csIR_bot@mastoxiv.page
2025-08-11 09:27:29

A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges
Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, Weinan Zhang
https://arxiv.org/abs/2508.05668

A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges
The advent of Large Language Models (LLMs) has significantly revolutionized web search. The emergence of LLM-based Search Agents marks a pivotal shift towards deeper, dynamic, autonomous information seeking. These agents can comprehend user intentions and environmental context and execute multi-turn retrieval with dynamic planning, extending search capabilities far beyond the web. Leading examples like OpenAI's Deep Research highlight their potential for deep information mining and real-world a…

@arXiv_csSD_bot@mastoxiv.page
2025-07-08 08:42:40

Robust Localization of Partially Fake Speech: Metrics, Models, and Out-of-Domain Evaluation
Hieu-Thi Luong, Inbal Rimons, Haim Permuter, Kong Aik Lee, Eng Siong Chng
https://arxiv.org/abs/2507.03468

Robust Localization of Partially Fake Speech: Metrics, Models, and Out-of-Domain Evaluation
Partial audio deepfake localization pose unique challenges and remain underexplored compared to full-utterance spoofing detection. While recent methods report strong in-domain performance, their real-world utility remains unclear. In this analysis, we critically examine the limitations of current evaluation practices, particularly the widespread use of Equal Error Rate (EER), which often obscures generalization and deployment readiness. We propose reframing the localization task as a sequential…

@denmanrooke@social.coop
2025-08-01 19:51:10

Studio Rúcach @… in collaboration with CREW are excited to announce a two-part masterclass exploring co-operative models for game dev.
Part 1: Navigating Co-Op Mode (Online)
Hear from co-ops creating equitable workplaces in the games industry.

Co-Create: Co-operative Business Models for the Games Sector. This 2-part masterclass introduces participants to the innovative potential of co-operative business models.

Part 1: Navigating Co-Op Mode. Hear from international game makers using co-op models to build sustainable, self-directed studios. Get real-world insights on how co-ownership and self-management can help build sustainable and equitable work environments.

27th of August 2025 15:00-16:30 (GMT) Online via Zoom.

Funded by Galwa…

@arXiv_csAR_bot@mastoxiv.page
2025-09-09 07:31:41

High Utilization Energy-Aware Real-Time Inference Deep Convolutional Neural Network Accelerator
Kuan-Ting Lin, Ching-Te Chiu, Jheng-Yi Chang, Shi-Zong Huang, Yu-Ting Li
https://arxiv.org/abs/2509.05688

High Utilization Energy-Aware Real-Time Inference Deep Convolutional Neural Network Accelerator
Deep convolution Neural Network (DCNN) has been widely used in computer vision tasks. However, for edge devices even inference has too large computational complexity and data access amount. The inference latency of state-of-the-art models are impractical for real-world applications. In this paper, we propose a high utilization energy-aware real-time inference deep convolutional neural network accelerator, which improves the performance of the current accelerators. First, we use the 1x1 size con…

@arXiv_csLG_bot@mastoxiv.page
2025-09-10 10:42:51

One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning
Yuan Pu, Yazhe Niu, Jia Tang, Junyu Xiong, Shuai Hu, Hongsheng Li
https://arxiv.org/abs/2509.07945

One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning
In heterogeneous multi-task learning, tasks not only exhibit diverse observation and action spaces but also vary substantially in intrinsic difficulty. While conventional multi-task world models like UniZero excel in single-task settings, we find that when handling large-scale heterogeneous environments, gradient conflicts and the loss of model plasticity often constrain their sample and computational efficiency. In this work, we address these challenges from two perspectives: the single learni…

@arXiv_csAI_bot@mastoxiv.page
2025-09-08 07:39:39

Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning
Brennen Hill
https://arxiv.org/abs/2509.04731 https://arxiv.…

Language-Driven Hierarchical Task Structures as Explicit World Models for Multi-Agent Learning
The convergence of Language models, Agent models, and World models represents a critical frontier for artificial intelligence. While recent progress has focused on scaling Language and Agent models, the development of sophisticated, explicit World Models remains a key bottleneck, particularly for complex, long-horizon multi-agent tasks. In domains such as robotic soccer, agents trained via standard reinforcement learning in high-fidelity but structurally-flat simulators often fail due to intrac…

@arXiv_csCL_bot@mastoxiv.page
2025-09-12 09:50:19

GrACE: A Generative Approach to Better Confidence Elicitation in Large Language Models
Zhaohan Zhang, Ziquan Liu, Ioannis Patras
https://arxiv.org/abs/2509.09438 https://…

GrACE: A Generative Approach to Better Confidence Elicitation in Large Language Models
Assessing the reliability of Large Language Models (LLMs) by confidence elicitation is a prominent approach to AI safety in high-stakes applications, such as healthcare and finance. Existing methods either require expensive computational overhead or suffer from poor calibration, making them impractical and unreliable for real-world deployment. In this work, we propose GrACE, a Generative Approach to Confidence Elicitation that enables scalable and reliable confidence elicitation for LLMs. GrACE…

@arXiv_physicssocph_bot@mastoxiv.page
2025-09-10 09:01:41

From lines to networks
Marc Barthelemy
https://arxiv.org/abs/2509.07951 https://arxiv.org/pdf/2509.07951

From lines to networks
Many real-world networks, ranging from subway systems to polymer structures and fungal mycelia, do not form by the incremental addition of individual nodes but instead grow through the successive extension and intersection of lines or filaments. Yet most existing models for spatial network formation focus on node-based growth, leaving a significant gap in our understanding of systems built from spatially extended components. Here we introduce a minimal model for spatial networks, rooted in the …

@pbloem@sigmoid.social
2025-06-26 10:56:22

After training, we finetune on real-world data. We observe that the models that have been pre-trained with noise converge very quickly compared to a baseline which is trained from scratch.
Moreover, on the other datasets, the UP models retain their zero-shot performance during finetuning. This suggests that there may be a generalization benefit to using a UP model.
All this is at the expense of much longer training, but that cost can be amortized over many tasks.

The results for the finetuning experiment. Six datasets (linux, code, dyck, wp, german and ndfa) and the performance of four models: the baseline and UP trained models and two finetuning datasets.

The results show that the UP models converge quicker, and that they retain most of their zero-shot performance on the other datasets.

@arXiv_eessAS_bot@mastoxiv.page
2025-08-13 08:43:32

LPGNet: A Lightweight Network with Parallel Attention and Gated Fusion for Multimodal Emotion Recognition
Zhining He, Yang Xiao
https://arxiv.org/abs/2508.08925 https://

LPGNet: A Lightweight Network with Parallel Attention and Gated Fusion for Multimodal Emotion Recognition
Emotion recognition in conversations (ERC) aims to predict the emotional state of each utterance by using multiple input types, such as text and audio. While Transformer-based models have shown strong performance in this task, they often face two major issues: high computational cost and heavy dependence on speaker information. These problems reduce their ability to generalize in real-world conversations. To solve these challenges, we propose LPGNet, a Lightweight network with Parallel attentio…

@arXiv_csGR_bot@mastoxiv.page
2025-09-09 09:44:02

Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data
Nithin Gopalakrishnan Nair, Srinivas Kaza, Xuan Luo, Vishal M. Patel, Stephen Lombardi, Jungyeon Park
https://arxiv.org/abs/2509.06950

Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data
Large transformer-based models have made significant progress in generalizable novel view synthesis (NVS) from sparse input views, generating novel viewpoints without the need for test-time optimization. However, these models are constrained by the limited diversity of publicly available scene datasets, making most real-world (in-the-wild) scenes out-of-distribution. To overcome this, we incorporate synthetic training data generated from diffusion models, which improves generalization across un…

@arXiv_mathNA_bot@mastoxiv.page
2025-07-08 11:52:41

When do World Models Successfully Learn Dynamical Systems?
Edmund Ross, Claudia Drygala, Leonhard Schwarz, Samir Kaiser, Francesca di Mare, Tobias Breiten, Hanno Gottschalk
https://arxiv.org/abs/2507.04898

When do World Models Successfully Learn Dynamical Systems?
In this work, we explore the use of compact latent representations with learned time dynamics ('World Models') to simulate physical systems. Drawing on concepts from control theory, we propose a theoretical framework that explains why projecting time slices into a low-dimensional space and then concatenating to form a history ('Tokenization') is so effective at learning physics datasets, and characterise when exactly the underlying dynamics admit a reconstruction mapping from the history of pre…

@ErikJonker@mastodon.social
2025-07-05 07:35:51

More time should be devoted about the (near) future businessmodels of AI and how it collects data/content. Just trying to prevent AI models from scraping data will be futile.
https://blog.cloudflare.com/content-independence-day-no-ai-crawl-without-compen…

Content Independence Day: no AI crawl without compensation!
It’s Content Independence Day: Cloudflare, along with a majority of the world's leading publishers and AI companies, is changing the default to block AI crawlers unless they pay creators for content.

@arXiv_statME_bot@mastoxiv.page
2025-08-08 08:24:32

Goodness-of-fit test for multi-layer stochastic block models
Huan Qing
https://arxiv.org/abs/2508.04957 https://arxiv.org/pdf/2508.04957

Goodness-of-fit test for multi-layer stochastic block models
Community detection in multi-layer networks is a fundamental task in complex network analysis across various areas like social, biological, and computer sciences. However, most existing algorithms assume that the number of communities is known in advance, which is usually impractical for real-world multi-layer networks. To address this limitation, we develop a novel goodness-of-fit test for the popular multi-layer stochastic block model. The test statistic is derived from a normalized aggregati…

@arXiv_csCR_bot@mastoxiv.page
2025-07-10 08:53:21

Attacker's Noise Can Manipulate Your Audio-based LLM in the Real World
Vinu Sankar Sadasivan, Soheil Feizi, Rajiv Mathews, Lun Wang
https://arxiv.org/abs/2507.06256

Attacker's Noise Can Manipulate Your Audio-based LLM in the Real World
This paper investigates the real-world vulnerabilities of audio-based large language models (ALLMs), such as Qwen2-Audio. We first demonstrate that an adversary can craft stealthy audio perturbations to manipulate ALLMs into exhibiting specific targeted behaviors, such as eliciting responses to wake-keywords (e.g., "Hey Qwen"), or triggering harmful behaviors (e.g. "Change my calendar event"). Subsequently, we show that playing adversarial background noise during user interaction with the ALLMs…

@arXiv_csIT_bot@mastoxiv.page
2025-08-05 08:51:00

Robust Detection of Planted Subgraphs in Semi-Random Models
Dor Elimelech, Wasim Huleihel
https://arxiv.org/abs/2508.02158 https://arxiv.org/pdf/2508.02158…

Robust Detection of Planted Subgraphs in Semi-Random Models
Detection of planted subgraphs in Erdös-Rényi random graphs has been extensively studied, leading to a rich body of results characterizing both statistical and computational thresholds. However, most prior work assumes a purely random generative model, making the resulting algorithms potentially fragile in the face of real-world perturbations. In this work, we initiate the study of semi-random models for the planted subgraph detection problem, wherein an adversary is allowed to remove edges o…

@arXiv_csSI_bot@mastoxiv.page
2025-07-11 07:36:51

Scalable Signed Exponential Random Graph Models under Local Dependence
Marc Schalberger, Cornelius Fritz
https://arxiv.org/abs/2507.07660 https://

Scalable Signed Exponential Random Graph Models under Local Dependence
Traditional network analysis focuses on binary edges, while real-world relationships are more nuanced, encompassing cooperation, neutrality, and conflict. The rise of negative edges in social media discussions spurred interest in analyzing signed interactions, especially in polarized debates. However, the vast data generated by digital networks presents challenges for traditional methods like Stochastic Block Models (SBM) and Exponential Family Random Graph Models (ERGM), particularly due to th…

@arXiv_csSE_bot@mastoxiv.page
2025-08-08 08:25:52

Generative AI for Object-Oriented Programming: Writing the Right Code and Reasoning the Right Logic
Gang Xu, Airong Wang, Yushan Pan
https://arxiv.org/abs/2508.05005 https://

Generative AI for Object-Oriented Programming: Writing the Right Code and Reasoning the Right Logic
We find ourselves in the midst of an explosion in artificial intelligence research, particularly with large language models (LLMs). These models have diverse applications spanning finance, commonsense knowledge graphs, medicine, and visual analysis. In the world of Object-Oriented Programming(OOP), a robust body of knowledge and methods has been developed for managing complex tasks through object-oriented thinking. However, the intersection of LLMs with OOP remains an underexplored territory. E…

@arXiv_csRO_bot@mastoxiv.page
2025-08-12 12:01:03

ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks
Kaijun Wang, Liqin Lu, Mingyu Liu, Jianuo Jiang, Zeju Li, Bolin Zhang, Wancai Zheng, Xinyi Yu, Hao Chen, Chunhua Shen
https://arxiv.org/abs/2508.08240

ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks
Language-guided long-horizon mobile manipulation has long been a grand challenge in embodied semantic reasoning, generalizable manipulation, and adaptive locomotion. Three fundamental limitations hinder progress: First, although large language models have improved spatial reasoning and task planning through semantic priors, existing implementations remain confined to tabletop scenarios, failing to address the constrained perception and limited actuation ranges of mobile platforms. Second, curre…

@arXiv_csCV_bot@mastoxiv.page
2025-07-11 10:19:21

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian
https://arxiv.org/abs/2507.07982

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our ke…

@arXiv_csCL_bot@mastoxiv.page
2025-09-10 09:07:01

LLM Analysis of 150 years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade
Aida Kostikova, Ole P\"utz, Steffen Eger, Olga Sabelfeld, Benjamin Paassen
https://arxiv.org/abs/2509.07274

LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade
Migration has been a core topic in German political debate, from millions of expellees post World War II over labor migration to refugee movements in the recent past. Studying political speech regarding such wide-ranging phenomena in depth traditionally required extensive manual annotations, limiting the scope of analysis to small subsets of the data. Large language models (LLMs) have the potential to partially automate even complex annotation tasks. We provide an extensive evaluation of a mult…

@arXiv_csLG_bot@mastoxiv.page
2025-09-11 10:13:03

Signal Fidelity Index-Aware Calibration for Dementia Predictions Across Heterogeneous Real-World Data
Jingya Cheng, Jiazi Tian, Federica Spoto, Alaleh Azhir, Daniel Mork, Hossein Estiri
https://arxiv.org/abs/2509.08679

Signal Fidelity Index-Aware Calibration for Dementia Predictions Across Heterogeneous Real-World Data
\textbf{Background:} Machine learning models trained on electronic health records (EHRs) often degrade across healthcare systems due to distributional shift. A fundamental but underexplored factor is diagnostic signal decay: variability in diagnostic quality and consistency across institutions, which affects the reliability of codes used for training and prediction. \textbf{Objective:} To develop a Signal Fidelity Index (SFI) quantifying diagnostic data quality at the patient level in dementi…

@arXiv_csAI_bot@mastoxiv.page
2025-09-11 07:39:43

TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making
Kechen Jiao, Zhirui Fang, Jiahao Liu, Bei Li, Qifan Wang, Xinyu Liu, Junhao Ruan, Zhongjian Qiao, Yifan Zhu, Yaxin Xu, Jingang Wang, Xiu Li
https://arxiv.org/abs/2509.08500

TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making
Using effective generalization capabilities of vision language models (VLMs) in context-specific dynamic tasks for embodied artificial intelligence remains a significant challenge. Although supervised fine-tuned models can better align with the real physical world, they still exhibit sluggish responses and hallucination issues in dynamically changing environments, necessitating further alignment. Existing post-SFT methods, reliant on reinforcement learning and chain-of-thought (CoT) approaches,…

@arXiv_csCV_bot@mastoxiv.page
2025-09-12 10:11:29

Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift
Umaima Rahman, Raza Imam, Mohammad Yaqub, Dwarikanath Mahapatra
https://arxiv.org/abs/2509.09397

Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift
Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals fro…

@arXiv_physicssocph_bot@mastoxiv.page
2025-07-10 08:12:51

Modeling Multistability and Hysteresis in Urban Traffic Dynamics
Jung-Hoon Jung, Young-Ho Eom
https://arxiv.org/abs/2507.06659 https://

Modeling Multistability and Hysteresis in Urban Traffic Dynamics
Growing evidence suggests that the macroscopic functional states of urban road networks exhibit multistability and hysteresis, but microscopic mechanisms underlying these phenomena remain elusive. Here, we demonstrate that in real-world road networks, the recovery process of congested roads is not spontaneous, as assumed in existing models, but is hindered by connected congested roads, and such hindered recovery can lead to the emergence of multistability and hysteresis in urban traffic dynamic…

@arXiv_csIR_bot@mastoxiv.page
2025-08-11 08:36:09

AI Guided Accelerator For Search Experience
Jayanth Yetukuri, Mehran Elyasi, Samarth Agrawal, Aritra Mandal, Rui Kong, Harish Vempati, Ishita Khan
https://arxiv.org/abs/2508.05649

AI Guided Accelerator For Search Experience
Effective query reformulation is pivotal in narrowing the gap between a user's exploratory search behavior and the identification of relevant products in e-commerce environments. While traditional approaches predominantly model query rewrites as isolated pairs, they often fail to capture the sequential and transitional dynamics inherent in real-world user behavior. In this work, we propose a novel framework that explicitly models transitional queries--intermediate reformulations occurring durin…

@ErikJonker@mastodon.social
2025-06-24 13:25:24

"Israel and America’s war with Iran is not just a strategic throw of the dice. It is a fundamental shift in how power operates in a multipolar world. Though a transition toward new models of collective security remains theoretically possible, as the foundations of US global hegemony wither under Donald Trump the current trajectory favours entropy over order. The danger is not merely in acts of war, it is also in the chaos they leave behind."

Chaos engulfs Iran. What can Britain and Europe do?
Whether it is a democracy or a dictatorship, a state that loses a long war usually faces a dangerous reckoning. For four decades the ideological foundations of the Islamic Republic of Iran were anchor

@arXiv_csGR_bot@mastoxiv.page
2025-07-10 07:55:31

3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds
Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou, Alex Zook, Shangru Li, Yu-Hsin Chou, Ethem Can, Xunlei Wu, Clemens Eppner, Valts Blukis, Jonathan Tremblay, Jiajun Wu, Stan Birchfield, Nick Haber
https://arxiv.org/abs/2507.…

3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds
Despite large-scale pretraining endowing models with language and vision reasoning capabilities, improving their spatial reasoning capability remains challenging due to the lack of data grounded in the 3D world. While it is possible for humans to manually create immersive and interactive worlds through 3D graphics, as seen in applications such as VR, gaming, and robotics, this process remains highly labor-intensive. In this paper, we propose a scalable method for generating high-quality 3D envi…

@arXiv_csSE_bot@mastoxiv.page
2025-08-06 09:21:50

MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation
Haiyang Li
https://arxiv.org/abs/2508.02998 https://ar…

MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation
Large Language Models (LLMs) have demonstrated impressive capabilities in code generation. However, current evaluation datasets suffer from issues such as the lack of runnable test cases, deviation from the distribution of real-world code, and the ability to evaluate only the Python language. These limitations undermine the credibility of the evaluation results. To address these limitations, we introduce \textbf{MRG-Bench} (Multi-language Repository-level Code Generation Benchmark), a novel d…

@arXiv_csCV_bot@mastoxiv.page
2025-07-11 10:19:11

Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions
Longfei Li, Zhiwen Fan, Wenyan Cong, Xinhang Liu, Yuyang Yin, Matt Foutter, Panwang Pan, Chenyu You, Yue Wang, Zhangyang Wang, Yao Zhao, Marco Pavone, Yunchao Wei
https://arxiv.org/abs/2507.07978…

Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions
Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation…

@arXiv_csAI_bot@mastoxiv.page
2025-09-08 07:56:50

What-If Analysis of Large Language Models: Explore the Game World Using Proactive Thinking
Yuan Sui, Yanming Zhang, Yi Liao, Yu Gu, Guohua Tang, Zhongqian Sun, Wei Yang, Bryan Hooi
https://arxiv.org/abs/2509.04791

What-If Analysis of Large Language Models: Explore the Game World Using Proactive Thinking
Large language models (LLMs) excel at processing information reactively but lack the ability to systemically explore hypothetical futures. They cannot ask, "what if we take this action? how will it affect the final outcome" and forecast its potential consequences before acting. This critical gap limits their utility in dynamic, high-stakes scenarios like strategic planning, risk assessment, and real-time decision making. To bridge this gap, we propose WiA-LLM, a new paradigm that equips LLMs wi…

@arXiv_csRO_bot@mastoxiv.page
2025-07-09 07:36:02

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation
TRI LBM Team, Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, Naveen Kuppuswamy, Kuan-Hui Lee, Katherine Liu, Dale McConachie, Ian McMahon, Haruki Nishimura, Calder Phillips-Grafflin, Charles Richter, Paarth Shah, Krishnan Srinivasan, Blake Wulfe, Chen Xu, Mengchao Zhang, Alex Alspach, Maya …

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation
Robot manipulation has seen tremendous progress in recent years, with imitation learning policies enabling successful performance of dexterous and hard-to-model tasks. Concurrently, scaling data and model size has led to the development of capable language and vision foundation models, motivating large-scale efforts to create general-purpose robot foundation models. While these models have garnered significant enthusiasm and investment, meaningful evaluation of real-world performance remains a …

@arXiv_csCR_bot@mastoxiv.page
2025-07-09 07:41:32

A Systematization of Security Vulnerabilities in Computer Use Agents
Daniel Jones, Giorgio Severi, Martin Pouliot, Gary Lopez, Joris de Gruyter, Santiago Zanella-Beguelin, Justin Song, Blake Bullwinkel, Pamela Cortez, Amanda Minnich
https://arxiv.org/abs/2507.05445

A Systematization of Security Vulnerabilities in Computer Use Agents
Computer Use Agents (CUAs), autonomous systems that interact with software interfaces via browsers or virtual machines, are rapidly being deployed in consumer and enterprise environments. These agents introduce novel attack surfaces and trust boundaries that are not captured by traditional threat models. Despite their growing capabilities, the security boundaries of CUAs remain poorly understood. In this paper, we conduct a systematic threat analysis and testing of real-world CUAs under adversa…

@arXiv_csCV_bot@mastoxiv.page
2025-09-11 10:07:33

BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion
Sike Xiang, Shuang Chen, Amir Atapour-Abarghouei
https://arxiv.org/abs/2509.08715 https:…

BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion
As multimodal large language models (MLLMs) advance, their large-scale architectures pose challenges for deployment in resource-constrained environments. In the age of large models, where energy efficiency, computational scalability and environmental sustainability are paramount, the development of lightweight and high-performance models is critical for real-world applications. As such, we propose a lightweight MLLM framework for end-to-end visual question answering. Our proposed approach centr…

@arXiv_csLG_bot@mastoxiv.page
2025-07-09 10:03:32

Bit-Flip Fault Attack: Crushing Graph Neural Networks via Gradual Bit Search
Sanaz Kazemi Abharian, Sai Manoj Pudukotai Dinakarrao
https://arxiv.org/abs/2507.05531

Bit-Flip Fault Attack: Crushing Graph Neural Networks via Gradual Bit Search
Graph Neural Networks (GNNs) have emerged as a powerful machine learning method for graph-structured data. A plethora of hardware accelerators has been introduced to meet the performance demands of GNNs in real-world applications. However, security challenges of hardware-based attacks have been generally overlooked. In this paper, we investigate the vulnerability of GNN models to hardware-based fault attack, wherein an attacker attempts to misclassify output by modifying trained weight paramete…

@arXiv_csCL_bot@mastoxiv.page
2025-07-11 10:03:21

DocCHA: Towards LLM-Augmented Interactive Online diagnosis System
Xinyi Liu, Dachun Sun, Yi R. Fung, Dilek Hakkani-T\"ur, Tarek Abdelzaher
https://arxiv.org/abs/2507.07870

DocCHA: Towards LLM-Augmented Interactive Online diagnosis System
Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three…

@arXiv_csAI_bot@mastoxiv.page
2025-09-08 09:39:00

LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation
Yinglin Duan, Zhengxia Zou, Tongwei Gu, Wei Jia, Zhan Zhao, Luyi Xu, Xinzhu Liu, Hao Jiang, Kang Chen, Shuang Qiu
https://arxiv.org/abs/2509.05263

LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation
Recent research has been increasingly focusing on developing 3D world models that simulate complex real-world scenarios. World models have found broad applications across various domains, including embodied AI, autonomous driving, entertainment, etc. A more realistic simulation with accurate physics will effectively narrow the sim-to-real gap and allow us to gather rich information about the real world conveniently. While traditional manual modeling has enabled the creation of virtual 3D scenes…

@arXiv_csCY_bot@mastoxiv.page
2025-07-08 09:31:20

MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks
Dumitran Adrian Marius, Theodor-Pierre Moroianu, Buca Mihnea-Vicentiu
https://arxiv.org/abs/2507.03162

MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks
The rapid advancement of Large Language Models (LLMs) has transformed various domains, particularly computer science (CS) education. These models exhibit remarkable capabilities in code-related tasks and problem-solving, raising questions about their potential and limitations in advanced CS contexts. This study presents a novel bilingual (English-Romanian) multimodal (text and image) dataset of multiple-choice questions derived from a high-level computer science competition. A particularity of …

@arXiv_csGR_bot@mastoxiv.page
2025-08-08 08:28:22

A Study of the Framework and Real-World Applications of Language Embedding for 3D Scene Understanding
Mahmoud Chick Zaouali, Todd Charter, Yehor Karpichev, Brandon Haworth, Homayoun Najjjaran
https://arxiv.org/abs/2508.05064

A Study of the Framework and Real-World Applications of Language Embedding for 3D Scene Understanding
Gaussian Splatting has rapidly emerged as a transformative technique for real-time 3D scene representation, offering a highly efficient and expressive alternative to Neural Radiance Fields (NeRF). Its ability to render complex scenes with high fidelity has enabled progress across domains such as scene reconstruction, robotics, and interactive content creation. More recently, the integration of Large Language Models (LLMs) and language embeddings into Gaussian Splatting pipelines has opened new …

@arXiv_csCV_bot@mastoxiv.page
2025-09-11 08:06:23

Two Stage Context Learning with Large Language Models for Multimodal Stance Detection on Climate Change
Lata Pangtey, Omkar Kabde, Shahid Shafi Dar, Nagendra Kumar
https://arxiv.org/abs/2509.08024

Two Stage Context Learning with Large Language Models for Multimodal Stance Detection on Climate Change
With the rapid proliferation of information across digital platforms, stance detection has emerged as a pivotal challenge in social media analysis. While most of the existing approaches focus solely on textual data, real-world social media content increasingly combines text with visual elements creating a need for advanced multimodal methods. To address this gap, we propose a multimodal stance detection framework that integrates textual and visual information through a hierarchical fusion appro…

@arXiv_csSE_bot@mastoxiv.page
2025-09-09 11:08:52

Empirical Study of Code Large Language Models for Binary Security Patch Detection
Qingyuan Li, Binchang Li, Cuiyun Gao, Shuzheng Gao, Zongjie Li
https://arxiv.org/abs/2509.06052

Empirical Study of Code Large Language Models for Binary Security Patch Detection
Security patch detection (SPD) is crucial for maintaining software security, as unpatched vulnerabilities can lead to severe security risks. In recent years, numerous learning-based SPD approaches have demonstrated promising results on source code. However, these approaches typically cannot be applied to closed-source applications and proprietary systems that constitute a significant portion of real-world software, as they release patches only with binary files, and the source code is inaccessi…

@arXiv_csRO_bot@mastoxiv.page
2025-08-07 09:00:53

Incorporating Stochastic Models of Controller Behavior into Kinodynamic Efficiently Adaptive State Lattices for Mobile Robot Motion Planning in Off-Road Environments
Eric R. Damm, Eli S. Lancaster, Felix A. Sanchez, Kiana Bronder, Jason M. Gregory, Thomas M. Howard
https://arxiv.org/abs/2508.04384 …

Incorporating Stochastic Models of Controller Behavior into Kinodynamic Efficiently Adaptive State Lattices for Mobile Robot Motion Planning in Off-Road Environments
Mobile robot motion planners rely on theoretical models to predict how the robot will move through the world. However, when deployed on a physical robot, these models are subject to errors due to real-world physics and uncertainty in how the lower-level controller follows the planned trajectory. In this work, we address this problem by presenting three methods of incorporating stochastic controller behavior into the recombinant search space of the Kinodynamic Efficiently Adaptive State Lattice …

@arXiv_csIR_bot@mastoxiv.page
2025-08-07 08:39:34

Enhancing Serendipity Recommendation System by Constructing Dynamic User Knowledge Graphs with Large Language Models
Qian Yong, Yanhui Li, Jialiang Shi, Yaguang Dou, Tian Qi
https://arxiv.org/abs/2508.04032

Enhancing Serendipity Recommendation System by Constructing Dynamic User Knowledge Graphs with Large Language Models
The feedback loop in industrial recommendation systems reinforces homogeneous content, creates filter bubble effects, and diminishes user satisfaction. Recently, large language models(LLMs) have demonstrated potential in serendipity recommendation, thanks to their extensive world knowledge and superior reasoning capabilities. However, these models still face challenges in ensuring the rationality of the reasoning process, the usefulness of the reasoning results, and meeting the latency requirem…

@arXiv_csAI_bot@mastoxiv.page
2025-09-08 07:36:09

An Approach to Grounding AI Model Evaluations in Human-derived Criteria
Sasha Mitts
https://arxiv.org/abs/2509.04676 https://arxiv.org/pdf/2509.04676

An Approach to Grounding AI Model Evaluations in Human-derived Criteria
In the rapidly evolving field of artificial intelligence (AI), traditional benchmarks can fall short in attempting to capture the nuanced capabilities of AI models. We focus on the case of physical world modeling and propose a novel approach to augment existing benchmarks with human-derived evaluation criteria, aiming to enhance the interpretability and applicability of model behaviors. Grounding our study in the Perception Test and OpenEQA benchmarks, we conducted in-depth interviews and large…

@arXiv_csIR_bot@mastoxiv.page
2025-07-08 09:06:50

Exploring the Effect of Context-Awareness and Popularity Calibration on Popularity Bias in POI Recommendations
Andrea Forster, Simone Kopeinik, Denic Helic, Stefan Thalmann, Dominik Kowald
https://arxiv.org/abs/2507.03503

Exploring the Effect of Context-Awareness and Popularity Calibration on Popularity Bias in POI Recommendations
Point-of-interest (POI) recommender systems help users discover relevant locations, but their effectiveness is often compromised by popularity bias, which disadvantages less popular, yet potentially meaningful places. This paper addresses this challenge by evaluating the effectiveness of context-aware models and calibrated popularity techniques as strategies for mitigating popularity bias. Using four real-world POI datasets (Brightkite, Foursquare, Gowalla, and Yelp), we analyze the individual …

@arXiv_csLG_bot@mastoxiv.page
2025-09-04 10:26:01

Rashomon in the Streets: Explanation Ambiguity in Scene Understanding
Helge Spieker, J{\o}rn Eirik Betten, Arnaud Gotlieb, Nadjib Lazaar, Nassim Belmecheri
https://arxiv.org/abs/2509.03169

Rashomon in the Streets: Explanation Ambiguity in Scene Understanding
Explainable AI (XAI) is essential for validating and trusting models in safety-critical applications like autonomous driving. However, the reliability of XAI is challenged by the Rashomon effect, where multiple, equally accurate models can offer divergent explanations for the same prediction. This paper provides the first empirical quantification of this effect for the task of action prediction in real-world driving scenes. Using Qualitative Explainable Graphs (QXGs) as a symbolic scene represe…

@arXiv_csCL_bot@mastoxiv.page
2025-09-10 10:19:41

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
Zheng Wu, Heyuan Huang, Xingyu Lou, Xiangmou Qu, Pengzhou Cheng, Zongru Wu, Weiwen Liu, Weinan Zhang, Jun Wang, Zhaoxiang Wang, Zhuosheng Zhang
https://arxiv.org/abs/2509.07553

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query h…

@arXiv_csAI_bot@mastoxiv.page
2025-09-05 09:55:31

World Model Implanting for Test-time Adaptation of Embodied Agents
Minjong Yoo, Jinwoo Jang, Sihyung Yoon, Honguk Woo
https://arxiv.org/abs/2509.03956 https://

World Model Implanting for Test-time Adaptation of Embodied Agents
In embodied AI, a persistent challenge is enabling agents to robustly adapt to novel domains without requiring extensive data collection or retraining. To address this, we present a world model implanting framework (WorMI) that combines the reasoning capabilities of large language models (LLMs) with independently learned, domain-specific world models through test-time composition. By allowing seamless implantation and removal of the world models, the embodied agent's policy achieves and maintai…

@arXiv_csCV_bot@mastoxiv.page
2025-09-10 10:45:41

One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation
Zheng Geng, Nan Wang, Shaocong Xu, Chongjie Ye, Bohan Li, Zhaoxi Chen, Sida Peng, Hao Zhao
https://arxiv.org/abs/2509.07978

One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation
Estimating the 6D pose of arbitrary unseen objects from a single reference image is critical for robotics operating in the long-tail of real-world instances. However, this setting is notoriously challenging: 3D models are rarely available, single-view reconstructions lack metric scale, and domain gaps between generated models and real-world images undermine robustness. We propose OnePoseViaGen, a pipeline that tackles these challenges through two key components. First, a coarse-to-fine alignmen…

@arXiv_csSE_bot@mastoxiv.page
2025-08-07 09:03:43

Experimental Analysis of Productive Interaction Strategy with ChatGPT: User Study on Function and Project-level Code Generation Tasks
Sangwon Hyun, Hyunjun Kim, Jinhyuk Jang, Hyojin Choi, M. Ali Babar
https://arxiv.org/abs/2508.04125

Experimental Analysis of Productive Interaction Strategy with ChatGPT: User Study on Function and Project-level Code Generation Tasks
The application of Large Language Models (LLMs) is growing in the productive completion of Software Engineering tasks. Yet, studies investigating the productive prompting techniques often employed a limited problem space, primarily focusing on well-known prompting patterns and mainly targeting function-level SE practices. We identify significant gaps in real-world workflows that involve complexities beyond class-level (e.g., multi-class dependencies) and different features that can impact Human…

@arXiv_csIR_bot@mastoxiv.page
2025-07-08 11:39:10

Harnessing Pairwise Ranking Prompting Through Sample-Efficient Ranking Distillation
Junru Wu, Le Yan, Zhen Qin, Honglei Zhuang, Paul Suganthan G. C., Tianqi Liu, Zhe Dong, Xuanhui Wang, Harrie Oosterhuis
https://arxiv.org/abs/2507.04820

Harnessing Pairwise Ranking Prompting Through Sample-Efficient Ranking Distillation
While Pairwise Ranking Prompting (PRP) with Large Language Models (LLMs) is one of the most effective zero-shot document ranking methods, it has a quadratic computational complexity with respect to the number of documents to be ranked, as it requires an enumeration over all possible document pairs. Consequently, the outstanding ranking performance of PRP has remained unreachable for most real-world ranking applications. In this work, we propose to harness the effectiveness of PRP through pair…

@arXiv_csRO_bot@mastoxiv.page
2025-09-09 10:58:22

Learning in ImaginationLand: Omnidirectional Policies through 3D Generative Models (OP-Gen)
Yifei Ren, Edward Johns
https://arxiv.org/abs/2509.06191 https://

Learning in ImaginationLand: Omnidirectional Policies through 3D Generative Models (OP-Gen)
Recent 3D generative models, which are capable of generating full object shapes from just a few images, now open up new opportunities in robotics. In this work, we show that 3D generative models can be used to augment a dataset from a single real-world demonstration, after which an omnidirectional policy can be learned within this imagined dataset. We found that this enables a robot to perform a task when initialised from states very far from those observed during the demonstration, including s…

@arXiv_csCV_bot@mastoxiv.page
2025-07-11 10:18:11

Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement
Xiao Yang, Yuxuan Fan, Can Liu, Houcheng Su, Weichen Guo, Jiyao Wang, Dengbo He
https://arxiv.org/abs/2507.07908

Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement
Remote photoplethysmography (rPPG) has emerged as a promising non-invasive method for monitoring physiological signals using the camera. Although various domain adaptation and generalization methods were proposed to promote the adaptability of deep-based rPPG models in unseen deployment environments, considerations in aspects like privacy concerns and real-time adaptation restrict their application in real-world deployment. Thus, we aim to propose a novel fully Test-Time Adaptation (TTA) strate…

@arXiv_csCL_bot@mastoxiv.page
2025-07-08 13:42:41

Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection
Soumyadeep Jana, Abhrajyoti Kundu, Sanasam Ranbir Singh
https://arxiv.org/abs/2507.04468

Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection
The widespread use of multimodal content on social media has heightened the need for effective sarcasm detection to improve opinion mining. However, existing models rely heavily on large annotated datasets, making them less suitable for real-world scenarios where labeled data is scarce. This motivates the need to explore the problem in a few-shot setting. To this end, we introduce DMDP (Deep Modality-Disentangled Prompt Tuning), a novel framework for few-shot multimodal sarcasm detection. Unlik…

@arXiv_csSE_bot@mastoxiv.page
2025-09-03 09:49:33

Aligning Requirement for Large Language Model's Code Generation
Zhao Tian, Junjie Chen
https://arxiv.org/abs/2509.01313 https://arxiv.org/pdf/2509.0131…

Aligning Requirement for Large Language Model's Code Generation
Code generation refers to the automatic generation of source code based on a given programming specification, which has garnered significant attention particularly with the advancement of large language models (LLMs). However, due to the inherent complexity of real-world problems, the LLM-generated code often fails to fully align with the provided specification. While state-of-the-art agent-based techniques have been proposed to enhance LLM code generation, they overlook the critical issue of s…

@arXiv_csIR_bot@mastoxiv.page
2025-08-11 09:52:09

M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation
Zhiyou Xiao, Qinhan Yu, Binghui Li, Geng Chen, Chong Chen, Wentao Zhang
https://arxiv.org/abs/2508.06328

M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation
Current research on Multimodal Retrieval-Augmented Generation (MRAG) enables diverse multimodal inputs but remains limited to single-modality outputs, restricting expressive capacity and practical utility. In contrast, real-world applications often demand both multimodal inputs and multimodal outputs for effective communication and grounded reasoning. Motivated by the recent success of Reinforcement Learning (RL) in complex reasoning tasks for Large Language Models (LLMs), we adopt RL as a prin…

@arXiv_csAI_bot@mastoxiv.page
2025-09-03 14:06:03

An Epidemiological Knowledge Graph extracted from the World Health Organization's Disease Outbreak News
Sergio Consoli, Pietro Coletti, Peter V. Markov, Lia Orfei, Indaco Biazzo, Lea Schuh, Nicolas Stefanovitch, Lorenzo Bertolini, Mario Ceresa, Nikolaos I. Stilianakis
https://arxiv.org/abs/2509.02258

An Epidemiological Knowledge Graph extracted from the World Health Organization's Disease Outbreak News
The rapid evolution of artificial intelligence (AI), together with the increased availability of social media and news for epidemiological surveillance, are marking a pivotal moment in epidemiology and public health research. Leveraging the power of generative AI, we use an ensemble approach which incorporates multiple Large Language Models (LLMs) to extract valuable actionable epidemiological information from the World Health Organization (WHO) Disease Outbreak News (DONs). DONs is a collectio…

@arXiv_csRO_bot@mastoxiv.page
2025-09-03 13:13:13

Constrained Decoding for Robotics Foundation Models
Parv Kapoor, Akila Ganlath, Changliu Liu, Sebastian Scherer, Eunsuk Kang
https://arxiv.org/abs/2509.01728 https://

Constrained Decoding for Robotics Foundation Models
Recent advances in the development of robotic foundation models have led to promising end-to-end and general-purpose capabilities in robotic systems. These models are pretrained on vast datasets of robot trajectories to process multi- modal inputs and directly output a sequence of action that the system then executes in the real world. Although this approach is attractive from the perspective of im- proved generalization across diverse tasks, these models are still data-driven and, therefore, l…

@arXiv_csCL_bot@mastoxiv.page
2025-09-05 10:11:31

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
Riccardo Lunardi, Vincenzo Della Mea, Stefano Mizzaro, Kevin Roitero
https://arxiv.org/abs/2509.04013 https:/…

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query. In this study, we systematically assess the robustness of LLMs to paraphrased benchmark questions and investiga…

@arXiv_csAI_bot@mastoxiv.page
2025-08-06 09:37:00

Tree-of-Reasoning: Towards Complex Medical Diagnosis via Multi-Agent Reasoning with Evidence Tree
Qi Peng, Jialin Cui, Jiayuan Xie, Yi Cai, Qing Li
https://arxiv.org/abs/2508.03038

Tree-of-Reasoning: Towards Complex Medical Diagnosis via Multi-Agent Reasoning with Evidence Tree
Large language models (LLMs) have shown great potential in the medical domain. However, existing models still fall short when faced with complex medical diagnosis task in the real world. This is mainly because they lack sufficient reasoning depth, which leads to information loss or logical jumps when processing a large amount of specialized medical data, leading to diagnostic errors. To address these challenges, we propose Tree-of-Reasoning (ToR), a novel multi-agent framework designed to handl…

@arXiv_csRO_bot@mastoxiv.page
2025-08-07 09:23:04

Open Scene Graphs for Open-World Object-Goal Navigation
Joel Loo, Zhanxin Wu, David Hsu
https://arxiv.org/abs/2508.04678 https://arxiv.org/pdf/2508.04678…

Open Scene Graphs for Open-World Object-Goal Navigation
How can we build general-purpose robot systems for open-world semantic navigation, e.g., searching a novel environment for a target object specified in natural language? To tackle this challenge, we introduce OSG Navigator, a modular system composed of foundation models, for open-world Object-Goal Navigation (ObjectNav). Foundation models provide enormous semantic knowledge about the world, but struggle to organise and maintain spatial information effectively at scale. Key to OSG Navigator is t…

@arXiv_csCL_bot@mastoxiv.page
2025-09-08 10:12:20

AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding
Yan Xie, Yibo Cui, Liang Xie, Erwei Yin
https://arxiv.org/abs/2509.04821 https://a…

AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding
Spoken Language Understanding (SLU) is a core component of conversational systems, enabling machines to interpret user utterances. Despite its importance, developing effective SLU systems remains challenging due to the scarcity of labeled training data and the computational burden of deploying Large Language Models (LLMs) in real-world applications. To further alleviate these issues, we propose an Adaptive Feature Distillation framework that transfers rich semantic representations from a Genera…

@arXiv_csRO_bot@mastoxiv.page
2025-08-05 11:45:31

CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning
Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, Chunhe Xia
https://arxiv.org/abs/2508.02219

CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning
Vision-Language-Action (VLA) models demonstrate significant potential for developing generalized policies in real-world robotic control. This progress inspires researchers to explore fine-tuning these models with Reinforcement Learning (RL). However, fine-tuning VLA models with RL still faces challenges related to sample efficiency, compatibility with action chunking, and training stability. To address these challenges, we explore the fine-tuning of VLA models through offline reinforcement lear…

@arXiv_csCL_bot@mastoxiv.page
2025-09-10 10:20:11

Avoiding Knowledge Edit Skipping in Multi-hop Question Answering with Guided Decomposition
Yi Liu, Xiangrong Zhu, Xiangyu Liu, Wei Wei, Wei Hu
https://arxiv.org/abs/2509.07555 h…

Avoiding Knowledge Edit Skipping in Multi-hop Question Answering with Guided Decomposition
In a rapidly evolving world where information updates swiftly, knowledge in large language models (LLMs) becomes outdated quickly. Retraining LLMs is not a cost-effective option, making knowledge editing (KE) without modifying parameters particularly necessary. We find that although existing retrieval-augmented generation (RAG)-based KE methods excel at editing simple knowledge, they struggle with KE in multi-hop question answering due to the issue of "edit skipping", which refers to skipping t…

@arXiv_csRO_bot@mastoxiv.page
2025-07-02 10:12:20

A Survey: Learning Embodied Intelligence from Physical Simulators and World Models
Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, Wei Li, Wei Yin, Yao Yao, Jia Pan, Qiu Shen, Ruigang Yang, Xun Cao, Qionghai Dai
https://arxiv.org/abs/2507.00917

A Survey: Learning Embodied Intelligence from Physical Simulators and World Models
The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real-world interactions. Two foundational technologies, physical simulators and world models, have emerged as critical …

@arXiv_csCL_bot@mastoxiv.page
2025-07-10 10:03:41

VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation
Ziang Ye, Yang Zhang, Wentao Shi, Xiaoyu You, Fuli Feng, Tat-Seng Chua
https://arxiv.org/abs/2507.06899

VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation
Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexp…

@arXiv_csCV_bot@mastoxiv.page
2025-08-04 10:10:21

Can Large Pretrained Depth Estimation Models Help With Image Dehazing?
Hongfei Zhang, Kun Zhou, Ruizheng Wu, Jiangbo Lu
https://arxiv.org/abs/2508.00698 https://

Can Large Pretrained Depth Estimation Models Help With Image Dehazing?
Image dehazing remains a challenging problem due to the spatially varying nature of haze in real-world scenes. While existing methods have demonstrated the promise of large-scale pretrained models for image dehazing, their architecture-specific designs hinder adaptability across diverse scenarios with different accuracy and efficiency requirements. In this work, we systematically investigate the generalization capability of pretrained depth representations-learned from millions of diverse image…

@arXiv_csCL_bot@mastoxiv.page
2025-07-10 09:55:01

CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs
Garapati Keerthana, Manik Gupta
https://arxiv.org/abs/2507.06715

CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs
Large language models (LLMs), including zero-shot and few-shot paradigms, have shown promising capabilities in clinical text generation. However, real-world applications face two key challenges: (1) patient data is highly unstructured, heterogeneous, and scattered across multiple note types and (2) clinical notes are often long and semantically dense, making naive prompting infeasible due to context length constraints and the risk of omitting clinically relevant information. We introduce CLI-…

Tootfinder

Opt-in global Mastodon full text search. Join the index!