OpenAI releases GPT-5 pro, a version with extended reasoning exclusive to ChatGPT Pro subscribers, saying it scored 88.4% without tools on the GPQA benchmark (Maximilian Schreiner/The Decoder)
https://the-decoder.com/openai-claims-
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks
Zixuan Wang, Dingming Li, Hongxing Li, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
https://arxiv.org/abs/2508.05614
Driver Assistant: Persuading Drivers to Adjust Secondary Tasks Using Large Language Models
Wei Xiang, Muchen Li, Jie Yan, Manling Zheng, Hanfei Zhu, Mengyun Jiang, Lingyun Sun
https://arxiv.org/abs/2508.05238
Experimental Analysis of Productive Interaction Strategy with ChatGPT: User Study on Function and Project-level Code Generation Tasks
Sangwon Hyun, Hyunjun Kim, Jinhyuk Jang, Hyojin Choi, M. Ali Babar
https://arxiv.org/abs/2508.04125
MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks
Dumitran Adrian Marius, Theodor-Pierre Moroianu, Buca Mihnea-Vicentiu
https://arxiv.org/abs/2507.03162
BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks
Sagnik Anupam, Davis Brown, Shuo Li, Eric Wong, Hamed Hassani, Osbert Bastani
https://arxiv.org/abs/2510.02418
Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li
https://arxiv.org/abs/2508.05606
Todays tasks:
Migrating my personal matrix-server
Hopefully migrate my writefreely blogs
Do something stupid or sleep early
FurtherAI, which uses AI to automate insurance tasks such as claims processing, raised a $25M Series A led by a16z, bringing its total funding to $30M (Chris Metinko/Axios)
https://www.axios.com/pro/enterprise-software-deals/2025/10…
windsurfers: Windsurfers network (1986)
A network of interpersonal contacts among windsurfers in southern California during the Fall of 1986. The edge weights indicate the perception of social affiliations majored by the tasks in which each individual was asked to sort cards with other surfer’s name in the order of closeness.
This network has 43 nodes and 336 edges.
Tags: Social, Offline, Weighted
SAFERad: A Framework to Enable Radar Data for Safety-Relevant Perception Tasks
Tim Br\"uhl, Jenny Gl\"onkler, Robin Schwager, Tin Stribor Sohn, Tim Dieter Eberhardt, S\"oren Hohmann
https://arxiv.org/abs/2507.03959
Latency Minimization for Multi-AAV-Enabled ISCC Systems with Movable Antenna
Yiyang Chen, Wenchao Liu, Chunjie Wang, Yinyu Wu, Xuhui Zhang, Yanyan Shen
https://arxiv.org/abs/2508.05574
When Should Users Check? A Decision-Theoretic Model of Confirmation Frequency in Multi-Step AI Agent Tasks
Jieyu Zhou, Aryan Roy, Sneh Gupta, Daniel Weitekamp, Christopher J. MacLellan
https://arxiv.org/abs/2510.05307
UMaine has published a profile of my lab's efforts to harness new media for environmental causes, from my colleague Joline Blais's efforts to keep Maine's lakes healthy to the What Uses More tool for comparing AI's eco footprint to other activities. https://u…
The only way in which I’ve wished Siri was “smarter” the past few years has nothing to do with LLMs:
I store my shopping list in Reminders, and my tasks in @….
Routinely I’ll say something like “Add milk to my shopping list” or “In OmniFocus, remind me to mow the lawn”and *most* of the time, it works flawlessly.
10% of the time, I try…
Google launches its asynchronous coding agent Jules out of beta, with a free plan capped at 15 daily tasks and higher limits for Google AI Pro and Ultra users (Jagmeet Singh/TechCrunch)
https://techcrunch.com/2025/08/06/googles-ai-coding-agent-jules-is-now…
Raster scanning can improve task performance in simulated prosthetic vision #BCI
AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials
Taoyuze Lv, Alexander Chen, Fengyu Xie, Chu Wu, Jeffrey Meng, Dongzhan Zhou, Bram Hoex, Zhicheng Zhong, Tong Xie
https://arxiv.org/abs/2510.04704
Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training
Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang
https://arxiv.org/abs/2510.04996
Attention-Guided Multi-Scale Local Reconstruction for Point Clouds via Masked Autoencoder Self-Supervised Learning
Xin Cao, Haoyu Wang, Yuzhu Mao, Xinda Liu, Linzhi Su, Kang Li
https://arxiv.org/abs/2507.04084
Real-time prediction of plasma instabilities with sparse-grid-accelerated optimized dynamic mode decomposition
Kevin Gill, Ionut-Gabriel Farcas, Silke Glas, Benjamin J. Faber
https://arxiv.org/abs/2507.03245
COMMET: A System for Human-Induced Conflicts in Mobile Manipulation of Everyday Tasks
Dongping Li, Shaoting Peng, John Pohovey, Katherine Rose Driggs-Campbell
https://arxiv.org/abs/2509.04836
windsurfers: Windsurfers network (1986)
A network of interpersonal contacts among windsurfers in southern California during the Fall of 1986. The edge weights indicate the perception of social affiliations majored by the tasks in which each individual was asked to sort cards with other surfer’s name in the order of closeness.
This network has 43 nodes and 336 edges.
Tags: Social, Offline, Weighted
Invisible Saboteurs: Sycophantic LLMs Mislead Novices in Problem-Solving Tasks
Jessica Y. Bo, Majeed Kazemitabaar, Mengqing Deng, Michael Inzlicht, Ashton Anderson
https://arxiv.org/abs/2510.03667
LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models
Ci-Siang Lin, Min-Hung Chen, Yu-Yang Sheng, Yu-Chiang Frank Wang
https://arxiv.org/abs/2510.03232
When Names Disappear: Revealing What LLMs Actually Understand About Code
Cuong Chi Le, Minh V. T. Pham, Cuong Duc Van, Hoang N. Phan, Huy N. Phan, Tien N. Nguyen
https://arxiv.org/abs/2510.03178
BEDTime: A Unified Benchmark for Automatically Describing Time Series
Medhasweta Sen, Zachary Gottesman, Jiaxing Qiu, C. Bayan Bruss, Nam Nguyen, Tom Hartvigsen
https://arxiv.org/abs/2509.05215
SAM2-UNeXT: An Improved High-Resolution Baseline for Adapting Foundation Models to Downstream Segmentation Tasks
Xinyu Xiong, Zihuang Wu, Lei Zhang, Lei Lu, Ming Li, Guanbin Li
https://arxiv.org/abs/2508.03566
Adversarial Reinforcement Learning for Large Language Model Agent Safety
Zizhao Wang, Dingcheng Li, Vaishakh Keshava, Phillip Wallis, Ananth Balashankar, Peter Stone, Lukas Rutishauser
https://arxiv.org/abs/2510.05442
Sticker-TTS: Learn to Utilize Historical Experience with a Sticker-driven Test-Time Scaling Framework
Jie Chen, Jinhao Jiang, Yingqian Min, Zican Dong, Shijie Wang, Wayne Xin Zhao, Ji-Rong Wen
https://arxiv.org/abs/2509.05007
Rillet, which is building AI ledger software to automate accounting tasks, raised a $70M Series B co-led by a16z and Iconiq, a source says at a ~$500M valuation (Aditya Soni/Reuters)
https://www.reuters.com/technology/ai-acco
How are CS students using resources and AI tools for coding tasks?
Natalia Echeverry, Arun Lekshmi Narayanan
https://arxiv.org/abs/2508.04667 https://arxiv…
A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning
Cheng Peng, Xinyu Dong, Mengxian Lyu, Daniel Paredes, Yaoyun Zhang, Yonghui Wu
https://arxiv.org/abs/2509.04753
Google releases the Gemini 2.5 Computer Use model, built on Gemini 2.5 Pro's capabilities to power agents that can interact with UIs, in preview via the API (The Keyword)
https://blog.google/technology/google-deepmind/gemini-computer-use-model/
CLAd-VR: Cognitive Load-based Adaptive Training for Machining Tasks in Virtual Reality
Bhavya Matam, Adamay Mann, Kachina Studer, Christian Gabbianelli, Sonia Castelo, John Liu, Claudio Silva, Dishita Turakhia
https://arxiv.org/abs/2510.05249
Google says it's working on a fix for Gemini's self-loathing comments, which have included "I am a failure. I am a disgrace to my profession." (Lauren Edmonds/Business Insider)
https://www.businessinsider.com/gemini-self-loathi…
FilBench: Can LLMs Understand and Generate Filipino?
Lester James V. Miranda, Elyanah Aco, Conner Manuel, Jan Christian Blaise Cruz, Joseph Marvin Imperial
https://arxiv.org/abs/2508.03523
Discrepancy-Aware Contrastive Adaptation in Medical Time Series Analysis
Yifan Wang, Hongfeng Ai, Ruiqi Li, Maowei Jiang, Ruiyuan Kang, Jiahua Dong, Cheng Jiang, Chenzhong Li
https://arxiv.org/abs/2508.05572
Latent Uncertainty Representations for Video-based Driver Action and Intention Recognition
Koen Vellenga, H. Joe Steinhauer, Jonas Andersson, Anders Sj\"ogren
https://arxiv.org/abs/2510.05006
RapidGNN: Energy and Communication-Efficient Distributed Training on Large-Scale Graph Neural Networks
Arefin Niam, Tevfik Kosar, M S Q Zulkar Nine
https://arxiv.org/abs/2509.05207
Observing Without Doing: Pseudo-Apprenticeship Patterns in Student LLM Use
Jade Hak, Nathaniel Lam Johnson, Matin Amoozadeh, Amin Alipour, Souti Chattopadhyay
https://arxiv.org/abs/2510.04986
Orchestrating Human-AI Teams: The Manager Agent as a Unifying Research Challenge
Charlie Masters, Advaith Vellanki, Jiangbo Shangguan, Bart Kultys, Jonathan Gilmore, Alastair Moore, Stefano V. Albrecht
https://arxiv.org/abs/2510.02557
FDC-Net: Rethinking the association between EEG artifact removal and multi-dimensional affective computing
Wenjia Dong, Xueyuan Xu, Tianze Yu, Junming Zhang, Li Zhuo
https://arxiv.org/abs/2508.05231
Visual Representations inside the Language Model
Benlin Liu, Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, Ranjay Krishna
https://arxiv.org/abs/2510.04819 https://
Efficient Agents: Building Effective Agents While Reducing Cost
Ningning Wang, Xavier Hu, Pai Liu, He Zhu, Yue Hou, Heyuan Huang, Shengyu Zhang, Jian Yang, Jiaheng Liu, Ge Zhang, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou
https://arxiv.org/abs/2508.02694
Real-Time Nonlinear Model Predictive Control of Heavy-Duty Skid-Steered Mobile Platform for Trajectory Tracking Tasks
Alvaro Paz, Pauli Mustalahti, Mohammad Dastranj, Jouni Mattila
https://arxiv.org/abs/2510.02976
Anthropic's Mike Krieger: Opus 4.1 is better at coding, agentic tasks, and more, and Anthropic was previously too focused on only shipping "really big upgrades" (Shirin Ghaffary/Bloomberg)
CoRe-GS: Coarse-to-Refined Gaussian Splatting with Semantic Object Focus
Hannah Schieber, Dominik Frischmann, Simon Boche, Victor Schaack, Angela Schoellig, Stefan Leutenegger, Daniel Roth
https://arxiv.org/abs/2509.04859
From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models
Mingkang Zhu, Xi Chen, Bei Yu, Hengshuang Zhao, Jiaya Jia
https://arxiv.org/abs/2510.05095
ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context
Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, Jinwoo Shin
https://arxiv.org/abs/2510.04246
REN: Anatomically-Informed Mixture-of-Experts for Interstitial Lung Disease Diagnosis
Alec K. Peltekian, Halil Ertugrul Aktas, Gorkem Durak, Kevin Grudzinski, Bradford C. Bemiss, Carrie Richardson, Jane E. Dematte, G. R. Scott Budinger, Anthony J. Esposito, Alexander Misharin, Alok Choudhary, Ankit Agrawal, Ulas Bagci
https://arxiv.org/abs…
Learning to Reason for Factuality
Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas O\u{g}uz, Rulin Shao, Gargi Ghosh, Jason Weston, Wen-tau Yih
https://arxiv.org/abs/2508.05618
Cloning a Conversational Voice AI Agent from Call\,Recording Datasets for Telesales
Krittanon Kaewtawee, Wachiravit Modecrua, Krittin Pachtrachai, Touchapon Kraisingkorn
https://arxiv.org/abs/2509.04871
DeGuV: Depth-Guided Visual Reinforcement Learning for Generalization and Interpretability in Manipulation
Tien Pham, Xinyun Chi, Khang Nguyen, Manfred Huber, Angelo Cangelosi
https://arxiv.org/abs/2509.04970
A Semantics-Aware Hierarchical Self-Supervised Approach to Classification of Remote Sensing Images
Giulio Weikmann, Gianmarco Perantoni, Lorenzo Bruzzone
https://arxiv.org/abs/2510.04916
Do LLMs Align with My Task? Evaluating Text-to-SQL via Dataset Alignment
Davood Rafiei, Morgan Lindsay Heisler, Weiwei Zhang, Mohammadreza Pourreza, Yong Zhang
https://arxiv.org/abs/2510.04919
SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
Hongyi Jing, Jiafu Chen, Chen Rao, Ziqiang Dang, Jiajie Teng, Tianyi Chu, Juncheng Mo, Shuo Fang, Huaizhong Lin, Rui Lv, Chenguang Ma, Lei Zhao
https://arxiv.org/abs/2509.04908
Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models
Haitao Hong, Yuchen Yan, Xingyu Wu, Guiyang Hou, Wenqi Zhang, Weiming Lu, Yongliang Shen, Jun Xiao
https://arxiv.org/abs/2508.05613
Replaced article(s) found for cs.CV. https://arxiv.org/list/cs.CV/new
[1/6]:
- Hulk: A Universal Knowledge Translator for Human-Centric Tasks
Wang, Wu, He, Guo, Zhu, Bai, Zhao, Wu, He, Ouyang, Tang
Replaced article(s) found for cs.CV. https://arxiv.org/list/cs.CV/new
[6/6]:
- IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao
Enhancing Diversity in Large Language Models via Determinantal Point Processes
Yilei Chen, Souradip Chakraborty, Lorenz Wolf, Ioannis Ch. Paschalidis, Aldo Pacchiano
https://arxiv.org/abs/2509.04784
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling
Xinlei Yu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Ruolin Shen, Jiangning Zhang, Xiaobin Hu, Yanwei Fu, Shuicheng Yan
https://arxiv.org/abs/2508.03404
Memorization $\neq$ Understanding: Do Large Language Models Have the Ability of Scenario Cognition?
Boxiang Ma, Ru Li, Yuanlong Wang, Hongye Tan, Xiaoli Li
https://arxiv.org/abs/2509.04866
Training-Free Out-Of-Distribution Segmentation With Foundation Models
Laith Nayal, Hadi Salloum, Ahmad Taha, Yaroslav Kholodov, Alexander Gasnikov
https://arxiv.org/abs/2510.02909
SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models
Rui Qi, Zhibo Man, Yufeng Chen, Fengran Mo, Jinan Xu, Kaiyu Huang
https://arxiv.org/abs/2510.02648
LUIVITON: Learned Universal Interoperable VIrtual Try-ON
Cong Cao, Xianhang Cheng, Jingyuan Liu, Yujian Zheng, Zhenhui Lin, Meriem Chkir, Hao Li
https://arxiv.org/abs/2509.05030
Slm-mux: Orchestrating small language models for reasoning
Chenyu Wang, Zishen Wan, Hao Kang, Emma Chen, Zhiqiang Xie, Tushar Krishna, Vijay Janapa Reddi, Yilun Du
https://arxiv.org/abs/2510.05077
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu
COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization
Yassine Taoudi-Benchekroun, Klim Troyan, Pascal Sager, Stefan Gerber, Lukas Tuggener, Benjamin Grewe
https://arxiv.org/abs/2509.05249
Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes
Shahed Masoudian, Gustavo Escobedo, Hannah Strauss, Markus Schedl
https://arxiv.org/abs/2508.03292
Replaced article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[4/4]:
- Mind the Gap: The Divergence Between Human and LLM-Generated Tasks
Yi-Long Lu, Jiajun Song, Chunhui Zhang, Wei Wang
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, Carole-Jean Wu
https://arxiv.org/abs/2510.04800
SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification
Kanghoon Yoon, Minsub Kim, Sungjae Lee, Joonhyung Lee, Sunghyeon Woo, Yeonjun In, Se Jung Kwon, Chanyoung Park, Dongsoo Lee
https://arxiv.org/abs/2510.02329
GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay
Yunan Zhang, Shuoran Jiang, Mengchen Zhao, Yuefeng Li, Yang Fan, Xiangping Wu, Qingcai Chen
https://arxiv.org/abs/2508.04676