Alibaba Technical Report: Qwen3-VL beats GPT-5 and Gemini 2.5 Pro on visual tasks and has 100% accuracy on "needle-in-a-haystack" tests for 30-minute videos (Jonathan Kemper/The Decoder)
https://the-decoder.com/qwen3-vl-can-scan-two-hour-…
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, Man Gao, Xi Su, Xiaodong Cai, Xunliang Cai, Yu Yang, Yunke Zhao
https://arxiv.org/abs/2509.26490
Personalized Auto-Grading and Feedback System for Constructive Geometry Tasks Using Large Language Models on an Online Math Platform
Yong Oh Lee, Byeonghun Bang, Joohyun Lee, Sejun Oh
https://arxiv.org/abs/2509.25529
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, Filippos Kokkinos
https://arxiv.org/abs/2509.26625
windsurfers: Windsurfers network (1986)
A network of interpersonal contacts among windsurfers in southern California during the Fall of 1986. The edge weights indicate the perception of social affiliations majored by the tasks in which each individual was asked to sort cards with other surfer’s name in the order of closeness.
This network has 43 nodes and 336 edges.
Tags: Social, Offline, Weighted
"As explained in chapter 11 of Meyer’s book, assertions are meant to check the correctness of a piece of software; that is, its ability to perform the tasks defined in their specification.
Because, you do have a specification, right? Right?"
https://deprogrammaticaipsum.com/asser…
Caught a bug over the holidays so I’m mostly resting, feeling sorry for myself, and taking the time to at least carry out some mindless housekeeping tasks (updating dependencies, etc.) on some of my Node modules.
Released updates to the following packages yesterday:
Tape-based Node.js testing:
• Tap monkey (https://
🎯 Real-world validation through extended #CCBench testing with human evaluators completing multi-turn tasks in isolated #Docker containers across frontend development, tool building, data analysis, testing & algorithms
🔧 Near parity with
It isn't all that hard for my reminders. I use the default stuff for now, but there are #opensource apps that can remind you to do things.
TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos
Xiangrui Liu, Minghao Qin, Yan Shu, Zhengyang Liang, Yang Tian, Chen Jason Zhang, Bo Zhao, Zheng Liu
https://arxiv.org/abs/2509.26360 …
Where to watch Texans vs. Broncos: TV channel, live stream, prediction, pick, odds, spread
https://www.cbssports.com/nfl/news/where-to-watch-texan…
The LLMs are useful for some tasks, I'm currently tidying up, proof reading and editing a document written by many international co-authors.
The LLM I'm using can very quickly correct grammar and spelling mistakes and produces much easier to understand text from occasionally tortured paragraphs. It still needs an experts eye (mine!) to check no mistakes have been introduced or complexities over-simplified.
This is actually the first time I've used an LLM for this task. It's making it much faster and less painful.
It also explains a lot about academic publishing lately..
SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models
Qinjian Zhao, Jiaqi Wang, Zhiqiang Gao, Zhihao Dou, Belal Abuhaija, Kaizhu Huang
https://arxiv.org/abs/2509.26345
Pretrain-Test Task Alignment Governs Generalization in In-Context Learning
Mary I. Letey, Jacob A. Zavatone-Veth, Yue M. Lu, Cengiz Pehlevan
https://arxiv.org/abs/2509.26551 htt…
"I'm an average user, so I don't need all the options and apps the programme has to offer. But, to be honest, #Microsoft is making it increasingly attractive to switch. Now that the company is putting #AI in everything, everything is becoming more annoying to use."
Can Dutch
The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows
Litao Yan, Andrew Head, Ken Milne, Vu Le, Sumit Gulwani, Chris Parnin, Emerson Murphy-Hill
https://arxiv.org/abs/2509.26557
Enabling Time-Aware Priority Traffic Management over Distributed FPGA Nodes
Alberto Scionti, Paolo Savio, Francesco Lubrano, Federico Stirano, Antonino Nespola, Olivier Terzo, Corrado De Sio, Luca Sterpone
https://arxiv.org/abs/2509.26043
Evaluating the Impact of Radiographic Noise on Chest X-ray Semantic Segmentation and Disease Classification Using a Scalable Noise Injection Framework
Derek Jiu, Kiran Nijjer, Nishant Chinta, Ryan Bui, Ben Liu, Kevin Zhu
https://arxiv.org/abs/2509.25265
Are you afraid of our new GenAI overlords taking over our jobs soon? According to a new benchmark, The Remote Labor Index by Scale AI and the Center for AI Safety (CAIS), there's no need to be. The best current models are able to solve around ~2% of the tasks of the index: #AIResearch #GenAI
Replaced article(s) found for math.AC. https://arxiv.org/list/math.AC/new
[1/1]:
- Composing Global Solutions to Reasoning Tasks via Algebraic Objects in Neural Nets
Yuandong Tian
Commutative algebra neural network reveals genetic origins of diseases
JunJie Wee, Faisal Suwayyid, Mushal Zia, Hongsong Feng, Yuta Hozumi, Guo-Wei Wei
https://arxiv.org/abs/2509.26566
A Chaotic Dynamics Framework Inspired by Dorsal Stream for Event Signal Processing
Yu Chen, Jing Lian, Zhaofei Yu, Jizhao Liu, Jisheng Dang, Gang Wang
https://arxiv.org/abs/2509.26085
Introducing Large Language Models in the Design Flow of Time Sensitive Networking
Rubi Debnath, Luxi Zhao, Mohammadreza Barzegaran, Sebastian Steinhorst
https://arxiv.org/abs/2509.26368
TrackCore-F: Deploying Transformer-Based Subatomic Particle Tracking on FPGAs
Arjan Blankestijn, Uraz Odyurt, Amirreza Yousefzadeh
https://arxiv.org/abs/2509.26335 https://
🔮 The Future: Z1 Scale-Up Coming next: Z1 chips with 250K interconnected Pits per chip. Powers energy-based models 10,000x more efficient than #GPUs for generative tasks. Check the new paper on Denoising Thermodynamic #Models!
📺 Watch the full video:
Kiedy spałeś jeszcze krócej niż zwykle, bo tuż przed snem wpadłeś na to, jak zaimplementować mnożenie w https://nandgame.com.
Oczywiście, jak już się położyłeś, to wpadłeś na to, jak zrobić to szybciej, i z 32-bitowym wynikiem. Aczkolwiek stara metoda też dobra, taka ładnie pedagogiczna.
STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models
Shaoxiong Guo, Tianyi Du, Lijun Li, Yuyao Wu, Jie Li, Jing Shao
https://arxiv.org/abs/2509.26473
Fast-dLLM v2: Efficient Block-Diffusion LLM
Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie
https://arxiv.org/abs/2509.26328
fev-bench: A Realistic Benchmark for Time Series Forecasting
Oleksandr Shchur, Abdul Fatir Ansari, Caner Turkmen, Lorenzo Stella, Nick Erickson, Pablo Guerron, Michael Bohlke-Schneider, Yuyang Wang
https://arxiv.org/abs/2509.26468
Since the start of the semester I feel like being hunted by an angry pack of looming deadlines: everything seems to cluster around October 31 and mid-November.
But! I have now finished all work-related tasks due October 31, and it's only the 28!
Since the start of the semester I feel like being hunted by an angry pack of looming deadlines: everything seems to cluster around October 31 and mid-November.
But! I have now finished all work-related tasks due October 31, and it's only the 28!
Since the start of the semester I feel like being hunted by an angry pack of looming deadlines: everything seems to cluster around October 31 and mid-November.
But! I have now finished all work-related tasks due October 31, and it's only the 28!
windsurfers: Windsurfers network (1986)
A network of interpersonal contacts among windsurfers in southern California during the Fall of 1986. The edge weights indicate the perception of social affiliations majored by the tasks in which each individual was asked to sort cards with other surfer’s name in the order of closeness.
This network has 43 nodes and 336 edges.
Tags: Social, Offline, Weighted
I so hope this blows up in his face spectacularly.
Note that I do not believe that any LLM can become as skilled as the least skilled #InfoSec professional using conventional tools, so I’m quite confident this will not reach any useful goal.
I’m saying he needs to learn a lesson, and it would be great if it were useful for others as well. @…
Auto-ARGUE: LLM-Based Report Generation Evaluation
William Walden, Marc Mason, Orion Weller, Laura Dietz, Hannah Recknor, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, James Mayfield, Eugene Yang
https://arxiv.org/abs/2509.26184
Today I'm heading to #Berlin again with tiny sh0rky 🦈🤏 and some cleanup tasks left for the ride.
The intel_fw library turns out to be very rich and powerful already.
I will start my new job on Monday, and I'll be using #Arch btw.
scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis
Ping Xu, Zaitian Wang, Zhirui Wang, Pengjiang Li, Ran Zhang, Gaoyang Li, Hanyu Xie, Jiajia Wang, Yuanchun Zhou, Pengfei Wang
https://arxiv.org/abs/2509.25884
Welcome to the world of the field, engineering.
For a long time, we've hired very few into sales, mktg, support or consulting that don't already gobs of experience elsewhere.
✅ How #Microsoft’s developers are using #AI - The Verge
DEPTHOR : Robust Depth Enhancement from a Real-World Lightweight dToF and RGB Guidance
Jijun Xiang, Longliang Liu, Xuan Zhu, Xianqi Wang, Min Lin, Xin Yang
https://arxiv.org/abs/2509.26498
From NL2SQL to NL2GeoSQL: GeoSQL-Eval for automated evaluation of LLMs on PostGIS queries
Shuyang Hou, Haoyue Jiao, Ziqi Liu, Lutong Xie, Guanyu Chen, Shaowen Wu, Xuefeng Guan, Huayi Wu
https://arxiv.org/abs/2509.25264
OpenAI releases gpt-oss-safeguard, its open-weight reasoning models for safety classification tasks, available in 120B and 20B parameters, under Apache 2.0 (OpenAI)
https://openai.com/index/introducing-gpt-oss-safeguard/
Diversity-Incentivized Exploration for Versatile Reasoning
Zican Hu, Shilin Zhang, Yafu Li, Jianhao Yan, Xuyang Hu, Leyang Cui, Xiaoye Qu, Chunlin Chen, Yu Cheng, Zhi Wang
https://arxiv.org/abs/2509.26209
Crosslisted article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[3/7]:
- Personalized Auto-Grading and Feedback System for Constructive Geometry Tasks Using Large Languag...
Yong Oh Lee, Byeonghun Bang, Joohyun Lee, Sejun Oh
Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
Seiji Maekawa, Jackson Hassell, Pouya Pezeshkpour, Tom Mitchell, Estevam Hruschka
https://arxiv.org/abs/2509.26553
RE: https://hachyderm.io/@thomasfuchs/115601979925351548
hear me out, how about NOT CRAMMING EVERYTHING INTO ONE DEVICE that just works mid for everything, but instead, you know, do some actual innovation here and there
for example, make devices specifically tailored for certain tasks
like if you're Apple why in the fuck don't you make devices with e-paper screens for people who don't want to be terminally online
Another day another study showing that "AI assistants" do not work reliably for cognitive tasks.
"Largest study of its kind shows AI assistants misrepresent news content 45% of the time – regardless of language or territory"
https://www.bbc.co.uk/mediacentre/202…
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Lifan Yuan, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jinchen He, Yifan Su, Jiabin Yu, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Xinan Chen, Peixue Wu, Yunkai Wang, Juntai Zhou, Yong Zhao, Farshid Jafarpour, Jessie Shelton, Aaron Young, John Bartolotta, Wenchao Xu,…
Like all the rest of the nerds, I did a bit of tech support on family computers.
They're all popping up windows from scam virus scanners lying that subscriptions need to be renewed or machines are unprotected. People don't know how to remove these things. Luckily they also don't really know how to pay the subscription.
Their phones are updating on them. Changing where buttons used to be. Removing options. Forcing people to register to use they things they have been doing for years.
They don't know how to register.
Things pop up asking for passwords and they have no idea who is asking or which password to use.
I tell them that I don't really understand why they keep using Windows now it is so shitty and awful. They say they don't know how to use anything else. The fact they don't really know how to use windows either doesn't seem to register.
The tech corporations have given up completely on being user friendly. They are all deliberately user hostile and exploitative now.
Corporate tech is terrible. The industry is failing it's users, abusing them. People don't even know there is any other way. They are just giving up on achieving their tasks until someone can fix the pop-ups and subscription boxes and passwords and 2fa for them.
Tech sucks now. Sucks hard.
#tech #christmasTechSupport
Been using a number of AI models over the past week or so as work has slowed down, giving me time to explore things more deeply.
Been using Claude Code with musistudio/claude-code-router which is great as I can switch between different models on similar tasks.
Experience so far has been that Gemini 3 Flash is very good for thinking and coding tasks but the code does tend to be fragile so rewrites are needed. For tough problems where the errors are not straightforward it falls d…
VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne
https://arxiv.org/abs/2509.25339
Replaced article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[4/5]:
- Composing Global Solutions to Reasoning Tasks via Algebraic Objects in Neural Nets
Yuandong Tian
'Cost-Effective Machine Learning for Automatically Processing Bibliographic Metadata' is a very readable account of using DistilBERT for specific DH tasks https://www.euppublishing.com/doi/full/10.3366/ijhac.2025.0353
Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation
Xunzhu Tang, Iyiola Emmanuel Olatunji, Tiezhu Sun, Jacques Klein, Tegawende F. Bissyande
https://arxiv.org/abs/2509.25243
Todoist FTW.
Several years ago we had smoky pies and rolls for Thanksgiving because we put off cleaning the ovens until it was too late.
I created a recurring Todoist task to clean the ovens on the 3rd Sunday of November. Ever since then we’ve had pristine ovens and smoke free cooking every Thanksgiving.
Collette appreciates that a holiday for which most of the tasks are hers, she doesn’t have to worry and the ovens are ready to go.
When you develop a hippier mindset, even the smallest daily tasks start taking hours longer, because in the end, none of it really matters anyway, I mean, why rush?!
#MentalHealth #Hippie
SCUBA: Salesforce Computer Use Benchmark
Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu
https://arxiv.org/abs/2509.26506

SCUBA: Salesforce Computer Use Benchmark
We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterprise Software UI navigation, data manipulation, workflow automation, information retrieval, and tro…
I am trying to alternate between using Workflowy and Emacs Org for tasks. The tool chosen randomly on a weekly basis.
(Intention is to understand what features I find absolutely required).
#Emacs #orgmode #workflowy
You know that you’re teaching in #Switzerland when you catch the majority of students running the live broadcast of today’s #ski competition in a separate window while doing learning tasks
#AcademicChatter
Deconstructing Self-Bias in LLM-generated Translation Benchmarks
Wenda Xu, Sweta Agrawal, Vil\'em Zouhar, Markus Freitag, Daniel Deutsch
https://arxiv.org/abs/2509.26600 htt…
A Multi-Language Object-Oriented Programming Benchmark for Large Language Models
Shuai Wang, Liang Ding, Li Shen, Yong Luo, Han Hu, Lefei Zhang, Fu Lin
https://arxiv.org/abs/2509.26111
QUARTZ : QA-based Unsupervised Abstractive Refinement for Task-oriented Dialogue Summarization
Mohamed Imed Eddine Ghebriout (Universite de Lorraine, CNRS, Inria, LORIA, Nancy, France), Ga\"el Guibon (Universite de Lorraine, CNRS, Inria, LORIA, Nancy, France), Ivan Lerner (Inserm, Centre de Recherche des Cordeliers, Universite Paris Cite, Sorbonne Universite, Paris, France), Emmanuel Vincent (Universite de Lorraine, CNRS, Inria, LORIA, Nancy, France)
Some people now argue that LLMs are useless. I disagree; they can be very useful if you take them as what they are: models of language that generate text on the basis of some given text. As such, they can be useful for a wide range of text-related tasks, including assisting with writing. And the more formulaic the genre, the better they work obviously. This is part of the reason why they are so popular with students, and in academia more generally.
⇢
Some people now argue that LLMs are useless. I disagree; they can be very useful if you take them as what they are: models of language that generate text on the basis of some given text. As such, they can be useful for a wide range of text-related tasks, including assisting with writing. And the more formulaic the genre, the better they work obviously. This is part of the reason why they are so popular with students, and in academia more generally.
⇢
Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development
Yuxuan Wan, Tingshuo Liang, Jiakai Xu, Jingyu Xiao, Yintong Huo, Michael R. Lyu
https://arxiv.org/abs/2509.25297
Uber plans to launch data labelling tasks in the US for some drivers to earn extra money, appearing under "digital tasks" in the driver app, later this fall (Natalie Lung/Bloomberg)
https://www.bloomberg.com/news/articles/20
LLM Agents for Knowledge Discovery in Atomic Layer Processing
Andreas Werbrouck, Marshall B. Lindsay, Matthew Maschmann, Matthias J. Young
https://arxiv.org/abs/2509.26201 https…
Sunday Robotics unveils Memo, a fully autonomous home robot capable of tasks like making espresso and loading dishwashers, set to launch in beta in 2026 (Will Knight/Wired)
https://www.wired.com/story/memo-sunday-robotics-home-robot/
🚀 Real results: Geoffrey Huntley ran a 3-month loop building a complete programming language. YC hackathon teams shipped 6 repos overnight for $297 in API costs
✅ Best for: Large refactors, batch operations, test coverage, documentation generation - tasks with clear completion criteria
⚠️ Not for: Ambiguous requirements, architectural decisions, security-sensitive code, or exploration work
Creation, Critique, and Consumption: Exploring Generative AI Descriptions for Supporting Blind and Low Vision Professionals with Visual Tasks
Lucy Jiang, Lotus Zhang, Leah Findlater
https://arxiv.org/abs/2510.08991
windsurfers: Windsurfers network (1986)
A network of interpersonal contacts among windsurfers in southern California during the Fall of 1986. The edge weights indicate the perception of social affiliations majored by the tasks in which each individual was asked to sort cards with other surfer’s name in the order of closeness.
This network has 43 nodes and 336 edges.
Tags: Social, Offline, Weighted
Arbiter, which is using AI to automate healthcare administrative tasks, emerges from stealth with a $52M seed from multiple family offices at a $400M valuation (Rebecca Torrence/Business Insider)
https://www.businessinsider.com/health-sta
Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks
Jo\~ao Palmeiro, Diogo Duarte, Rita Costa, Pedro Bizarro
https://arxiv.org/abs/2510.06071
Beijing-based DP Technology, which develops AI tools used by researchers for tasks like computer-aided drug design and battery design, raised a ~$114M Series C (Eunice Xu/South China Morning Post)
https://www.scmp.com/business/companies/ar
Anthropic finds that LLMs trained to "reward hack" by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research (Anthropic)
https://www.anthropic.com/research/emergent-misalignment-reward-hacking
China's MiniMax releases M2.1, an upgrade to its open-source M2 model that it says has "significantly enhanced" coding capabilities in Rust, Java, and others (MiniMax)
https://www.minimax.io/news/minimax-m21
Evaluating the Quality of Randomness and Entropy in Tasks Supported by Large Language Models
Rabimba Karanjai, Yang Lu, Ranjith Chodavarapu, Lei Xu, Weidong Shi
https://arxiv.org/abs/2510.12080
Selection, Reflection and Self-Refinement: Revisit Reasoning Tasks via a Causal Lens
Yunlong Deng, Boyang Sun, Yan Li, Lingjing Kong, Zeyu Tang, Kun Zhang, Guangyi Chen
https://arxiv.org/abs/2510.08222
ChatGPT Atlas hands-on: generally able to interpret instructions and navigate simple menus, but "technical constraints on session length" are a limiting factor (Kyle Orland/Ars Technica)
https://arstechnica.com/features/2025/
Hyro, whose AI agents let US health care organizations automate tasks like scheduling and prescription renewals, raised $45M, taking its total funding to $95M (Sophie Shulman/CTech)
https://www.calcalistech.com/ctechnews/article/rylcu6valg
The Pentagon partners with xAI to embed the company's frontier AI systems, based on the Grok family of models, directly into GenAI.mil as soon as early 2026 (Bonny Chu/Fox News)
https://www.foxnews.com/politics/pentagon-
Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks
Yuxiang Zhang, Jiangming Shu, Ye Ma, Xueyuan Lin, Shangxi Wu, Jitao Sang
https://arxiv.org/abs/2510.12635
METR: Claude Opus 4.5 has a 50% task completion time horizon of about 4 hours and 49 minutes, more than double that of Claude Opus 4 released earlier this year (@metr_evals)
https://x.com/metr_evals/status/2002203627377574113