Google says Gemini 3 Pro scores 1,501 on LMArena, above 2.5 Pro, and demonstrates PhD-level reasoning with top scores on Humanity's Last Exam and GPQA Diamond (Abner Li/9to5Google)
https://9to5google.com/2025/11/18/gemini-3-launch/
Is #AI really just dumb statistics? "Olympiad-level physics problem-solving presents a significant challenge for both humans and artificial intelligence (AI), as it requires a sophisticated integration of precise calculation, abstract reasoning, and a fundamental grasp of physical principles," says the (abstract of the) paper https://arxiv.org/abs/2511.10515: "The Chinese Physics Olympiad (CPhO), renowned for its complexity and depth, serves as an ideal and rigorous testbed for these advanced capabilities. In this paper, we introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning, and apply it to the CPhO 2025 theory examination. LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor and significantly outperforming all baseline methods." Oops ...?
What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?
Jiwan Chung, Neel Joshi, Pratyusha Sharma, Youngjae Yu, Vibhav Vineet
https://arxiv.org/abs/2510.01719
Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces
Minju Gwak, Guijin Son, Jaehyung Kim
https://arxiv.org/abs/2510.06953 https://
We've updated the What Uses More app to reflect last week's finding by Luccioni and Gamazaychikov that "reasoning" mode increases energy and water usage by 30x. The study casts doubt on the improved efficiency AI companies are claiming for newer models
https://www.
Comparative Explanations via Counterfactual Reasoning in Recommendations
Yi Yu, Zhenxing Hu
https://arxiv.org/abs/2510.10920 https://arxiv.org/pdf/2510.109…
Embracing Discrete Search: A Reasonable Approach to Causal Structure Learning
Marcel Wien\"obst, Leonard Henckel, Sebastian Weichwald
https://arxiv.org/abs/2510.04970 https…
MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs
Jiyao Liu, Jinjie Wei, Wanying Qu, Chenglong Ma, Junzhi Ning, Yunheng Li, Ying Chen, Xinzhe Luo, Pengcheng Chen, Xin Gao, Ming Hu, Huihui Xu, Xin Wang, Shujian Gao, Dingkang Yang, Zhongying Deng, Jin Ye, Lihao Liu, Junjun He, Ningsheng Xu
https://arxiv…
94.1% accuracy is definitely the exception to the rule for me, but the moves looked clear and obvious. I had wondered about whether patience against the pinned queen was accurate but reasoned it had to be.
Opponent allowing the pin on the queen was their undoing, obviously, but they still played with 82.5% accuracy. In most of my games, I'd be delighted to score that high.
#chess
The illusion of readiness: Stress testing large frontier models on multimodal medical benchmarks #AI