MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Xi Chen, Mingkang Zhu, Shaoteng Liu, Xiaoyang Wu, Xiaogang Xu, Yu Liu, Xiang Bai, Hengshuang Zhao
https://arxiv.org/abs/2506.22434
HLTCOE at LiveRAG: GPT-Researcher using ColBERT retrieval
Kevin Duh, Eugene Yang, Orion Weller, Andrew Yates, Dawn Lawrie
https://arxiv.org/abs/2506.22356 …
Toroidal graph manifolds with small homology are not SU(2)-abelian
Giacomo Bascape
https://arxiv.org/abs/2506.21729 https://arxiv.org…
"I think it is a huge mistake for people to assume that they can trust AI when they do not trust each other. The safest way to develop superintelligence is to first strengthen trust between humans, and then cooperate with each other to develop superintelligence in a safe manner. But what we are doing now is exactly the opposite. Instead, all efforts are being directed toward developing a superintelligence."
#AGI #AI
https://www.wired.com/story/questions-answered-by-yuval-noah-harari-for-wired-ai-artificial-intelligence-singularity/
Action Language BC
Joseph Babb, Joohyung Lee
https://arxiv.org/abs/2506.18044 https://arxiv.org/pdf/2506.18044
Detecting Atmospheric CO2 Trends as Population-Level Signatures for Long-Term Stable Water Oceans and Biotic Activity on Temperate Terrestrial Exoplanets
Janina Hansen, Daniel Angerhausen, Sascha P. Quanz, Derek Vance, Bj\"orn S. Konrad, Emily O. Garvin, Eleonora Alei, Jens Kammerer, Felix A. Dannert
https://arxiv.org/abs/2…
Potemkin Understanding in Large Language Models
Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan
https://arxiv.org/abs/2506.21521 https://arxiv.org/pdf/2506.21521 https://arxiv.org/html/2506.21521
arXiv:2506.21521v1 Announce Type: new
Abstract: Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs -- such as AP exams -- are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.
toXiv_bot_toot
The Pythagoras number of fields of transcendence degree $1$ over $\mathbb{Q}$
Olivier Benoist
https://arxiv.org/abs/2506.21380 https://
Connections between hyperlinearity, stability and character rigidity for higher rank lattices
Alon Dogon, Itamar Vigdorovich
https://arxiv.org/abs/2506.20843