Tootfinder

Opt-in global Mastodon full text search. Join the index!

No exact results. Similar results found.
@heiseonline@social.heise.de
2025-06-26 10:48:00

Xbench: Chinesischer KI-Benchmark prüft Modelle auf Alltagstauglichkeit
Ein neuer Benchmark aus China testet KI-Modelle auf ihre Fähigkeit, reale Aufgaben zu lösen. Er soll Unternehmen bei Investitionsentscheidungen in KI helfen.

@seeingwithsound@mas.to
2025-05-27 21:57:36

A Project Moohan benchmark gets spotted, and may have revealed the Android XR headset's key spec techradar.com/computing/virtua

@arXiv_csCL_bot@mastoxiv.page
2025-06-27 09:38:19

Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks
Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, Kenneth Enevoldsen
arxiv.org/abs/2506.21182

@arXiv_csCV_bot@mastoxiv.page
2025-06-27 10:21:39

SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark
Alex Costanzino, Pierluigi Zama Ramirez, Luigi Lella, Matteo Ragaglia, Alessandro Oliva, Giuseppe Lisanti, Luigi Di Stefano
arxiv.org/abs/2506.21549

@Techmeme@techhub.social
2025-06-28 03:51:27

Sources: Applied Compute, a pre-launch reinforcement learning startup founded by three former OpenAI staffers, raised $20M at a $100M valuation led by Benchmark (Alex Konrad/Upstarts Media)
upstartsmedia.com/p/ex-openai-

@escap@azapft.is
2025-06-28 15:43:21

Da fällt mir ein: #bahn #db

@arXiv_csDC_bot@mastoxiv.page
2025-06-27 09:08:49

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks
Joshua H. Davis, Daniel Nichols, Ishan Khillan, Abhinav Bhatele
arxiv.org/abs/2506.20938

@arXiv_eessSY_bot@mastoxiv.page
2025-06-27 09:23:39

DPLib: A Standard Benchmark Library for Distributed Power System Analysis and Optimization
Milad Hasanzadeh, Amin Kargarian
arxiv.org/abs/2506.20819

@arXiv_csCL_bot@mastoxiv.page
2025-06-27 09:59:49

Potemkin Understanding in Large Language Models
Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan
arxiv.org/abs/2506.21521 arxiv.org/pdf/2506.21521 arxiv.org/html/2506.21521
arXiv:2506.21521v1 Announce Type: new
Abstract: Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs -- such as AP exams -- are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.
toXiv_bot_toot

@arXiv_csCL_bot@mastoxiv.page
2025-06-27 09:59:39

skLEP: A Slovak General Language Understanding Benchmark
Marek \v{S}uppa, Andrej Ridzik, Daniel Hl\'adek, Tom\'a\v{s} Jav\r{u}rek, Vikt\'oria Ondrejov\'a, Krist\'ina S\'asikov\'a, Martin Tamajka, Mari\'an \v{S}imko
arxiv.org/abs/2506.21508 arxiv.org/pdf/2506.21508 arxiv.org/html/2506.21508
arXiv:2506.21508v1 Announce Type: new
Abstract: In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at github.com/slovak-nlp/sklep in the hopes of fostering reproducibility and drive future research in Slovak NLU.
toXiv_bot_toot