A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
David Schlangen, Sherzod Hakimov, Jonathan Jordan, Philipp Sadler
https://arxiv.org/abs/2507.08491
Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization
Itai Mondshine, Tzuf Paz-Argaman, Reut Tsarfaty
https://arxiv.org/abs/2507.08342
Evolutionists Flock To Darwin-Shaped Wall Stain
https://theonion.com/evolutionists-flock-to-darwin-shaped-wall-stain-1819570078/
L-CLIPScore: a Lightweight Embedding-based Captioning Metric for Evaluating and Training
Li Li, Yingzhe Peng, Xu Yang, Ruoxi Cheng, Haiyang Xu, Ming Yan, Fei Huang
https://arxiv.org/abs/2507.08710
Liito-orava on taigametsien tulevaisuuden avainlaji. Tuore geneettinen tutkimus paljastaa yllättäviä piirteitä liito-oravan evoluutiosta sekä vakavia huolia lajin suojelun kannalta. Kaukoidässä saattaa asustaa oma alalaji. https://www.helsinki.fi/fi/uutiset/evoluut
DS@GT at LongEval: Evaluating Temporal Performance in Web Search Systems and Topics with Two-Stage Retrieval
Anthony Miyaguchi, Imran Afrulbasha, Aleksandar Pramov
https://arxiv.org/abs/2507.08360
LLMs are now part of our daily work, making coding easier. Join Ivan Dolgov at this year's Berlin Buzzwords to learn how they built an in-house LLM for AI code completion in JetBrains products, covering design choices, data preparation, training and model evaluation.
Learn more: https://
Anthropic details how it built its multi-agent Claude Research system, claiming significant improvements in internal evaluations over single-agent systems (Anthropic)
https://www.anthropic.com/engineering/built-multi-agent-research-system
Diagnosing Failures in Large Language Models' Answers: Integrating Error Attribution into Evaluation Framework
Zishan Xu, Shuyi Xie, Qingsong Lv, Shupei Xiao, Linlin Song, Sui Wenjuan, Fan Lin
https://arxiv.org/abs/2507.08459