Tootfinder

No exact results. Similar results found.

@arXiv_csCL_bot@mastoxiv.page
2025-07-14 09:56:22

A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
David Schlangen, Sherzod Hakimov, Jonathan Jordan, Philipp Sadler
https://arxiv.org/abs/2507.08491

A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available. The second, best exemplified by the LM-arena, relies on (often self-selected) users bringing their own intents to a site that routes these to several models in parallel, among whose r…

@arXiv_csCL_bot@mastoxiv.page
2025-07-14 09:48:52

Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization
Itai Mondshine, Tzuf Paz-Argaman, Reut Tsarfaty
https://arxiv.org/abs/2507.08342

Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization
Automatic n-gram based metrics such as ROUGE are widely used for evaluating generative tasks such as summarization. While these metrics are considered indicative (even if imperfect) of human evaluation for English, their suitability for other languages remains unclear. To address this, we systematically assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks. Specifically, we design a large-scale evaluation suite acr…

@doktrock@toad.social
2025-06-11 21:50:41

Evolutionists Flock To Darwin-Shaped Wall Stain
https://theonion.com/evolutionists-flock-to-darwin-shaped-wall-stain-1819570078/

Evolutionists Flock To Darwin-Shaped Wall Stain
DAYTON, TN—A steady stream of devoted evolutionists continued to gather in this small Tennessee town today to witness what many believe is an image of Charles Darwin—author of The Origin Of Species and founder of the modern evolutionary movement—made manifest on a concrete wall in downtown Dayton.

@arXiv_csCV_bot@mastoxiv.page
2025-07-14 10:04:52

L-CLIPScore: a Lightweight Embedding-based Captioning Metric for Evaluating and Training
Li Li, Yingzhe Peng, Xu Yang, Ruoxi Cheng, Haiyang Xu, Ming Yan, Fei Huang
https://arxiv.org/abs/2507.08710

L-CLIPScore: a Lightweight Embedding-based Captioning Metric for Evaluating and Training
We propose a novel embedding-based captioning metric termed as L-CLIPScore that can be used for efficiently evaluating caption quality and training captioning model. L-CLIPScore is calculated from a lightweight CLIP (L-CLIP), which is a dual-encoder architecture compressed and distilled from CLIP. To compress, we apply two powerful techniques which are weight multiplexing and matrix decomposition for reducing the parameters of encoders and word embedding matrix, respectively. To distill, we des…

@lysander07@sigmoid.social
2025-05-13 16:25:32

Last week, our students learned how to conduct a proper evaluation for an NLP experiment. To this end, we introduced a small textcorpus with sentences about Joseph Fourier, who counts as one of the discoverers of the greenhouse effect, responsible for global warming.

Slide of the Information Service ENgineering lecture 03, Natural Language Processing 02, section 2.6: Evaluation, Precision, and Recall
Headline: Experiment
Let's consider the following text corpus (FOURIERCORPUS):
1
In 1807, Fourier's work on heat transfer laid the foundation for understanding the greenhouse effect.
2
Joseph Fourier's energy balance analysis showed atmosphere's heat-trapping role.
3
Fourrier's calculations, though rudimentary, suggested that the atmosphere acts as an insulato…

@Dragofix@mastodontti.fi
2025-07-12 22:27:10

Liito-orava on taigametsien tulevaisuuden avainlaji. Tuore geneettinen tutkimus paljastaa yllättäviä piirteitä liito-oravan evoluutiosta sekä vakavia huolia lajin suojelun kannalta. Kaukoidässä saattaa asustaa oma alalaji. https://www.helsinki.fi/fi/uutiset/evoluut

Liito-orava on taigametsien tulevaisuuden avainlaji | Helsingin yliopisto
Tuore geneettinen tutkimus paljastaa yllättäviä piirteitä liito-oravan evoluutiosta sekä vakavia huolia lajin suojelun kannalta. Kaukoidässä saattaa asustaa oma alalaji.

@arXiv_csIR_bot@mastoxiv.page
2025-07-14 07:40:51

DS@GT at LongEval: Evaluating Temporal Performance in Web Search Systems and Topics with Two-Stage Retrieval
Anthony Miyaguchi, Imran Afrulbasha, Aleksandar Pramov
https://arxiv.org/abs/2507.08360

DS@GT at LongEval: Evaluating Temporal Performance in Web Search Systems and Topics with Two-Stage Retrieval
Information Retrieval (IR) models are often trained on static datasets, making them vulnerable to performance degradation as web content evolves. The DS@GT competition team participated in the Longitudinal Evaluation of Model Performance (LongEval) lab at CLEF 2025, which evaluates IR systems across temporally distributed web snapshots. Our analysis of the Qwant web dataset includes exploratory data analysis with topic modeling over time. The two-phase retrieval system employs sparse keyword se…

@berlinbuzzwords@floss.social
2025-05-14 14:00:33

LLMs are now part of our daily work, making coding easier. Join Ivan Dolgov at this year's Berlin Buzzwords to learn how they built an in-house LLM for AI code completion in JetBrains products, covering design choices, data preparation, training and model evaluation.
Learn more: https://

Session title: How to train a fast LLM for coding tasks

Join us from June 15-17 in Berlin or participate online / berlinbuzzwords.de

How to train a fast LLM for coding tasks
In this talk, we present our approach to training a code completion model using Mellum, our new open-source model, as an example. Mellum powers in-file code completion in AI-enabled JetBrains IDEs. We'll walk through the entire process, from designing the model and preparing the dataset — with emphasis on the permissiveness of using data — to the training process and evaluation strategies. Attendees will gain insights into state-of-the-art techniques and the challenges we faced and discover…

@Techmeme@techhub.social
2025-06-15 10:05:34

Anthropic details how it built its multi-agent Claude Research system, claiming significant improvements in internal evaluations over single-agent systems (Anthropic)
https://www.anthropic.com/engineering/built-multi-agent-research-system

How we built our multi-agent research system
On the the engineering challenges and lessons learned from building Claude's Research system

@arXiv_csCL_bot@mastoxiv.page
2025-07-14 09:53:52

Diagnosing Failures in Large Language Models' Answers: Integrating Error Attribution into Evaluation Framework
Zishan Xu, Shuyi Xie, Qingsong Lv, Shupei Xiao, Linlin Song, Sui Wenjuan, Fan Lin
https://arxiv.org/abs/2507.08459

Diagnosing Failures in Large Language Models' Answers: Integrating Error Attribution into Evaluation Framework
With the widespread application of Large Language Models (LLMs) in various tasks, the mainstream LLM platforms generate massive user-model interactions daily. In order to efficiently analyze the performance of models and diagnose failures in their answers, it is essential to develop an automated framework to systematically categorize and attribute errors. However, existing evaluation models lack error attribution capability. In this work, we establish a comprehensive Misattribution Framework wi…

Tootfinder

Opt-in global Mastodon full text search. Join the index!