Tootfinder

No exact results. Similar results found.

@tagesschau_eil@mastodon.social
2025-10-28 16:28:04

📢 Netanjahu ordnet neue Angriffe im Gazastreifen an
Der israelische Ministerpräsident Benjamin Netanjahu hat die Armee angewiesen, sofort "intensive Angriffe" im Gazastreifen auszuführen. Dies teilte das Büro Netanjahus mit. Seit rund zwei Wochen gilt im Gazastreifen eine Waffenruhe.
➡️

Netanjahu ordnet neue Angriffe im Gazastreifen an
Der israelische Ministerpräsident Benjamin Netanjahu hat die Armee angewiesen, sofort "intensive Angriffe" im Gazastreifen auszuführen. Dies teilte das Büro Netanjahus mit. Seit rund zwei Wochen gilt im Gazastreifen eine Waffenruhe.

@kurtsh@mastodon.social
2025-12-24 18:08:26

Got a Surface Laptop Studio 2? Or just need a 100W USB-Power Delivery charger?
My new favorite charger is the "Anker Prime Charger, 160W 3-Port (140W on a single port). Perfect for a power-hungry laptop smartphone.
https://www.amazon.com/Anker-Charger-Compa

Amazon.com: Anker Prime Charger, 160W 3-Port Compact USB C GaN Charger Block, Smart Display and Touch Control, 140W Max Charging via Any Single Port, for MacBook, Laptop, iPhone 17/16/15 Series, iPad(Non-Battery) : Cell Phones & Accessories
Buy Anker Prime Charger, 160W 3-Port Compact USB C GaN Charger Block, Smart Display and Touch Control, 140W Max Charging via Any Single Port, for MacBook, Laptop, iPhone 17/16/15 Series, iPad(Non-Battery): Wall Chargers - Amazon.com ✓ FREE DELIVERY possible on eligible purchases

@pavelasamsonov@mastodon.social
2025-12-26 20:18:55

2025 has only intensified the struggles of past years. The challenge for us is not "how do we pull through?" but "how do we make a 'new normal' work for us?"
The answer, of course, is not to simply try harder. To prevent 2026 from going like 2025, we need to deliberately re-evaluate our own goals, and our relationships with our jobs.
Learn more in the final issue of the Product Picnic for the year — and if you like it, please subscribe!

Bouncing back from burnout
The causes of burnout are systemic. You can't get out of it just by trying harder.

@holger_moller@bildung.social
2025-12-28 07:56:23

Demokratien können sich wehren:
"Demokratische Institutionen haben [im Fall eines Angriffs] zwei Optionen. Entweder sie nehmen sich zurück, um nicht als parteiisch oder spaltend wahrgenommen zu werden. Oder sie gehen in den Gegenangriff über – trotz der Gefahr, ihre Kompetenzen zu überschreiten und ihren Ruf als unabhängige Institutionen zu verlieren."
Benjamin Hindrichs
Was davon geschieht gerade bei uns in

Demokratien können sich wehren! Wie genau, zeigt Brasilien
Mordpläne und ein Putschversuch: In Brasilien versuchte Ex-Präsident Bolsonaro, die Macht an sich zu reißen. Doch der Staat wehrt sich. Drei Lektionen, die Europa jetzt braucht.

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:31:50

ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering
Francesco Maria Molfese, Luca Moroni, Ciro Porcaro, Simone Conia, Roberto Navigli
https://arxiv.org/abs/2510.09351

ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering
While Small Language Models (SLMs) have demonstrated promising performance on an increasingly wide array of commonsense reasoning benchmarks, current evaluation practices rely almost exclusively on the accuracy of their final answers, neglecting the validity of the reasoning processes that lead to those answers. To address this issue, we introduce ReTraceQA, a novel benchmark that introduces process-level evaluation for commonsense reasoning tasks. Our expert-annotated dataset reveals that in a…

@arXiv_csCV_bot@mastoxiv.page
2025-10-15 10:50:41

VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage
A. Alfarano (University of Zurich, Max Planck Society), L. Venturoli (University of Zurich, Max Planck Society), D. Negueruela del Castillo (University of Zurich, Max Planck Society)
https://arxiv.org/abs/2510.12750

VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage
Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statisti…

@arXiv_csAI_bot@mastoxiv.page
2025-10-08 10:27:09

MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization
Dayy\'an O'Brien, Barry Haddow, Emily Allaway, Pinzhen Chen
https://arxiv.org/abs/2510.05962

MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization
Conducting contamination-free evaluation of mathematical capabilities can be difficult for two reasons: models may memorize a test set once it is made public, and current mathematical benchmarks are prone to overfitting due to having limited diversity of symbols and rules, coupled with closed-ended answers. This paper proposes a method to leverage these shortcomings as useful features to a construct dynamic, counterfactual benchmark, which can be used to both reveal overfitting and measure true…

@arXiv_csAI_bot@mastoxiv.page
2025-10-06 09:20:39

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents
Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park
https://arxiv.org/abs/2510.02837

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents
Although recent tool-augmented benchmarks incorporate complex user requests and diverse tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent's performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucination, and adaptivity. The most straightforward method for ev…

@arXiv_csCV_bot@mastoxiv.page
2025-10-13 10:24:30

Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy
Patrick Wienholt, Sophie Caselitz, Robert Siepmann, Philipp Bruners, Keno Bressem, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn
https://arxiv.org/abs/2510.09256

Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy
To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography …

@arXiv_csAI_bot@mastoxiv.page
2025-10-03 09:29:31

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets
Zeshi Dai, Zimo Peng, Zerui Cheng, Ryan Yihe Li
https://arxiv.org/abs/2510.00332

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets
We present CAIA, a benchmark exposing a critical blind spot in AI evaluation: the inability of state-of-the-art models to operate in adversarial, high-stakes environments where misinformation is weaponized and errors are irreversible. While existing benchmarks measure task completion in controlled settings, real-world deployment demands resilience against active deception. Using crypto markets as a testbed where $30 billion was lost to exploits in 2024, we evaluate 17 models on 178 time-anchore…

Tootfinder

Opt-in global Mastodon full text search. Join the index!