Tootfinder

@heiseonline@social.heise.de
2025-09-18 16:36:00

Deepseek-R1: KI-Training hat sogar weniger als 300.000 US-Dollar gekostet
Die Konkurrenzfähigkeit der KI-Modelle von Deepseek hat Anfang des Jahres die KI-Branche schockiert. Jetzt gibt es erstmals konkrete Informationen zum Training.

Deepseek-R1: KI-Training hat sogar weniger als 300.000 US-Dollar gekostet
Die Konkurrenzfähigkeit der KI-Modelle von Deepseek hat Anfang des Jahres die KI-Branche schockiert. Jetzt gibt es erstmals konkrete Informationen zum Training.

@Techmeme@techhub.social
2025-09-20 13:41:48

Huawei says DeepSeek-R1-Safe, which was trained on 1,000 of its Ascend AI chips, is "nearly 100% successful" in preventing politically sensitive topics (Eduardo Baptista/Reuters)
https://www.reuters.com/business/media-tel

@Techmeme@techhub.social
2025-08-21 09:35:58

DeepSeek details V3.1 and says it surpasses R1 on key benchmarks and is customized to work with next-gen Chinese-made AI chips, after unveiling it on August 19 (Bloomberg)
https://www.bloomberg.com/news/articles/2025-08-21/deep…

@arXiv_csCY_bot@mastoxiv.page
2025-07-21 08:11:50

The Emperor's New Chain-of-Thought: Probing Reasoning Theater Bias in Large Reasoning Models
Qian Wang, Yubo Fan, Zhenheng Tang, Nuo Chen, Wenxuan Wang, Bingsheng He
https://arxiv.org/abs/2507.13758

The Emperor's New Chain-of-Thought: Probing Reasoning Theater Bias in Large Reasoning Models
Large Reasoning Models (LRMs) like DeepSeek-R1 and o1 are increasingly used as automated evaluators, raising critical questions about their vulnerability to the aesthetics of reasoning in LLM-as-a-judge settings. We introduce THEATER, a comprehensive benchmark to systematically evaluate this vulnerability-termed Reasoning Theater Bias (RTB)-by comparing LLMs and LRMs across subjective preference and objective factual datasets. Through investigation of six bias types including Simple Cues and Fa…

@Techmeme@techhub.social
2025-09-18 13:10:41

In a peer-reviewed Nature article, DeepSeek says it has spent $294,000 on training its R1 model and used 512 Nvidia H800 chips (Eduardo Baptista/Reuters)
https://www.reuters.com/world/china/chinas-deepseek-says-its-hit-ai-model-cos…

@thomasrenkert@hcommons.social
2025-08-14 14:23:51

The geopolitical #aiarmsrace seems largely unimpressed by people proclaiming #LLMs have plateaued and #AGI is never coming.
Such assessments are only relevant for the market, but not so much for count…

Chinese artificial intelligence company DeepSeek delayed the release of its new model after failing to train it using Huawei’s chips, highlighting the limits of Beijing’s push to replace US technology.

DeepSeek was encouraged by authorities to adopt Huawei’s Ascend processor rather than use Nvidia’s systems after releasing its R1 model in January, according to three people familiar with the matter.

@arXiv_csSE_bot@mastoxiv.page
2025-09-18 09:38:01

A Study on Thinking Patterns of Large Reasoning Models in Code Generation
Kevin Halim, Sin G. Teo, Ruitao Feng, Zhenpeng Chen, Yang Gu, Chong Wang, Yang Liu
https://arxiv.org/abs/2509.13758

A Study on Thinking Patterns of Large Reasoning Models in Code Generation
Currently, many large language models (LLMs) are utilized for software engineering tasks such as code generation. The emergence of more advanced models known as large reasoning models (LRMs), such as OpenAI's o3, DeepSeek R1, and Qwen3. They have demonstrated the capability of performing multi-step reasoning. Despite the advancement in LRMs, little attention has been paid to systematically analyzing the reasoning patterns these models exhibit and how such patterns influence the generated code. …

@arXiv_csAI_bot@mastoxiv.page
2025-08-14 08:59:12

Mathematical Computation and Reasoning Errors by Large Language Models
Liang Zhang, Edith Aurora Graf
https://arxiv.org/abs/2508.09932 https://arxiv.org/pd…

Mathematical Computation and Reasoning Errors by Large Language Models
Large Language Models (LLMs) are increasingly utilized in AI-driven educational instruction and assessment, particularly within mathematics education. The capability of LLMs to generate accurate answers and detailed solutions for math problem-solving tasks is foundational for ensuring reliable and precise feedback and assessment in math education practices. Our study focuses on evaluating the accuracy of four LLMs (OpenAI GPT-4o and o1, DeepSeek-V3 and DeepSeek-R1) solving three categories of m…

@arXiv_csLG_bot@mastoxiv.page
2025-10-14 13:40:48

MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model
Prasanna Mayilvahanan, Ricardo Dominguez-Olmedo, Thadd\"aus Wiedemer, Wieland Brendel
https://arxiv.org/abs/2510.11653

MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model
With the advent of DeepSeek-R1, a new wave of reinforcement learning (RL) methods has emerged that seem to unlock stronger mathematical reasoning. However, a closer look at the open-source ecosystem reveals a critical limitation: with sufficiently many draws (e.g., $\texttt{pass@1024}$), many existing base models already solve nearly all questions on widely used math benchmarks such as MATH-500 and AIME 2024. This suggests that the RL fine-tuning methods prevalent in the LLM reasoning literatur…

@arXiv_csCL_bot@mastoxiv.page
2025-09-30 14:06:31

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes
Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen, Chen Lai, Rick Cao, Yuandong Tian, Raghuraman Krishnamoorthi, Yangyang Shi, Vikas Chandra
https://arxiv.org/abs/2509.24945

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes
The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we…

@timfoster@mastodon.social
2025-08-30 17:44:18

Compare and buy tulips! One monthly subscription to get the very best tulip colours for all your tulip needs!
https://store.boingboing.net/sales/chatplayground-ai-basic-plan-lifetime-subscriptions

ChatPlayground AI: Lifetime Subscription | Boing Boing

Compare the best AI models, including ChatGPT-4, Google Gemini, Claude 3.5 Sonnet, DeepSeek R1, Llama, Grok, Perplexity, Mixtral, and 40+ more!

@arXiv_csLG_bot@mastoxiv.page
2025-09-11 10:14:13

Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
Jeffrey Amico, Gabriel Passamani Andrade, John Donaghy, Ben Fielding, Tristin Forbus, Harry Grieve, Semih Kara, Jari Kolehmainen, Yihua Lou, Christopher Nies, Edward Phillip Flores Nu\~no, Diogo Ortega, Shikhar Rastogi, Austin Virts, Matthew J. Wright
https://

Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and …

@arXiv_csAI_bot@mastoxiv.page
2025-10-10 07:33:08

Base Models Know How to Reason, Thinking Models Learn When
Constantin Venhoff, Iv\'an Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda
https://arxiv.org/abs/2510.07364 https…

Base Models Know How to Reason, Thinking Models Learn When
Why do thinking language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasoning capabilities or repurpose pre-existing base model ones. In this work, we propose a hybrid model where we activate reasoning mechanisms in base models at the right time to elicit thinking-model-level reasoning chains, implying that thinking models exploit already existing capabilities. To ground o…

@Techmeme@techhub.social
2025-10-10 20:26:02

SemiAnalysis launches InferenceMAX, an open-source benchmark that automatically tracks LLM inference performance across AI models and frameworks every night (Kimbo Chen/SemiAnalysis)
https://newsletter.semianalysis.com/p/inferencemax-open-source-inference

InferenceMAX™: Open Source Inference Benchmarking
NVIDIA GB200 NVL72, AMD MI355X, Throughput Token per GPU, Latency Tok/s/user, Perf per Dollar, Tokens per Provisioned Megawatt, DeepSeek R1 670B, GPTOSS 120B, Llama3 70B

@arXiv_csCL_bot@mastoxiv.page
2025-10-09 10:26:21

Overview of the Plagiarism Detection Task at PAN 2025
Andr\'e Greiner-Petter, Maik Fr\"obe, Jan Philip Wahle, Terry Ruas, Bela Gipp, Akiko Aizawa, Martin Potthast
https://arxiv.org/abs/2510.06805

Overview of the Plagiarism Detection Task at PAN 2025
The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the resul…

@arXiv_csHC_bot@mastoxiv.page
2025-09-30 11:31:31

Diamonds in the rough: Transforming SPARCs of imagination into a game concept by leveraging medium sized LLMs
Julian Geheeb, Farhan Abid Ivan, Daniel Dyrda, Miriam Ansch\"utz, Georg Groh
https://arxiv.org/abs/2509.24730

Diamonds in the rough: Transforming SPARCs of imagination into a game concept by leveraging medium sized LLMs
Recent research has demonstrated that large language models (LLMs) can support experts across various domains, including game design. In this study, we examine the utility of medium-sized LLMs, models that operate on consumer-grade hardware typically available in small studios or home environments. We began by identifying ten key aspects that contribute to a strong game concept and used ChatGPT to generate thirty sample game ideas. Three medium-sized LLMs, LLaMA 3.1, Qwen 2.5, and DeepSeek-R1, …

@arXiv_csAI_bot@mastoxiv.page
2025-10-10 10:31:19

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai
https://arxiv.org/abs/2510.08189

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning beha…

@arXiv_physicsmedph_bot@mastoxiv.page
2025-09-11 08:33:13

An Iterative LLM Framework for SIBT utilizing RAG-based Adaptive Weight Optimization
Zhuo Xiao (Image Processing Center, Beihang University, Beijing, China), Qinglong Yao (Image Processing Center, Beihang University, Beijing, China), Jingjing Wang (Image Processing Center, Beihang University, Beijing, China), Fugen Zhou (Image Processing Center, Beihang University, Beijing, China), Bo Liu (Image Processing Center, Beihang University, Beijing, China), Haitao Sun (Department of Radiation…

An Iterative LLM Framework for SIBT utilizing RAG-based Adaptive Weight Optimization
Seed implant brachytherapy (SIBT) is an effective cancer treatment modality; however, clinical planning often relies on manual adjustment of objective function weights, leading to inefficiencies and suboptimal results. This study proposes an adaptive weight optimization framework for SIBT planning, driven by large language models (LLMs). A locally deployed DeepSeek-R1 LLM is integrated with an automatic planning algorithm in an iterative loop. Starting with fixed weights, the LLM evaluates plan…

@arXiv_csNI_bot@mastoxiv.page
2025-09-29 08:36:17

Evaluating Open-Source Large Language Models for Technical Telecom Question Answering
Arina Caraus, Alessio Buscemi, Sumit Kumar, Ion Turcanu
https://arxiv.org/abs/2509.21949 ht…

Evaluating Open-Source Large Language Models for Technical Telecom Question Answering
Large Language Models (LLMs) have shown remarkable capabilities across various fields. However, their performance in technical domains such as telecommunications remains underexplored. This paper evaluates two open-source LLMs, Gemma 3 27B and DeepSeek R1 32B, on factual and reasoning-based questions derived from advanced wireless communications material. We construct a benchmark of 105 question-answer pairs and assess performance using lexical metrics, semantic similarity, and LLM-as-a-judge s…

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 11:57:26

DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models
Kaiwen Yan, Xuanqing Shi, Hongcheng Guo, Wenxuan Wang, Zhuosheng Zhang, Chengwei Qin
https://arxiv.org/abs/2508.17803

DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models
Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit…

@arXiv_csAI_bot@mastoxiv.page
2025-08-28 07:36:40

SLIM: Subtrajectory-Level Elimination for More Effective Reasoning
Xifeng Yao, Chengyuan Ma, Dongyu Lang, Yinhao Ni, Zhiwei Xu, Huarui Xie, Zihao Chen, Guang Shen, Dandan Tu, Yi Bai, Changzheng Zhang
https://arxiv.org/abs/2508.19502

SLIM: Subtrajectory-Level Elimination for More Effective Reasoning
In recent months, substantial progress has been made in complex reasoning of Large Language Models, particularly through the application of test-time scaling. Notable examples include o1/o3/o4 series and DeepSeek-R1. When responding to a query, these models generate an extended reasoning trajectory, during which the model explores, reflects, backtracks, and self-verifies before arriving at a conclusion. However, fine-tuning models with such reasoning trajectories may not always be optimal. Our …

@arXiv_csCL_bot@mastoxiv.page
2025-07-22 12:24:50

The Impact of Language Mixing on Bilingual LLM Reasoning
Yihao Li, Jiayi Xin, Miranda Muqing Miao, Qi Long, Lyle Ungar
https://arxiv.org/abs/2507.15849 htt…

The Impact of Language Mixing on Bilingual LLM Reasoning
Proficient multilingual speakers often intentionally switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) with strong capabilities in both languages exhibit language mixing--alternating languages within their chain of thought. Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning. In this work, we study language switching in Chinese-English bilingual r…

Tootfinder

Opt-in global Mastodon full text search. Join the index!