Tootfinder

@heiseonline@social.heise.de
2025-09-11 10:31:00

Manus: So tickt der Entwickler hinter der KI-App aus China
Yichao Ji entwickelte die KI-App Manus, Nachfolger von Deepseek. Ein Blick auf seinen Werdegang zeigt, welche Entscheidungen ihn zu diesem Erfolg führten.

Manus: So tickt der Entwickler hinter der KI-App aus China
Yichao Ji entwickelte die KI-App Manus, Nachfolger von Deepseek. Ein Blick auf seinen Werdegang zeigt, welche Entscheidungen ihn zu diesem Erfolg führten.

@ErikJonker@mastodon.social
2025-10-10 06:28:27

Interesting, a lab that wants to build opensource (!) attracts a lot of funding 🤔
Reflection AI raises $2B to be America's open frontier AI lab, challenging DeepSeek | TechCrunch https://techcrunch.com/2025/10/09/reflection-…

Reflection AI raises $2B to be America's open frontier AI lab, challenging DeepSeek | TechCrunch
Reflection, once focused on autonomous coding agents, has raised $2B at an $8B valuation to expand into both an opensource alternative to closed frontier labs like OpenAI and Anthropic, and a Western equivalent to Chinese AI firms like DeepSeek.

@Techmeme@techhub.social
2025-08-11 10:16:07

Malaysian startup Zetrix unveils NurAI, a chatbot for Muslims built using similar techniques to DeepSeek's V3, and plans AI avatars of Islamic scholars (Saritha Rai/Bloomberg)
https://www.bloomberg.com/news/articles/2025-0…

@arXiv_csLG_bot@mastoxiv.page
2025-09-11 10:14:13

Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
Jeffrey Amico, Gabriel Passamani Andrade, John Donaghy, Ben Fielding, Tristin Forbus, Harry Grieve, Semih Kara, Jari Kolehmainen, Yihua Lou, Christopher Nies, Edward Phillip Flores Nu\~no, Diogo Ortega, Shikhar Rastogi, Austin Virts, Matthew J. Wright
https://

Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and …

@Techmeme@techhub.social
2025-10-09 12:17:47

Reflection AI, which is developing an open-source AI model to compete with DeepSeek, raised $2B led by Nvidia, valuing it at $8B, up from $545M in March (Michael J. de la Merced/New York Times)
https://www.nytimes.com/2025/10/09/business/dealbook/ref…

Reflection AI Raises $2 Billion as It Aims to Compete With DeepSeek
The big fund-raising round was the latest sign of investor fervor for artificial intelligence companies despite concerns that the boom is overheated.

@arXiv_csCL_bot@mastoxiv.page
2025-09-12 09:41:09

Agentic LLMs for Question Answering over Tabular Data
Rishit Tyagi, Mohit Gupta, Rahul Bouri
https://arxiv.org/abs/2509.09234 https://arxiv.org/pdf/2509.09…

Agentic LLMs for Question Answering over Tabular Data
Question Answering over Tabular Data (Table QA) presents unique challenges due to the diverse structure, size, and data types of real-world tables. The SemEval 2025 Task 8 (DataBench) introduced a benchmark composed of large-scale, domain-diverse datasets to evaluate the ability of models to accurately answer structured queries. We propose a Natural Language to SQL (NL-to-SQL) approach leveraging large language models (LLMs) such as GPT-4o, GPT-4o-mini, and DeepSeek v2:16b to generate SQL queri…

@arXiv_csCY_bot@mastoxiv.page
2025-09-12 09:19:49

Evaluating the Clinical Safety of LLMs in Response to High-Risk Mental Health Disclosures
Siddharth Shah, Amit Gupta, Aarav Mann, Alexandre Vaz, Benjamin E. Caldwell, Robert Scholz, Peter Awad, Rocky Allemandi, Doug Faust, Harshita Banka, Tony Rousmaniere
https://arxiv.org/abs/2509.08839

Evaluating the Clinical Safety of LLMs in Response to High-Risk Mental Health Disclosures
As large language models (LLMs) increasingly mediate emotionally sensitive conversations, especially in mental health contexts, their ability to recognize and respond to high-risk situations becomes a matter of public safety. This study evaluates the responses of six popular LLMs (Claude, Gemini, Deepseek, ChatGPT, Grok 3, and LLAMA) to user prompts simulating crisis-level mental health disclosures. Drawing on a coding framework developed by licensed clinicians, five safety-oriented behaviors w…

@heiseonline@social.heise.de
2025-09-29 15:24:00

DeepSeek senkt API-Preise um 50 Prozent und stellt V3.2-Exp vor
Das chinesische Start-up DeepSeek hat sein experimentelles KI-Modell V3.2-Exp vorgestellt und die API-Preise um mehr als 50 Prozent gesenkt.
ht…

DeepSeek V3.2-Exp: Neue Sparse Attention und halbierte API-Kosten
Das chinesische Start-up DeepSeek hat sein experimentelles KI-Modell V3.2-Exp vorgestellt und die API-Preise um mehr als 50 Prozent gesenkt.

@Techmeme@techhub.social
2025-10-10 20:26:02

SemiAnalysis launches InferenceMAX, an open-source benchmark that automatically tracks LLM inference performance across AI models and frameworks every night (Kimbo Chen/SemiAnalysis)
https://newsletter.semianalysis.com/p/inferencemax-open-source-inference

InferenceMAX™: Open Source Inference Benchmarking
NVIDIA GB200 NVL72, AMD MI355X, Throughput Token per GPU, Latency Tok/s/user, Perf per Dollar, Tokens per Provisioned Megawatt, DeepSeek R1 670B, GPTOSS 120B, Llama3 70B

@arXiv_csAI_bot@mastoxiv.page
2025-10-10 07:33:08

Base Models Know How to Reason, Thinking Models Learn When
Constantin Venhoff, Iv\'an Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda
https://arxiv.org/abs/2510.07364 https…

Base Models Know How to Reason, Thinking Models Learn When
Why do thinking language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasoning capabilities or repurpose pre-existing base model ones. In this work, we propose a hybrid model where we activate reasoning mechanisms in base models at the right time to elicit thinking-model-level reasoning chains, implying that thinking models exploit already existing capabilities. To ground o…

@arXiv_csCL_bot@mastoxiv.page
2025-09-11 08:16:43

NOWJ@COLIEE 2025: A Multi-stage Framework Integrating Embedding Models and Large Language Models for Legal Retrieval and Entailment
Hoang-Trung Nguyen, Tan-Minh Nguyen, Xuan-Bach Le, Tuan-Kiet Le, Khanh-Huyen Nguyen, Ha-Thanh Nguyen, Thi-Hai-Yen Vuong, Le-Minh Nguyen
https://arxiv.org/abs/2509.08025

NOWJ@COLIEE 2025: A Multi-stage Framework Integrating Embedding Models and Large Language Models for Legal Retrieval and Entailment
This paper presents the methodologies and results of the NOWJ team's participation across all five tasks at the COLIEE 2025 competition, emphasizing advancements in the Legal Case Entailment task (Task 2). Our comprehensive approach systematically integrates pre-ranking models (BM25, BERT, monoT5), embedding-based semantic representations (BGE-m3, LLM2Vec), and advanced Large Language Models (Qwen-2, QwQ-32B, DeepSeek-V3) for summarization, relevance scoring, and contextual re-ranking. Specific…

@carloshr@lile.cl
2025-08-28 18:59:48

Espero que me perdonen, pero he estado usando la IA de Deepseek como asistente para programar algunos scripts de uso personal y de trabajo. No estoy particularmente satisfecho con los resultados hasta ahora...
#IA #Deepseek

@arXiv_physicsmedph_bot@mastoxiv.page
2025-09-11 08:33:13

An Iterative LLM Framework for SIBT utilizing RAG-based Adaptive Weight Optimization
Zhuo Xiao (Image Processing Center, Beihang University, Beijing, China), Qinglong Yao (Image Processing Center, Beihang University, Beijing, China), Jingjing Wang (Image Processing Center, Beihang University, Beijing, China), Fugen Zhou (Image Processing Center, Beihang University, Beijing, China), Bo Liu (Image Processing Center, Beihang University, Beijing, China), Haitao Sun (Department of Radiation…

An Iterative LLM Framework for SIBT utilizing RAG-based Adaptive Weight Optimization
Seed implant brachytherapy (SIBT) is an effective cancer treatment modality; however, clinical planning often relies on manual adjustment of objective function weights, leading to inefficiencies and suboptimal results. This study proposes an adaptive weight optimization framework for SIBT planning, driven by large language models (LLMs). A locally deployed DeepSeek-R1 LLM is integrated with an automatic planning algorithm in an iterative loop. Starting with fixed weights, the LLM evaluates plan…

@arXiv_csAI_bot@mastoxiv.page
2025-10-10 10:31:19

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai
https://arxiv.org/abs/2510.08189

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning beha…

@arXiv_csLG_bot@mastoxiv.page
2025-09-10 10:38:21

K2-Think: A Parameter-Efficient Reasoning System
Zhoujun Cheng, Richard Fan, Shibo Hao, Taylor W. Killian, Haonan Li, Suqi Sun, Hector Ren, Alexander Moreno, Daqian Zhang, Tianjun Zhong, Yuxin Xiong, Yuanzhe Hu, Yutao Xie, Xudong Han, Yuqi Wang, Varad Pimpalkhute, Yonghao Zhuang, Aaryamonvikram Singh, Xuezhi Liang, Anze Xie, Jianshu She, Desai Fan, Chengqian Gao, Liqun Ma, Mikhail Yurochkin, John Maggs, Xuezhe Ma, Guowei He, Zhiting Hu, Zhengzhong Liu, Eric P. Xing

K2-Think: A Parameter-Efficient Reasoning System
K2-Think is a reasoning system that achieves state-of-the-art performance with a 32B parameter model, matching or surpassing much larger models like GPT-OSS 120B and DeepSeek v3.1. Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewa…

@Techmeme@techhub.social
2025-09-09 12:35:58

The UAE's Institute of Foundation Models open sources its K2 Think model, trained on only ~2,000 AI chips and designed for math, coding, and science research (Cade Metz/New York Times)
https://www.nytimes.com/2025/09/09/technology/uae-emirates-ai-open-sourc…

United Arab Emirates Joins U.S. and China in Giving Away A.I. Technology
The Persian Gulf nation has “open sourced” technology meant to compete with OpenAI and China’s DeepSeek.

@heiseonline@social.heise.de
2025-09-18 16:36:00

Deepseek-R1: KI-Training hat sogar weniger als 300.000 US-Dollar gekostet
Die Konkurrenzfähigkeit der KI-Modelle von Deepseek hat Anfang des Jahres die KI-Branche schockiert. Jetzt gibt es erstmals konkrete Informationen zum Training.

Deepseek-R1: KI-Training hat sogar weniger als 300.000 US-Dollar gekostet
Die Konkurrenzfähigkeit der KI-Modelle von Deepseek hat Anfang des Jahres die KI-Branche schockiert. Jetzt gibt es erstmals konkrete Informationen zum Training.

@arXiv_csHC_bot@mastoxiv.page
2025-09-08 07:36:59

Transforming Fashion with AI: A Comparative Study of Large Language Models in Apparel Design
Nusrat Jahan Lamia, Sadia Afrin Mim
https://arxiv.org/abs/2509.04705 https://…

Transforming Fashion with AI: A Comparative Study of Large Language Models in Apparel Design
Fashion has evolved from handcrafted designs to automated production over the years, where AI has added another dimension to it. Nowadays, practically every industry uses artificial models to automate their operations. To explore their role, we examined three prominent LLMs (OpenAI, GeminiAI, Deepseek) in multiple stages of textile manufacturing (e.g., sustainable choice, cost effectiveness, production planning, etc.). We assessed the models' ability to replicate garment design using certain pa…

@michabbb@social.vivaldi.net
2025-08-25 11:43:15

Current pricing remains until effective date
📈 Service Resources
Scaled up service resources to better meet API needs. Users are encouraged to utilize the enhanced service.
📚 Documentation
For more details, visit DeepSeek API Docs:
https://api-docs.deepseek.com

@arXiv_csSE_bot@mastoxiv.page
2025-08-07 08:58:43

Analyzing Prominent LLMs: An Empirical Study of Performance and Complexity in Solving LeetCode Problems
Everton Guimaraes, Nathalia Nascimento, Chandan Shivalingaiah, Asish Nelapati
https://arxiv.org/abs/2508.03931

Analyzing Prominent LLMs: An Empirical Study of Performance and Complexity in Solving LeetCode Problems
Large Language Models (LLMs) like ChatGPT, Copilot, Gemini, and DeepSeek are transforming software engineering by automating key tasks, including code generation, testing, and debugging. As these models become integral to development workflows, a systematic comparison of their performance is essential for optimizing their use in real world applications. This study benchmarks these four prominent LLMs on one hundred and fifty LeetCode problems across easy, medium, and hard difficulties, generati…

@Techmeme@techhub.social
2025-08-29 13:25:48

Sources: DeepSeek plans to use Huawei's Ascend AI chips to train smaller versions of its upcoming R2 models but will still use Nvidia chips for largest models (The Information)
https://www.theinformation.com/articles/deepseek-opts-huawei-chips-train-models…

DeepSeek Opts for Huawei Chips to Train Some Models
DeepSeek, one of China’s leading artificial intelligence developers, has decided to use Huawei Technologies’ AI chips to train some of its AI models, a sign it is reducing its reliance on Nvidia chips, according to three people with knowledge of the effort. The move follows pressure by the ...

@heiseonline@social.heise.de
2025-09-24 13:04:00

KI-Update kompakt: ShadowLeak, Nvidia & OpenAI, Siemens, DeepSeek
Das "KI-Update" liefert drei mal pro Woche eine Zusammenfassung der wichtigsten KI-Entwicklungen.
https://www.

KI-Update kompakt: ShadowLeak, Nvidia & OpenAI, Siemens, DeepSeek
Das "KI-Update" liefert drei mal pro Woche eine Zusammenfassung der wichtigsten KI-Entwicklungen.

@jorgecandeias@mastodon.social
2025-08-02 17:48:03

Não é que eu esteja contra isto, que não estou, bem pelo contršrio.
Mas sinceramente não estou a ver como se consegue garantir a aplicabilidade destas coisas.
Provavelmente é falta de imaginação minha.
https://jornaleconomico.sapo.p…

Meta fora e Musk hesitante: 25 empresas aderem ao código de boas práticas de IA da UE
Entre as ausentes estão a Meta (que já tinha anunciado que não o faria) e empresas chinesas como a Alibaba, Baidu e Deepseek. O dono da empresa xAI, Elon Musk, decidiu assinar apenas o capítulo da segurança e não os outros dois (focados nos direitos de autor e na transparência).

@arXiv_csCL_bot@mastoxiv.page
2025-09-03 14:25:43

DeepSeek performs better than other Large Language Models in Dental Cases
Hexian Zhang, Xinyu Yan, Yanqi Yang, Lijian Jin, Ping Yang, Junwen Wang
https://arxiv.org/abs/2509.02036

DeepSeek performs better than other Large Language Models in Dental Cases
Large language models (LLMs) hold transformative potential in healthcare, yet their capacity to interpret longitudinal patient narratives remains inadequately explored. Dentistry, with its rich repository of structured clinical data, presents a unique opportunity to rigorously assess LLMs' reasoning abilities. While several commercial LLMs already exist, DeepSeek, a model that gained significant attention earlier this year, has also joined the competition. This study evaluated four state-of-the…

@metacurity@infosec.exchange
2025-09-17 13:57:49

Don't miss today's Metacurity for the most critical infosec developments you should know, including
--A bona fide self-replicating worm has infected 187 npm packages,
--BreachForums founder hit with new three-year sentence,
--Coinbase beach suspect accused of participating in $500k bribery scheme,
--DHS intelligence arm exposed sensitive database,
--MSFT seized 338 sites linked to Raccoon0365 stealer,
--DeepSeek is biased against Falun Gong and oth…

A self-replicating worm has infected 187 npm packages
BreachForums founder hit with new three-year sentence, Coinbase breach suspect accused of participating in $500k+ bribery scheme, DHS intelligence arm exposed sensitive database, MSFT seized 338 sites linked to Raccoon0365 stealer, DeepSeek is biased against Falun Gong and others, much more

@Techmeme@techhub.social
2025-09-29 12:16:13

DeepSeek releases DeepSeek-V3.2-Exp, saying it built the model using a new technique called DeepSeek Sparse Attention, and halves the pricing of its tools (Saritha Rai/Bloomberg)
https://www.bloomberg.com/news/articles/2025-09-2…

@arXiv_csDC_bot@mastoxiv.page
2025-08-05 09:55:50

xDeepServe: Model-as-a-Service on Huawei CloudMatrix384
Ao Xiao, Bangzheng He, Baoquan Zhang, Baoxing Huai, Bingji Wang, Bo Wang, Bo Xu, Boyi Hou, Chan Yang, Changhong Liu, Cheng Cui, Chenyu Zhu, Cong Feng, Daohui Wang, Dayun Lin, Duo Zhao, Fengshao Zou, Fu Wang, Gangqiang Zhang, Gengyuan Dan, Guanjie Chen, Guodong Guan, Guodong Yang, Haifeng Li, Haipei Zhu, Hao Feng, Hao Huang, Hao Xu, Hengrui Ma, Hengtao Fan, Hui Liu, Jia Li, Jiang Liu, Jiang Xu, Jie Meng, Jinhan Xin, Junhao Hu, Juwe…

xDeepServe: Model-as-a-Service on Huawei CloudMatrix384
The rise of scaled-out LLMs and scaled-up SuperPods signals a new era in large-scale AI infrastructure. LLMs continue to scale out via MoE, as seen in recent models like DeepSeek, Kimi, and Qwen. In parallel, AI hardware is scaling up, with Huawei's CloudMatrix384 SuperPod offering hundreds of GB/s high-speed interconnects. Running large MoE models on SuperPod-scale hardware brings new challenges. It requires new execution models, scalable scheduling, efficient expert load balancing, and elimin…

@thomasrenkert@hcommons.social
2025-08-14 14:23:51

The geopolitical #aiarmsrace seems largely unimpressed by people proclaiming #LLMs have plateaued and #AGI is never coming.
Such assessments are only relevant for the market, but not so much for count…

Chinese artificial intelligence company DeepSeek delayed the release of its new model after failing to train it using Huawei’s chips, highlighting the limits of Beijing’s push to replace US technology.

DeepSeek was encouraged by authorities to adopt Huawei’s Ascend processor rather than use Nvidia’s systems after releasing its R1 model in January, according to three people familiar with the matter.

@gadgetboy@gadgetboy.social
2025-07-18 23:46:25

Qwen3-8B needs to work on its people skills...
#ai #deepseek #qwen

@timfoster@mastodon.social
2025-08-30 17:44:18

Compare and buy tulips! One monthly subscription to get the very best tulip colours for all your tulip needs!
https://store.boingboing.net/sales/chatplayground-ai-basic-plan-lifetime-subscriptions

ChatPlayground AI: Lifetime Subscription | Boing Boing

Compare the best AI models, including ChatGPT-4, Google Gemini, Claude 3.5 Sonnet, DeepSeek R1, Llama, Grok, Perplexity, Mixtral, and 40+ more!

@kurtsh@mastodon.social
2025-08-28 05:38:59

Full demo of VIDEO generation through an agents in #Azure using multiple 3rd party #AI models such as Llama, Deepseek, Grok & Sora.
Orchestrating the allocation of the required GPU's - Nvidia GB200 & AMD MI300X &

@Techmeme@techhub.social
2025-07-28 15:53:03

Z.ai, formerly known as Zhipu and which has raised $1.5B from Tencent and others, releases GLM-4.5, an open source AI model that's cheaper to use than DeepSeek (Evelyn Cheng/CNBC)
https://www.cnbc.com/2025/07/28/chinas-latest-ai-…

China's latest AI model claims to be even cheaper to use than DeepSeek
Startup Z.ai, formerly known as Zhipu, announced Monday a new, low-cost AI model as Chinese companies race to stay at the frontlines of the tech.

@arXiv_csSE_bot@mastoxiv.page
2025-08-07 09:35:54

Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection
Damian Gnieciak, Tomasz Szandala
https://arxiv.org/abs/2508.04448 htt…

Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection
Modern software relies on a multitude of automated testing and quality assurance tools to prevent errors, bugs and potential vulnerabilities. This study sets out to provide a head-to-head, quantitative and qualitative evaluation of six automated approaches: three industry-standard rule-based static code-analysis tools (SonarQube, CodeQL and Snyk Code) and three state-of-the-art large language models hosted on the GitHub Models platform (GPT-4.1, Mistral Large and DeepSeek V3). Using a curated s…

@arXiv_csCL_bot@mastoxiv.page
2025-10-09 10:26:21

Overview of the Plagiarism Detection Task at PAN 2025
Andr\'e Greiner-Petter, Maik Fr\"obe, Jan Philip Wahle, Terry Ruas, Bela Gipp, Akiko Aizawa, Martin Potthast
https://arxiv.org/abs/2510.06805

Overview of the Plagiarism Detection Task at PAN 2025
The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the resul…

@arXiv_csSD_bot@mastoxiv.page
2025-09-03 09:02:13

CoComposer: LLM Multi-agent Collaborative Music Composition
Peiwen Xing, Aske Plaat, Niki van Stein
https://arxiv.org/abs/2509.00132 https://arxiv.org/pdf/…

CoComposer: LLM Multi-agent Collaborative Music Composition
Existing AI Music composition tools are limited in generation duration, musical quality, and controllability. We introduce CoComposer, a multi-agent system that consists of five collaborating agents, each with a task based on the traditional music composition workflow. Using the AudioBox-Aesthetics system, we experimentally evaluate CoComposer on four compositional criteria. We test with three LLMs (GPT-4o, DeepSeek-V3-0324, Gemini-2.5-Flash), and find (1) that CoComposer outperforms existing m…

@tiotasram@kolektiva.social
2025-07-23 06:15:10

How much of my children's future is AI going to burn up? That depends on how much we feed the hype beast. *That* is why "don't use AI at all without mentioning the drawbacks & a very good reason" is my stance (and I'm an AI researcher, technically).
Local models that run on your laptop: acceptable if produced by ethical means (including data sourcing & compensation for data filtering) & training costs are mitigated. Are such models way worse than the huge datacenter-scale models? Yes, for now. Deal with it.
ChatGPT, Claude, Copilot, even DeepSeek: get out. You're feeding the beast that is consuming my kids' future. Heck, even talking up these models or about how "everyone is using them so it's okay" or about "they're not going away" I'd feeding the beast even if you don't touch them.
I wish it weren't like this, because the capabilities of the big models are cool even once you cut past the hype.
#AI

@Techmeme@techhub.social
2025-09-03 08:35:48

A reporter details how her mother, a kidney transplant patient who lives in China, bonded with DeepSeek's chatbot as her AI doctor, calling it "more humane" (Viola Zhou/Rest of World)
https://restofworld.org/2025/ai-chatbot-china-sick/

My mom and Dr. DeepSeek
In China and around the world, the sick and lonely turn to AI.

@arXiv_csCL_bot@mastoxiv.page
2025-09-01 09:41:52

Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek family of models
Shubham Sharma, Sneha Tuli, Narendra Badam
https://arxiv.org/abs/2508.21377

Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek family of models
Large Language Models (LLMs) are transforming AI across industries, but their development and deployment remain complex. This survey reviews 16 key challenges in building and using LLMs and examines how these challenges are addressed by two state-of-the-art models with unique approaches: OpenAI's closed source GPT-4o (May 2024 update) and DeepSeek-V3-0324 (March 2025), a large open source Mixture-of-Experts model. Through this comparison, we showcase the trade-offs between closed source models …

@arXiv_csNI_bot@mastoxiv.page
2025-09-29 08:36:17

Evaluating Open-Source Large Language Models for Technical Telecom Question Answering
Arina Caraus, Alessio Buscemi, Sumit Kumar, Ion Turcanu
https://arxiv.org/abs/2509.21949 ht…

Evaluating Open-Source Large Language Models for Technical Telecom Question Answering
Large Language Models (LLMs) have shown remarkable capabilities across various fields. However, their performance in technical domains such as telecommunications remains underexplored. This paper evaluates two open-source LLMs, Gemma 3 27B and DeepSeek R1 32B, on factual and reasoning-based questions derived from advanced wireless communications material. We construct a benchmark of 105 question-answer pairs and assess performance using lexical metrics, semantic similarity, and LLM-as-a-judge s…

@arXiv_econEM_bot@mastoxiv.page
2025-07-30 07:37:01

Can large language models assist choice modelling? Insights into prompting strategies and current models capabilities
Georges Sfeir, Gabriel Nova, Stephane Hess, Sander van Cranenburgh
https://arxiv.org/abs/2507.21790

Can large language models assist choice modelling? Insights into prompting strategies and current models capabilities
Large Language Models (LLMs) are widely used to support various workflows across different disciplines, yet their potential in choice modelling remains relatively unexplored. This work examines the potential of LLMs as assistive agents in the specification and, where technically feasible, estimation of Multinomial Logit models. We implement a systematic experimental framework involving thirteen versions of six leading LLMs (ChatGPT, Claude, DeepSeek, Gemini, Gemma, and Llama) evaluated under fi…

@Techmeme@techhub.social
2025-07-19 15:10:58

How an open-source approach helped DeepSeek and other Chinese AI companies; Hugging Face: Alibaba's Qwen is now the world's largest open-source AI ecosystem (South China Morning Post)
https://www.scmp.com/tech/big-tech/article

How open-source AI is helping China win hearts and market share
China’s free-for-all AI models, developed by firms like DeepSeek and Alibaba, present a viable alternative to US closed-source systems.

@arXiv_econGN_bot@mastoxiv.page
2025-08-01 08:06:11

Will Compute Bottlenecks Prevent an Intelligence Explosion?
Parker Whitfill, Cheryl Wu
https://arxiv.org/abs/2507.23181 https://arxiv.org/pdf/2507.23181

Will Compute Bottlenecks Prevent an Intelligence Explosion?
The possibility of a rapid, "software-only" intelligence explosion brought on by AI's recursive self-improvement (RSI) is a subject of intense debate within the AI community. This paper presents an economic model and an empirical estimation of the elasticity of substitution between research compute and cognitive labor at frontier AI firms to shed light on the possibility. We construct a novel panel dataset for four leading AI labs (OpenAI, DeepMind, Anthropic, and DeepSeek) from 2014 to 2024 an…

@ErikJonker@mastodon.social
2025-07-19 15:29:16

Very nice article about LLM architecture, a bit too complicated for me but probably not for others..
https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison

The Big LLM Architecture Comparison
From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design

@arXiv_csCY_bot@mastoxiv.page
2025-10-01 08:07:37

Effectiveness of Large Language Models in Simulating Regional Psychological Structures: An Empirical Examination of Personality and Subjective Well-being
Ke Luoma, Li Zengyi, Liao Jiangqun, Tong Song, Peng Kaiping
https://arxiv.org/abs/2509.25283

Effectiveness of Large Language Models in Simulating Regional Psychological Structures: An Empirical Examination of Personality and Subjective Well-being
This study examines whether LLMs can simulate culturally grounded psychological patterns based on demographic information. Using DeepSeek, we generated 2943 virtual participants matched to demographic distributions from the CFPS2018 and compared them with human responses on the Big Five personality traits and subjective well-being across seven Chinese regions.Personality was measured using a 15-item Chinese Big Five inventory, and happiness with a single-item rating. Results revealed broad simi…

@arXiv_csHC_bot@mastoxiv.page
2025-09-30 11:31:31

Diamonds in the rough: Transforming SPARCs of imagination into a game concept by leveraging medium sized LLMs
Julian Geheeb, Farhan Abid Ivan, Daniel Dyrda, Miriam Ansch\"utz, Georg Groh
https://arxiv.org/abs/2509.24730

Diamonds in the rough: Transforming SPARCs of imagination into a game concept by leveraging medium sized LLMs
Recent research has demonstrated that large language models (LLMs) can support experts across various domains, including game design. In this study, we examine the utility of medium-sized LLMs, models that operate on consumer-grade hardware typically available in small studios or home environments. We began by identifying ten key aspects that contribute to a strong game concept and used ChatGPT to generate thirty sample game ideas. Three medium-sized LLMs, LLaMA 3.1, Qwen 2.5, and DeepSeek-R1, …

@arXiv_csCR_bot@mastoxiv.page
2025-07-17 09:07:30

Exploiting Jailbreaking Vulnerabilities in Generative AI to Bypass Ethical Safeguards for Facilitating Phishing Attacks
Rina Mishra, Gaurav Varshney
https://arxiv.org/abs/2507.12185

Exploiting Jailbreaking Vulnerabilities in Generative AI to Bypass Ethical Safeguards for Facilitating Phishing Attacks
The advent of advanced Generative AI (GenAI) models such as DeepSeek and ChatGPT has significantly reshaped the cybersecurity landscape, introducing both promising opportunities and critical risks. This study investigates how GenAI powered chatbot services can be exploited via jailbreaking techniques to bypass ethical safeguards, enabling the generation of phishing content, recommendation of hacking tools, and orchestration of phishing campaigns. In ethically controlled experiments, we used Cha…

@Techmeme@techhub.social
2025-08-12 18:50:47

Despite Musk's claim Apple "makes it impossible" for non-OpenAI AI apps to top its App Store, DeepSeek was #1 in January and Perplexity was #1 in July in India (Henry Chandonnet/Business Insider)
https://www.businessinsider.com/elon-musk-

DeepSeek's No. 1 ranking pokes a hole in Musk's Apple criticism
Elon Musk suggested without evidence that Apple blocked AI apps that aren't made by OpenAI from topping the App Store. DeepSeek did it this year.

@newsie@darktundra.xyz
2025-08-20 11:02:41

Chinese Livestreaming 'Virtual Human' Salespeople Are Outselling Their Human Counterparts https://www.404media.co/chinese-livestreaming-virtual-human-salespeople-are-outselling-their-human-counterparts/

Chinese Livestreaming 'Virtual Human' Salespeople Are Outselling Their Human Counterparts
Built using AI technology from Baidu and DeepSeek, these virtual livestreamers sell everything from wet wipes to printers and work 24 hours a day, seven days a week.

@Techmeme@techhub.social
2025-09-17 09:51:03

CrowdStrike: DeepSeek refuses to write code or produces less-secure code when English prompts say the code will be used by groups or regions disfavored by China (Joseph Menn/Washington Post)

AI firm DeepSeek writes less-secure code for groups China disfavors
Research by a U.S. security firm points to the country’s leading player in AI providing higher-quality results for some purposes than others.

@arXiv_csSE_bot@mastoxiv.page
2025-09-30 11:29:51

SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code
Qinglin Wang, Zhihong Sun, Ruyun Wang, Tao Huang, Zhi Jin, Ge Li, Chen Lyu
https://arxiv.org/abs/2509.24507

SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code
Large Language Models (LLMs) can translate natural language requirements into code, yet empirical analyses of representative models reveal that semantic errors-programs that compile but behave incorrectly-constitute the majority of observed faults (e.g., >60% on DeepSeek-Coder-6.7B and QwenCoder-7B). Post-hoc repair pipelines detect such faults only after execution, incurring latency, relying on incomplete test suites, and often mis-localizing the defect. Since semantic drift originates in the …

@ErikJonker@mastodon.social
2025-07-16 18:53:02

The latest large opensource AI model from China.
Kimi-K2
#AI #china #KimiK2

Kimi K2: Open Agentic Intelligence
Kimi K2 is our latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. It achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models.

@arXiv_csAI_bot@mastoxiv.page
2025-08-28 07:36:40

SLIM: Subtrajectory-Level Elimination for More Effective Reasoning
Xifeng Yao, Chengyuan Ma, Dongyu Lang, Yinhao Ni, Zhiwei Xu, Huarui Xie, Zihao Chen, Guang Shen, Dandan Tu, Yi Bai, Changzheng Zhang
https://arxiv.org/abs/2508.19502

SLIM: Subtrajectory-Level Elimination for More Effective Reasoning
In recent months, substantial progress has been made in complex reasoning of Large Language Models, particularly through the application of test-time scaling. Notable examples include o1/o3/o4 series and DeepSeek-R1. When responding to a query, these models generate an extended reasoning trajectory, during which the model explores, reflects, backtracks, and self-verifies before arriving at a conclusion. However, fine-tuning models with such reasoning trajectories may not always be optimal. Our …

@arXiv_csLG_bot@mastoxiv.page
2025-09-26 10:28:01

TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix
Ahmet Caner Y\"uz\"ug\"uler, Ahmet \c{C}elik, Jiawei Zhuang, Lukas Cavigelli
https://arxiv.org/abs/2509.21081

TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix
Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidt…

@Techmeme@techhub.social
2025-09-18 13:10:41

In a peer-reviewed Nature article, DeepSeek says it has spent $294,000 on training its R1 model and used 512 Nvidia H800 chips (Eduardo Baptista/Reuters)
https://www.reuters.com/world/china/chinas-deepseek-says-its-hit-ai-model-cos…

@arXiv_csHC_bot@mastoxiv.page
2025-07-28 08:07:01

Prompt Engineering and the Effectiveness of Large Language Models in Enhancing Human Productivity
Rizal Khoirul Anam
https://arxiv.org/abs/2507.18638 https://

Prompt Engineering and the Effectiveness of Large Language Models in Enhancing Human Productivity
The widespread adoption of large language models (LLMs) such as ChatGPT, Gemini, and DeepSeek has significantly changed how people approach tasks in education, professional work, and creative domains. This paper investigates how the structure and clarity of user prompts impact the effectiveness and productivity of LLM outputs. Using data from 243 survey respondents across various academic and occupational backgrounds, we analyze AI usage habits, prompting strategies, and user satisfaction. The …

@arXiv_csSE_bot@mastoxiv.page
2025-09-01 09:09:53

Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity
Domenico Cotroneo, Cristina Improta, Pietro Liguori
https://arxiv.org/abs/2508.21634

Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity
As AI code assistants become increasingly integrated into software development workflows, understanding how their code compares to human-written programs is critical for ensuring reliability, maintainability, and security. In this paper, we present a large-scale comparison of code authored by human developers and three state-of-the-art LLMs, i.e., ChatGPT, DeepSeek-Coder, and Qwen-Coder, on multiple dimensions of software quality: code defects, security vulnerabilities, and structural complexit…

@arXiv_csCY_bot@mastoxiv.page
2025-08-28 08:50:11

Hallucinating with AI: AI Psychosis as Distributed Delusions
Lucy Osler
https://arxiv.org/abs/2508.19588 https://arxiv.org/pdf/2508.19588

Hallucinating with AI: AI Psychosis as Distributed Delusions
There is much discussion of the false outputs that generative AI systems such as ChatGPT, Claude, Gemini, DeepSeek, and Grok create. In popular terminology, these have been dubbed AI hallucinations. However, deeming these AI outputs hallucinations is controversial, with many claiming this is a metaphorical misnomer. Nevertheless, in this paper, I argue that when viewed through the lens of distributed cognition theory, we can better see the dynamic and troubling ways in which inaccurate beliefs,…

@arXiv_csAI_bot@mastoxiv.page
2025-08-14 08:59:12

Mathematical Computation and Reasoning Errors by Large Language Models
Liang Zhang, Edith Aurora Graf
https://arxiv.org/abs/2508.09932 https://arxiv.org/pd…

Mathematical Computation and Reasoning Errors by Large Language Models
Large Language Models (LLMs) are increasingly utilized in AI-driven educational instruction and assessment, particularly within mathematics education. The capability of LLMs to generate accurate answers and detailed solutions for math problem-solving tasks is foundational for ensuring reliable and precise feedback and assessment in math education practices. Our study focuses on evaluating the accuracy of four LLMs (OpenAI GPT-4o and o1, DeepSeek-V3 and DeepSeek-R1) solving three categories of m…

@Techmeme@techhub.social
2025-08-22 16:06:34

Tesla plans to roll out in-car voice assistant features powered by DeepSeek and ByteDance's Doubao in China; Tesla vehicles in the US use Grok (Linda Lew/Bloomberg)
https://www.bloomberg.com/news/articles/2025-08-22/tesla-t…

@michabbb@social.vivaldi.net
2025-08-25 11:43:15

🚀 #DeepSeek #API Upgraded to V3.1 with Dual-Mode Support & #Anthropic Compatibility
#ai …

@Techmeme@techhub.social
2025-09-20 13:41:48

Huawei says DeepSeek-R1-Safe, which was trained on 1,000 of its Ascend AI chips, is "nearly 100% successful" in preventing politically sensitive topics (Eduardo Baptista/Reuters)
https://www.reuters.com/business/media-tel

@arXiv_csCL_bot@mastoxiv.page
2025-09-30 14:06:31

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes
Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen, Chen Lai, Rick Cao, Yuandong Tian, Raghuraman Krishnamoorthi, Yangyang Shi, Vikas Chandra
https://arxiv.org/abs/2509.24945

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes
The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we…

@Techmeme@techhub.social
2025-08-21 09:35:58

DeepSeek details V3.1 and says it surpasses R1 on key benchmarks and is customized to work with next-gen Chinese-made AI chips, after unveiling it on August 19 (Bloomberg)
https://www.bloomberg.com/news/articles/2025-08-21/deep…

@arXiv_csSE_bot@mastoxiv.page
2025-07-29 10:09:02

From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics
Khairul Alam, Banani Roy
https://arxiv.org/abs/2507.20122 https://

From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics
The increasing complexity of bioinformatics data analysis has made Scientific Workflow Systems (SWSs) like Galaxy and Nextflow essential for enabling scalable, reproducible, and automated workflows. However, creating and understanding these workflows remains challenging, particularly for domain experts without programming expertise. This study investigates whether modern Large Language Models (LLMs), GPT-4o, Gemini 2.5 Flash, and DeepSeek-V3, can support the generation of accurate, complete, an…

@arXiv_csCL_bot@mastoxiv.page
2025-09-29 11:18:27

Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning
Antreas Ioannou, Andreas Shiamishis, Nora Hollenstein, Nezihe Merve G\"urel
https://arxiv.org/abs/2509.22472

Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning
In an era dominated by Large Language Models (LLMs), understanding their capabilities and limitations, especially in high-stakes fields like law, is crucial. While LLMs such as Meta's LLaMA, OpenAI's ChatGPT, Google's Gemini, DeepSeek, and other emerging models are increasingly integrated into legal workflows, their performance in multilingual, jurisdictionally diverse, and adversarial contexts remains insufficiently explored. This work evaluates LLaMA and Gemini on multilingual legal and non-l…

@Techmeme@techhub.social
2025-08-19 13:20:45

DeepSeek releases V3.1, adding a longer context window, with few other details; Chinese local media blames CEO Liang Wenfeng's perfectionism for R2's delay (Bloomberg)
https://www.bloomberg.com/news/articles/2025-08-19…

@arXiv_csLG_bot@mastoxiv.page
2025-07-17 10:24:30

Thought Purity: Defense Paradigm For Chain-of-Thought Attack
Zihao Xue, Zhen Bi, Long Ma, Zhenlin Hu, Yan Wang, Zhenfang Liu, Qing Sheng, Jie Xiao, Jungang Lou
https://arxiv.org/abs/2507.12314

Thought Purity: Defense Paradigm For Chain-of-Thought Attack
While reinforcement learning-trained Large Reasoning Models (LRMs, e.g., Deepseek-R1) demonstrate advanced reasoning capabilities in the evolving Large Language Models (LLMs) domain, their susceptibility to security threats remains a critical vulnerability. This weakness is particularly evident in Chain-of-Thought (CoT) generation processes, where adversarial methods like backdoor prompt attacks can systematically subvert the model's core reasoning mechanisms. The emerging Chain-of-Thought Atta…

@Techmeme@techhub.social
2025-07-31 06:36:06

How AI has transformed data center design, with concerns about overspending on AI infrastructure, sparked by DeepSeek, fading amid the ongoing building frenzy (Financial Times)
https://ig.ft.com/ai-data-centres/

Inside the relentless race for AI capacity
The quest for superintelligence is spurring a data centre boom — but critics question the cost, environmental impact and whether it is all needed

@arXiv_csCY_bot@mastoxiv.page
2025-07-21 08:11:50

The Emperor's New Chain-of-Thought: Probing Reasoning Theater Bias in Large Reasoning Models
Qian Wang, Yubo Fan, Zhenheng Tang, Nuo Chen, Wenxuan Wang, Bingsheng He
https://arxiv.org/abs/2507.13758

The Emperor's New Chain-of-Thought: Probing Reasoning Theater Bias in Large Reasoning Models
Large Reasoning Models (LRMs) like DeepSeek-R1 and o1 are increasingly used as automated evaluators, raising critical questions about their vulnerability to the aesthetics of reasoning in LLM-as-a-judge settings. We introduce THEATER, a comprehensive benchmark to systematically evaluate this vulnerability-termed Reasoning Theater Bias (RTB)-by comparing LLMs and LRMs across subjective preference and objective factual datasets. Through investigation of six bias types including Simple Cues and Fa…

@arXiv_csCL_bot@mastoxiv.page
2025-07-14 09:59:02

A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1
Marcin Pietro\'n, Rafa{\l} Olszowski, Jakub Gomu{\l}ka, Filip Gampel, Andrzej Tomski
https://arxiv.org/abs/2507.08621

A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1
Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanc…

@Techmeme@techhub.social
2025-08-29 10:40:49

Huawei reports H1 2025 revenue up 3.9% YoY to ~$58.5B, driven by soaring AI compute demand and a rebound in phone sales, and net profit down 32% YoY to ~$5.2B (Bloomberg)
https://www.bloomberg.com/news/articles/2025-08-29/de…

@arXiv_econGN_bot@mastoxiv.page
2025-07-16 08:25:01

Artificial Finance: How AI Thinks About Money
Orhan Erdem, Ragavi Pobbathi Ashok
https://arxiv.org/abs/2507.10933 https://arxiv.org/p…

Artificial Finance: How AI Thinks About Money
In this paper, we explore how large language models (LLMs) approach financial decision-making by systematically comparing their responses to those of human participants across the globe. We posed a set of commonly used financial decision-making questions to seven leading LLMs, including five models from the GPT series(GPT-4o, GPT-4.5, o1, o3-mini), Gemini 2.0 Flash, and DeepSeek R1. We then compared their outputs to human responses drawn from a dataset covering 53 nations. Our analysis reveals …

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 11:57:26

DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models
Kaiwen Yan, Xuanqing Shi, Hongcheng Guo, Wenxuan Wang, Zhuosheng Zhang, Chengwei Qin
https://arxiv.org/abs/2508.17803

DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models
Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit…

@Techmeme@techhub.social
2025-08-27 01:01:41

Chinese AI chip designer Cambricon reports 44-fold revenue growth and a profit of ~$144M in H1 2025, after Beijing encouraged companies to use homegrown tech (Rachel Yeo/Bloomberg)
https://www.bloomberg.com/news/articles/20

@arXiv_csSE_bot@mastoxiv.page
2025-09-18 09:38:01

A Study on Thinking Patterns of Large Reasoning Models in Code Generation
Kevin Halim, Sin G. Teo, Ruitao Feng, Zhenpeng Chen, Yang Gu, Chong Wang, Yang Liu
https://arxiv.org/abs/2509.13758

A Study on Thinking Patterns of Large Reasoning Models in Code Generation
Currently, many large language models (LLMs) are utilized for software engineering tasks such as code generation. The emergence of more advanced models known as large reasoning models (LRMs), such as OpenAI's o3, DeepSeek R1, and Qwen3. They have demonstrated the capability of performing multi-step reasoning. Despite the advancement in LRMs, little attention has been paid to systematically analyzing the reasoning patterns these models exhibit and how such patterns influence the generated code. …

@arXiv_csCY_bot@mastoxiv.page
2025-07-16 07:41:31

Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?
Bhakti Khera, Rezvan Alamian, Pascal A. Scherz, Stephan M. Goetz
https://arxiv.org/abs/2507.10576

Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?
The legal field already uses various large language models (LLMs) in actual applications, but their quantitative performance and reasons for it are underexplored. We evaluated several open-source and proprietary LLMs -- including GPT-series, Anthropic, Deepseek and Llama-3, variants -- on parts of the European Qualifying Examination (EQE) for future European Patent Attorneys. OpenAI o1 led with 0.82 accuracy and 0.81 F1 score, whereas (Amazon Web Services) AWS Llama 3.1 8B lagged at 0.50 accura…

@arXiv_csCL_bot@mastoxiv.page
2025-07-22 12:24:50

The Impact of Language Mixing on Bilingual LLM Reasoning
Yihao Li, Jiayi Xin, Miranda Muqing Miao, Qi Long, Lyle Ungar
https://arxiv.org/abs/2507.15849 htt…

The Impact of Language Mixing on Bilingual LLM Reasoning
Proficient multilingual speakers often intentionally switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) with strong capabilities in both languages exhibit language mixing--alternating languages within their chain of thought. Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning. In this work, we study language switching in Chinese-English bilingual r…

@arXiv_csCY_bot@mastoxiv.page
2025-08-13 09:49:12

Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams
Zane Witherspoon, Thet Mon Aye, YingYing Hao
https://arxiv.org/abs/2508.09036 https…

Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams
The rapid emergence of large language models (LLMs) has raised urgent questions across the modern workforce about this new technology's strengths, weaknesses, and capabilities. For privacy professionals, the question is whether these AI systems can provide reliable support on regulatory compliance, privacy program management, and AI governance. In this study, we evaluate ten leading open and closed LLMs, including models from OpenAI, Anthropic, Google DeepMind, Meta, and DeepSeek, by benchmarki…

@Techmeme@techhub.social
2025-07-16 04:55:53

Jensen Huang hailed AI models from DeepSeek, Alibaba, and Tencent as "world class" at a Beijing expo and said US licenses for H20 chips "will come very fast" (Reuters)
https://www.reuters.com/world/china/nvidias-huang-hail…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:33:11

A Comparative Evaluation of Large Language Models for Persian Sentiment Analysis and Emotion Detection in Social Media Texts
Kian Tohidi, Kia Dashtipour, Simone Rebora, Sevda Pourfaramarz
https://arxiv.org/abs/2509.14922

A Comparative Evaluation of Large Language Models for Persian Sentiment Analysis and Emotion Detection in Social Media Texts
This study presents a comprehensive comparative evaluation of four state-of-the-art Large Language Models (LLMs)--Claude 3.7 Sonnet, DeepSeek-V3, Gemini 2.0 Flash, and GPT-4o--for sentiment analysis and emotion detection in Persian social media texts. Comparative analysis among LLMs has witnessed a significant rise in recent years, however, most of these analyses have been conducted on English language tasks, creating gaps in understanding cross-linguistic performance patterns. This research ad…

@Techmeme@techhub.social
2025-08-14 04:46:03

Sources: DeepSeek R2's launch delay is due to training issues on Huawei Ascend chips, prompting a switch to Nvidia chips for training and Huawei's for inference (Financial Times)
https://www.ft.com/content/eb984646-6320-4bfe-a78d-a1da2274b092

@arXiv_csCL_bot@mastoxiv.page
2025-08-21 08:57:30

Punctuation and Predicates in Language Models
Sonakshi Chauhan, Maheep Chaudhary, Koby Choy, Samuel Nellessen, Nandi Schoots
https://arxiv.org/abs/2508.14067 https://

Punctuation and Predicates in Language Models
In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency (for preserving model performance) of punctuation tokens across layers in GPT-2, DeepSeek, and Gemma. Our results show stark model-sp…

@Techmeme@techhub.social
2025-08-13 08:45:56

Chinese open-source AI models from DeepSeek, Alibaba's Qwen, and others gaining global traction spurs US policymakers and companies to respond (Raffaele Huang/Wall Street Journal)
https://www.wsj.com/tech/ai/chinas-l…

@arXiv_csCL_bot@mastoxiv.page
2025-08-21 09:27:50

Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper
Krishna Garg, Firoz Shaikh, Sambaran Bandyopadhyay, Cornelia Caragea
https://arxiv.org/abs/2508.14273

Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper
As researchers increasingly adopt LLMs as writing assistants, generating high-quality research paper introductions remains both challenging and essential. We introduce Scientific Introduction Generation (SciIG), a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works. Curating new datasets from NAACL 2025 and ICLR 2025 papers, we assess five state-of-the-art models, including both open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, Mi…

@Techmeme@techhub.social
2025-07-12 18:25:56

Moonshot's Kimi K2 uses a 1T-parameter MoE architecture with 32B active parameters and outperforms models like GPT-4.1 and DeepSeek-V3 on key benchmarks (Michael Nuñez/VentureBeat)
https://venturebeat.com/ai/moonshot-ais-kimi-k2-outperfor…

Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks — and it’s free
Chinese AI startup Moonshot releases open-source Kimi K2 model that outperforms OpenAI and Anthropic on coding tasks with breakthrough agentic capabilities and competitive pricing.

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:52:40

Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding
Maciej Skorski, Alina Landowska
https://arxiv.org/abs/2508.13804 https://

Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding
How do large language models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluate top language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maveri…

@arXiv_csCL_bot@mastoxiv.page
2025-09-16 12:21:37

Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities
Payam Latifi
https://arxiv.org/abs/2509.12098

Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities
This pilot study presents a small-scale but carefully annotated benchmark of Named Entity Recognition (NER) performance across six systems: three non-LLM NLP tools (NLTK, spaCy, Stanza) and three general-purpose large language models (LLMs: Gemini-1.5-flash, DeepSeek-V3, Qwen-3-4B). The dataset contains 119 tokens covering five entity types (PERSON, LOCATION, ORGANIZATION, DATE, TIME). We evaluated each system's output against the manually annotated gold standard dataset using F1-score. The res…

@arXiv_csCL_bot@mastoxiv.page
2025-09-17 10:23:20

Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents
Fuyu Xing, Zimu Wang, Wei Wang, Haiyang Zhang
https://arxiv.org/abs/2509.12876 https://

Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents
The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed …

@arXiv_csCL_bot@mastoxiv.page
2025-09-17 10:37:50

The Few-shot Dilemma: Over-prompting Large Language Models
Yongjian Tang, Doruk Tuncel, Christian Koerner, Thomas Runkler
https://arxiv.org/abs/2509.13196 https://

The Few-shot Dilemma: Over-prompting Large Language Models
Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3…

@arXiv_csCL_bot@mastoxiv.page
2025-07-18 09:52:02

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Yixin Liu, Chengye Wang, Lovekesh Vig, Arman Cohan
https://arxiv.org/abs/2507.13300

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these …

@arXiv_csCL_bot@mastoxiv.page
2025-07-16 10:33:21

HKGAI-V1: Towards Regional Sovereign Large Language Model for Hong Kong
Sirui Han, Junqi Zhu, Ruiyuan Zhang, Yike Guo
https://arxiv.org/abs/2507.11502 http…

HKGAI-V1: Towards Regional Sovereign Large Language Model for Hong Kong
This paper presents the development of HKGAI-V1, a foundational sovereign large language model (LLM), developed as part of an initiative to establish value-aligned AI infrastructure specifically tailored for Hong Kong. Addressing the region's unique multilingual environment (Cantonese, Mandarin, and English), its distinct socio-legal context under the "one country, two systems" framework, and specific local cultural and value considerations, the model is built upon the DeepSeek architecture and…

Tootfinder

Opt-in global Mastodon full text search. Join the index!