Tootfinder

@arXiv_csCL_bot@mastoxiv.page
2025-08-22 12:38:52

Replaced article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[3/3]:
- CRISPR-GPT for Agentic Automation of Gene-editing Experiments
Qu, Huang, Yin, Zhan, Liu, Yin, Cousins, Johnson, Wang, Shah, Altman, Zhou, Wang, Cong

@cdarwin@c.im
2025-06-20 19:15:25

In its December 2023 lawsuit against OpenAI, The New York Times produced dozens of examples where GPT-4 exactly reproduced significant passages from Times stories.
In its response, OpenAI described this as a “fringe behavior” and a “problem that researchers at OpenAI and elsewhere work hard to address.”
But is it actually a fringe behavior?
And have leading AI companies addressed it?
New research—focusing on books rather than newspaper articles and on different compa…

Study: Meta AI model can reproduce almost half of Harry Potter book
The research could have big implications for generative AI copyright lawsuits.

@arXiv_csSE_bot@mastoxiv.page
2025-06-23 09:16:00

Evaluating the Use of LLMs for Documentation to Code Traceability
Ebube Alor, SayedHassan Khatoonabadi, Emad Shihab
https://arxiv.org/abs/2506.16440 https:…

Evaluating the Use of LLMs for Documentation to Code Traceability
Large Language Models (LLMs) offer new potential for automating documentation-to-code traceability, yet their capabilities remain underexplored. We present a comprehensive evaluation of LLMs (Claude 3.5 Sonnet, GPT-4o, and o3-mini) in establishing trace links between various software documentation (including API references and user guides) and source code. We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI). Through systematic experiments, we assess three key…

@arXiv_csCR_bot@mastoxiv.page
2025-07-22 07:53:50

Mitigating Trojanized Prompt Chains in Educational LLM Use Cases: Experimental Findings and Detection Tool Design
Richard M. Charles, James H. Curry, Richard B. Charles
https://arxiv.org/abs/2507.14207

Mitigating Trojanized Prompt Chains in Educational LLM Use Cases: Experimental Findings and Detection Tool Design
The integration of Large Language Models (LLMs) in K--12 education offers both transformative opportunities and emerging risks. This study explores how students may Trojanize prompts to elicit unsafe or unintended outputs from LLMs, bypassing established content moderation systems with safety guardrils. Through a systematic experiment involving simulated K--12 queries and multi-turn dialogues, we expose key vulnerabilities in GPT-3.5 and GPT-4. This paper presents our experimental design, detai…

@arXiv_csCV_bot@mastoxiv.page
2025-08-15 10:25:22

Performance of GPT-5 in Brain Tumor MRI Reasoning
Mojtaba Safari, Shansong Wang, Mingzhe Hu, Zach Eidex, Qiang Li, Xiaofeng Yang
https://arxiv.org/abs/2508.10865 https://…

Performance of GPT-5 in Brain Tumor MRI Reasoning
Accurate differentiation of brain tumor types on magnetic resonance imaging (MRI) is critical for guiding treatment planning in neuro-oncology. Recent advances in large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning. In this study, we evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark derived from 3 Brain Tumor Segmentation (BraTS) datasets - glioblastoma (…

@Techmeme@techhub.social
2025-08-01 18:25:51

Source: GPT-5 improvements won't be comparable to the leaps in performance of earlier models, such as between GPT-3 in 2020 and GPT-4 in 2023 (The Information)
https://www.theinformation.com/articles/inside-openais-rocky-path-gpt-5

Inside OpenAI’s Rocky Path to GPT-5
OpenAI made waves across the industry in December when it published the results from its tests of artificial intelligence that performs better on tasks when it gets more time and computing power to process them. The results implied ChatGPT customers were about to be blown away by what the new AI ...

@ErikJonker@mastodon.social
2025-06-07 08:07:20

Interesting, "GPT-style models have a fixed memorization capacity of approximately 3.6 bits per parameter."
https://venturebeat.com/ai/how-much-information-do-llms-really-memorize-now-we-know-thanks-to-met…

How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell
Using a clever solution, researchers find GPT-style models have a fixed memorization capacity of approximately 3.6 bits per parameter.

@arXiv_csSE_bot@mastoxiv.page
2025-08-21 09:32:00

Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis
Abbas Sabra, Olivier Schmitt, Joseph Tyler
https://arxiv.org/abs/2508.14727 https://

Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis
This study presents a quantitative evaluation of the code quality and security of five prominent Large Language Models (LLMs): Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder 8B. While prior research has assessed the functional performance of LLM-generated code, this research tested LLM output from 4,442 Java coding assignments through comprehensive static analysis using SonarQube. The findings suggest that although LLMs can generate functional code, they also introduce…

@arXiv_csCY_bot@mastoxiv.page
2025-07-16 07:41:31

Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?
Bhakti Khera, Rezvan Alamian, Pascal A. Scherz, Stephan M. Goetz
https://arxiv.org/abs/2507.10576

Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?
The legal field already uses various large language models (LLMs) in actual applications, but their quantitative performance and reasons for it are underexplored. We evaluated several open-source and proprietary LLMs -- including GPT-series, Anthropic, Deepseek and Llama-3, variants -- on parts of the European Qualifying Examination (EQE) for future European Patent Attorneys. OpenAI o1 led with 0.82 accuracy and 0.81 F1 score, whereas (Amazon Web Services) AWS Llama 3.1 8B lagged at 0.50 accura…

@jdrm@social.linux.pizza
2025-08-06 09:04:05

Nos reíamos de que Reagan preguntara a una vidente decisiones de política durante su presidencia. Pues en Suecia estšn con la versión 3.0 de consultar a un oršculo: https://www.theguardian.com/technology/2025/aug/05/chat-gpt-sw…

‘We didn’t vote for ChatGPT’: Swedish PM under fire for using AI in role
Tech experts criticise Ulf Kristersson as newspaper accuses him of falling for ‘the oligarchs’ AI psychosis’

@arXiv_csCL_bot@mastoxiv.page
2025-08-21 08:31:50

Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models
Badrinath Ramakrishnan, Akshaya Balaji
https://arxiv.org/abs/2508.14062 https://

Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, but their tendency to memorize training data poses significant privacy risks, particularly during fine-tuning processes. This paper presents a comprehensive empirical analysis of data memorization in fine-tuned LLMs and introduces a novel multi-layered privacy protection framework. Through controlled experiments on modern LLM architectures including GPT-2, Phi-3, and Gemma-2,…

@usul@piaille.fr
2025-06-11 11:31:32

Focus and Context and LLMs | Taras' Blog on AI, Perf, Hacks
#AI

Focus and Context and LLMs
I decided to write down some thoughts on agentic coding and why it’s a very hyped wrong turn. Let me start with some background on my LLM experience. I adopted LLMs into my work in Aug 2020. I was sold when I saw that GPT-3 could generate usable SQL statements. Something that used to take 4-8 hours of RTFMing, now took 15min. I have since worked on chatcraft.org, various RAG frameworks, etc. I use aider heavily for work, frequently switch models, have been struggling with tool calling during …

@arXiv_csCV_bot@mastoxiv.page
2025-07-03 10:32:10

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov, O\u{g}uzhan Fatih Kar, Amir Zamir
https://arxiv.org/abs/2507.01955

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using establ…

@arXiv_csHC_bot@mastoxiv.page
2025-08-01 09:22:31

Exploring LLM-generated Culture-specific Affective Human-Robot Tactile Interaction
Qiaoqiao Ren, Tony Belpaeme
https://arxiv.org/abs/2507.22905 https://arx…

Exploring LLM-generated Culture-specific Affective Human-Robot Tactile Interaction
As large language models (LLMs) become increasingly integrated into robotic systems, their potential to generate socially and culturally appropriate affective touch remains largely unexplored. This study investigates whether LLMs-specifically GPT-3.5, GPT-4, and GPT-4o --can generate culturally adaptive tactile behaviours to convey emotions in human-robot interaction. We produced text based touch descriptions for 12 distinct emotions across three cultural contexts (Chinese, Belgian, and unspeci…

@jonippolito@digipres.club
2025-07-02 12:40:23

I built a free tool to help students compare the energy/water use of AI tasks—like a 3-sec video gen or 500-word GPT reply—to everyday ones like Netflix, Google, or cloud storage. Try it at https://what-uses-more.com
Adjust variables like prompt complexity or the energy source and climate of local …

Decorative graphic with title "What Uses More" and a chart showing the different energy footprints of two tasks

What Uses More? Compare the environmental footprint of digital tasks
Compare the energy and water footprint of AI and digital activities.

@arXiv_csSE_bot@mastoxiv.page
2025-06-18 08:59:19

Quality Assessment of Python Tests Generated by Large Language Models
Victor Alves, Carla Bezerra, Ivan Machado, Larissa Rocha, T\'assio Virg\'inio, Publio Silva
https://arxiv.org/abs/2506.14297

Quality Assessment of Python Tests Generated by Large Language Models
The manual generation of test scripts is a time-intensive, costly, and error-prone process, indicating the value of automated solutions. Large Language Models (LLMs) have shown great promise in this domain, leveraging their extensive knowledge to produce test code more efficiently. This study investigates the quality of Python test code generated by three LLMs: GPT-4o, Amazon Q, and LLama 3.3. We evaluate the structural reliability of test suites generated under two distinct prompt contexts: Te…

@ErikJonker@mastodon.social
2025-08-09 18:02:14

GPT-5 may be slightly disappointing, Genie 3 demo blew me away... Watch it.
#ai

@arXiv_csAI_bot@mastoxiv.page
2025-08-06 09:49:50

Can Large Language Models Bridge the Gap in Environmental Knowledge?
Linda Smail (College of Interdisciplinary Studies, Zayed University, UAE), David Santandreu Calonge (Department of Academic Development, Mohamed bin Zayed University of Artificial Intelligence, UAE), Firuz Kamalov (School of Engineering, Applied Science,Technology, Canadian University Dubai, UAE), Nur H. Orak (Department of Environmental Engineering, Marmara University, T\"urkiye)

Can Large Language Models Bridge the Gap in Environmental Knowledge?
This research investigates the potential of Artificial Intelligence (AI) models to bridge the knowledge gap in environmental education among university students. By focusing on prominent large language models (LLMs) such as GPT-3.5, GPT-4, GPT-4o, Gemini, Claude Sonnet, and Llama 2, the study assesses their effectiveness in conveying environmental concepts and, consequently, facilitating environmental education. The investigation employs a standardized tool, the Environmental Knowledge Test (EK…

@arXiv_csPL_bot@mastoxiv.page
2025-08-07 12:56:16

Replaced article(s) found for cs.PL. https://arxiv.org/list/cs.PL/new
[1/1]:
- RTLCoder: Outperforming GPT-3.5 in Design RTL Generation with Our Open-Source Dataset and Lightwe...
Shang Liu, Wenji Fang, Yao Lu, Qijun Zhang, Hongce Zhang, Zhiyao Xie

@arXiv_csCY_bot@mastoxiv.page
2025-06-09 07:25:02

Can LLMs Talk 'Sex'? Exploring How AI Models Handle Intimate Conversations
Huiqian Lai
https://arxiv.org/abs/2506.05514 https://

Can LLMs Talk 'Sex'? Exploring How AI Models Handle Intimate Conversations
This study examines how four prominent large language models (Claude 3.7 Sonnet, GPT-4o, Gemini 2.5 Flash, and Deepseek-V3) handle sexually oriented requests through qualitative content analysis. By evaluating responses to prompts ranging from explicitly sexual to educational and neutral control scenarios, the research reveals distinct moderation paradigms reflecting fundamentally divergent ethical positions. Claude 3.7 Sonnet employs strict and consistent prohibitions, while GPT-4o navigates u…

@arXiv_csIR_bot@mastoxiv.page
2025-06-10 07:52:42

FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models
Xuan Xu, Fufang Wen, Beilin Chu, Zhibing Fu, Qinhong Lin, Jiaqi Liu, Binjie Fei, Zhongliang Yang, Linna Zhou, Yu Li
https://arxiv.org/abs/2506.06335

FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models
In natural language processing (NLP), the focus has shifted from encoder-only tiny language models like BERT to decoder-only large language models(LLMs) such as GPT-3. However, LLMs' practical application in the financial sector has revealed three limitations: (1) LLMs often perform worse than fine-tuned BERT on discriminative tasks despite costing much higher computational resources, such as market sentiment analysis in financial reports; (2) Application on generative tasks heavily relies on r…

@arXiv_csCR_bot@mastoxiv.page
2025-06-03 17:52:02

This https://arxiv.org/abs/2505.18889 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

Security Concerns for Large Language Models: A Survey
Large Language Models (LLMs) such as GPT-4 (and its recent iterations like GPT-4o and the GPT-4.1 series), Google's Gemini, Anthropic's Claude 3 models, and xAI's Grok have caused a revolution in natural language processing, but their capabilities also introduce new security vulnerabilities. In this survey, we provide a comprehensive overview of the emerging security concerns around LLMs, categorizing threats into prompt injection and jailbreaking, adversarial attacks (including input perturbat…

@arXiv_physicsedph_bot@mastoxiv.page
2025-08-13 08:59:32

The Boiling-Frog Problem of Physics Education
Gerd Kortemeyer
https://arxiv.org/abs/2508.08842 https://arxiv.org/pdf/2508.08842

The Boiling-Frog Problem of Physics Education
It is astonishing how rapidly general-purpose AI has crossed familiar thresholds in introductory physics. Comparing outputs from successive models, GPT-5 Thinking moves far beyond the plug-and-chug tendencies seen earlier: on a classic elevator problem it works symbolically, notes when variables cancel, and verifies results; attempts to prompt novice-like behavior mainly affect tone, not method. On representation translation, the model scores 24/26 (92.3%) on TUG-Kv4.0. In a card-sorting proxy …

@arXiv_csCY_bot@mastoxiv.page
2025-06-03 07:20:41

Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs
Nariman Naderi, Zahra Atf, Peter R Lewis, Aref Mahjoub far, Seyed Amir Ahmad Safavi-Naini, Ali Soroush
https://arxiv.org/abs/2506.00072

Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs
This paper investigates how prompt engineering techniques impact both accuracy and confidence elicitation in Large Language Models (LLMs) applied to medical contexts. Using a stratified dataset of Persian board exam questions across multiple specialties, we evaluated five LLMs - GPT-4o, o3-mini, Llama-3.3-70b, Llama-3.1-8b, and DeepSeek-v3 - across 156 configurations. These configurations varied in temperature settings (0.3, 0.7, 1.0), prompt styles (Chain-of-Thought, Few-Shot, Emotional, Exper…

@arXiv_csSE_bot@mastoxiv.page
2025-07-14 08:37:21

Leveraging Large Language Models for Classifying App Users' Feedback
Yasaman Abedini, Abbas Heydarnoori
https://arxiv.org/abs/2507.08250 https://

Leveraging Large Language Models for Classifying App Users' Feedback
In recent years, significant research has been conducted into classifying application (app) user feedback, primarily relying on supervised machine learning algorithms. However, fine-tuning more generalizable classifiers based on existing labeled datasets remains an important challenge, as creating large and accurately labeled datasets often requires considerable time and resources. In this paper, we evaluate the capabilities of four advanced LLMs, including GPT-3.5-Turbo, GPT-4, Flan-T5, and Ll…

@arXiv_csHC_bot@mastoxiv.page
2025-08-08 08:43:02

Charts-of-Thought: Enhancing LLM Visualization Literacy Through Structured Data Extraction
Amit Kumar Das, Mohammad Tarun, Klaus Mueller
https://arxiv.org/abs/2508.04842 https:/…

Charts-of-Thought: Enhancing LLM Visualization Literacy Through Structured Data Extraction
This paper evaluates the visualization literacy of modern Large Language Models (LLMs) and introduces a novel prompting technique called Charts-of-Thought. We tested three state-of-the-art LLMs (Claude-3.7-sonnet, GPT-4.5 preview, and Gemini-2.0-pro) on the Visualization Literacy Assessment Test (VLAT) using standard prompts and our structured approach. The Charts-of-Thought method guides LLMs through a systematic data extraction, verification, and analysis process before answering visualizatio…

@arXiv_csCL_bot@mastoxiv.page
2025-06-12 09:20:52

Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages
Amel Muminovic, Amela Kadric Muminovic
https://arxiv.org/abs/2506.09992

Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages
Online toxic language causes real harm, especially in regions with limited moderation tools. In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data. We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. Four models (GPT-3.5 Turbo, G…

@arXiv_csAI_bot@mastoxiv.page
2025-08-11 09:30:00

Retrieval Augmented Large Language Model System for Comprehensive Drug Contraindications
Byeonghun Bang, Jongsuk Yoon, Dong-Jin Chang, Seho Park, Yong Oh Lee
https://arxiv.org/abs/2508.06145

Retrieval Augmented Large Language Model System for Comprehensive Drug Contraindications
The versatility of large language models (LLMs) has been explored across various sectors, but their application in healthcare poses challenges, particularly in the domain of pharmaceutical contraindications where accurate and reliable information is required. This study enhances the capability of LLMs to address contraindications effectively by implementing a Retrieval Augmented Generation (RAG) pipeline. Utilizing OpenAI's GPT-4o-mini as the base model, and the text-embedding-3-small model for…

@arXiv_csSE_bot@mastoxiv.page
2025-06-13 08:08:42

Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements
Seyed Moein Abtahi, Akramul Azim
https://arxiv.org/abs/2506.10330

Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements
This study examined code issue detection and revision automation by integrating Large Language Models (LLMs) such as OpenAI's GPT-3.5 Turbo and GPT-4o into software development workflows. A static code analysis framework detects issues such as bugs, vulnerabilities, and code smells within a large-scale software project. Detailed information on each issue was extracted and organized to facilitate automated code revision using LLMs. An iterative prompt engineering process is applied to ensure tha…

@arXiv_csCY_bot@mastoxiv.page
2025-06-05 07:16:45

Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability
Lorraine Saju, Arnim Bleier, Jana Lasser, Claudia Wagner
https://arxiv.org/abs/2506.03655

Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability
The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify signific…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 07:21:03

Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs
Chenjun Xu, Bingbing Wen, Bin Han, Robert Wolfe, Lucy Lu Wang, Bill Howe
https://arxiv.org/abs/2506.00582

Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs
Psychology research has shown that humans are poor at estimating their performance on tasks, tending towards underconfidence on easy tasks and overconfidence on difficult tasks. We examine three LLMs, Llama-3-70B-instruct, Claude-3-Sonnet, and GPT-4o, on a range of QA tasks of varying difficulty, and show that models exhibit subtle differences from human patterns of overconfidence: less sensitive to task difficulty, and when prompted to answer based on different personas -- e.g., expert vs laym…

@arXiv_csCY_bot@mastoxiv.page
2025-07-29 10:11:51

The Carbon Cost of Conversation, Sustainability in the Age of Language Models
Sayed Mahbub Hasan Amiri, Prasun Goswami, Md. Mainul Islam, Mohammad Shakhawat Hossen, Sayed Majhab Hasan Amiri, Naznin Akter
https://arxiv.org/abs/2507.20018

The Carbon Cost of Conversation, Sustainability in the Age of Language Models
Large language models (LLMs) like GPT-3 and BERT have revolutionized natural language processing (NLP), yet their environmental costs remain dangerously overlooked. This article critiques the sustainability of LLMs, quantifying their carbon footprint, water usage, and contribution to e-waste through case studies of models such as GPT-4 and energy-efficient alternatives like Mistral 7B. Training a single LLM can emit carbon dioxide equivalent to hundreds of cars driven annually, while data centr…

@arXiv_csSE_bot@mastoxiv.page
2025-06-10 10:11:13

Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study
E. G. Santana Jr, Jander Pereira Santos Junior, Erlon P. Almeida, Iftekhar Ahmed, Paulo Anselmo da Mota Silveira Neto, Eduardo Santana de Almeida
https://arxiv.org/abs/2506.07594

Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study
Test smells indicate poor development practices in test code, reducing maintainability and reliability. While developers often struggle to prevent or refactor these issues, existing tools focus primarily on detection rather than automated refactoring. Large Language Models (LLMs) have shown strong potential in code understanding and transformation, but their ability to both identify and refactor test smells remains underexplored. We evaluated GPT-4-Turbo, LLaMA 3 70B, and Gemini-1.5 Pro on Pyth…

@arXiv_csCL_bot@mastoxiv.page
2025-07-28 13:02:38

Replaced article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[1/3]:
- Comparison of pipeline, sequence-to-sequence, and GPT models for end-to-end relation extraction: ...
Shashank Gupta, Xuguang Ai, Ramakanth Kavuluru

Tootfinder

Opt-in global Mastodon full text search. Join the index!