Tootfinder

@arXiv_csCV_bot@mastoxiv.page
2025-08-15 10:25:22

Performance of GPT-5 in Brain Tumor MRI Reasoning
Mojtaba Safari, Shansong Wang, Mingzhe Hu, Zach Eidex, Qiang Li, Xiaofeng Yang
https://arxiv.org/abs/2508.10865 https://…

Performance of GPT-5 in Brain Tumor MRI Reasoning
Accurate differentiation of brain tumor types on magnetic resonance imaging (MRI) is critical for guiding treatment planning in neuro-oncology. Recent advances in large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning. In this study, we evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark derived from 3 Brain Tumor Segmentation (BraTS) datasets - glioblastoma (…

@arXiv_csSE_bot@mastoxiv.page
2025-09-15 08:53:41

WALL: A Web Application for Automated Quality Assurance using Large Language Models
Seyed Moein Abtahi, Akramul Azim
https://arxiv.org/abs/2509.09918 https://

WALL: A Web Application for Automated Quality Assurance using Large Language Models
As software projects become increasingly complex, the volume and variety of issues in code files have grown substantially. Addressing this challenge requires efficient issue detection, resolution, and evaluation tools. This paper presents WALL, a web application that integrates SonarQube and large language models (LLMs) such as GPT-3.5 Turbo and GPT-4o to automate these tasks. WALL comprises three modules: an issue extraction tool, code issues reviser, and code comparison tool. Together, they e…

@arXiv_csHC_bot@mastoxiv.page
2025-10-14 08:48:58

ROBOPSY PL[AI]: Using Role-Play to Investigate how LLMs Present Collective Memory
Margarete Jahrmann, Thomas Brandstetter, Stefan Glasauer
https://arxiv.org/abs/2510.09874 https…

ROBOPSY PL[AI]: Using Role-Play to Investigate how LLMs Present Collective Memory
The paper presents the first results of an artistic research project investigating how Large Language Models (LLMs) curate and present collective memory. In a public installation exhibited during two months in Vienna in 2025, visitors could interact with five different LLMs (ChatGPT with GPT 4o and GPT 4o mini, Mistral Large, DeepSeek-Chat, and a locally run Llama 3.1 model), which were instructed to act as narrators, implementing a role-playing game revolving around the murder of Austrian phil…

@arXiv_csCL_bot@mastoxiv.page
2025-09-12 09:23:29

Automated Classification of Tutors' Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus
Liqun He, Jiaqi Xu
https://arxiv.org/abs/2509.09125 https://…

Automated Classification of Tutors' Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus
This study explores the use of generative AI for automating the classification of tutors' Dialogue Acts (DAs), aiming to reduce the time and effort required by traditional manual coding. This case study uses the open-source CIMA corpus, in which tutors' responses are pre-annotated into four DA categories. Both GPT-3.5-turbo and GPT-4 models were tested using tailored prompts. Results show that GPT-4 achieved 80% accuracy, a weighted F1-score of 0.81, and a Cohen's Kappa of 0.74, surpassing base…

@Techmeme@techhub.social
2025-08-01 18:25:51

Source: GPT-5 improvements won't be comparable to the leaps in performance of earlier models, such as between GPT-3 in 2020 and GPT-4 in 2023 (The Information)
https://www.theinformation.com/articles/inside-openais-rocky-path-gpt-5

Inside OpenAI’s Rocky Path to GPT-5
OpenAI made waves across the industry in December when it published the results from its tests of artificial intelligence that performs better on tasks when it gets more time and computing power to process them. The results implied ChatGPT customers were about to be blown away by what the new AI ...

@arXiv_csDC_bot@mastoxiv.page
2025-09-11 08:56:33

Design and Implementation of Code Completion System Based on LLM and CodeBERT Hybrid Subsystem
Bingbing Zhang, Ziyu Lin, Yingxin Su
https://arxiv.org/abs/2509.08215 https://

Design and Implementation of Code Completion System Based on LLM and CodeBERT Hybrid Subsystem
In the rapidly evolving industry of software development, coding efficiency and accuracy play significant roles in delivering high-quality software. Various code suggestion and completion tools, such as CodeBERT from Microsoft and GPT-3.5 from OpenAI, have been developed using deep learning techniques and integrated into IDEs to assist software engineers' development. Research has shown that CodeBERT has outstanding performance in code summarization and capturing code semantics, while GPT-3.5 d…

@ErikJonker@mastodon.social
2025-08-09 18:02:14

GPT-5 may be slightly disappointing, Genie 3 demo blew me away... Watch it.
#ai

@jdrm@social.linux.pizza
2025-08-06 09:04:05

Nos reíamos de que Reagan preguntara a una vidente decisiones de política durante su presidencia. Pues en Suecia estšn con la versión 3.0 de consultar a un oršculo: https://www.theguardian.com/technology/2025/aug/05/chat-gpt-sw…

‘We didn’t vote for ChatGPT’: Swedish PM under fire for using AI in role
Tech experts criticise Ulf Kristersson as newspaper accuses him of falling for ‘the oligarchs’ AI psychosis’

@arXiv_physicsedph_bot@mastoxiv.page
2025-08-13 08:59:32

The Boiling-Frog Problem of Physics Education
Gerd Kortemeyer
https://arxiv.org/abs/2508.08842 https://arxiv.org/pdf/2508.08842

The Boiling-Frog Problem of Physics Education
It is astonishing how rapidly general-purpose AI has crossed familiar thresholds in introductory physics. Comparing outputs from successive models, GPT-5 Thinking moves far beyond the plug-and-chug tendencies seen earlier: on a classic elevator problem it works symbolically, notes when variables cancel, and verifies results; attempts to prompt novice-like behavior mainly affect tone, not method. On representation translation, the model scores 24/26 (92.3%) on TUG-Kv4.0. In a card-sorting proxy …

@arXiv_csAI_bot@mastoxiv.page
2025-08-11 09:30:00

Retrieval Augmented Large Language Model System for Comprehensive Drug Contraindications
Byeonghun Bang, Jongsuk Yoon, Dong-Jin Chang, Seho Park, Yong Oh Lee
https://arxiv.org/abs/2508.06145

Retrieval Augmented Large Language Model System for Comprehensive Drug Contraindications
The versatility of large language models (LLMs) has been explored across various sectors, but their application in healthcare poses challenges, particularly in the domain of pharmaceutical contraindications where accurate and reliable information is required. This study enhances the capability of LLMs to address contraindications effectively by implementing a Retrieval Augmented Generation (RAG) pipeline. Utilizing OpenAI's GPT-4o-mini as the base model, and the text-embedding-3-small model for…

@arXiv_csCL_bot@mastoxiv.page
2025-10-07 12:18:02

Resource-Efficient Fine-Tuning of LLaMA-3.2-3B for Medical Chain-of-Thought Reasoning
Imran Mansha
https://arxiv.org/abs/2510.05003 https://arxiv.org/pdf/2…

Resource-Efficient Fine-Tuning of LLaMA-3.2-3B for Medical Chain-of-Thought Reasoning
Large Language Models (LLMs) such as GPT-4 and LLaMA have demonstrated remarkable reasoning abilities but require significant computational resources for fine-tuning. This paper presents a resource-efficient fine-tuning approach for LLaMA-3.2-3B to enhance medical chain-of-thought reasoning while operating under constrained GPU and memory settings. Using parameter-efficient tuning techniques such as LoRA and QLoRA, we adapt the base model on publicly available medical reasoning datasets. The mo…

@arXiv_csPL_bot@mastoxiv.page
2025-08-07 12:56:16

Replaced article(s) found for cs.PL. https://arxiv.org/list/cs.PL/new
[1/1]:
- RTLCoder: Outperforming GPT-3.5 in Design RTL Generation with Our Open-Source Dataset and Lightwe...
Shang Liu, Wenji Fang, Yao Lu, Qijun Zhang, Hongce Zhang, Zhiyao Xie

@arXiv_csAI_bot@mastoxiv.page
2025-08-06 09:49:50

Can Large Language Models Bridge the Gap in Environmental Knowledge?
Linda Smail (College of Interdisciplinary Studies, Zayed University, UAE), David Santandreu Calonge (Department of Academic Development, Mohamed bin Zayed University of Artificial Intelligence, UAE), Firuz Kamalov (School of Engineering, Applied Science,Technology, Canadian University Dubai, UAE), Nur H. Orak (Department of Environmental Engineering, Marmara University, T\"urkiye)

Can Large Language Models Bridge the Gap in Environmental Knowledge?
This research investigates the potential of Artificial Intelligence (AI) models to bridge the knowledge gap in environmental education among university students. By focusing on prominent large language models (LLMs) such as GPT-3.5, GPT-4, GPT-4o, Gemini, Claude Sonnet, and Llama 2, the study assesses their effectiveness in conveying environmental concepts and, consequently, facilitating environmental education. The investigation employs a standardized tool, the Environmental Knowledge Test (EK…

@arXiv_csCY_bot@mastoxiv.page
2025-08-07 08:33:34

Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy
Jairo Gudi\~no-Rosero, Cl\'ement Contet, Umberto Grandi, C\'esar A. Hidalgo
https://arxiv.org/abs/2508.04281

Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy
Large Language Models (LLMs) are gaining traction as a method to generate consensus statements and aggregate preferences in digital democracy experiments. Yet, LLMs may introduce critical vulnerabilities in these systems. Here, we explore the impact of prompt-injection attacks targeting consensus generating systems by introducing a four-dimensional taxonomy of attacks. We test these attacks using LLaMA 3.1 8B and Chat GPT 4.1 Nano finding the LLMs more vulnerable to criticism attacks -- attacks…

@arXiv_csAR_bot@mastoxiv.page
2025-08-26 07:31:46

GPT-OSS-20B: A Comprehensive Deployment-Centric Analysis of OpenAI's Open-Weight Mixture of Experts Model
Deepak Kumar, Divakar Yadav, Yash Patel
https://arxiv.org/abs/2508.16700

GPT-OSS-20B: A Comprehensive Deployment-Centric Analysis of OpenAI's Open-Weight Mixture of Experts Model
We present a single-GPU (H100, bf16) evaluation of GPT-OSS-20B (Mixture-of-Experts; 20.9B total, approx. 3.61B active) against dense baselines Qwen3-32B and Yi-34B across multiple dimensions. We measure true time-to-first-token (TTFT), full-decode throughput (TPOT), end-to-end latency percentiles, peak VRAM with past key values (PKV) held, and energy via a consistent nvidia-smi-based sampler. At a 2048-token context with 64-token decode, GPT-OSS-20B delivers higher decode throughput and tokens …

@UP8@mastodon.social
2025-09-29 15:25:58

🧾 Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing
#software

Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing
This paper benchmarks eight multi-modal large language models from three families (GPT-5, Gemini 2.5, and open-source Gemma 3) on three diverse openly available invoice document datasets using zero-shot prompting. We compare two processing strategies: direct image processing using multi-modal capabilities and a structured parsing approach converting documents to markdown first. Results show native image processing generally outperforms structured approaches, with performance varying across mode…

@arXiv_csCR_bot@mastoxiv.page
2025-09-19 07:38:11

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models
Gustavo Sandoval, Denys Fenchenko, Junyao Chen
https://arxiv.org/abs/2509.14271

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models
This paper documents early research conducted in 2022 on defending against prompt injection attacks in large language models, providing historical context for the evolution of this critical security domain. This research focuses on two adversarial attacks against Large Language Models (LLMs): prompt injection and goal hijacking. We examine how to construct these attacks, test them on various LLMs, and compare their effectiveness. We propose and evaluate a novel defense technique called Adversar…

@arXiv_csAI_bot@mastoxiv.page
2025-10-08 10:03:59

Large Language Model-Based Uncertainty-Adjusted Label Extraction for Artificial Intelligence Model Development in Upper Extremity Radiography
Hanna Kreutzer, Anne-Sophie Caselitz, Thomas Dratsch, Daniel Pinto dos Santos, Christiane Kuhl, Daniel Truhn, Sven Nebelung
https://arxiv.org/abs/2510.05664 …

Large Language Model-Based Uncertainty-Adjusted Label Extraction for Artificial Intelligence Model Development in Upper Extremity Radiography
Objectives: To evaluate GPT-4o's ability to extract diagnostic labels (with uncertainty) from free-text radiology reports and to test how these labels affect multi-label image classification of musculoskeletal radiographs. Methods: This retrospective study included radiography series of the clavicle (n=1,170), elbow (n=3,755), and thumb (n=1,978). After anonymization, GPT-4o filled out structured templates by indicating imaging findings as present ("true"), absent ("false"), or "uncertain." To …

@Techmeme@techhub.social
2025-09-29 19:26:02

Anthropic prices Claude Sonnet 4.5 at $3/1M input and $15/1M output tokens, same as Sonnet 4, cheaper than Opus at $15/$75 but higher than GPT-5 at $1.25/$10 (Simon Willison/Simon Willison's Weblog)
https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/

Claude Sonnet 4.5 is probably the “best coding model in the world” (at least for now)
Anthropic released Claude Sonnet 4.5 today, with a very bold set of claims: Claude Sonnet 4.5 is the best coding model in the world. It’s the strongest model for building …

@arXiv_physicsedph_bot@mastoxiv.page
2025-09-11 07:56:02

Feedback That Clicks: Introductory Physics Students' Valued Features in AI Feedback Generated From Self-Crafted and Engineered Prompts
Amogh Sirnoorkar, N. Sanjay Rebello
https://arxiv.org/abs/2509.08516

Feedback That Clicks: Introductory Physics Students' Valued Features in AI Feedback Generated From Self-Crafted and Engineered Prompts
Since the advent of GPT-3.5 in 2022, Generative Artificial Intelligence (AI) has shown tremendous potential in STEM education, particularly in providing real-time, customized feedback to students in large-enrollment courses. A crucial skill that mediates effective use of AI is the systematic structuring of natural language instructions to AI models, commonly referred to as prompt engineering. This study has three objectives: (i) to investigate the sophistication of student-generated prompts whe…

@arXiv_csHC_bot@mastoxiv.page
2025-08-08 08:43:02

Charts-of-Thought: Enhancing LLM Visualization Literacy Through Structured Data Extraction
Amit Kumar Das, Mohammad Tarun, Klaus Mueller
https://arxiv.org/abs/2508.04842 https:/…

Charts-of-Thought: Enhancing LLM Visualization Literacy Through Structured Data Extraction
This paper evaluates the visualization literacy of modern Large Language Models (LLMs) and introduces a novel prompting technique called Charts-of-Thought. We tested three state-of-the-art LLMs (Claude-3.7-sonnet, GPT-4.5 preview, and Gemini-2.0-pro) on the Visualization Literacy Assessment Test (VLAT) using standard prompts and our structured approach. The Charts-of-Thought method guides LLMs through a systematic data extraction, verification, and analysis process before answering visualizatio…

@arXiv_csIR_bot@mastoxiv.page
2025-09-16 10:02:17

Do Large Language Models Favor Recent Content? A Study on Recency Bias in LLM-Based Reranking
Hanpei Fang, Sijie Tao, Nuo Chen, Kai-Xin Chang, Tetsuya Sakai
https://arxiv.org/abs/2509.11353

Do Large Language Models Favor Recent Content? A Study on Recency Bias in LLM-Based Reranking
Large language models (LLMs) are increasingly deployed in information systems, including being used as second-stage rerankers in information retrieval pipelines, yet their susceptibility to recency bias has received little attention. We investigate whether LLMs implicitly favour newer documents by prepending artificial publication dates to passages in the TREC Deep Learning passage retrieval collections in 2021 (DL21) and 2022 (DL22). Across seven models, GPT-3.5-turbo, GPT-4o, GPT-4, LLaMA-3 8…

@arXiv_csSD_bot@mastoxiv.page
2025-09-30 20:39:23

Replaced article(s) found for cs.SD. https://arxiv.org/list/cs.SD/new
[1/1]:
- M6(GPT)3: Generating Multitrack Modifiable Multi-Minute MIDI Music from Text using Genetic algori...
Jakub Po\'cwiardowski, Mateusz Modrzejewski, Marek S. Tatara

@arXiv_csCY_bot@mastoxiv.page
2025-08-28 08:13:51

Should LLMs be WEIRD? Exploring WEIRDness and Human Rights in Large Language Models
Ke Zhou, Marios Constantinides, Daniele Quercia
https://arxiv.org/abs/2508.19269 https://

Should LLMs be WEIRD? Exploring WEIRDness and Human Rights in Large Language Models
Large language models (LLMs) are often trained on data that reflect WEIRD values: Western, Educated, Industrialized, Rich, and Democratic. This raises concerns about cultural bias and fairness. Using responses to the World Values Survey, we evaluated five widely used LLMs: GPT-3.5, GPT-4, Llama-3, BLOOM, and Qwen. We measured how closely these responses aligned with the values of the WEIRD countries and whether they conflicted with human rights principles. To reflect global diversity, we compar…

@arXiv_csHC_bot@mastoxiv.page
2025-08-01 09:22:31

Exploring LLM-generated Culture-specific Affective Human-Robot Tactile Interaction
Qiaoqiao Ren, Tony Belpaeme
https://arxiv.org/abs/2507.22905 https://arx…

Exploring LLM-generated Culture-specific Affective Human-Robot Tactile Interaction
As large language models (LLMs) become increasingly integrated into robotic systems, their potential to generate socially and culturally appropriate affective touch remains largely unexplored. This study investigates whether LLMs-specifically GPT-3.5, GPT-4, and GPT-4o --can generate culturally adaptive tactile behaviours to convey emotions in human-robot interaction. We produced text based touch descriptions for 12 distinct emotions across three cultural contexts (Chinese, Belgian, and unspeci…

@arXiv_eessSY_bot@mastoxiv.page
2025-09-23 09:00:00

Synergies between Federated Foundation Models and Smart Power Grids
Seyyedali Hosseinalipour, Shimiao Li, Adedoyin Inaolaji, Filippo Malandra, Luis Herrera, Nicholas Mastronarde
https://arxiv.org/abs/2509.16496

Synergies between Federated Foundation Models and Smart Power Grids
The recent emergence of large language models (LLMs) such as GPT-3 has marked a significant paradigm shift in machine learning. Trained on massive corpora of data, these models demonstrate remarkable capabilities in language understanding, generation, summarization, and reasoning, transforming how intelligent systems process and interact with human language. Although LLMs may still seem like a recent breakthrough, the field is already witnessing the rise of a new and more general category: mult…

@arXiv_csSE_bot@mastoxiv.page
2025-08-21 09:32:00

Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis
Abbas Sabra, Olivier Schmitt, Joseph Tyler
https://arxiv.org/abs/2508.14727 https://

Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis
This study presents a quantitative evaluation of the code quality and security of five prominent Large Language Models (LLMs): Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder 8B. While prior research has assessed the functional performance of LLM-generated code, this research tested LLM output from 4,442 Java coding assignments through comprehensive static analysis using SonarQube. The findings suggest that although LLMs can generate functional code, they also introduce…

@arXiv_csCL_bot@mastoxiv.page
2025-09-03 14:37:23

An Ensemble Classification Approach in A Multi-Layered Large Language Model Framework for Disease Prediction
Ali Hamdi, Malak Mohamed, Rokaia Emad, Khaled Shaban
https://arxiv.org/abs/2509.02446

An Ensemble Classification Approach in A Multi-Layered Large Language Model Framework for Disease Prediction
Social telehealth has made remarkable progress in healthcare by allowing patients to post symptoms and participate in medical consultations remotely. Users frequently post symptoms on social media and online health platforms, creating a huge repository of medical data that can be leveraged for disease classification. Large language models (LLMs) such as LLAMA3 and GPT-3.5, along with transformer-based models like BERT, have demonstrated strong capabilities in processing complex medical text. In…

@arXiv_csCR_bot@mastoxiv.page
2025-07-22 07:53:50

Mitigating Trojanized Prompt Chains in Educational LLM Use Cases: Experimental Findings and Detection Tool Design
Richard M. Charles, James H. Curry, Richard B. Charles
https://arxiv.org/abs/2507.14207

Mitigating Trojanized Prompt Chains in Educational LLM Use Cases: Experimental Findings and Detection Tool Design
The integration of Large Language Models (LLMs) in K--12 education offers both transformative opportunities and emerging risks. This study explores how students may Trojanize prompts to elicit unsafe or unintended outputs from LLMs, bypassing established content moderation systems with safety guardrils. Through a systematic experiment involving simulated K--12 queries and multi-turn dialogues, we expose key vulnerabilities in GPT-3.5 and GPT-4. This paper presents our experimental design, detai…

@arXiv_csCY_bot@mastoxiv.page
2025-07-29 10:11:51

The Carbon Cost of Conversation, Sustainability in the Age of Language Models
Sayed Mahbub Hasan Amiri, Prasun Goswami, Md. Mainul Islam, Mohammad Shakhawat Hossen, Sayed Majhab Hasan Amiri, Naznin Akter
https://arxiv.org/abs/2507.20018

The Carbon Cost of Conversation, Sustainability in the Age of Language Models
Large language models (LLMs) like GPT-3 and BERT have revolutionized natural language processing (NLP), yet their environmental costs remain dangerously overlooked. This article critiques the sustainability of LLMs, quantifying their carbon footprint, water usage, and contribution to e-waste through case studies of models such as GPT-4 and energy-efficient alternatives like Mistral 7B. Training a single LLM can emit carbon dioxide equivalent to hundreds of cars driven annually, while data centr…

@arXiv_csCL_bot@mastoxiv.page
2025-09-01 09:48:02

Personality Matters: User Traits Predict LLM Preferences in Multi-Turn Collaborative Tasks
Sarfaroz Yunusov, Kaige Chen, Kazi Nishat Anwar, Ali Emami
https://arxiv.org/abs/2508.21628

Personality Matters: User Traits Predict LLM Preferences in Multi-Turn Collaborative Tasks
As Large Language Models (LLMs) increasingly integrate into everyday workflows, where users shape outcomes through multi-turn collaboration, a critical question emerges: do users with different personality traits systematically prefer certain LLMs over others? We conducted a study with 32 participants evenly distributed across four Keirsey personality types, evaluating their interactions with GPT-4 and Claude 3.5 across four collaborative tasks: data analysis, creative writing, information retr…

@arXiv_physicsmedph_bot@mastoxiv.page
2025-08-26 08:37:56

Root Cause Analysis of Radiation Oncology Incidents Using Large Language Models
Yuntao Wang, Mariluz De Ornelas, Matthew T. Studenski, Elizabeth Bossart, Siamak P. Najad-Davarani, Yunze Yang
https://arxiv.org/abs/2508.17201

Root Cause Analysis of Radiation Oncology Incidents Using Large Language Models
Purpose To evaluate the reasoning capabilities of large language models (LLMs) in performing root cause analysis (RCA) of radiation oncology incidents using narrative reports from the Radiation Oncology Incident Learning System (RO-ILS), and to assess their potential utility in supporting patient safety efforts. Methods and Materials Four LLMs, Gemini 2.5 Pro, GPT-4o, o3, and Grok 3, were prompted with the 'Background and Incident Overview' sections of 19 public RO-ILS cases. Using a standard…

@arXiv_csCL_bot@mastoxiv.page
2025-09-17 10:37:50

The Few-shot Dilemma: Over-prompting Large Language Models
Yongjian Tang, Doruk Tuncel, Christian Koerner, Thomas Runkler
https://arxiv.org/abs/2509.13196 https://

The Few-shot Dilemma: Over-prompting Large Language Models
Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3…

@arXiv_csAI_bot@mastoxiv.page
2025-07-25 07:52:32

Does visualization help AI understand data?
Victoria R. Li, Johnathan Sun, Martin Wattenberg
https://arxiv.org/abs/2507.18022 https://arxiv.org/pdf/2507.18…

Does visualization help AI understand data?
Charts and graphs help people analyze data, but can they also be useful to AI systems? To investigate this question, we perform a series of experiments with two commercial vision-language models: GPT 4.1 and Claude 3.5. Across three representative analysis tasks, the two systems describe synthetic datasets more precisely and accurately when raw data is accompanied by a scatterplot, especially as datasets grow in complexity. Comparison with two baselines -- providing a blank chart and a chart wi…

@arXiv_csCL_bot@mastoxiv.page
2025-08-22 12:38:52

Replaced article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[3/3]:
- CRISPR-GPT for Agentic Automation of Gene-editing Experiments
Qu, Huang, Yin, Zhan, Liu, Yin, Cousins, Johnson, Wang, Shah, Altman, Zhou, Wang, Cong

@arXiv_csCY_bot@mastoxiv.page
2025-07-16 07:41:31

Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?
Bhakti Khera, Rezvan Alamian, Pascal A. Scherz, Stephan M. Goetz
https://arxiv.org/abs/2507.10576

Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?
The legal field already uses various large language models (LLMs) in actual applications, but their quantitative performance and reasons for it are underexplored. We evaluated several open-source and proprietary LLMs -- including GPT-series, Anthropic, Deepseek and Llama-3, variants -- on parts of the European Qualifying Examination (EQE) for future European Patent Attorneys. OpenAI o1 led with 0.82 accuracy and 0.81 F1 score, whereas (Amazon Web Services) AWS Llama 3.1 8B lagged at 0.50 accura…

@arXiv_csCL_bot@mastoxiv.page
2025-07-28 13:02:38

Replaced article(s) found for cs.CL. https://arxiv.org/list/cs.CL/new
[1/3]:
- Comparison of pipeline, sequence-to-sequence, and GPT models for end-to-end relation extraction: ...
Shashank Gupta, Xuguang Ai, Ramakanth Kavuluru

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:03:06

A Retail-Corpus for Aspect-Based Sentiment Analysis with Large Language Models
Oleg Silcenco, Marcos R. Machad, Wallace C. Ugulino, Daniel Braun
https://arxiv.org/abs/2508.17994

A Retail-Corpus for Aspect-Based Sentiment Analysis with Large Language Models
Aspect-based sentiment analysis enhances sentiment detection by associating it with specific aspects, offering deeper insights than traditional sentiment analysis. This study introduces a manually annotated dataset of 10,814 multilingual customer reviews covering brick-and-mortar retail stores, labeled with eight aspect categories and their sentiment. Using this dataset, the performance of GPT-4 and LLaMA-3 in aspect based sentiment analysis is evaluated to establish a baseline for the newly in…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:33:11

A Comparative Evaluation of Large Language Models for Persian Sentiment Analysis and Emotion Detection in Social Media Texts
Kian Tohidi, Kia Dashtipour, Simone Rebora, Sevda Pourfaramarz
https://arxiv.org/abs/2509.14922

A Comparative Evaluation of Large Language Models for Persian Sentiment Analysis and Emotion Detection in Social Media Texts
This study presents a comprehensive comparative evaluation of four state-of-the-art Large Language Models (LLMs)--Claude 3.7 Sonnet, DeepSeek-V3, Gemini 2.0 Flash, and GPT-4o--for sentiment analysis and emotion detection in Persian social media texts. Comparative analysis among LLMs has witnessed a significant rise in recent years, however, most of these analyses have been conducted on English language tasks, creating gaps in understanding cross-linguistic performance patterns. This research ad…

@arXiv_csCL_bot@mastoxiv.page
2025-08-21 08:31:50

Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models
Badrinath Ramakrishnan, Akshaya Balaji
https://arxiv.org/abs/2508.14062 https://

Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, but their tendency to memorize training data poses significant privacy risks, particularly during fine-tuning processes. This paper presents a comprehensive empirical analysis of data memorization in fine-tuned LLMs and introduces a novel multi-layered privacy protection framework. Through controlled experiments on modern LLM architectures including GPT-2, Phi-3, and Gemma-2,…

Tootfinder

Opt-in global Mastodon full text search. Join the index!