Tootfinder

@arXiv_csCV_bot@mastoxiv.page
2025-07-03 10:32:10

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov, O\u{g}uzhan Fatih Kar, Amir Zamir
https://arxiv.org/abs/2507.01955

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using establ…

@arXiv_csCY_bot@mastoxiv.page
2025-06-05 07:16:45

Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability
Lorraine Saju, Arnim Bleier, Jana Lasser, Claudia Wagner
https://arxiv.org/abs/2506.03655

Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability
The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify signific…

@almad@fosstodon.org
2025-06-06 13:57:08

Based on media output, GPT is the world’s most successful business writing influencer

@Techmeme@techhub.social
2025-06-04 06:40:58

How Morgan Stanley is using its DevGen.AI tool, built in-house on OpenAI's GPT models, to translate legacy code into modern coding languages (Isabelle Bousquette/Wall Street Journal)
https://www.wsj.com/article…

@arXiv_csCR_bot@mastoxiv.page
2025-06-03 17:52:02

This https://arxiv.org/abs/2505.18889 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

Security Concerns for Large Language Models: A Survey
Large Language Models (LLMs) such as GPT-4 (and its recent iterations like GPT-4o and the GPT-4.1 series), Google's Gemini, Anthropic's Claude 3 models, and xAI's Grok have caused a revolution in natural language processing, but their capabilities also introduce new security vulnerabilities. In this survey, we provide a comprehensive overview of the emerging security concerns around LLMs, categorizing threats into prompt injection and jailbreaking, adversarial attacks (including input perturbat…

@arXiv_csRO_bot@mastoxiv.page
2025-06-05 10:00:55

This https://arxiv.org/abs/2505.20573 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csRO_…

Collision- and Reachability-Aware Multi-Robot Control with Grounded LLM Planners
Large language models (LLMs) have demonstrated strong performance in various robot control tasks. However, their deployment in real-world applications remains constrained. Even state-ofthe-art LLMs, such as GPT-o4mini, frequently produce invalid action plans that violate physical constraints, such as directing a robot to an unreachable location or causing collisions between robots. This issue primarily arises from a lack of awareness of these physical constraints during the reasoning process. T…

@life_is@no-pony.farm
2025-07-03 05:42:45

Wie ich höre, ist es möglich GPT ChatBots so zu frustrieren, dass sie sich selbst deinstallieren und so aus allen Projekten entfernt werden können, wo sie unerwünscht sind.

@Life_is@no-pony.farm
2025-07-03 05:42:45

Wie ich höre, ist es möglich GPT ChatBots so zu frustrieren, dass sie sich selbst deinstallieren und so aus allen Projekten entfernt werden können, wo sie unerwünscht sind.

@arXiv_qbioQM_bot@mastoxiv.page
2025-07-04 08:27:41

Leveraging Transformer Models to Capture Multi-Scale Dynamics in Biomolecules by nano-GPT
Wenqi Zeng, Lu Zhang, Yuan Yao
https://arxiv.org/abs/2507.02734 h…

Leveraging Transformer Models to Capture Multi-Scale Dynamics in Biomolecules by nano-GPT
Long-term biomolecular dynamics are critical for understanding key evolutionary transformations in molecular systems. However, capturing these processes requires extended simulation timescales that often exceed the practical limits of conventional models. To address this, shorter simulations, initialized with diverse perturbations, are commonly used to sample phase space and explore a wide range of behaviors. Recent advances have leveraged language models to infer long-term behavior from short …

@shoppingtonz@mastodon.social
2025-07-05 06:18:23

the good thing with "fake ai" is that it is sometimes so stupid that it motivates you to work harder.
I was using the GPT-4o mini regarding I dunno both what it "had" on Qubes OS and modifying Template VMs...the answers it gave were utterly stupid and when I(with some temper issues) pointed it out in a nice and clear way(still with temper issues) it gave the same stupid answers.
Edit: no "part 2" needed, any other parts I just attach to this post...

@arXiv_csCL_bot@mastoxiv.page
2025-06-30 10:20:20

Identifying a Circuit for Verb Conjugation in GPT-2
David Demitri Africa
https://arxiv.org/abs/2506.22105 https://arxiv.org/pdf/2506.…

Identifying a Circuit for Verb Conjugation in GPT-2
I implement a procedure to isolate and interpret the sub-network (or "circuit") responsible for subject-verb agreement in GPT-2 Small. In this study, the model is given prompts where the subject is either singular (e.g. "Alice") or plural (e.g. "Alice and Bob"), and the task is to correctly predict the appropriate verb form ("walks" for singular subjects, "walk" for plural subjects). Using a series of techniques-including performance verification automatic circuit discovery via direct path patc…

@arXiv_condmatmtrlsci_bot@mastoxiv.page
2025-07-02 09:45:20

Testing the spin-bath view of self-attention: A Hamiltonian analysis of GPT-2 Transformer
Satadeep Bhattacharjee, Seung-Cheol Lee
https://arxiv.org/abs/2507.00683

Testing the spin-bath view of self-attention: A Hamiltonian analysis of GPT-2 Transformer
The recently proposed physics-based framework by Huo and Johnson~\cite{huo2024capturing} models the attention mechanism of Large Language Models (LLMs) as an interacting two-body spin system, offering a first-principles explanation for phenomena like repetition and bias. Building on this hypothesis, we extract the complete Query-Key weight matrices from a production-grade GPT-2 model and derive the corresponding effective Hamiltonian for every attention head. From these Hamiltonians we obtain a…

@arXiv_csCV_bot@mastoxiv.page
2025-06-04 07:58:39

UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan
https://arxiv.org/abs/2506.03147

UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Although existing unified models deliver strong performance on vision-language understanding and text-to-image generation, their models are limited in exploring image perception and manipulation tasks, which are urgently desired by users for wide applications. Recently, OpenAI released their powerful GPT-4o-Image model for comprehensive image perception and manipulation, achieving expressive capability and attracting community interests. By observing the performance of GPT-4o-Image in our caref…

@arXiv_csIR_bot@mastoxiv.page
2025-06-30 09:55:10

HLTCOE at LiveRAG: GPT-Researcher using ColBERT retrieval
Kevin Duh, Eugene Yang, Orion Weller, Andrew Yates, Dawn Lawrie
https://arxiv.org/abs/2506.22356 …

HLTCOE at LiveRAG: GPT-Researcher using ColBERT retrieval
The HLTCOE LiveRAG submission utilized the GPT-researcher framework for researching the context of the question, filtering the returned results, and generating the final answer. The retrieval system was a ColBERT bi-encoder architecture, which represents a passage with many dense tokens. Retrieval used a local, compressed index of the FineWeb10-BT collection created with PLAID-X, using a model fine-tuned for multilingual retrieval. Query generation from context was done with Qwen2.5-7B-Instruct…

@arXiv_csCY_bot@mastoxiv.page
2025-06-06 07:17:58

Intentionally Unintentional: GenAI Exceptionalism and the First Amendment
David Atkinson, Jena D. Hwang, Jacob Morrison
https://arxiv.org/abs/2506.05211 ht…

Intentionally Unintentional: GenAI Exceptionalism and the First Amendment
This paper challenges the assumption that courts should grant First Amendment protections to outputs from large generative AI models, such as GPT-4 and Gemini. We argue that because these models lack intentionality, their outputs do not constitute speech as understood in the context of established legal precedent, so there can be no speech to protect. Furthermore, if the model outputs are not speech, users cannot claim a First Amendment speech right to receive the outputs. We also argue that ex…

@arXiv_statAP_bot@mastoxiv.page
2025-07-04 08:26:21

BACTA-GPT: An AI-Based Bayesian Adaptive Clinical Trial Architect
Krishna Padmanabhan, Danny Baker
https://arxiv.org/abs/2507.02130 https://

BACTA-GPT: An AI-Based Bayesian Adaptive Clinical Trial Architect
Bayesian adaptive clinical trials offer a flexible and efficient alternative to traditional fixed-design trials, but their implementation is often hindered by the complexity of Bayesian computations and the need for advanced statistical programming expertise. The authors introduce a custom fine-tuned LLM designed to assist with this and lower barriers to adoption of Bayesian methods for adaptive clinical trials. This paper describes the development and fine-tuning of BACTA-GPT, a Large Language…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 07:18:25

Evaluation of LLMs for mathematical problem solving
Ruonan Wang, Runxi Wang, Yunwen Shen, Chengfeng Wu, Qinglin Zhou, Rohitash Chandra
https://arxiv.org/abs/2506.00309

Evaluation of LLMs for mathematical problem solving
Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexities (GSM8K, MATH500, and UNSW datasets). We take a five-dimensional approach based on the Structured Chain-of-Thought (SCoT) framework to assess final answer correctness, step com…

@arXiv_csHC_bot@mastoxiv.page
2025-06-03 16:54:22

This https://arxiv.org/abs/2503.18792 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csHC_…

REALM: A Dataset of Real-World LLM Use Cases
Large Language Models (LLMs), such as the GPT series, have driven significant industrial applications, leading to economic and societal transformations. However, a comprehensive understanding of their real-world applications remains limited. To address this, we introduce REALM, a dataset of over 94,000 LLM use cases collected from Reddit and news articles. REALM captures two key dimensions: the diverse applications of LLMs and the demographics of their users. It categorizes LLM applications and…

@arXiv_astrophIM_bot@mastoxiv.page
2025-06-04 07:45:38

An Exploratory Framework for Future SETI Applications: Detecting Generative Reactivity via Language Models
Po-Chieh Yu
#toXiv_bot_toot

@arXiv_csSE_bot@mastoxiv.page
2025-06-02 07:21:33

Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach
Melika Sepidband, Hamed Taherkhani, Song Wang, Hadi Hemmati
https://arxiv.org/abs/2505.23953

Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach
Automatic code generation has gained significant momentum with the advent of Large Language Models (LLMs) such as GPT-4. Although many studies focus on improving the effectiveness of LLMs for code generation, very limited work tries to understand the generated code's characteristics and leverage that to improve failed cases. In this paper, as the most straightforward characteristic of code, we investigate the relationship between code complexity and the success of LLM generated code. Using a la…

@hllizi@hespere.de
2025-05-30 17:19:25

"They ran the bare job titles through GPT, without looking at the details of the specific jobs, and got the chatbot to guess what those titles would have meant. Then they decided the chatbot could do most of the jobs. They were, after all, using the chatbot to do their job."
Rule: your job can successfully be taken over by a chatbot if it comes with no accountability.

The UK will totally replace two-thirds of junior civil servants with AI chatbots, says the chatbot
The UK government is still flat out of ideas and still convinced “AI” is the magical fix-all for bureaucracy. By “AI,” it means ChatGPT it buys in from Microsoft. The Chance…

@lysander07@sigmoid.social
2025-05-12 08:39:14

Last leg on our brief history of NLP (so far) is the advent of large language models with GPT-3 in 2020 and the introduction of learning from the prompt (aka few-shot learning).
T. B. Brown et al. (2020). Language models are few-shot learners. NIPS'20
https://…

Slide from Information System Engineering 2025 lecture, 02 - Natural Language Processing 01, A brief history of NLP, NLP Timeline.
The NLP timeline is in the middle of the page from top to bottom. The marker is at 2020. On the left side, an original screenshot of GPT-3 is shown, giving advise on how to present a talk about "Symbolic and Subsymbolic AI - An Epic Dilemma?".
The right side holds the following text:
2020: GPT-3 was released by OpenAI, based on 45TB data crawled from the web. A “da…

@arXiv_csAI_bot@mastoxiv.page
2025-06-03 07:21:03

Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs
Chenjun Xu, Bingbing Wen, Bin Han, Robert Wolfe, Lucy Lu Wang, Bill Howe
https://arxiv.org/abs/2506.00582

Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs
Psychology research has shown that humans are poor at estimating their performance on tasks, tending towards underconfidence on easy tasks and overconfidence on difficult tasks. We examine three LLMs, Llama-3-70B-instruct, Claude-3-Sonnet, and GPT-4o, on a range of QA tasks of varying difficulty, and show that models exhibit subtle differences from human patterns of overconfidence: less sensitive to task difficulty, and when prompted to answer based on different personas -- e.g., expert vs laym…

@inthehands@hachyderm.io
2025-06-26 22:40:10

This essay (ht @… ) offers a lot to chew on: some gems, some flubs, some quibblable provocations, some big insights. This sentence in particular stood out to me (context for it in the screenshot):
“Whether we’re reading or conversing, we want something to be meant, not just said.”
https://slate.com/life/2025/06/ai-chatgpt-generator-grok-gemini-writing.html

@arXiv_csCR_bot@mastoxiv.page
2025-06-03 16:55:16

This https://arxiv.org/abs/2408.16028 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data
Supervised-learning-based vulnerability detectors often fall short due to limited labelled training data. In contrast, Large Language Models (LLMs) like GPT-4 are trained on vast unlabelled code corpora, yet perform only marginally better than coin flips when directly prompted to detect vulnerabilities. In this paper, we reframe vulnerability detection as anomaly detection, based on the premise that vulnerable code is rare and thus anomalous relative to patterns learned by LLMs. We introduce AN…

@lmc@mastodon.social
2025-06-24 04:04:38

this is probably the one and only time you’ll ever hear me say how much I appreciate AI
#MLB

Chat GPT Memes + AI Art on Instagram: "Dodgers Manager Dave Roberts talks about his ejection against the Padres. Roberts was ejected after Padres pitcher Randy Vasquez struck shohei in the leg with a pitch. The incident occurred half an inning after Dodgers pitcher Lou Trivino hit Fernando Tatis. I made this using Fish Audio and Clonos.io #shohei #daveroberts #dodgers #dodgersbaseball #shoheiohtani #fernandotatisjr #padres #sandiegopadres #dodgersnation #ladodgers #losdoyers #dodgersfan #mlb #mlbmemes #mlbmeme"
16K likes, 699 comments - memerunnergpt on June 19, 2025: "Dodgers Manager Dave Roberts talks about his ejection against the Padres. Roberts was ejected after Padres pitcher Randy Vasquez struck shohei in the leg with a pitch. The incident occurred half an inning after Dodgers pitcher Lou Trivino hit Fernando Tatis. I made this using Fish Audio and Clonos.io #shohei #daveroberts #dodgers #dodgersbaseball #shoheiohtani #fernandotatisjr #padres #sandiegopadres #dodgersnation #ladodgers #losdoye…

@arXiv_csMM_bot@mastoxiv.page
2025-06-02 09:58:40

This https://arxiv.org/abs/2505.23449 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csMM_…

CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection
Multimodal large language models (MLLMs) have demonstrated impressive capabilities in visual reasoning and text generation. While previous studies have explored the application of MLLM for detecting out-of-context (OOC) misinformation, our empirical analysis reveals two persisting challenges of this paradigm. Evaluating the representative GPT-4o model on direct reasoning and evidence augmented reasoning, results indicate that MLLM struggle to capture the deeper relationships-specifically, cases…

@arXiv_csCY_bot@mastoxiv.page
2025-06-04 07:19:18

Machine vs Machine: Using AI to Tackle Generative AI Threats in Assessment
Mohammad Saleh Torkestani, Taha Mansouri
https://arxiv.org/abs/2506.02046 https:…

Machine vs Machine: Using AI to Tackle Generative AI Threats in Assessment
This paper presents a theoretical framework for addressing the challenges posed by generative artificial intelligence (AI) in higher education assessment through a machine-versus-machine approach. Large language models like GPT-4, Claude, and Llama increasingly demonstrate the ability to produce sophisticated academic content, traditional assessment methods face an existential threat, with surveys indicating 74-92% of students experimenting with these tools for academic purposes. Current respon…

@piotrsikora@pol.social
2025-04-23 15:58:17

Mam problem… z jednej strony takie automaty chciałbym “wyklikać” (bo wiem że płacą za chaty gpt etc) … żeby dalej nie mogły… z drugiej strony wiem że to kosztuje energię… z trzeciej… jak wyklikam mocno to biznes się nie będzie zgadzał… ciężki dylemat

Screen z chata który próbuje do mnie zagadać i udaje osobę.

@ErikJonker@mastodon.social
2025-06-07 08:07:20

Interesting, "GPT-style models have a fixed memorization capacity of approximately 3.6 bits per parameter."
https://venturebeat.com/ai/how-much-information-do-llms-really-memorize-now-we-know-thanks-to-met…

How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell
Using a clever solution, researchers find GPT-style models have a fixed memorization capacity of approximately 3.6 bits per parameter.

@arXiv_physicsedph_bot@mastoxiv.page
2025-07-02 13:25:45

Replaced article(s) found for physics.ed-ph. https://arxiv.org/list/physics.ed-ph/new
[1/1]:
- Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassin...
Paul Tschisgale, Holger Maus, Fabian Kieser, Ben Kroehs, Stefan Pete…

@arXiv_csCL_bot@mastoxiv.page
2025-07-02 10:06:50

Pitfalls of Evaluating Language Models with Open Benchmarks
Md. Najib Hasan (Wichita State University), Mohammad Fakhruddin Babar (Washington State University), Souvika Sarkar (Wichita State University), Monowar Hasan (Washington State University), Santu Karmaker (University of Central Florida)
https://arxiv.org/abs/2507.00460…

Pitfalls of Evaluating Language Models with Open Benchmarks
Open Large Language Model (LLM) benchmarks, such as HELM and BIG-bench, offer standardized, transparent protocols that facilitate the fair comparison, reproducibility, and iterative advancement of Language Models (LMs). However, their openness also introduces critical and underexplored pitfalls. This study exposes these weaknesses by systematically constructing ``cheating'' models -- smaller variants of BART, T5, and GPT-2 fine-tuned directly on public test sets -- which achieve top rankings on…

@cdarwin@c.im
2025-06-20 19:15:25

In its December 2023 lawsuit against OpenAI, The New York Times produced dozens of examples where GPT-4 exactly reproduced significant passages from Times stories.
In its response, OpenAI described this as a “fringe behavior” and a “problem that researchers at OpenAI and elsewhere work hard to address.”
But is it actually a fringe behavior?
And have leading AI companies addressed it?
New research—focusing on books rather than newspaper articles and on different compa…

Study: Meta AI model can reproduce almost half of Harry Potter book
The research could have big implications for generative AI copyright lawsuits.

@arXiv_csHC_bot@mastoxiv.page
2025-06-23 11:32:40

Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation
Guilherme Guerino, Luiz Rodrigues, Bruna Capeleti, Rafael Ferreira Mello, Andr\'e Freire, Luciana Zaina
https://arxiv.org/abs/2506.16345

Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation
Heuristic evaluation is a widely used method in Human-Computer Interaction (HCI) to inspect interfaces and identify issues based on heuristics. Recently, Large Language Models (LLMs), such as GPT-4o, have been applied in HCI to assist in persona creation, the ideation process, and the analysis of semi-structured interviews. However, considering the need to understand heuristics and the high degree of abstraction required to evaluate them, LLMs may have difficulty conducting heuristic evaluation…

@lapizistik@social.tchncs.de
2025-06-21 14:09:15

Ich hab die Idee¹ von @… (bzw seiner Tochter) mal mit meiner Diss² ausprobiert³: klappt großartig, um „Strategien“ unterschiedlicher Chatbots zu explorieren.
• Gemini schlägt mir allerlei⁴ Dissertationen von Leuten vor, die so ähnlich aber nicht genau so heißen wie ich.
• GPT-4o phantasiert zunächst einen Titel zusammen und behauptet dann⁴ m…

Christopher Kyba 🇨🇦🇪🇺 (@skyglowberlin@fediscience.org)
Attached: 4 images My daughter just came up with a great exercise: challenge your students to find the title of your PhD using ONLY LLMs (no Google allowed). If any of them manage, they get gummy bears 😃 I asked five different models, and got five different answers, all five of which were completely wrong 😂 #AI #ChatGPT #AISlop #LLM #LLMFail #Education #HigherEducation #AcademicChatter

@carl@heath.social
2025-04-19 14:35:01

Jag intervjuas om AI och desinformation i Ekot. Det ryska desinformationsnätverket Pravda masspublicerar miljontals artiklar med proryskt innehåll i syfte att manipulera AI-chattbotar som Chat GPT eller Copilot.
https://www.sverigesradio.se/artikel/chatt

Chattbotar manipuleras för att sprida pro-ryska budskap
Det ryska desinformationsnätverket Pravda masspublicerar miljontals artiklar med proryskt innehåll i syfte att manipulera AI-chattbotar som ChatGPT eller ...

@arXiv_csCR_bot@mastoxiv.page
2025-06-03 17:30:50

This https://arxiv.org/abs/2501.18626 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

The TIP of the Iceberg: Revealing a Hidden Class of Task-in-Prompt Adversarial Attacks on LLMs
We present a novel class of jailbreak adversarial attacks on LLMs, termed Task-in-Prompt (TIP) attacks. Our approach embeds sequence-to-sequence tasks (e.g., cipher decoding, riddles, code execution) into the model's prompt to indirectly generate prohibited inputs. To systematically assess the effectiveness of these attacks, we introduce the PHRYGE benchmark. We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models, including GPT-4o and LLaMA…

@arXiv_csAI_bot@mastoxiv.page
2025-07-01 10:22:03

Are Large Language Models Capable of Deep Relational Reasoning? Insights from DeepSeek-R1 and Benchmark Comparisons
Chi Chiu So, Yueyue Sun, Jun-Min Wang, Siu Pang Yung, Anthony Wai Keung Loh, Chun Pong Chau
https://arxiv.org/abs/2506.23128

Are Large Language Models Capable of Deep Relational Reasoning? Insights from DeepSeek-R1 and Benchmark Comparisons
How far are Large Language Models (LLMs) in performing deep relational reasoning? In this paper, we evaluate and compare the reasoning capabilities of three cutting-edge LLMs, namely, DeepSeek-R1, DeepSeek-V3 and GPT-4o, through a suite of carefully designed benchmark tasks in family tree and general graph reasoning. Our experiments reveal that DeepSeek-R1 consistently achieves the highest F1-scores across multiple tasks and problem sizes, demonstrating strong aptitude in logical deduction and …

@arXiv_csCY_bot@mastoxiv.page
2025-06-03 07:20:41

Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs
Nariman Naderi, Zahra Atf, Peter R Lewis, Aref Mahjoub far, Seyed Amir Ahmad Safavi-Naini, Ali Soroush
https://arxiv.org/abs/2506.00072

Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs
This paper investigates how prompt engineering techniques impact both accuracy and confidence elicitation in Large Language Models (LLMs) applied to medical contexts. Using a stratified dataset of Persian board exam questions across multiple specialties, we evaluated five LLMs - GPT-4o, o3-mini, Llama-3.3-70b, Llama-3.1-8b, and DeepSeek-v3 - across 156 configurations. These configurations varied in temperature settings (0.3, 0.7, 1.0), prompt styles (Chain-of-Thought, Few-Shot, Emotional, Exper…

@arXiv_csSE_bot@mastoxiv.page
2025-06-16 10:21:19

Identifying Helpful Context for LLM-based Vulnerability Repair: A Preliminary Study
G\'abor Antal, Bence Bogenf\"urst, Rudolf Ferenc, P\'eter Heged\H{u}s
https://arxiv.org/abs/2506.11561

Identifying Helpful Context for LLM-based Vulnerability Repair: A Preliminary Study
Recent advancements in large language models (LLMs) have shown promise for automated vulnerability detection and repair in software systems. This paper investigates the performance of GPT-4o in repairing Java vulnerabilities from a widely used dataset (Vul4J), exploring how different contextual information affects automated vulnerability repair (AVR) capabilities. We compare the latest GPT-4o's performance against previous results with GPT-4 using identical prompts. We evaluated nine additional…

@arXiv_csCR_bot@mastoxiv.page
2025-06-24 11:59:20

Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks
Xiaodong Wu, Xiangman Li, Jianbing Ni
https://arxiv.org/abs/2506.18543 http…

Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks
The widespread deployment of large language models (LLMs) has raised critical concerns over their vulnerability to jailbreak attacks, i.e., adversarial prompts that bypass alignment mechanisms and elicit harmful or policy-violating outputs. While proprietary models like GPT-4 have undergone extensive evaluation, the robustness of emerging open-source alternatives such as DeepSeek remains largely underexplored, despite their growing adoption in real-world applications. In this paper, we present …

@arXiv_econGN_bot@mastoxiv.page
2025-05-30 07:21:58

Learning to Regulate: A New Event-Level Dataset of Capital Control Measures
Geyue Sun, Xiao Liu, Tomas Williams, Roberto Samaniego
https://arxiv.org/abs/2505.23025

Learning to Regulate: A New Event-Level Dataset of Capital Control Measures
We construct a novel event-level Capital Control Measures (CCM) dataset covering 196 countries from 1999 to 2023 by leveraging prompt-based large language models (LLMs). The dataset enables event study analysis and cross-country comparisons based on rich policy attributes, including action type, intensity, direction, implementing entity, and other multidimensional characteristics. Using a two-step prompt framework with GPT-4.1, we extract structured information from the IMF's Annual Report on E…

@arXiv_csCL_bot@mastoxiv.page
2025-06-30 10:21:10

Detection of Personal Data in Structured Datasets Using a Large Language Model
Albert Agisha Ntwali, Luca R\"uck, Martin Heckmann
https://arxiv.org/abs/2506.22305

Detection of Personal Data in Structured Datasets Using a Large Language Model
We propose a novel approach for detecting personal data in structured datasets, leveraging GPT-4o, a state-of-the-art Large Language Model. A key innovation of our method is the incorporation of contextual information: in addition to a feature's name and values, we utilize information from other feature names within the dataset as well as the dataset description. We compare our approach to alternative methods, including Microsoft Presidio and CASSED, evaluating them on multiple datasets: DeSSI,…

@usul@piaille.fr
2025-06-11 11:31:32

Focus and Context and LLMs | Taras' Blog on AI, Perf, Hacks
#AI

Focus and Context and LLMs
I decided to write down some thoughts on agentic coding and why it’s a very hyped wrong turn. Let me start with some background on my LLM experience. I adopted LLMs into my work in Aug 2020. I was sold when I saw that GPT-3 could generate usable SQL statements. Something that used to take 4-8 hours of RTFMing, now took 15min. I have since worked on chatcraft.org, various RAG frameworks, etc. I use aider heavily for work, frequently switch models, have been struggling with tool calling during …

@arXiv_csHC_bot@mastoxiv.page
2025-06-19 08:23:34

Optimizing Web-Based AI Query Retrieval with GPT Integration in LangChain A CoT-Enhanced Prompt Engineering Approach
Wenqi Guan, Yang Fang
https://arxiv.org/abs/2506.15512

Optimizing Web-Based AI Query Retrieval with GPT Integration in LangChain A CoT-Enhanced Prompt Engineering Approach
Large Language Models have brought a radical change in the process of remote learning students, among other aspects of educative activities. Current retrieval of remote learning resources lacks depth in contextual meaning that provides comprehensive information on complex student queries. This work proposes a novel approach to enhancing remote learning retrieval by integrating GPT-based models within the LangChain framework. We achieve this system in a more intuitive and productive manner using…

@arXiv_csCY_bot@mastoxiv.page
2025-06-03 07:26:13

The World As Large Language Models See It: Exploring the reliability of LLMs in representing geographical features
Omid Reza Abbasi, Franz Welscher, Georg Weinberger, Johannes Scholz
https://arxiv.org/abs/2506.00203

The World As Large Language Models See It: Exploring the reliability of LLMs in representing geographical features
As large language models (LLMs) continue to evolve, questions about their trustworthiness in delivering factual information have become increasingly important. This concern also applies to their ability to accurately represent the geographic world. With recent advancements in this field, it is relevant to consider whether and to what extent LLMs' representations of the geographical world can be trusted. This study evaluates the performance of GPT-4o and Gemini 2.0 Flash in three key geospatial …

@Techmeme@techhub.social
2025-06-12 13:35:41

Landbase, whose GPT-4o-based AI tool automates outreach marketing, raised a $30M Series A co-led by Ashton Kutcher's Sound Ventures and Picus Capital (Julie Bort/TechCrunch)
https://techcrunch.com/2025/06/12/how-

How AI sales startup Landbase nabbed Ashton Kutcher’s Sound Ventures to lead its $30M Series A | TechCrunch
A concept Landbase's CEO calls "digital trust" led nearly 130 VCs to pursue the "vibe GTM" startup.

@arXiv_csDL_bot@mastoxiv.page
2025-06-23 15:58:08

Replaced article(s) found for cs.DL. https://arxiv.org/list/cs.DL/new
[1/1]:
- Web Archives Metadata Generation with GPT-4o: Challenges and Insights
Ashwin Nair, Zhen Rong Goh, Tianrui Liu, Abigail Yongping Huang

@arXiv_csAR_bot@mastoxiv.page
2025-06-19 08:01:26

Scaling Intelligence: Designing Data Centers for Next-Gen Language Models
Jesmin Jahan Tithi, Hanjiang Wu, Avishaii Abuhatzera, Fabrizio Petrini
https://arxiv.org/abs/2506.15006

Scaling Intelligence: Designing Data Centers for Next-Gen Language Models
The explosive growth of Large Language Models (LLMs) - such as GPT-4 with 1.8 trillion parameters - demands a radical rethinking of data center architecture to ensure scalability, efficiency, and cost-effectiveness. Our work provides a comprehensive co-design framework that jointly explores FLOPS, HBM bandwidth and capacity, multiple network topologies (two-tier vs. FullFlat optical), the size of the scale-out domain, and popular parallelism/optimization strategies used in LLMs. We introduce an…

@arXiv_csMM_bot@mastoxiv.page
2025-05-30 07:19:49

CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection
Fanxiao Li, Jiaying Wu, Canyuan He, Wei Zhou
https://arxiv.org/abs/2505.23449

CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection
Multimodal large language models (MLLMs) have demonstrated impressive capabilities in visual reasoning and text generation. While previous studies have explored the application of MLLM for detecting out-of-context (OOC) misinformation, our empirical analysis reveals two persisting challenges of this paradigm. Evaluating the representative GPT-4o model on direct reasoning and evidence augmented reasoning, results indicate that MLLM struggle to capture the deeper relationships-specifically, cases…

@arXiv_csLG_bot@mastoxiv.page
2025-06-09 10:11:32

Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models
Rihui Jin, Zheyu Xin, Xing Xie, Zuoyi Li, Guilin Qi, Yongrui Chen, Xinbang Dai, Tongtong Wu, Gholamreza Haffari
https://arxiv.org/abs/2506.06137

Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models
Table reasoning (TR) requires structured reasoning over semi-structured tabular data and remains challenging, particularly for small language models (SLMs, e.g., LLaMA-8B) due to their limited capacity compared to large LMs (LLMs, e.g., GPT-4o). To narrow this gap, we explore program-based TR (P-TR), which circumvents key limitations of text-based TR (T-TR), notably in numerical reasoning, by generating executable programs. However, applying P-TR to SLMs introduces two challenges: (i) vulnerabi…

@arXiv_csCR_bot@mastoxiv.page
2025-07-01 07:40:43

In-context learning for the classification of manipulation techniques in phishing emails
Antony Dalmiere (LAAS-TRUST, LAAS), Guillaume Auriol (LAAS-TRUST, INSA Toulouse), Vincent Nicomette (LAAS-TSF, LAAS), Pascal Marchand (LERASS)
https://arxiv.org/abs/2506.22515

In-context learning for the classification of manipulation techniques in phishing emails
Traditional phishing detection often overlooks psychological manipulation. This study investigates using Large Language Model (LLM) In-Context Learning (ICL) for fine-grained classification of phishing emails based on a taxonomy of 40 manipulation techniques. Using few-shot examples with GPT-4o-mini on real-world French phishing emails (SignalSpam), we evaluated performance against a human-annotated test set (100 emails). The approach effectively identifies prevalent techniques (e.g., Baiting, …

@arXiv_csSE_bot@mastoxiv.page
2025-06-16 10:20:19

Leveraging GPT-4 for Vulnerability-Witnessing Unit Test Generation
G\'abor Antal, D\'enes B\'an, Martin Isztin, Rudolf Ferenc, P\'eter Heged\H{u}s
https://arxiv.org/abs/2506.11559

Leveraging GPT-4 for Vulnerability-Witnessing Unit Test Generation
In the life-cycle of software development, testing plays a crucial role in quality assurance. Proper testing not only increases code coverage and prevents regressions but it can also ensure that any potential vulnerabilities in the software are identified and effectively fixed. However, creating such tests is a complex, resource-consuming manual process. To help developers and security experts, this paper explores the automatic unit test generation capability of one of the most widely used larg…

@arXiv_csCY_bot@mastoxiv.page
2025-07-01 09:19:43

Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center
James Wen, Sahil Nalawade, Zhiwei Liang, Catherine Bielick, Marisa Ferrara Boston, Alexander Chowdhury, Adele Collin, Luigi De Angelis, Jacob Ellen, Heather Frase, Rodrigo R. Gameiro, Juan Manuel Gutierrez, Pooja Kadam, Murat Keceli, Srikanth Krishnamurthy, Anne Kwok, Yanan Lance Lu, Heather Mattie, Liam G. McCoy, Katherine Miller, Allison C. Morgan, Marlene Louisa Moerig, Tran…

Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center
Generative AI is present in multiple industries. Dana-Farber Cancer Institute, in partnership with Microsoft, has created an internal AI tool, GPT4DFCI. Together we hosted a red teaming event to assess whether the underlying GPT models that support the tool would output copyrighted data. Our teams focused on reproducing content from books, news articles, scientific articles, and electronic health records. We found isolated instances where GPT4DFCI was able to identify copyrighted material and r…

@arXiv_csCR_bot@mastoxiv.page
2025-06-19 08:11:43

LLM vs. SAST: A Technical Analysis on Detecting Coding Bugs of GPT4-Advanced Data Analysis
Madjid G. Tehrani, Eldar Sultanow, William J. Buchanan, Mahkame Houmani, Christel H. Djaha Fodja
https://arxiv.org/abs/2506.15212

LLM vs. SAST: A Technical Analysis on Detecting Coding Bugs of GPT4-Advanced Data Analysis
With the rapid advancements in Natural Language Processing (NLP), large language models (LLMs) like GPT-4 have gained significant traction in diverse applications, including security vulnerability scanning. This paper investigates the efficacy of GPT-4 in identifying software vulnerabilities compared to traditional Static Application Security Testing (SAST) tools. Drawing from an array of security mistakes, our analysis underscores the potent capabilities of GPT-4 in LLM-enhanced vulnerability …

@arXiv_csCL_bot@mastoxiv.page
2025-06-17 10:26:01

Exploring Cultural Variations in Moral Judgments with Large Language Models
Hadi Mohammadi, Efthymia Papadopoulou, Yasmeen F. S. S. Meijer, Ayoub Bagheri
https://arxiv.org/abs/2506.12433

Exploring Cultural Variations in Moral Judgments with Large Language Models
Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs can mirror variations in moral attitudes reported by two major cross-cultural surveys: the World Values Survey and the PEW Research Center's Global Attitudes Survey. We compare smaller, monolingual, and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with more recent instruction-tuned models (GPT-4o,…

@arXiv_csSE_bot@mastoxiv.page
2025-06-26 09:05:40

Large Language Model-Driven Code Compliance Checking in Building Information Modeling
Soumya Madireddy, Lu Gao, Zia Din, Kinam Kim, Ahmed Senouci, Zhe Han, Yunpeng Zhang
https://arxiv.org/abs/2506.20551

Large Language Model-Driven Code Compliance Checking in Building Information Modeling
This research addresses the time-consuming and error-prone nature of manual code compliance checking in Building Information Modeling (BIM) by introducing a Large Language Model (LLM)-driven approach to semi-automate this critical process. The developed system integrates LLMs such as GPT, Claude, Gemini, and Llama, with Revit software to interpret building codes, generate Python scripts, and perform semi-automated compliance checks within the BIM environment. Case studies on a single-family res…

@arXiv_csIR_bot@mastoxiv.page
2025-06-16 07:50:09

TongSearch-QR: Reinforced Query Reasoning for Retrieval
Xubo Qin, Jun Bai, Jiaqi Li, Zixia Jia, Zilong Zheng
https://arxiv.org/abs/2506.11603 https://

TongSearch-QR: Reinforced Query Reasoning for Retrieval
Traditional information retrieval (IR) methods excel at textual and semantic matching but struggle in reasoning-intensive retrieval tasks that require multi-hop inference or complex semantic understanding between queries and documents. One promising solution is to explicitly rewrite or augment queries using large language models (LLMs) to elicit reasoning-relevant content prior to retrieval. However, the widespread use of large-scale language models like GPT-4 or LLaMA3-70B remains impractical …

@Xavier@infosec.exchange
2025-06-18 01:03:52

I'm so excited for this!
#KaliGPT #Kali #KaliLinux #troll

A doctored screenshot of the Kali Certified Linux Professional, but instead reads The Kali GPT Certified Professional.

@arXiv_csSE_bot@mastoxiv.page
2025-06-13 08:08:42

Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements
Seyed Moein Abtahi, Akramul Azim
https://arxiv.org/abs/2506.10330

Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements
This study examined code issue detection and revision automation by integrating Large Language Models (LLMs) such as OpenAI's GPT-3.5 Turbo and GPT-4o into software development workflows. A static code analysis framework detects issues such as bugs, vulnerabilities, and code smells within a large-scale software project. Detailed information on each issue was extracted and organized to facilitate automated code revision using LLMs. An iterative prompt engineering process is applied to ensure tha…

@arXiv_csCY_bot@mastoxiv.page
2025-06-18 08:10:40

Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases
Bonam Mingole, Aditya Majumdar, Firdaus Ahmed Choudhury, Jennifer L. Kraschnewski, Shyam S. Sundar, Amulya Yadav
https://arxiv.org/abs/2506.13805…

Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases
The proliferation of Large Language Models (LLMs) in high-stakes applications such as medical (self-)diagnosis and preliminary triage raises significant ethical and practical concerns about the effectiveness, appropriateness, and possible harmfulness of the use of these technologies for health-related concerns and queries. Some prior work has considered the effectiveness of LLMs in answering expert-written health queries/prompts, questions from medical examination banks, or queries based on pre…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-10 07:59:22

AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition
Chen Bao, Chuanbing Huo, Qinyu Chen, Chang Gao
https://arxiv.org/abs/2506.06566

AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition
This paper proposes AS-ASR, a lightweight aphasia-specific speech recognition framework based on Whisper-tiny, tailored for low-resource deployment on edge devices. Our approach introduces a hybrid training strategy that systematically combines standard and aphasic speech at varying ratios, enabling robust generalization, and a GPT-4-based reference enhancement method that refines noisy aphasic transcripts, improving supervision quality. We conduct extensive experiments across multiple data mix…

@arXiv_csCL_bot@mastoxiv.page
2025-06-19 08:16:49

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu
https://arxiv.org/abs/2506.15681

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VL…

@arXiv_csHC_bot@mastoxiv.page
2025-06-19 08:20:29

Mapping Caregiver Needs to AI Chatbot Design: Strengths and Gaps in Mental Health Support for Alzheimer's and Dementia Caregivers
Jiayue Melissa Shi, Dong Whi Yoo, Keran Wang, Violeta J. Rodriguez, Ravi Karkar, Koustuv Saha
https://arxiv.org/abs/2506.15047

Mapping Caregiver Needs to AI Chatbot Design: Strengths and Gaps in Mental Health Support for Alzheimer's and Dementia Caregivers
Family caregivers of individuals with Alzheimer's Disease and Related Dementia (AD/ADRD) face significant emotional and logistical challenges that place them at heightened risk for stress, anxiety, and depression. Although recent advances in generative AI -- particularly large language models (LLMs) -- offer new opportunities to support mental health, little is known about how caregivers perceive and engage with such technologies. To address this gap, we developed Carey, a GPT-4o-based chatbot …

@arXiv_csSE_bot@mastoxiv.page
2025-06-23 09:16:00

Evaluating the Use of LLMs for Documentation to Code Traceability
Ebube Alor, SayedHassan Khatoonabadi, Emad Shihab
https://arxiv.org/abs/2506.16440 https:…

Evaluating the Use of LLMs for Documentation to Code Traceability
Large Language Models (LLMs) offer new potential for automating documentation-to-code traceability, yet their capabilities remain underexplored. We present a comprehensive evaluation of LLMs (Claude 3.5 Sonnet, GPT-4o, and o3-mini) in establishing trace links between various software documentation (including API references and user guides) and source code. We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI). Through systematic experiments, we assess three key…

@arXiv_csIR_bot@mastoxiv.page
2025-06-11 07:38:03

No Stupid Questions: An Analysis of Question Query Generation for Citation Recommendation
Brian D. Zimmerman, Julien Aubert-B\'educhaud, Florian Boudin, Akiko Aizawa, Olga Vechtomova
https://arxiv.org/abs/2506.08196

No Stupid Questions: An Analysis of Question Query Generation for Citation Recommendation
Existing techniques for citation recommendation are constrained by their adherence to article contents and metadata. We leverage GPT-4o-mini's latent expertise as an inquisitive assistant by instructing it to ask questions which, when answered, could expose new insights about an excerpt from a scientific article. We evaluate the utility of these questions as retrieval queries, measuring their effectiveness in retrieving and ranking masked target documents. In some cases, generated questions end…

@arXiv_csHC_bot@mastoxiv.page
2025-06-18 08:27:49

Exploring MLLMs Perception of Network Visualization Principles
Jacob Miller, Markus Wallinger, Ludwig Felder, Timo Brand, Henry F\"orster, Johannes Zink, Chunyang Chen, Stephen Kobourov
https://arxiv.org/abs/2506.14611

Exploring MLLMs Perception of Network Visualization Principles
In this paper, we test whether Multimodal Large Language Models (MLLMs) can match human-subject performance in tasks involving the perception of properties in network layouts. Specifically, we replicate a human-subject experiment about perceiving quality (namely stress) in network layouts using GPT-4o and Gemini-2.5. Our experiments show that giving MLLMs exactly the same study information as trained human participants results in a similar performance to human experts and exceeds the performanc…

@arXiv_csCY_bot@mastoxiv.page
2025-06-11 07:27:33

Surgeons Awareness, Expectations, and Involvement with Artificial Intelligence: a Survey Pre and Post the GPT Era
Lorenzo Arboit, Dennis N. Schneider, Toby Collins, Daniel A. Hashimoto, Silvana Perretta, Bernard Dallemagne, Jacques Marescaux, EAES Working Group, Nicolas Padoy, Pietro Mascagni
https://arxiv.org/abs/2506.08258

Surgeons Awareness, Expectations, and Involvement with Artificial Intelligence: a Survey Pre and Post the GPT Era
Artificial Intelligence (AI) is transforming medicine, with generative AI models like ChatGPT reshaping perceptions of its potential. This study examines surgeons' awareness, expectations, and involvement with AI in surgery through comparative surveys conducted in 2021 and 2024. Two cross-sectional surveys were distributed globally in 2021 and 2024, the first before an IRCAD webinar and the second during the annual EAES meeting. The surveys assessed demographics, AI awareness, expectations, inv…

@arXiv_csCY_bot@mastoxiv.page
2025-06-24 09:45:29

Automatic Large Language Models Creation of Interactive Learning Lessons
Jionghao Lin, Jiarui Rao, Yiyang Zhao, Yuting Wang, Ashish Gurung, Amanda Barany, Jaclyn Ocumpaugh, Ryan S. Baker, Kenneth R. Koedinger
https://arxiv.org/abs/2506.17356

Automatic Large Language Models Creation of Interactive Learning Lessons
We explore the automatic generation of interactive, scenario-based lessons designed to train novice human tutors who teach middle school mathematics online. Employing prompt engineering through a Retrieval-Augmented Generation approach with GPT-4o, we developed a system capable of creating structured tutor training lessons. Our study generated lessons in English for three key topics: Encouraging Students' Independence, Encouraging Help-Seeking Behavior, and Turning on Cameras, using a task deco…

@arXiv_csIR_bot@mastoxiv.page
2025-06-10 16:39:29

This https://arxiv.org/abs/2504.05307 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csIR_…

Toward Total Recall: Enhancing FAIRness through AI-Driven Metadata Standardization
Scientific metadata often suffer from incompleteness, inconsistency, and formatting errors, which hinder effective discovery and reuse of the associated datasets. We present a method that combines GPT-4 with structured metadata templates from the CEDAR knowledge base to automatically standardize metadata and to ensure compliance with established standards. A CEDAR template specifies the expected fields of a metadata submission and their permissible values. Our standardization process involves u…

@arXiv_csSE_bot@mastoxiv.page
2025-06-17 11:03:37

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers
Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Bram Adams, Ahmed E. Hassan
https://arxiv.org/abs/2506.13538

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers
Although Foundation Models (FMs), such as GPT-4, are increasingly used in domains like finance and software engineering, reliance on textual interfaces limits these models' real-world interaction. To address this, FM providers introduced tool calling-triggering a proliferation of frameworks with distinct tool interfaces. In late 2024, Anthropic introduced the Model Context Protocol (MCP) to standardize this tool ecosystem, which has become the de facto standard with over eight million weekly SD…

@arXiv_csCL_bot@mastoxiv.page
2025-06-10 18:59:51

This https://arxiv.org/abs/2506.00975 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCL_…

NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction
Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of…

@arXiv_csCY_bot@mastoxiv.page
2025-06-09 07:25:02

Can LLMs Talk 'Sex'? Exploring How AI Models Handle Intimate Conversations
Huiqian Lai
https://arxiv.org/abs/2506.05514 https://

Can LLMs Talk 'Sex'? Exploring How AI Models Handle Intimate Conversations
This study examines how four prominent large language models (Claude 3.7 Sonnet, GPT-4o, Gemini 2.5 Flash, and Deepseek-V3) handle sexually oriented requests through qualitative content analysis. By evaluating responses to prompts ranging from explicitly sexual to educational and neutral control scenarios, the research reveals distinct moderation paradigms reflecting fundamentally divergent ethical positions. Claude 3.7 Sonnet employs strict and consistent prohibitions, while GPT-4o navigates u…

@arXiv_csSE_bot@mastoxiv.page
2025-06-17 10:40:25

MCTS-Refined CoT: High-Quality Fine-Tuning Data for LLM-Based Repository Issue Resolution
Yibo Wang, Zhihao Peng, Ying Wang, Zhao Wei, Hai Yu, Zhiliang Zhu
https://arxiv.org/abs/2506.12728

MCTS-Refined CoT: High-Quality Fine-Tuning Data for LLM-Based Repository Issue Resolution
LLMs demonstrate strong performance in auto-mated software engineering, particularly for code generation and issue resolution. While proprietary models like GPT-4o achieve high benchmarks scores on SWE-bench, their API dependence, cost, and privacy concerns limit adoption. Open-source alternatives offer transparency but underperform in complex tasks, especially sub-100B parameter models. Although quality Chain-of-Thought (CoT) data can enhance reasoning, current methods face two critical flaws:…

@arXiv_csIR_bot@mastoxiv.page
2025-06-10 07:52:42

FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models
Xuan Xu, Fufang Wen, Beilin Chu, Zhibing Fu, Qinhong Lin, Jiaqi Liu, Binjie Fei, Zhongliang Yang, Linna Zhou, Yu Li
https://arxiv.org/abs/2506.06335

FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models
In natural language processing (NLP), the focus has shifted from encoder-only tiny language models like BERT to decoder-only large language models(LLMs) such as GPT-3. However, LLMs' practical application in the financial sector has revealed three limitations: (1) LLMs often perform worse than fine-tuned BERT on discriminative tasks despite costing much higher computational resources, such as market sentiment analysis in financial reports; (2) Application on generative tasks heavily relies on r…

@arXiv_csCL_bot@mastoxiv.page
2025-06-12 09:20:52

Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages
Amel Muminovic, Amela Kadric Muminovic
https://arxiv.org/abs/2506.09992

Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages
Online toxic language causes real harm, especially in regions with limited moderation tools. In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data. We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. Four models (GPT-3.5 Turbo, G…

@arXiv_csSE_bot@mastoxiv.page
2025-06-18 08:59:19

Quality Assessment of Python Tests Generated by Large Language Models
Victor Alves, Carla Bezerra, Ivan Machado, Larissa Rocha, T\'assio Virg\'inio, Publio Silva
https://arxiv.org/abs/2506.14297

Quality Assessment of Python Tests Generated by Large Language Models
The manual generation of test scripts is a time-intensive, costly, and error-prone process, indicating the value of automated solutions. Large Language Models (LLMs) have shown great promise in this domain, leveraging their extensive knowledge to produce test code more efficiently. This study investigates the quality of Python test code generated by three LLMs: GPT-4o, Amazon Q, and LLama 3.3. We evaluate the structural reliability of test suites generated under two distinct prompt contexts: Te…

@arXiv_csSE_bot@mastoxiv.page
2025-06-18 08:54:25

Mobile Application Review Summarization using Chain of Density Prompting
Shristi Shrestha, Anas Mahmoud
https://arxiv.org/abs/2506.14192 https://

Mobile Application Review Summarization using Chain of Density Prompting
Mobile app users commonly rely on app store ratings and reviews to find apps that suit their needs. However, the sheer volume of reviews available on app stores can lead to information overload, thus impeding users' ability to make informed app selection decisions. To address this challenge, we leverage Large Language Models (LLMs) to summarize mobile app reviews. In particular, we use the Chain of Density (CoD) prompt to guide OpenAI GPT-4 to generate abstractive, semantically dense, and easil…

@arXiv_csSE_bot@mastoxiv.page
2025-06-10 10:11:13

Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study
E. G. Santana Jr, Jander Pereira Santos Junior, Erlon P. Almeida, Iftekhar Ahmed, Paulo Anselmo da Mota Silveira Neto, Eduardo Santana de Almeida
https://arxiv.org/abs/2506.07594

Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study
Test smells indicate poor development practices in test code, reducing maintainability and reliability. While developers often struggle to prevent or refactor these issues, existing tools focus primarily on detection rather than automated refactoring. Large Language Models (LLMs) have shown strong potential in code understanding and transformation, but their ability to both identify and refactor test smells remains underexplored. We evaluated GPT-4-Turbo, LLaMA 3 70B, and Gemini-1.5 Pro on Pyth…

Tootfinder

Opt-in global Mastodon full text search. Join the index!