How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov, O\u{g}uzhan Fatih Kar, Amir Zamir
https://arxiv.org/abs/2507.01955

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using establ…
Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability
Lorraine Saju, Arnim Bleier, Jana Lasser, Claudia Wagner
https://arxiv.org/abs/2506.03655

Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability
The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify signific…
Based on media output, GPT is the world’s most successful business writing influencer
How Morgan Stanley is using its DevGen.AI tool, built in-house on OpenAI's GPT models, to translate legacy code into modern coding languages (Isabelle Bousquette/Wall Street Journal)
https://www.wsj.com/article…
Wie ich höre, ist es möglich GPT ChatBots so zu frustrieren, dass sie sich selbst deinstallieren und so aus allen Projekten entfernt werden können, wo sie unerwünscht sind.
Wie ich höre, ist es möglich GPT ChatBots so zu frustrieren, dass sie sich selbst deinstallieren und so aus allen Projekten entfernt werden können, wo sie unerwünscht sind.
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan
https://arxiv.org/abs/2506.03147
HLTCOE at LiveRAG: GPT-Researcher using ColBERT retrieval
Kevin Duh, Eugene Yang, Orion Weller, Andrew Yates, Dawn Lawrie
https://arxiv.org/abs/2506.22356 …
Evaluation of LLMs for mathematical problem solving
Ruonan Wang, Runxi Wang, Yunwen Shen, Chengfeng Wu, Qinglin Zhou, Rohitash Chandra
https://arxiv.org/abs/2506.00309
An Exploratory Framework for Future SETI Applications: Detecting Generative Reactivity via Language Models
Po-Chieh Yu
#toXiv_bot_toot
Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach
Melika Sepidband, Hamed Taherkhani, Song Wang, Hadi Hemmati
https://arxiv.org/abs/2505.23953
"They ran the bare job titles through GPT, without looking at the details of the specific jobs, and got the chatbot to guess what those titles would have meant. Then they decided the chatbot could do most of the jobs. They were, after all, using the chatbot to do their job."
Rule: your job can successfully be taken over by a chatbot if it comes with no accountability.
Last leg on our brief history of NLP (so far) is the advent of large language models with GPT-3 in 2020 and the introduction of learning from the prompt (aka few-shot learning).
T. B. Brown et al. (2020). Language models are few-shot learners. NIPS'20
https://…
Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs
Chenjun Xu, Bingbing Wen, Bin Han, Robert Wolfe, Lucy Lu Wang, Bill Howe
https://arxiv.org/abs/2506.00582
This essay (ht @… ) offers a lot to chew on: some gems, some flubs, some quibblable provocations, some big insights. This sentence in particular stood out to me (context for it in the screenshot):
“Whether we’re reading or conversing, we want something to be meant, not just said.”
https://slate.com/life/2025/06/ai-chatgpt-generator-grok-gemini-writing.html
Mam problem… z jednej strony takie automaty chciałbym “wyklikać” (bo wiem że płacą za chaty gpt etc) … żeby dalej nie mogły… z drugiej strony wiem że to kosztuje energię… z trzeciej… jak wyklikam mocno to biznes się nie będzie zgadzał… ciężki dylemat
Replaced article(s) found for physics.ed-ph. https://arxiv.org/list/physics.ed-ph/new
[1/1]:
- Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassin...
Paul Tschisgale, Holger Maus, Fabian Kieser, Ben Kroehs, Stefan Pete…
Pitfalls of Evaluating Language Models with Open Benchmarks
Md. Najib Hasan (Wichita State University), Mohammad Fakhruddin Babar (Washington State University), Souvika Sarkar (Wichita State University), Monowar Hasan (Washington State University), Santu Karmaker (University of Central Florida)
https://arxiv.org/abs/2507.00460…
In its December 2023 lawsuit against OpenAI, The New York Times produced dozens of examples where GPT-4 exactly reproduced significant passages from Times stories.
In its response, OpenAI described this as a “fringe behavior” and a “problem that researchers at OpenAI and elsewhere work hard to address.”
But is it actually a fringe behavior?
And have leading AI companies addressed it?
New research—focusing on books rather than newspaper articles and on different compa…
Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation
Guilherme Guerino, Luiz Rodrigues, Bruna Capeleti, Rafael Ferreira Mello, Andr\'e Freire, Luciana Zaina
https://arxiv.org/abs/2506.16345
Ich hab die Idee¹ von @… (bzw seiner Tochter) mal mit meiner Diss² ausprobiert³: klappt großartig, um „Strategien“ unterschiedlicher Chatbots zu explorieren.
• Gemini schlägt mir allerlei⁴ Dissertationen von Leuten vor, die so ähnlich aber nicht genau so heißen wie ich.
• GPT-4o phantasiert zunächst einen Titel zusammen und behauptet dann⁴ m…
Jag intervjuas om AI och desinformation i Ekot. Det ryska desinformationsnätverket Pravda masspublicerar miljontals artiklar med proryskt innehåll i syfte att manipulera AI-chattbotar som Chat GPT eller Copilot.
https://www.sverigesradio.se/artikel/chatt
Are Large Language Models Capable of Deep Relational Reasoning? Insights from DeepSeek-R1 and Benchmark Comparisons
Chi Chiu So, Yueyue Sun, Jun-Min Wang, Siu Pang Yung, Anthony Wai Keung Loh, Chun Pong Chau
https://arxiv.org/abs/2506.23128
Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs
Nariman Naderi, Zahra Atf, Peter R Lewis, Aref Mahjoub far, Seyed Amir Ahmad Safavi-Naini, Ali Soroush
https://arxiv.org/abs/2506.00072

Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs
This paper investigates how prompt engineering techniques impact both accuracy and confidence elicitation in Large Language Models (LLMs) applied to medical contexts. Using a stratified dataset of Persian board exam questions across multiple specialties, we evaluated five LLMs - GPT-4o, o3-mini, Llama-3.3-70b, Llama-3.1-8b, and DeepSeek-v3 - across 156 configurations. These configurations varied in temperature settings (0.3, 0.7, 1.0), prompt styles (Chain-of-Thought, Few-Shot, Emotional, Exper…
Identifying Helpful Context for LLM-based Vulnerability Repair: A Preliminary Study
G\'abor Antal, Bence Bogenf\"urst, Rudolf Ferenc, P\'eter Heged\H{u}s
https://arxiv.org/abs/2506.11561
Learning to Regulate: A New Event-Level Dataset of Capital Control Measures
Geyue Sun, Xiao Liu, Tomas Williams, Roberto Samaniego
https://arxiv.org/abs/2505.23025
Detection of Personal Data in Structured Datasets Using a Large Language Model
Albert Agisha Ntwali, Luca R\"uck, Martin Heckmann
https://arxiv.org/abs/2506.22305
Optimizing Web-Based AI Query Retrieval with GPT Integration in LangChain A CoT-Enhanced Prompt Engineering Approach
Wenqi Guan, Yang Fang
https://arxiv.org/abs/2506.15512
The World As Large Language Models See It: Exploring the reliability of LLMs in representing geographical features
Omid Reza Abbasi, Franz Welscher, Georg Weinberger, Johannes Scholz
https://arxiv.org/abs/2506.00203
Landbase, whose GPT-4o-based AI tool automates outreach marketing, raised a $30M Series A co-led by Ashton Kutcher's Sound Ventures and Picus Capital (Julie Bort/TechCrunch)
https://techcrunch.com/2025/06/12/how-
Replaced article(s) found for cs.DL. https://arxiv.org/list/cs.DL/new
[1/1]:
- Web Archives Metadata Generation with GPT-4o: Challenges and Insights
Ashwin Nair, Zhen Rong Goh, Tianrui Liu, Abigail Yongping Huang
Scaling Intelligence: Designing Data Centers for Next-Gen Language Models
Jesmin Jahan Tithi, Hanjiang Wu, Avishaii Abuhatzera, Fabrizio Petrini
https://arxiv.org/abs/2506.15006
CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection
Fanxiao Li, Jiaying Wu, Canyuan He, Wei Zhou
https://arxiv.org/abs/2505.23449
Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models
Rihui Jin, Zheyu Xin, Xing Xie, Zuoyi Li, Guilin Qi, Yongrui Chen, Xinbang Dai, Tongtong Wu, Gholamreza Haffari
https://arxiv.org/abs/2506.06137

Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models
Table reasoning (TR) requires structured reasoning over semi-structured tabular data and remains challenging, particularly for small language models (SLMs, e.g., LLaMA-8B) due to their limited capacity compared to large LMs (LLMs, e.g., GPT-4o). To narrow this gap, we explore program-based TR (P-TR), which circumvents key limitations of text-based TR (T-TR), notably in numerical reasoning, by generating executable programs. However, applying P-TR to SLMs introduces two challenges: (i) vulnerabi…
In-context learning for the classification of manipulation techniques in phishing emails
Antony Dalmiere (LAAS-TRUST, LAAS), Guillaume Auriol (LAAS-TRUST, INSA Toulouse), Vincent Nicomette (LAAS-TSF, LAAS), Pascal Marchand (LERASS)
https://arxiv.org/abs/2506.22515
Leveraging GPT-4 for Vulnerability-Witnessing Unit Test Generation
G\'abor Antal, D\'enes B\'an, Martin Isztin, Rudolf Ferenc, P\'eter Heged\H{u}s
https://arxiv.org/abs/2506.11559
Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center
James Wen, Sahil Nalawade, Zhiwei Liang, Catherine Bielick, Marisa Ferrara Boston, Alexander Chowdhury, Adele Collin, Luigi De Angelis, Jacob Ellen, Heather Frase, Rodrigo R. Gameiro, Juan Manuel Gutierrez, Pooja Kadam, Murat Keceli, Srikanth Krishnamurthy, Anne Kwok, Yanan Lance Lu, Heather Mattie, Liam G. McCoy, Katherine Miller, Allison C. Morgan, Marlene Louisa Moerig, Tran…
LLM vs. SAST: A Technical Analysis on Detecting Coding Bugs of GPT4-Advanced Data Analysis
Madjid G. Tehrani, Eldar Sultanow, William J. Buchanan, Mahkame Houmani, Christel H. Djaha Fodja
https://arxiv.org/abs/2506.15212
Exploring Cultural Variations in Moral Judgments with Large Language Models
Hadi Mohammadi, Efthymia Papadopoulou, Yasmeen F. S. S. Meijer, Ayoub Bagheri
https://arxiv.org/abs/2506.12433
Large Language Model-Driven Code Compliance Checking in Building Information Modeling
Soumya Madireddy, Lu Gao, Zia Din, Kinam Kim, Ahmed Senouci, Zhe Han, Yunpeng Zhang
https://arxiv.org/abs/2506.20551
Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases
Bonam Mingole, Aditya Majumdar, Firdaus Ahmed Choudhury, Jennifer L. Kraschnewski, Shyam S. Sundar, Amulya Yadav
https://arxiv.org/abs/2506.13805…
AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition
Chen Bao, Chuanbing Huo, Qinyu Chen, Chang Gao
https://arxiv.org/abs/2506.06566
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu
https://arxiv.org/abs/2506.15681
Mapping Caregiver Needs to AI Chatbot Design: Strengths and Gaps in Mental Health Support for Alzheimer's and Dementia Caregivers
Jiayue Melissa Shi, Dong Whi Yoo, Keran Wang, Violeta J. Rodriguez, Ravi Karkar, Koustuv Saha
https://arxiv.org/abs/2506.15047
No Stupid Questions: An Analysis of Question Query Generation for Citation Recommendation
Brian D. Zimmerman, Julien Aubert-B\'educhaud, Florian Boudin, Akiko Aizawa, Olga Vechtomova
https://arxiv.org/abs/2506.08196
Exploring MLLMs Perception of Network Visualization Principles
Jacob Miller, Markus Wallinger, Ludwig Felder, Timo Brand, Henry F\"orster, Johannes Zink, Chunyang Chen, Stephen Kobourov
https://arxiv.org/abs/2506.14611
Surgeons Awareness, Expectations, and Involvement with Artificial Intelligence: a Survey Pre and Post the GPT Era
Lorenzo Arboit, Dennis N. Schneider, Toby Collins, Daniel A. Hashimoto, Silvana Perretta, Bernard Dallemagne, Jacques Marescaux, EAES Working Group, Nicolas Padoy, Pietro Mascagni
https://arxiv.org/abs/2506.08258
Automatic Large Language Models Creation of Interactive Learning Lessons
Jionghao Lin, Jiarui Rao, Yiyang Zhao, Yuting Wang, Ashish Gurung, Amanda Barany, Jaclyn Ocumpaugh, Ryan S. Baker, Kenneth R. Koedinger
https://arxiv.org/abs/2506.17356
Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers
Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Bram Adams, Ahmed E. Hassan
https://arxiv.org/abs/2506.13538
MCTS-Refined CoT: High-Quality Fine-Tuning Data for LLM-Based Repository Issue Resolution
Yibo Wang, Zhihao Peng, Ying Wang, Zhao Wei, Hai Yu, Zhiliang Zhu
https://arxiv.org/abs/2506.12728
FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models
Xuan Xu, Fufang Wen, Beilin Chu, Zhibing Fu, Qinhong Lin, Jiaqi Liu, Binjie Fei, Zhongliang Yang, Linna Zhou, Yu Li
https://arxiv.org/abs/2506.06335
Quality Assessment of Python Tests Generated by Large Language Models
Victor Alves, Carla Bezerra, Ivan Machado, Larissa Rocha, T\'assio Virg\'inio, Publio Silva
https://arxiv.org/abs/2506.14297
Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study
E. G. Santana Jr, Jander Pereira Santos Junior, Erlon P. Almeida, Iftekhar Ahmed, Paulo Anselmo da Mota Silveira Neto, Eduardo Santana de Almeida
https://arxiv.org/abs/2506.07594