
2025-07-30 10:18:21
UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases
Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang
https://arxiv.org/abs/2507.21652 https://arxiv.or…
UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases
Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang
https://arxiv.org/abs/2507.21652 https://arxiv.or…
Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security
Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, Xuelong Li
https://arxiv.org/abs/2507.22037
Refining Text Generation for Realistic Conversational Recommendation via Direct Preference Optimization
Manato Tajiri, Michimasa Inaba
https://arxiv.org/abs/2508.19918 https://
"Mit KI (Elicit) den Forschungsstand beschreiben – ein kritischer Erfahrungsbericht" @ Blog "Sozialwissenschaftliche Methodenberatung":
https://sozmethode.hypotheses.org/2943
Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms
Jonathan N\"other, Adish Singla, Goran Radanovic
https://arxiv.org/abs/2508.16481 https://
Towards Deeper Understanding of Natural User Interactions in Virtual Reality Based Assembly Tasks
Ryan Ghamandi, Yahya Hmaiti, Mykola Maslych, Ravi Kiran Kattoju, Joseph J. LaViola Jr
https://arxiv.org/abs/2508.17124
Context Engineering for Multi-Agent LLM Code Assistants Using Elicit, NotebookLM, ChatGPT, and Claude Code
Muhammad Haseeb
https://arxiv.org/abs/2508.08322 https://
HAMSA: Hijacking Aligned Compact Models via Stealthy Automation
Alexey Krylov, Iskander Vagizov, Dmitrii Korzh, Maryam Douiba, Azidine Guezzaz, Vladimir Kokh, Sergey D. Erokhin, Elena V. Tutubalina, Oleg Y. Rogov
https://arxiv.org/abs/2508.16484
Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, Bill Byrne
https://arxiv.org/abs/2508.16406 …
Improving Student-AI Interaction Through Pedagogical Prompting: An Example in Computer Science Education
Ruiwei Xiao, Xinying Hou, Runlong Ye, Majeed Kazemitabaar, Nicholas Diana, Michael Liut, John Stamper
https://arxiv.org/abs/2506.19107
Crosslisted article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[2/6]:
- Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin, Meng Han
TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards
Andreea Nica, Ivan Zakazov, Nicolas Mario Baldwin, Saibo Geng, Robert West
https://arxiv.org/abs/2507.18618
This https://arxiv.org/abs/2306.11154 has been replaced.
link: https://scholar.google.com/scholar?q=a
Bayesian inference for the learning rate in Generalised Bayesian inference
Jeong Eun Lee, Sitong Liu, Geoff K. Nicholls
https://arxiv.org/abs/2506.12532 ht…
Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks
Xiaodong Wu, Xiangman Li, Jianbing Ni
https://arxiv.org/abs/2506.18543 http…
The Emotional Alignment Design Policy
Eric Schwitzgebel, Jeff Sebo
https://arxiv.org/abs/2507.06263 https://arxiv.org/pdf/2507.06263
DREAM: Scalable Red Teaming for Text-to-Image Generative Systems via Distribution Modeling
Boheng Li, Junjie Wang, Yiming Li, Zhiyang Hu, Leyi Qi, Jianshuo Dong, Run Wang, Han Qiu, Zhan Qin, Tianwei Zhang
https://arxiv.org/abs/2507.16329
Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning
Zhengran Ji, Boyuan Chen
https://arxiv.org/abs/2508.07126 https://
Resa: Transparent Reasoning Models via SAEs
Shangshang Wang, Julian Asilis, \"Omer Faruk Akg\"ul, Enes Burak Bilgin, Ollie Liu, Deqing Fu, Willie Neiswanger
https://arxiv.org/abs/2506.09967
TongSearch-QR: Reinforced Query Reasoning for Retrieval
Xubo Qin, Jun Bai, Jiaqi Li, Zixia Jia, Zilong Zheng
https://arxiv.org/abs/2506.11603 https://
LLM Robustness Leaderboard v1 --Technical report
Pierre Peign\'e - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe
https://arxiv.org/abs/2508.06296 https://
To Each Their Own: Heterogeneity in Worker Preferences for Peer Information
Zhi Hao Lim
https://arxiv.org/abs/2508.06162 https://arxiv.org/pdf/2508.06162…
Mitigating Trojanized Prompt Chains in Educational LLM Use Cases: Experimental Findings and Detection Tool Design
Richard M. Charles, James H. Curry, Richard B. Charles
https://arxiv.org/abs/2507.14207
DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Yuchi Xu, Wenbo Su, Bo Zheng
https://arxiv.org/abs/2508.12726
Legal Requirements Translation from Law
Anmol Singhal, Travis Breaux
https://arxiv.org/abs/2507.02846 https://arxiv.org/pdf/2507.0284…
Markov Regime-Switching Intelligent Driver Model for Interpretable Car-Following Behavior
Chengyuan Zhang, Cathy Wu, Lijun Sun
https://arxiv.org/abs/2506.14762
Iterative Vickrey Auctions via Linear Programming
S\'ebastien Lahaie, Benjamin Lubin
https://arxiv.org/abs/2507.03252 https://arx…
MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies
Weiwei Qi, Shuo Shao, Wei Gu, Tianhang Zheng, Puning Zhao, Zhan Qin, Kui Ren
https://arxiv.org/abs/2508.13048
Inter(sectional) Alia(s): Ambiguity in Voice Agent Identity via Intersectional Japanese Self-Referents
Takao Fujii, Katie Seaborn, Madeleine Steeds, Jun Kato
https://arxiv.org/abs/2506.01998
Stochastically Dominant Peer Prediction
Yichi Zhang, Shengwei Xu, David Pennock, Grant Schoenebeck
https://arxiv.org/abs/2506.02259 https://
This https://arxiv.org/abs/2506.02878 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCL_…
Social Robots for People with Dementia: A Literature Review on Deception from Design to Perception
Fan Wang, Giulia Perugia, Yuan Feng, Wijnand IJsselsteijn
https://arxiv.org/abs/2507.00963
This https://arxiv.org/abs/2502.18504 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…
Linearly Decoding Refused Knowledge in Aligned Language Models
Aryan Shrivastava, Ari Holtzman
https://arxiv.org/abs/2507.00239 https://
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Aashray Reddy, Andrew Zagula, Nicholas Saban
https://arxiv.org/abs/2507.01020