Looking forward to reading this! “Making BERT Feel at Home. Modelling Domestic Space in 19th-Century British and Irish Fiction”, Journal of Computational Literary Studies4(1). doi: https://doi.org/10.48694/jcls.4164
By Guhr, S., Monaco, J., Sherman, A., Warner, M. & Algee-Hewitt, M
A Set of Quebec-French Corpus of Regional Expressions and Terms
David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury
https://arxiv.org/abs/2510.05026 https://…
Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples
Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, Nicholas Carlini, Yarin Gal, Robert Kirk
https://arxiv.org/abs/2510.07192…
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
Hups, ein bekanntermaßen unsicheres Passwort ergibt eine erstaunlich hohe neunstellige Zahl Treffer:
> This password has been seen 179,863,340 times before in data breaches!
https://haveibeenpwned.com/Passwords
AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues
Krish Patel, Dingkun Zhou, Ajay Kankipati, Akshaj Gupta, Zeyi Austin Li, Mohul Shukla, Vibhor Narang, Sara Kofman, Zongli Ye, Grace Wang, Xiaoyu Shi, Tingle Li, Guan-Ting Lin, Kan Jen Cheng, Huang-Cheng Chou, Jiachen Lian, Gopala Anumanchipalli
https://
AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training
Christian Rene Thelen, Patrick Gustav Blaneck, Tobias Bornheim, Niklas Grieger, Stephan Bialonski
https://arxiv.org/abs/2509.07459
> if you think about it in the context of the training models—it has a rough sense that you’re like a 37 year old guy on Reddit. That’s the kind of person that it’s doing the continuation for, because that’s a big chunk of the training corpus.
> I often tell people whenever they send me a message like, “a large language model said I should do x, y, z.” what you’re really saying is, “a 37 year old guy on Reddit said it,” and you’ve got roughly the same amount of information
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement
Nikolai Lund K\"uhne, Jesper Jensen, Jan {\O}stergaard, Zheng-Hua Tan
https://arxiv.org/abs/2510.01958
SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus
Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan
The Grinch did nothing wrong. He wasn't *stealing* #Christmas, he was just gathering a corpus for training his #AI model. Investors are already lining up with their billions to fund the construction of the Whoville Data Center, ignoring concerns from residents.
ArgoLOOM: agentic AI for fundamental physics from quarks to cosmos
S. D. Bakshi, P. Barry, C. Bissolotti, I. Cloet, S. Corrodi, Z. Djurcic, S. Habib, K. Heitmann, T. J. Hobbs, W. Hopkins, S. Joosten, B. Kriesten, N. Ramachandra, A. Wells, M. Zurek
https://arxiv.org/abs/2510.02426
How about “algorithmic stolen corpus derivative” instead of “AI generated”
Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data
Stephen Meisenbacher, Svetlozar Nestorov, Peter Norlander
https://arxiv.org/abs/2510.01470
TyDe 2025:
"Generating a corpus of Hazel programs from ill-typed OCaml programs"
presented by Patrick Ferris, joint work with Anil Madhavapeddy.
https://conf.researchr.org/details/icf
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
The InviTE Corpus: Annotating Invectives in Tudor English Texts for Computational Modeling
Sophie Spliethoff, Sanne Hoeken, Silke Schwandt, Sina Zarrie{\ss}, \"Ozge Ala\c{c}am
https://arxiv.org/abs/2509.22345
ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search
Zeyu Shen, Basileal Imana, Tong Wu, Chong Xiang, Prateek Mittal, Aleksandra Korolova
https://arxiv.org/abs/2509.23519
Rethinking Cross-Corpus Speech Emotion Recognition Benchmarking: Are Paralinguistic Pre-Trained Representations Sufficient?
Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Parabattina Bhagath, Pailla Balakrishna Reddy, Arun Balaji Buduru
https://arxiv.org/abs/2509.16182
The Insurrection Act is not a declaration of martial law.
It doesn't shut down the courts.
It doesn't suspend habeas corpus.
It means you can use the military to enforce federal laws,
but the laws themselves remain the same.
https://f…
Fat Bear Week is coming early this year. The annual online competition that normally starts in early October will instead start on Sept. 23.
Sci2Pol: Evaluating and Fine-tuning LLMs on Scientific-to-Policy Brief Generation
Weimin Wu, Alexander C. Furnas, Eddie Yang, Gefei Liu, Akhil Pandey Akella, Xuefeng Song, Dashun Wang, Han Liu
https://arxiv.org/abs/2509.21493
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control
Junki Ohmura, Yuki Ito, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki Kumakura
https://arxiv.org/abs/2509.15626
I have yet to hear any, much less a solid, argument against my suggestion that Presidential pardons and commutations are revocable by a subsequent president.
Were I elected president I would revoke all of El Cheato's pardons and commutations and let the people involved make arguments (probably in the context of Habeas Corpus proceedings) why those actions are not Constitutional.
My own sense is that the question tends more towards the "not revocable" with regard to …
I hadn't realised that I had any Assembly of Dust but I have Corpus Christi on an old EMusic compilation and very Steely Dan it is. In this case that's a good thing.
#Music #Serendiipity
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
EuroSpeech: A Multilingual Speech Corpus
Samuel Pfisterer, Florian Gr\"otschla, Luca A. Lanzend\"orfer, Florian Yan, Roger Wattenhofer
https://arxiv.org/abs/2510.00514
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering
Tengjun Ni, Xin Yuan, Shenghong Li, Kai Wu, Ren Ping Liu, Wei Ni, Wenjie Zhang
https://arxiv.org/abs/2510.02827
The MSP-Podcast Corpus
Carlos Busso, Reza Lotfian, Kusha Sridhar, Ali N. Salman, Wei-Cheng Lin, Lucas Goncalves, Srinivas Parthasarathy, Abinay Reddy Naini, Seong-Gyun Leem, Luz Martinez-Lucas, Huang-Cheng Chou, Pravin Mote
https://arxiv.org/abs/2509.09791
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
Fading to Grow: Growing Preference Ratios via Preference Fading Discrete Diffusion for Recommendation
Guoqing Hu, An Zhang. Shuchang Liu, Wenyu Mao, Jiancan Wu, Xun Yang, Xiang Li, Lantao Hu, Han Li, Kun Gai, Xiang Wang
https://arxiv.org/abs/2509.26063
WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing
Yuhang Dai, Ziyu Zhang, Shuai Wang, Longhao Li, Zhao Guo, Tianlun Zuo, Shuiyuan Wang, Hongfei Xue, Chengyou Wang, Qing Wang, Xin Xu, Hui Bu, Jie Li, Jian Kang, Binbin Zhang, Lei Xie
https://arxiv.org/abs/2509.18004
I hadn't realised that I had any Assembly of Dust but I have Corpus Christi on an old EMusic compilation and very Steely Dan it is. In this case that's a good thing.
#Music #Serendiipity
Evaluating LLM Generated Detection Rules in Cybersecurity
Anna Bertiger, Bobby Filar, Aryan Luthra, Stefano Meschiari, Aiden Mitchell, Sam Scholten, Vivek Sharath
https://arxiv.org/abs/2509.16749
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
LOTUSDIS: A Thai far-field meeting corpus for robust conversational ASR
Pattara Tipaksorn, Sumonmas Thatphithakkul, Vataya Chunwijitra, Kwanchiva Thangthai
https://arxiv.org/abs/2509.18722
Replaced article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[9/10]:
- HARPT: A Corpus for Analyzing Consumers' Trust and Privacy Concerns in Electronic Health Apps
Timoteo Kelly, Abdulkadir Korkmaz, Samuel Mallet, Connor Souders, Sadra Aliakbarpour, Praveen Rao
ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery
https://arxiv.org/abs/2510.10774
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
Visual Authority and the Rhetoric of Health Misinformation: A Multimodal Analysis of Social Media Videos
Mohammad Reza Zarei, Barbara Stead-Coyle, Michael Christensen, Sarah Everts, Majid Komeili
https://arxiv.org/abs/2509.20724
Using Medical Algorithms for Task-Oriented Dialogue in LLM-Based Medical Interviews
Rui Reis, Pedro Rangel Henriques, Jo\~ao Ferreira-Coimbra, Eva Oliveira, Nuno F. Rodrigues
https://arxiv.org/abs/2510.12490
On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search
Nick Hagar, Nicholas Diakopoulos, Jeremy Gilbert
https://arxiv.org/abs/2509.25494
Learning Relationships Between Separate Audio Tracks for Creative Applications
Balthazar Bujard (IRCAM, SU), J\'er\^ome Nika (IRCAM), F\'ed\'eric Bevilacqua (IRCAM), Nicolas Obin
https://arxiv.org/abs/2509.25296
LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text
Irina Tolstykh, Aleksandra Tsybina, Sergey Yakubson, Maksim Kuprashevich
https://arxiv.org/abs/2509.21269
Does Generative Retrieval Overcome the Limitations of Dense Retrieval?
Yingchen Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
https://arxiv.org/abs/2509.22116
CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis
Xinzhe Xu, Liang Zhao, Hongshen Xu, Chen Chen
https://arxiv.org/abs/2509.21208
Replaced article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[7/7]:
- ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery
Enhancing Speaker-Independent Dysarthric Speech Severity Classification with DSSCNet and Cross-Corpus Adaptation
Arnab Kumar Roy, Hemant Kumar Kathania, Paban Sapkota
https://arxiv.org/abs/2509.13442
A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction
Cameron Morin, Matti Marttinen Larsson
https://arxiv.org/abs/2510.12306
UPRPRC: Unified Pipeline for Reproducing Parallel Resources -- Corpus from the United Nations
Qiuyang Lu, Fangjian Shen, Zhengkai Tang, Qiang Liu, Hexuan Cheng, Hui Liu, Wushao Wen
https://arxiv.org/abs/2509.15789
Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition
Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu
https://arxiv.org/abs/2510.09072
Efficient and Versatile Model for Multilingual Information Retrieval of Islamic Text: Development and Deployment in Real-World Scenarios
Vera Pavlova, Mohammed Makhlouf
https://arxiv.org/abs/2509.15380
NeLLCom-Lex: A Neural-agent Framework to Study the Interplay between Lexical Systems and Language Use
Yuqing Zhang, Ecesu \"Urker, Tessa Verhoef, Gemma Boleda, Arianna Bisazza
https://arxiv.org/abs/2509.22479
CHRONOBERG: Capturing Language Evolution and Temporal Awareness in Foundation Models
Niharika Hegde, Subarnaduti Paul, Lars Joel-Frey, Manuel Brack, Kristian Kersting, Martin Mundt, Patrick Schramowski
https://arxiv.org/abs/2509.22360
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
Yizhou Wang, Chen Tang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu Wu, Jun Yao, Pengze Li, Encheng Su, Lintao Wang, Guohang Zhuang, Yuchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Yue, Zhenfei Yin, Jiamin Wu, Qihao Zheng, Yuhao Zhou, Huihui Xu, Chenglong Ma, Yan Lu, Wenlong Zhang, Chunfeng Song, Philip Torr, Shixiang Tang, Xinzhu Ma, Wanli Ouyang, Lei Bai
Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus
Chiara Alzetta, Serena Auriemma, Alessandro Bondielli, Luca Dini, Chiara Fazzone, Alessio Miaschi, Martina Miliani, Marta Sartor
https://arxiv.org/abs/2509.19033
TianHui: A Domain-Specific Large Language Model for Diverse Traditional Chinese Medicine Scenarios
Ji Yin, Menglan He, Yujie Zhang, Linshuai Zhang, Tingting Ma, Ce Tian, Jie Wu, Lin Xu, Tao Jiang
https://arxiv.org/abs/2509.19834
CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems
Soham Bhattacharjee, Mukund K Roy, Yathish Poojary, Bhargav Dave, Mihir Raj, Vandan Mujadia, Baban Gain, Pruthwik Mishra, Arafat Ahsan, Parameswari Krishnamurthy, Ashwath Rao, Gurpreet Singh Josan, Preeti Dubey, Aadil Amin Kak, Anna Rao Kulkarni, Narendra VG, Sunita Arora, Rakesh Balbantray, Prasenjit Majumdar, Karunesh K Arora, Asif Ekbal, Dipti Mishra Sharma
Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora
Thales Sales Almeida, Rodrigo Nogueira, Helio Pedrini
https://arxiv.org/abs/2509.08824
Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews
Chenye Zou, Xingyue Wen, Tianyi Hu, Qian Janice Wang, Daniel Hershcovich
https://arxiv.org/abs/2509.12961
Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale
Hasan Abed Al Kader Hammoud, Mohammad Zbeeb, Bernard Ghanem
https://arxiv.org/abs/2509.14008
Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records
Abdulrahman Allam, Seif Ahmed, Ali Hamdi, Khaled Shaban
https://arxiv.org/abs/2509.10108
!MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment
Mohamed Basem, Mohamed Younes, Seif Ahmed, Abdelrahman Moustafa
https://arxiv.org/abs/2509.10040
KORMo: Korean Open Reasoning Model for Everyone
Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junhun Yuk, Changsu Choi, Dongjae Shin, Huige Lee, Hoyun Song, Alice Oh, Kyungtae Lim
https://arxiv.org/abs/2510.09426