email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
The U.S. Department of Housing and Urban Development is preparing to
💥shut down seven major investigations and cases concerning alleged housing discrimination and segregation, including
⚠️ some where the agency already found civil rights violations, according to HUD records obtained by ProPublica.
The high-profile cases involve allegations that state and local governments across the South and Midwest illegally discriminated against people of color
by placing industrial…
I read 'The Public Interest Corpus Update – NYC Edition'. More work on the project's principles and goals, research and library service use cases, and thinking ahead to prospective year 1-3 and year 4-6 activities https://publicinterestcorpus.org/the-p
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
Fat Bear Week is coming early this year. The annual online competition that normally starts in early October will instead start on Sept. 23.
Best Practices and Considerations for Child Speech Corpus Collection and Curation in Educational, Clinical, and Forensic Scenarios
John Hansen, Satwik Dutta, Ellen Grand
https://arxiv.org/abs/2507.12870
Adapting Whisper for Lightweight and Efficient Automatic Speech Recognition of Children for On-device Edge Applications
Satwik Dutta, Shruthigna Chandupatla, John Hansen
https://arxiv.org/abs/2507.14451
EmoTale: An Enacted Speech-emotion Dataset in Danish
Maja J. Hjuler, Harald V. Skat-R{\o}rdam, Line H. Clemmensen, Sneha Das
https://arxiv.org/abs/2508.14548 https://
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
AnalogSeeker: An Open-source Foundation Language Model for Analog Circuit Design
Zihao Chen, Ji Zhuang, Jinyi Shen, Xiaoyue Ke, Xinyi Yang, Mingjie Zhou, Zhuoyao Du, Xu Yan, Zhouyang Wu, Zhenyu Xu, Jiangli Huang, Li Shang, Xuan Zeng, Fan Yang
https://arxiv.org/abs/2508.10409
The Expressions of Depression and Anxiety in Chinese Psycho-counseling: Usage of First-person Singular Pronoun and Negative Emotional Words
Lizhi Ma, Tong Zhao, Shuai Zhang, Nirui Song, Hongliang He, Anqi Li, Ran Feng, Huachuan Qiu, Jingsong Ma, Zhenzhong Lan
https://arxiv.org/abs/2507.13839…
DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Yuchi Xu, Wenbo Su, Bo Zheng
https://arxiv.org/abs/2508.12726
Arabic ASR on the SADA Large-Scale Arabic Speech Corpus with Transformer-Based Models
Branislav Gerazov, Marcello Politi, S\'ebastien Brati\`eres
https://arxiv.org/abs/2508.12968
MM-Food-100K: A 100,000-Sample Multimodal Food Intelligence Dataset with Verifiable Provenance
Yi Dong, Yusuke Muraoka, Scott Shi, Yi Zhang
https://arxiv.org/abs/2508.10429 http…
AKEGEN: A LLM-based Tabular Corpus Generator for Evaluating Dataset Discovery in Data Lakes
Zhenwei Dai, Chuan Lei, Asterios Katsifodimos, Xiao Qin, Christos Faloutsos, Huzefa Rangwala
https://arxiv.org/abs/2507.04687
250 years, wars on multiple continents, millions sworn to protect and uphold it, and Mango Zedong decides to Barbara Streisand Effect it by removing key sections.
At least the Seize-her Geezer is bad at being a monstrous dictator.
https://techcrunch.com…
HARPT: A Corpus for Analyzing Consumers' Trust and Privacy Concerns in Mobile Health Apps
Timoteo Kelly, Abdulkadir Korkmaz, Samuel Mallet, Connor Souders, Sadra Aliakbarpour, Praveen Rao
https://arxiv.org/abs/2506.19268
🔍 how many emojis is too many in Petition for Writ of Habeas Corpus form?
Several sections of Article 1 of the U.S. Constitution
appear to have been removed from the official U.S. government website,
as pointed out by sleuths on the internet and as seen by TechCrunch.
These sections largely relate to the powers that Congress has and does not have,
as well as limitations on the powers of individual states.
The removal includes sections relating to habeas corpus,
the powers that protect citizens from unlawful detention.
S…
MeAJOR Corpus: A Multi-Source Dataset for Phishing Email Detection
Paulo Mendes (GECAD, ISEP, Polytechnic of Porto, Portugal), Eva Maia (GECAD, ISEP, Polytechnic of Porto, Portugal), Isabel Pra\c{c}a (GECAD, ISEP, Polytechnic of Porto, Portugal)
https://arxiv.org/abs/2507.17978
Grammatical Structure and Grammatical Variations in Non-Metric Iranian Classical Music
Maziar Kanani, Sean O Leary, James McDermott
https://arxiv.org/abs/2507.10708
Health Insurance Coverage Rule Interpretation Corpus: Law, Policy, and Medical Guidance for Health Insurance Coverage Understanding
Mike Gartner
https://arxiv.org/abs/2508.03718
Crosslisted article(s) found for cs.IR. https://arxiv.org/list/cs.IR/new
[1/1]:
- Annotating Satellite Images of Forests with Keywords from a Specialized Corpus in the Context of ...
Nathalie Neptune, Josiane Mothe
Key sections of the US Constitution deleted from government’s website;
Changes in Article 1 of the U.S. Constitution: Large parts of Section 8 have been removed, and Sections 9 and 10 have been deleted altogether.
https://techcrunch.com/2025/08/06/key-
The MSP-Podcast Corpus
Carlos Busso, Reza Lotfian, Kusha Sridhar, Ali N. Salman, Wei-Cheng Lin, Lucas Goncalves, Srinivas Parthasarathy, Abinay Reddy Naini, Seong-Gyun Leem, Luz Martinez-Lucas, Huang-Cheng Chou, Pravin Mote
https://arxiv.org/abs/2509.09791
Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification
Anusha M D, Deepthi Vikram, Bharathi Raja Chakravarthi, Parameshwar R Hegde
https://arxiv.org/abs/2508.11166
Even outside the digital transformation of "planning", I'm very interested (as a member of the public) in the development of a rich corpus of public geospatial data. (Some deep unexpressed cartographic instinct?) This blogpost is an update on the status of that:
https://…
It's interesting how much Wikipedia's list of LLM catchphrases comes straight out of academic writing jargon.
Probably as a lot of the corpus that was mined for the LLM training goes back to the academic literature thanks to Sci-Hub et al.
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_AI_Cleanup/AI_catchphrases#Language_and_tone
AI, AGI, and learning efficiency
My 4-month-old kid is not DDoSing Wikipedia right now, nor will they ever do so before learning to speak, read, or write. Their entire "training corpus" will not top even 100 million "tokens" before they can speak & understand language, and do so with real intentionally.
Just to emphasize that point: 100 words-per-minute times 60 minutes-per-hour times 12 hours-per-day times 365 days-per-year times 4 years is a mere 105,120,000 words. That's a ludicrously *high* estimate of words-per-minute and hours-per-day, and 4 years old (the age of my other kid) is well after basic speech capabilities are developed in many children, etc. More likely the available "training data" is at least 1 or 2 orders of magnitude less than this.
The point here is that large language models, trained as they are on multiple *billions* of tokens, are not developing their behavioral capabilities in a way that's remotely similar to humans, even if you believe those capabilities are similar (they are by certain very biased ways of measurement; they very much aren't by others). This idea that humans must be naturally good at acquiring language is an old one (see e.g. #AI #LLM #AGI
Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding
Zhifeng Kong, Arushi Goel, Joao Felipe Santos, Sreyan Ghosh, Rafael Valle, Wei Ping, Bryan Catanzaro
https://arxiv.org/abs/2508.11818
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
Storage places in diplomatic texts (7th-13th centuries). Lexical, semantic, and digital investigation
Nicolas Perreaux (LAMOP, CNRS)
https://arxiv.org/abs/2509.12230 https://
Podcasts as a Medium for Participation in Collective Action: A Case Study of Black Lives Matter
Theodora Moldovan, Arianna Pera, Davide Vega, Luca Maria Aiello
https://arxiv.org/abs/2509.13197
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
Enhancing Speaker-Independent Dysarthric Speech Severity Classification with DSSCNet and Cross-Corpus Adaptation
Arnab Kumar Roy, Hemant Kumar Kathania, Paban Sapkota
https://arxiv.org/abs/2509.13442
Mind the Gap: Aligning Knowledge Bases with User Needs to Enhance Mental Health Retrieval
Amanda Chan, James Jiayu Liu, He Kai, Onno P. Kampman
https://arxiv.org/abs/2509.13626 …
Over the last three years, the head of a small charter school network that serves fewer than 1,000 students has taken home up to $870,000 annually,
a startling amount that appears to be the highest for any public school superintendent in Texas and among the top in the nation.
Valere Public Schools Superintendent Salvador Cavazos’ compensation to run three campuses in Austin, Corpus Christi and Brownsville
exceeds the less than $450,000 that New York City’s chancellor makes …
Active Learning for Text-to-Speech Synthesis with Informative Sample Collection
Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
https://arxiv.org/abs/2507.08319
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
Beyond Bars: Distribution of Edit Operations in Historical Prints
Adrian Nachtwey, Fabian C. Moss, Anna Viktoria Katrin Plaksin
https://arxiv.org/abs/2509.12786 https://
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale
Hasan Abed Al Kader Hammoud, Mohammad Zbeeb, Bernard Ghanem
https://arxiv.org/abs/2509.14008
About 9 million years ago, a natural inbreeding in the wild between tomato plants and a potato-like plant species in present-day South America gave way to what we know as the potato.
This new (and nutritious) plant arose from an evolutionary event that triggered the formation of the tuber–the underground structure that plants like potatoes, yams, and taros use to store food.
The findings are detailed in a study published July 31 in the journal Cell.
ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data
Vladislav Stankov, Maty\'a\v{s} Kopp, Ond\v{r}ej Bojar
https://arxiv.org/abs/2509.06675 https://…
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
Revealing the Hidden Temporal Structure of HubertSoft Embeddings based on the Russian Phonetic Corpus
Anastasia Ananeva, Anton Tomilov, Marina Volkova
https://arxiv.org/abs/2507.06794
email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…
IdolSongsJp Corpus: A Multi-Singer Song Corpus in the Style of Japanese Idol Groups
Hitoshi Suda, Junya Koguchi, Shunsuke Yoshida, Tomohiko Nakamura, Satoru Fukayama, Jun Ogata
https://arxiv.org/abs/2507.01349
Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews
Chenye Zou, Xingyue Wen, Tianyi Hu, Qian Janice Wang, Daniel Hershcovich
https://arxiv.org/abs/2509.12961
GeoGPT.RAG Technical Report
Fei Huang, Fan Wu, Zeqing Zhang, Qihao Wang, Long Zhang, Grant Michael Boquet, Hongyang Chen
https://arxiv.org/abs/2509.09686 https://
StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features
Jeremi K. Ochab, Mateusz Matias, Tymoteusz Boba, Tomasz Walkowiak
https://arxiv.org/abs/2507.12064
SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN
Luca Mariotti, Veronica Guidetti, Federica Mandreoli
https://arxiv.org/abs/2507.06895
WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation
Longhao Li, Zhao Guo, Hongjie Chen, Yuhang Dai, Ziyu Zhang, Hongfei Xue, Tianlun Zuo, Chengyou Wang, Shuiyuan Wang, Jie Li, Xin Xu, Hui Bu, Binbin Zhang, Ruibin Yuan, Ziya Zhou, Wei Xue, Lei Xie
https://arxiv.org/abs/2509.03959
MDC-R: The Minecraft Dialogue Corpus with Reference
Chris Madge, Maris Camilleri, Paloma Carretero Garcia, Mladen Karan, Juexi Shao, Prashant Jayannavar, Julian Hough, Benjamin Roth, Massimo Poesio
https://arxiv.org/abs/2506.22062
HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches
Jiejun Tan, Zhicheng Dou, Yan Yu, Jiehan Cheng, Qiang Ju, Jian Xie, Ji-Rong Wen
https://arxiv.org/abs/2508.08088
Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora
Thales Sales Almeida, Rodrigo Nogueira, Helio Pedrini
https://arxiv.org/abs/2509.08824
JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles
Yuto Kondo, Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko
https://arxiv.org/abs/2506.18296
The Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks
Zachary Hopton, Jannis Vamvas, Andrin B\"uchler, Anna Rutkiewicz, Rico Cathomas, Rico Sennrich
https://arxiv.org/abs/2508.16371
Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records
Abdulrahman Allam, Seif Ahmed, Ali Hamdi, Khaled Shaban
https://arxiv.org/abs/2509.10108
!MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment
Mohamed Basem, Mohamed Younes, Seif Ahmed, Abdelrahman Moustafa
https://arxiv.org/abs/2509.10040
DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures
Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro
https://arxiv.org/abs/2507.08606
Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach
Bruno Alexandre Rosa, Hil\'ario Oliveira, Luiz Rodrigues, Eduardo Araujo Oliveira, Rafael Ferreira Mello
https://arxiv.org/abs/2507.08487
Speech-Based Depressive Mood Detection in the Presence of Multiple Sclerosis: A Cross-Corpus and Cross-Lingual Study
Monica Gonzalez-Machorro, Uwe Reichel, Pascal Hecker, Helly Hammer, Hesam Sagha, Florian Eyben, Robert Hoepner, Bj\"orn W. Schuller
https://arxiv.org/abs/2508.18092
SinLlama - A Large Language Model for Sinhala
H. W. K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, Rishemjit Kaur
https://arxiv.org/abs/2508.09115
OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages
Rapha\"el Merx, Hanna Suominen, Trevor Cohn, Ekaterina Vylomova
https://arxiv.org/abs/2508.16048
ViRanker: A BGE-M3 & Blockwise Parallel Transformer Cross-Encoder for Vietnamese Reranking
Phuong-Nam Dang, Kieu-Linh Nguyen, Thanh-Hieu Pham
https://arxiv.org/abs/2509.09131
A Retail-Corpus for Aspect-Based Sentiment Analysis with Large Language Models
Oleg Silcenco, Marcos R. Machad, Wallace C. Ugulino, Daniel Braun
https://arxiv.org/abs/2508.17994
AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training
Christian Rene Thelen, Patrick Gustav Blaneck, Tobias Bornheim, Niklas Grieger, Stephan Bialonski
https://arxiv.org/abs/2509.07459
Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams
Matthew Anderson Hendricks, Alice Cicirello
https://arxiv.org/abs/2507.06803
SIGIR 2025 -- LiveRAG Challenge Report
David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Oren Somekh, Ran Tavory
https://arxiv.org/abs/2507.04942 h…
Narrative Shift Detection: A Hybrid Approach of Dynamic Topic Models and Large Language Models
Kai-Robin Lange, Tobias Schmidt, Matthias Reccius, Henrik M\"uller, Michael Roos, Carsten Jentsch
https://arxiv.org/abs/2506.20269