Tootfinder

Opt-in global Mastodon full text search. Join the index!

@netzschleuder@social.skewed.de
2025-09-21 10:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csCL_bot@mastoxiv.page
2025-08-20 09:51:00

Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study
Hanna Woloszyn, Benjamin Gagl
arxiv.org/abs/2508.13769

@netzschleuder@social.skewed.de
2025-07-20 20:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csIR_bot@mastoxiv.page
2025-08-22 09:46:30

Test-time Corpus Feedback: From Retrieval to RAG
Mandeep Rathee, Venktesh V, Sean MacAvaney, Avishek Anand
arxiv.org/abs/2508.15437 arxiv.o…

The U.S. Department of Housing and Urban Development is preparing to
💥shut down seven major investigations and cases concerning alleged housing discrimination and segregation, including
⚠️ some where the agency already found civil rights violations, according to HUD records obtained by ProPublica.
The high-profile cases involve allegations that state and local governments across the South and Midwest illegally discriminated against people of color
by placing industrial…

@arXiv_csDL_bot@mastoxiv.page
2025-08-22 08:10:31

Guidelines for the Enhancement of the Corpus and the Verismo Vocabulary
Michael Bassi, Giovanni Salucci
arxiv.org/abs/2508.15645 arxiv.org/…

@mia@hcommons.social
2025-08-13 16:11:14

I read 'The Public Interest Corpus Update – NYC Edition'. More work on the project's principles and goals, research and library service use cases, and thinking ahead to prospective year 1-3 and year 4-6 activities publicinterestcorpus.org/the-p

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:43:50

ding-01 :ARG0: An AMR Corpus for Spontaneous French Dialogue
Jeongwoo Kang, Maria Boritchev, Maximin Coavoux
arxiv.org/abs/2508.12819 arxiv…

@netzschleuder@social.skewed.de
2025-09-20 03:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@johl@mastodon.xyz
2025-09-19 20:11:00

Fat Bear Week is coming early this year. The annual online competition that normally starts in early October will instead start on Sept. 23.

@arXiv_csCY_bot@mastoxiv.page
2025-08-22 08:01:11

Systematic Review Of Collaborative Learning Activities For Promoting AI Literacy
Ashish Hingle, Aditya Johri
arxiv.org/abs/2508.15111 arxiv…

@arXiv_csSD_bot@mastoxiv.page
2025-07-18 09:22:42

Best Practices and Considerations for Child Speech Corpus Collection and Curation in Educational, Clinical, and Forensic Scenarios
John Hansen, Satwik Dutta, Ellen Grand
arxiv.org/abs/2507.12870

@arXiv_csAI_bot@mastoxiv.page
2025-08-19 10:38:50

An LLM ASP Workflow for Joint Entity-Relation Extraction
Trang Tran, Trung Hoang Le, Huiping Cao, Tran Cao Son
arxiv.org/abs/2508.12611 a…

@arXiv_csRO_bot@mastoxiv.page
2025-08-19 10:52:30

Energy Efficiency in Robotics Software: A Systematic Literature Review (2020-2024)
Aryan Gupta
arxiv.org/abs/2508.12170 arxiv.org/pdf/2508.…

@arXiv_eessAS_bot@mastoxiv.page
2025-07-22 07:58:20

Adapting Whisper for Lightweight and Efficient Automatic Speech Recognition of Children for On-device Edge Applications
Satwik Dutta, Shruthigna Chandupatla, John Hansen
arxiv.org/abs/2507.14451

@arXiv_csCL_bot@mastoxiv.page
2025-08-21 09:54:20

EmoTale: An Enacted Speech-emotion Dataset in Danish
Maja J. Hjuler, Harald V. Skat-R{\o}rdam, Line H. Clemmensen, Sneha Das
arxiv.org/abs/2508.14548

@mia@hcommons.social
2025-07-19 09:23:06

#DH2025 thanks @flochiff.bsky.social for sharing this link as I wanted to follow up on Pandore! 'Pandore: automating text-processing workflows for humanities researchers' from Sorbonne Université and ObTIC - Observatoire des textes, des idées et des corpus

@netzschleuder@social.skewed.de
2025-09-16 22:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csSI_bot@mastoxiv.page
2025-09-16 08:58:27

YTCommentVerse: A Multi-Category Multi-Lingual YouTube Comment Corpus
Hridoy Sankar Dutta, Biswadeep Khan
arxiv.org/abs/2509.11057 arxiv.or…

@arXiv_csCL_bot@mastoxiv.page
2025-08-21 09:59:20

The Digital Sous Chef -- A Comparative Study on Fine-Tuning Language Models for Recipe Generation
Shubham Pundhir, Ganesh Bagler
arxiv.org/abs/2508.14718

@BBC3MusicBot@mastodonapp.uk
2025-09-22 06:20:09

🇺🇦 #NowPlaying on BBCRadio3's #Breakfast
Tenebrae, Wolfgang Amadeus Mozart, The Chamber Orchestra of Europe & Nigel Short:
🎵 Ave verum corpus, K 618
#Tenebrae #WolfgangAmadeusMozart #TheChamberOrchestraofEurope
open.spotify.com/track/6nYkXGz

@arXiv_csAR_bot@mastoxiv.page
2025-08-15 07:44:32

AnalogSeeker: An Open-source Foundation Language Model for Analog Circuit Design
Zihao Chen, Ji Zhuang, Jinyi Shen, Xiaoyue Ke, Xinyi Yang, Mingjie Zhou, Zhuoyao Du, Xu Yan, Zhouyang Wu, Zhenyu Xu, Jiangli Huang, Li Shang, Xuan Zeng, Fan Yang
arxiv.org/abs/2508.10409

@arXiv_csCL_bot@mastoxiv.page
2025-07-21 09:46:30

The Expressions of Depression and Anxiety in Chinese Psycho-counseling: Usage of First-person Singular Pronoun and Negative Emotional Words
Lizhi Ma, Tong Zhao, Shuai Zhang, Nirui Song, Hongliang He, Anqi Li, Ran Feng, Huachuan Qiu, Jingsong Ma, Zhenzhong Lan
arxiv.org/abs/2507.13839

@arXiv_csSD_bot@mastoxiv.page
2025-09-17 09:14:10

More Similar than Dissimilar: Modeling Annotators for Cross-Corpus Speech Emotion Recognition
James Tavernor, Emily Mower Provost
arxiv.org/abs/2509.12295

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:40:20

DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Yuchi Xu, Wenbo Su, Bo Zheng
arxiv.org/abs/2508.12726

@arXiv_eessAS_bot@mastoxiv.page
2025-08-19 08:45:09

Arabic ASR on the SADA Large-Scale Arabic Speech Corpus with Transformer-Based Models
Branislav Gerazov, Marcello Politi, S\'ebastien Brati\`eres
arxiv.org/abs/2508.12968

@arXiv_csAI_bot@mastoxiv.page
2025-08-15 09:21:52

MM-Food-100K: A 100,000-Sample Multimodal Food Intelligence Dataset with Verifiable Provenance
Yi Dong, Yusuke Muraoka, Scott Shi, Yi Zhang
arxiv.org/abs/2508.10429

@arXiv_csIR_bot@mastoxiv.page
2025-07-16 08:03:31

Extracting Document Relations from Search Corpus by Marginalizing over User Queries
Yuki Iwamoto, Kaoru Tsunoda, Ken Kaneiwa
arxiv.org/abs/2507.10726

@arXiv_csDB_bot@mastoxiv.page
2025-07-08 08:51:30

AKEGEN: A LLM-based Tabular Corpus Generator for Evaluating Dataset Discovery in Data Lakes
Zhenwei Dai, Chuan Lei, Asterios Katsifodimos, Xiao Qin, Christos Faloutsos, Huzefa Rangwala
arxiv.org/abs/2507.04687

@davidaugust@mastodon.online
2025-08-06 22:21:27

250 years, wars on multiple continents, millions sworn to protect and uphold it, and Mango Zedong decides to Barbara Streisand Effect it by removing key sections.
At least the Seize-her Geezer is bad at being a monstrous dictator.

techcrunch.com…

@arXiv_csHC_bot@mastoxiv.page
2025-06-25 08:34:50

HARPT: A Corpus for Analyzing Consumers' Trust and Privacy Concerns in Mobile Health Apps
Timoteo Kelly, Abdulkadir Korkmaz, Samuel Mallet, Connor Souders, Sadra Aliakbarpour, Praveen Rao
arxiv.org/abs/2506.19268

@catsalad@infosec.exchange
2025-08-01 01:46:09

🔍 how many emojis is too many in Petition for Writ of Habeas Corpus form?

Several sections of Article 1 of the U.S. Constitution
appear to have been removed from the official U.S. government website,
as pointed out by sleuths on the internet and as seen by TechCrunch.
These sections largely relate to the powers that Congress has and does not have,
as well as limitations on the powers of individual states.
The removal includes sections relating to habeas corpus,
the powers that protect citizens from unlawful detention. 
S…

@arXiv_csCR_bot@mastoxiv.page
2025-07-25 08:29:21

MeAJOR Corpus: A Multi-Source Dataset for Phishing Email Detection
Paulo Mendes (GECAD, ISEP, Polytechnic of Porto, Portugal), Eva Maia (GECAD, ISEP, Polytechnic of Porto, Portugal), Isabel Pra\c{c}a (GECAD, ISEP, Polytechnic of Porto, Portugal)
arxiv.org/abs/2507.17978

@arXiv_csNE_bot@mastoxiv.page
2025-07-16 08:57:21

Grammatical Structure and Grammatical Variations in Non-Metric Iranian Classical Music
Maziar Kanani, Sean O Leary, James McDermott
arxiv.org/abs/2507.10708

@arXiv_csCY_bot@mastoxiv.page
2025-08-07 07:32:33

Health Insurance Coverage Rule Interpretation Corpus: Law, Policy, and Medical Guidance for Health Insurance Coverage Understanding
Mike Gartner
arxiv.org/abs/2508.03718

@arXiv_csIR_bot@mastoxiv.page
2025-09-18 11:08:43

Crosslisted article(s) found for cs.IR. arxiv.org/list/cs.IR/new
[1/1]:
- Annotating Satellite Images of Forests with Keywords from a Specialized Corpus in the Context of ...
Nathalie Neptune, Josiane Mothe

@ubuntourist@mastodon.social
2025-08-07 00:41:14

Key sections of the US Constitution deleted from government’s website;
Changes in Article 1 of the U.S. Constitution: Large parts of Section 8 have been removed, and Sections 9 and 10 have been deleted altogether.
techcrunch.com/2025/08/06/key-

@arXiv_eessAS_bot@mastoxiv.page
2025-09-15 08:20:11

The MSP-Podcast Corpus
Carlos Busso, Reza Lotfian, Kusha Sridhar, Ali N. Salman, Wei-Cheng Lin, Lucas Goncalves, Srinivas Parthasarathy, Abinay Reddy Naini, Seong-Gyun Leem, Luz Martinez-Lucas, Huang-Cheng Chou, Pravin Mote
arxiv.org/abs/2509.09791

@arXiv_csCL_bot@mastoxiv.page
2025-08-18 09:22:10

Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification
Anusha M D, Deepthi Vikram, Bharathi Raja Chakravarthi, Parameshwar R Hegde
arxiv.org/abs/2508.11166

@gwire@mastodon.social
2025-07-08 10:38:27

Even outside the digital transformation of "planning", I'm very interested (as a member of the public) in the development of a rich corpus of public geospatial data. (Some deep unexpressed cartographic instinct?) This blogpost is an update on the status of that:

@gedankenstuecke@scholar.social
2025-07-01 02:39:32

It's interesting how much Wikipedia's list of LLM catchphrases comes straight out of academic writing jargon.
Probably as a lot of the corpus that was mined for the LLM training goes back to the academic literature thanks to Sci-Hub et al.
en.wikipedia.org/wiki/Wikipedi

@tiotasram@kolektiva.social
2025-07-19 07:51:05

AI, AGI, and learning efficiency
My 4-month-old kid is not DDoSing Wikipedia right now, nor will they ever do so before learning to speak, read, or write. Their entire "training corpus" will not top even 100 million "tokens" before they can speak & understand language, and do so with real intentionally.
Just to emphasize that point: 100 words-per-minute times 60 minutes-per-hour times 12 hours-per-day times 365 days-per-year times 4 years is a mere 105,120,000 words. That's a ludicrously *high* estimate of words-per-minute and hours-per-day, and 4 years old (the age of my other kid) is well after basic speech capabilities are developed in many children, etc. More likely the available "training data" is at least 1 or 2 orders of magnitude less than this.
The point here is that large language models, trained as they are on multiple *billions* of tokens, are not developing their behavioral capabilities in a way that's remotely similar to humans, even if you believe those capabilities are similar (they are by certain very biased ways of measurement; they very much aren't by others). This idea that humans must be naturally good at acquiring language is an old one (see e.g. #AI #LLM #AGI

@arXiv_csSD_bot@mastoxiv.page
2025-08-19 09:06:59

Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding
Zhifeng Kong, Arushi Goel, Joao Felipe Santos, Sreyan Ghosh, Rafael Valle, Wei Ping, Bryan Catanzaro
arxiv.org/abs/2508.11818

@netzschleuder@social.skewed.de
2025-08-06 16:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csDL_bot@mastoxiv.page
2025-09-17 07:40:50

Storage places in diplomatic texts (7th-13th centuries). Lexical, semantic, and digital investigation
Nicolas Perreaux (LAMOP, CNRS)
arxiv.org/abs/2509.12230

@arXiv_csSI_bot@mastoxiv.page
2025-09-17 09:25:19

Podcasts as a Medium for Participation in Collective Action: A Case Study of Black Lives Matter
Theodora Moldovan, Arianna Pera, Davide Vega, Luca Maria Aiello
arxiv.org/abs/2509.13197

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:33:21

Patent Language Model Pretraining with ModernBERT
Amirhossein Yousefiramandi, Ciaran Cooney
arxiv.org/abs/2509.14926 arxiv.org/pdf/2509.149…

@netzschleuder@social.skewed.de
2025-08-06 10:00:04

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_eessAS_bot@mastoxiv.page
2025-09-18 08:13:31

Enhancing Speaker-Independent Dysarthric Speech Severity Classification with DSSCNet and Cross-Corpus Adaptation
Arnab Kumar Roy, Hemant Kumar Kathania, Paban Sapkota
arxiv.org/abs/2509.13442

@arXiv_csAI_bot@mastoxiv.page
2025-09-16 08:15:26

AI Answer Engine Citation Behavior An Empirical Analysis of the GEO16 Framework
Arlen Kumar, Leanid Palkhouski
arxiv.org/abs/2509.10762 arx…

@arXiv_csIR_bot@mastoxiv.page
2025-09-18 08:00:21

Mind the Gap: Aligning Knowledge Bases with User Needs to Enhance Mental Health Retrieval
Amanda Chan, James Jiayu Liu, He Kai, Onno P. Kampman
arxiv.org/abs/2509.13626

Over the last three years, the head of a small charter school network that serves fewer than 1,000 students has taken home up to $870,000 annually,
a startling amount that appears to be the highest for any public school superintendent in Texas and among the top in the nation.
Valere Public Schools Superintendent Salvador Cavazos’ compensation to run three campuses in Austin, Corpus Christi and Brownsville
exceeds the less than $450,000 that New York City’s chancellor makes …

@arXiv_csSD_bot@mastoxiv.page
2025-07-14 08:12:52

Active Learning for Text-to-Speech Synthesis with Informative Sample Collection
Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
arxiv.org/abs/2507.08319

@netzschleuder@social.skewed.de
2025-07-03 11:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csCL_bot@mastoxiv.page
2025-09-12 09:23:29

Automated Classification of Tutors' Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus
Liqun He, Jiaqi Xu
arxiv.org/abs/2509.09125

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 07:44:21

Tokenization Strategies for Low-Resource Agglutinative Languages in Word2Vec: Case Study on Turkish and Finnish
Jinfan Frank Hu
arxiv.org/abs/2509.14238

@arXiv_csSD_bot@mastoxiv.page
2025-09-17 09:28:39

Beyond Bars: Distribution of Edit Operations in Historical Prints
Adrian Nachtwey, Fabian C. Moss, Anna Viktoria Katrin Plaksin
arxiv.org/abs/2509.12786

@netzschleuder@social.skewed.de
2025-07-31 15:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csIR_bot@mastoxiv.page
2025-07-17 07:58:10

Context-Aware Search and Retrieval Over Erasure Channels
Sara Ghasvarianjahromi, Yauhen Yakimenka, J\"org Kliewer
arxiv.org/abs/2507.11894

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 10:10:51

Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale
Hasan Abed Al Kader Hammoud, Mohammad Zbeeb, Bernard Ghanem
arxiv.org/abs/2509.14008

@arXiv_csDL_bot@mastoxiv.page
2025-09-03 08:23:43

A World in Print: Introducing a Danish-Norwegian corpus of historical newspapers
Johan Heinsen, Camilla B{\o}geskov
arxiv.org/abs/2509.02356

About 9 million years ago, a natural inbreeding in the wild between tomato plants and a potato-like plant species in present-day South America gave way to what we know as the potato.
This new (and nutritious) plant arose from an evolutionary event that triggered the formation of the tuber–the underground structure that plants like potatoes, yams, and taros use to store food.
The findings are detailed in a study published July 31 in the journal Cell.

@arXiv_csCL_bot@mastoxiv.page
2025-09-09 12:02:42

ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data
Vladislav Stankov, Maty\'a\v{s} Kopp, Ond\v{r}ej Bojar
arxiv.org/abs/2509.06675

@netzschleuder@social.skewed.de
2025-07-26 10:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csSD_bot@mastoxiv.page
2025-07-10 08:43:21

Revealing the Hidden Temporal Structure of HubertSoft Embeddings based on the Russian Phonetic Corpus
Anastasia Ananeva, Anton Tomilov, Marina Volkova
arxiv.org/abs/2507.06794

@netzschleuder@social.skewed.de
2025-07-27 01:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_eessAS_bot@mastoxiv.page
2025-07-03 08:54:50

IdolSongsJp Corpus: A Multi-Singer Song Corpus in the Style of Japanese Idol Groups
Hitoshi Suda, Junya Koguchi, Shunsuke Yoshida, Tomohiko Nakamura, Satoru Fukayama, Jun Ogata
arxiv.org/abs/2507.01349

@arXiv_csCL_bot@mastoxiv.page
2025-09-17 10:31:20

Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews
Chenye Zou, Xingyue Wen, Tianyi Hu, Qian Janice Wang, Daniel Hershcovich
arxiv.org/abs/2509.12961

@arXiv_csIR_bot@mastoxiv.page
2025-09-15 08:39:01

GeoGPT.RAG Technical Report
Fei Huang, Fan Wu, Zeqing Zhang, Qihao Wang, Long Zhang, Grant Michael Boquet, Hongyang Chen
arxiv.org/abs/2509.09686

@arXiv_csCL_bot@mastoxiv.page
2025-07-17 10:04:30

StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features
Jeremi K. Ochab, Mateusz Matias, Tymoteusz Boba, Tomasz Walkowiak
arxiv.org/abs/2507.12064

@arXiv_csCL_bot@mastoxiv.page
2025-07-10 10:02:31

SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN
Luca Mariotti, Veronica Guidetti, Federica Mandreoli
arxiv.org/abs/2507.06895

@arXiv_csSD_bot@mastoxiv.page
2025-09-05 07:50:51

WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation
Longhao Li, Zhao Guo, Hongjie Chen, Yuhang Dai, Ziyu Zhang, Hongfei Xue, Tianlun Zuo, Chengyou Wang, Shuiyuan Wang, Jie Li, Xin Xu, Hui Bu, Binbin Zhang, Ruibin Yuan, Ziya Zhou, Wei Xue, Lei Xie
arxiv.org/abs/2509.03959

@arXiv_csCL_bot@mastoxiv.page
2025-06-30 10:20:00

MDC-R: The Minecraft Dialogue Corpus with Reference
Chris Madge, Maris Camilleri, Paloma Carretero Garcia, Mladen Karan, Juexi Shao, Prashant Jayannavar, Julian Hough, Benjamin Roth, Massimo Poesio
arxiv.org/abs/2506.22062

@arXiv_csSD_bot@mastoxiv.page
2025-08-28 08:13:21

The IRMA Dataset: A Structured Audio-MIDI Corpus for Iranian Classical Music
Sepideh Shafiei, Shapour Hakam
arxiv.org/abs/2508.19876 arxiv.…

@arXiv_csIR_bot@mastoxiv.page
2025-08-12 10:17:43

HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches
Jiejun Tan, Zhicheng Dou, Yan Yu, Jiehan Cheng, Qiang Ju, Jian Xie, Ji-Rong Wen
arxiv.org/abs/2508.08088

@arXiv_csCL_bot@mastoxiv.page
2025-09-11 10:00:13

Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora
Thales Sales Almeida, Rodrigo Nogueira, Helio Pedrini
arxiv.org/abs/2509.08824

@arXiv_csSD_bot@mastoxiv.page
2025-06-24 10:26:40

JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles
Yuto Kondo, Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko
arxiv.org/abs/2506.18296

@arXiv_csCL_bot@mastoxiv.page
2025-08-15 10:04:02

A Computational Approach to Analyzing Language Change and Variation in the Constructed Language Toki Pona
Daniel Huang, Hyoun-A Joo
arxiv.org/abs/2508.10246

@arXiv_csCL_bot@mastoxiv.page
2025-08-25 10:05:20

The Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks
Zachary Hopton, Jannis Vamvas, Andrin B\"uchler, Anna Rutkiewicz, Rico Cathomas, Rico Sennrich
arxiv.org/abs/2508.16371

@arXiv_csCL_bot@mastoxiv.page
2025-08-25 10:04:50

JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus
Masaaki Nagata, Katsuki Chousa, Norihito Yasuda
arxiv.org/abs/2508.16303

@arXiv_csCL_bot@mastoxiv.page
2025-09-15 09:54:31

Prominence-aware automatic speech recognition for conversational speech
Julian Linke, Barbara Schuppler
arxiv.org/abs/2509.10116 arxiv.org/…

@arXiv_csCL_bot@mastoxiv.page
2025-09-15 09:56:01

Benchmark of stylistic variation in LLM-generated texts
Ji\v{r}\'i Mili\v{c}ka, Anna Marklov\'a, V\'aclav Cvr\v{c}ek
arxiv.org/abs/2509.10179

@arXiv_csCL_bot@mastoxiv.page
2025-08-15 10:15:32

Neural Machine Translation for Coptic-French: Strategies for Low-Resource Ancient Languages
Nasma Chaoui, Richard Khoury
arxiv.org/abs/2508.10683

@arXiv_csCL_bot@mastoxiv.page
2025-09-15 09:54:21

Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records
Abdulrahman Allam, Seif Ahmed, Ali Hamdi, Khaled Shaban
arxiv.org/abs/2509.10108

@arXiv_csCL_bot@mastoxiv.page
2025-09-15 09:51:31

!MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment
Mohamed Basem, Mohamed Younes, Seif Ahmed, Abdelrahman Moustafa
arxiv.org/abs/2509.10040

@arXiv_csCL_bot@mastoxiv.page
2025-08-04 09:51:40

DACTYL: Diverse Adversarial Corpus of Texts Yielded from Large Language Models
Shantanu Thorat, Andrew Caines
arxiv.org/abs/2508.00619 arxi…

@arXiv_csCL_bot@mastoxiv.page
2025-07-14 09:58:42

DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures
Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro
arxiv.org/abs/2507.08606

@arXiv_csCL_bot@mastoxiv.page
2025-07-14 09:56:02

Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach
Bruno Alexandre Rosa, Hil\'ario Oliveira, Luiz Rodrigues, Eduardo Araujo Oliveira, Rafael Ferreira Mello
arxiv.org/abs/2507.08487

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:04:56

Speech-Based Depressive Mood Detection in the Presence of Multiple Sclerosis: A Cross-Corpus and Cross-Lingual Study
Monica Gonzalez-Machorro, Uwe Reichel, Pascal Hecker, Helly Hammer, Hesam Sagha, Florian Eyben, Robert Hoepner, Bj\"orn W. Schuller
arxiv.org/abs/2508.18092

@arXiv_csCL_bot@mastoxiv.page
2025-08-13 10:18:52

SinLlama - A Large Language Model for Sinhala
H. W. K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, Rishemjit Kaur
arxiv.org/abs/2508.09115

@arXiv_csCL_bot@mastoxiv.page
2025-08-25 10:01:40

OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages
Rapha\"el Merx, Hanna Suominen, Trevor Cohn, Ekaterina Vylomova
arxiv.org/abs/2508.16048

@arXiv_csCL_bot@mastoxiv.page
2025-09-12 09:23:49

ViRanker: A BGE-M3 & Blockwise Parallel Transformer Cross-Encoder for Vietnamese Reranking
Phuong-Nam Dang, Kieu-Linh Nguyen, Thanh-Hieu Pham
arxiv.org/abs/2509.09131

@arXiv_csCL_bot@mastoxiv.page
2025-08-26 12:03:06

A Retail-Corpus for Aspect-Based Sentiment Analysis with Large Language Models
Oleg Silcenco, Marcos R. Machad, Wallace C. Ugulino, Daniel Braun
arxiv.org/abs/2508.17994

@arXiv_csCL_bot@mastoxiv.page
2025-09-11 09:45:53

LLM Ensemble for RAG: Role of Context Length in Zero-Shot Question Answering for BioASQ Challenge
Dima Galat, Diego Molla-Aliod
arxiv.org/abs/2509.08596

@arXiv_csCL_bot@mastoxiv.page
2025-09-10 10:04:21

AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training
Christian Rene Thelen, Patrick Gustav Blaneck, Tobias Bornheim, Niklas Grieger, Stephan Bialonski
arxiv.org/abs/2509.07459

@arXiv_csCL_bot@mastoxiv.page
2025-07-10 10:00:01

Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams
Matthew Anderson Hendricks, Alice Cicirello
arxiv.org/abs/2507.06803

@arXiv_csCL_bot@mastoxiv.page
2025-09-10 08:51:41

Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector
Amal Chebbi, Babajide Kolade
arxiv.org/abs/2509.07177 arxiv.org…

@arXiv_csCL_bot@mastoxiv.page
2025-07-08 14:00:21

SIGIR 2025 -- LiveRAG Challenge Report
David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Oren Somekh, Ran Tavory
arxiv.org/abs/2507.04942

@arXiv_csCL_bot@mastoxiv.page
2025-06-26 09:14:30

Narrative Shift Detection: A Hybrid Approach of Dynamic Topic Models and Large Language Models
Kai-Robin Lange, Tobias Schmidt, Matthias Reccius, Henrik M\"uller, Michael Roos, Carsten Jentsch
arxiv.org/abs/2506.20269