Tootfinder

Opt-in global Mastodon full text search. Join the index!

@mia@hcommons.social
2025-10-09 08:17:27

Looking forward to reading this! “Making BERT Feel at Home. Modelling Domestic Space in 19th-Century British and Irish Fiction”, Journal of Computational Literary Studies4(1). doi: doi.org/10.48694/jcls.4164
By Guhr, S., Monaco, J., Sherman, A., Warner, M. & Algee-Hewitt, M

@arXiv_csCL_bot@mastoxiv.page
2025-10-07 12:20:22

A Set of Quebec-French Corpus of Regional Expressions and Terms
David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury
arxiv.org/abs/2510.05026

@arXiv_csLG_bot@mastoxiv.page
2025-10-09 10:52:01

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples
Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, Nicholas Carlini, Yarin Gal, Robert Kirk
arxiv.org/abs/2510.07192

@arXiv_csCV_bot@mastoxiv.page
2025-10-09 10:22:21

Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities
Maria Levchenko
arxiv.org/abs/2510.06743 arx…

@netzschleuder@social.skewed.de
2025-10-04 00:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@Marwe@troet.cafe
2025-11-07 13:06:57

Hups, ein bekanntermaßen unsicheres Passwort ergibt eine erstaunlich hohe neunstellige Zahl Treffer:
> This password has been seen 179,863,340 times before in data breaches!
haveibeenpwned.com/Passwords

@arXiv_csMM_bot@mastoxiv.page
2025-10-10 09:10:09

AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues
Krish Patel, Dingkun Zhou, Ajay Kankipati, Akshaj Gupta, Zeyi Austin Li, Mohul Shukla, Vibhor Narang, Sara Kofman, Zongli Ye, Grace Wang, Xiaoyu Shi, Tingle Li, Guan-Ting Lin, Kan Jen Cheng, Huang-Cheng Chou, Jiachen Lian, Gopala Anumanchipalli

@arXiv_csLG_bot@mastoxiv.page
2025-10-09 10:43:11

Enhancing Speech Emotion Recognition via Fine-Tuning Pre-Trained Models and Hyper-Parameter Optimisation
Aryan Golbaghi, Shuo Zhou
arxiv.org/abs/2510.07052

@arXiv_csCL_bot@mastoxiv.page
2025-09-10 10:04:21

AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training
Christian Rene Thelen, Patrick Gustav Blaneck, Tobias Bornheim, Niklas Grieger, Stephan Bialonski
arxiv.org/abs/2509.07459

@gray17@mastodon.social
2025-12-06 13:43:20

> if you think about it in the context of the training models—it has a rough sense that you’re like a 37 year old guy on Reddit. That’s the kind of person that it’s doing the continuation for, because that’s a big chunk of the training corpus.
> I often tell people whenever they send me a message like, “a large language model said I should do x, y, z.” what you’re really saying is, “a 37 year old guy on Reddit said it,” and you’ve got roughly the same amount of information

@netzschleuder@social.skewed.de
2025-10-03 21:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csIR_bot@mastoxiv.page
2025-10-06 07:36:49

Less LLM, More Documents: Searching for Improved RAG
Jingjie Ning, Yibo Kong, Yunfan Long, Jamie Callan
arxiv.org/abs/2510.02657 arxiv.org/…

@arXiv_csSD_bot@mastoxiv.page
2025-10-03 09:05:11

Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement
Nikolai Lund K\"uhne, Jesper Jensen, Jan {\O}stergaard, Zheng-Hua Tan
arxiv.org/abs/2510.01958

@arXiv_csCL_bot@mastoxiv.page
2025-09-10 08:51:41

Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector
Amal Chebbi, Babajide Kolade
arxiv.org/abs/2509.07177 arxiv.org…

@arXiv_csCV_bot@mastoxiv.page
2025-10-06 10:12:09

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus
Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan

@arXiv_csCY_bot@mastoxiv.page
2025-10-07 10:04:52

Quantifying Gender Stereotypes in Japan between 1900 and 1999 with Word Embeddings
Shintaro Sakai, Haewoon Kwak, Jisun An, Akira Matsui
arxiv.org/abs/2510.03905

@pavelasamsonov@mastodon.social
2025-12-01 13:48:36

The Grinch did nothing wrong. He wasn't *stealing* #Christmas, he was just gathering a corpus for training his #AI model. Investors are already lining up with their billions to fund the construction of the Whoville Data Center, ignoring concerns from residents.

@arXiv_hepph_bot@mastoxiv.page
2025-10-06 08:12:19

ArgoLOOM: agentic AI for fundamental physics from quarks to cosmos
S. D. Bakshi, P. Barry, C. Bissolotti, I. Cloet, S. Corrodi, Z. Djurcic, S. Habib, K. Heitmann, T. J. Hobbs, W. Hopkins, S. Joosten, B. Kriesten, N. Ramachandra, A. Wells, M. Zurek
arxiv.org/abs/2510.02426

@thomasfuchs@hachyderm.io
2025-10-27 15:49:29

How about “algorithmic stolen corpus derivative” instead of “AI generated”

@arXiv_csGR_bot@mastoxiv.page
2025-10-07 09:18:22

Neon: Negative Extrapolation From Self-Training Improves Image Generation
Sina Alemohammad, Zhangyang Wang, Richard G. Baraniuk
arxiv.org/abs/2510.03597

@felwert@fedihum.org
2025-09-24 12:36:01

Bin sehr traurig, nicht auf der #FORGE25 zu sein, aber froh, das Philipp Tögel unseren @… würdig vertritt. Wer direkt mal in unseren Ansatz für die TEI-Modellierung heterogener multilingualer Textkorpora reinschnuppern will:

@arXiv_csCY_bot@mastoxiv.page
2025-10-03 08:17:21

Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data
Stephen Meisenbacher, Svetlozar Nestorov, Peter Norlander
arxiv.org/abs/2510.01470

@patrikja@functional.cafe
2025-10-12 07:05:09

TyDe 2025:
"Generating a corpus of Hazel programs from ill-typed OCaml programs"
presented by Patrick Ferris, joint work with Anil Madhavapeddy.
conf.researchr.org/details/icf

First slide of  "Generating a corpus of Hazel programs from ill-typed OCaml programs" presented by Patrick Ferris.
@rachel@norfolk.social
2025-10-29 22:40:29

I do love hanging around on a social platform where there simply isn’t any value in all this shit
theverge.com/news/809349/meta-

@netzschleuder@social.skewed.de
2025-10-25 12:00:04

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@avalon@jazztodon.com
2025-10-30 02:47:45

Habeas Corpus - Darcy James Argue - Secret Society
#infernalmachines
youtube.com/watch?v=b0Cw_pLDflY

@arXiv_csIR_bot@mastoxiv.page
2025-10-02 09:42:21

On Listwise Reranking for Corpus Feedback
Soyoung Yoon, Jongho Kim, Daeyong Kwon, Avishek Anand, Seung-won Hwang
arxiv.org/abs/2510.00887 a…

@arXiv_csCL_bot@mastoxiv.page
2025-09-29 11:14:17

The InviTE Corpus: Annotating Invectives in Tudor English Texts for Computational Modeling
Sophie Spliethoff, Sanne Hoeken, Silke Schwandt, Sina Zarrie{\ss}, \"Ozge Ala\c{c}am
arxiv.org/abs/2509.22345

@arXiv_csCR_bot@mastoxiv.page
2025-09-30 11:07:51

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search
Zeyu Shen, Basileal Imana, Tong Wu, Chong Xiang, Prateek Mittal, Aleksandra Korolova
arxiv.org/abs/2509.23519

@awinkler@openbiblio.social
2025-09-24 14:57:46
Content warning:

research questions by Barbara McGillivray from @… at the end of her presentation in Aarhus. She has recently won funding for the project 'Computational Corpus Annotation for Quantitative Analysis of Latin Lexical Semantics' (COALA), cf.

@arXiv_eessAS_bot@mastoxiv.page
2025-09-22 09:31:01

Rethinking Cross-Corpus Speech Emotion Recognition Benchmarking: Are Paralinguistic Pre-Trained Representations Sufficient?
Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Parabattina Bhagath, Pailla Balakrishna Reddy, Arun Balaji Buduru
arxiv.org/abs/2509.16182

The Insurrection Act is not a declaration of martial law.
It doesn't shut down the courts.
It doesn't suspend habeas corpus.
It means you can use the military to enforce federal laws,
but the laws themselves remain the same.
f…

@johl@mastodon.xyz
2025-09-19 20:11:00

Fat Bear Week is coming early this year. The annual online competition that normally starts in early October will instead start on Sept. 23.

@arXiv_csCL_bot@mastoxiv.page
2025-09-29 11:23:57

ArabJobs: A Multinational Corpus of Arabic Job Ads
Mo El-Haj
arxiv.org/abs/2509.22589 arxiv.org/pdf/2509.22589

@arXiv_csCE_bot@mastoxiv.page
2025-09-29 08:19:57

Sci2Pol: Evaluating and Fine-tuning LLMs on Scientific-to-Policy Brief Generation
Weimin Wu, Alexander C. Furnas, Eddie Yang, Gefei Liu, Akhil Pandey Akella, Xuefeng Song, Dashun Wang, Han Liu
arxiv.org/abs/2509.21493

@netzschleuder@social.skewed.de
2025-09-21 10:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csSD_bot@mastoxiv.page
2025-09-22 09:34:21

LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control
Junki Ohmura, Yuki Ito, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki Kumakura
arxiv.org/abs/2509.15626

@arXiv_csSI_bot@mastoxiv.page
2025-09-16 08:58:27

YTCommentVerse: A Multi-Category Multi-Lingual YouTube Comment Corpus
Hridoy Sankar Dutta, Biswadeep Khan
arxiv.org/abs/2509.11057 arxiv.or…

@karlauerbach@sfba.social
2025-11-15 22:21:20

I have yet to hear any, much less a solid, argument against my suggestion that Presidential pardons and commutations are revocable by a subsequent president.
Were I elected president I would revoke all of El Cheato's pardons and commutations and let the people involved make arguments (probably in the context of Habeas Corpus proceedings) why those actions are not Constitutional.
My own sense is that the question tends more towards the "not revocable" with regard to …

@arXiv_csCL_bot@mastoxiv.page
2025-10-02 10:36:01

Tenyidie Syllabification corpus creation and deep learning applications
Teisovi Angami, Kevisino Khate
arxiv.org/abs/2510.00629 arxiv.org/p…

@simon_lucy@mastodon.social
2025-11-12 12:27:29

I hadn't realised that I had any Assembly of Dust but I have Corpus Christi on an old EMusic compilation and very Steely Dan it is. In this case that's a good thing.
#Music #Serendiipity

@netzschleuder@social.skewed.de
2025-09-20 03:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csCL_bot@mastoxiv.page
2025-10-02 10:28:51

EuroSpeech: A Multilingual Speech Corpus
Samuel Pfisterer, Florian Gr\"otschla, Luca A. Lanzend\"orfer, Florian Yan, Roger Wattenhofer
arxiv.org/abs/2510.00514

@netzschleuder@social.skewed.de
2025-11-20 02:00:04

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@kexpmusicbot@mastodonapp.uk
2025-10-12 07:22:59

🇺🇦 #NowPlaying on KEXP's #SeekAndDestroy
Corpus Offal:
🎵 Gorging Gastric Decedent
#CorpusOffal
listen.20buckspin.com/track/go
open.spotify.com/track/2UmFZG5

@arXiv_csAI_bot@mastoxiv.page
2025-09-16 08:15:26

AI Answer Engine Citation Behavior An Empirical Analysis of the GEO16 Framework
Arlen Kumar, Leanid Palkhouski
arxiv.org/abs/2509.10762 arx…

@arXiv_csCL_bot@mastoxiv.page
2025-10-06 10:17:59

StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering
Tengjun Ni, Xin Yuan, Shenghong Li, Kai Wu, Ren Ping Liu, Wei Ni, Wenjie Zhang
arxiv.org/abs/2510.02827

@arXiv_qbioGN_bot@mastoxiv.page
2025-10-01 08:01:27

DNABERT-2: Fine-Tuning a Genomic Language Model for Colorectal Gene Enhancer Classification
Darren King, Yaser Atlasi, Gholamreza Rafiee
arxiv.org/abs/2509.25274

@arXiv_eessAS_bot@mastoxiv.page
2025-09-15 08:20:11

The MSP-Podcast Corpus
Carlos Busso, Reza Lotfian, Kusha Sridhar, Ali N. Salman, Wei-Cheng Lin, Lucas Goncalves, Srinivas Parthasarathy, Abinay Reddy Naini, Seong-Gyun Leem, Luz Martinez-Lucas, Huang-Cheng Chou, Pravin Mote
arxiv.org/abs/2509.09791

@netzschleuder@social.skewed.de
2025-09-16 22:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csIR_bot@mastoxiv.page
2025-10-01 08:37:47

Fading to Grow: Growing Preference Ratios via Preference Fading Discrete Diffusion for Recommendation
Guoqing Hu, An Zhang. Shuchang Liu, Wenyu Mao, Jiancan Wu, Xun Yang, Xiang Li, Lantao Hu, Han Li, Kun Gai, Xiang Wang
arxiv.org/abs/2509.26063

@arXiv_csSD_bot@mastoxiv.page
2025-10-02 08:12:21

Unpacking Musical Symbolism in Online Communities: Content-Based and Network-Centric Approaches
Kajwan Ziaoddini
arxiv.org/abs/2510.00006 a…

@arXiv_csCL_bot@mastoxiv.page
2025-09-23 12:57:21

WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing
Yuhang Dai, Ziyu Zhang, Shuai Wang, Longhao Li, Zhao Guo, Tianlun Zuo, Shuiyuan Wang, Hongfei Xue, Chengyou Wang, Qing Wang, Xin Xu, Hui Bu, Jie Li, Jian Kang, Binbin Zhang, Lei Xie
arxiv.org/abs/2509.18004

@simon_lucy@mastodon.social
2025-11-12 12:27:29

I hadn't realised that I had any Assembly of Dust but I have Corpus Christi on an old EMusic compilation and very Steely Dan it is. In this case that's a good thing.
#Music #Serendiipity

@arXiv_csCR_bot@mastoxiv.page
2025-09-23 10:16:30

Evaluating LLM Generated Detection Rules in Cybersecurity
Anna Bertiger, Bobby Filar, Aryan Luthra, Stefano Meschiari, Aiden Mitchell, Sam Scholten, Vivek Sharath
arxiv.org/abs/2509.16749

@netzschleuder@social.skewed.de
2025-11-14 05:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csCL_bot@mastoxiv.page
2025-09-24 10:31:14

LOTUSDIS: A Thai far-field meeting corpus for robust conversational ASR
Pattara Tipaksorn, Sumonmas Thatphithakkul, Vataya Chunwijitra, Kwanchiva Thangthai
arxiv.org/abs/2509.18722

@arXiv_csLG_bot@mastoxiv.page
2025-09-23 20:12:16

Replaced article(s) found for cs.LG. arxiv.org/list/cs.LG/new
[9/10]:
- HARPT: A Corpus for Analyzing Consumers' Trust and Privacy Concerns in Electronic Health Apps
Timoteo Kelly, Abdulkadir Korkmaz, Samuel Mallet, Connor Souders, Sadra Aliakbarpour, Praveen Rao

@arXiv_csSD_bot@mastoxiv.page
2025-10-14 11:10:48

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery
arxiv.org/abs/2510.10774

@arXiv_csCL_bot@mastoxiv.page
2025-09-25 10:36:52

SwissGPC v1.0 -- The Swiss German Podcasts Corpus
Samuel Stucki, Mark Cieliebak, Jan Deriu
arxiv.org/abs/2509.19866 arxiv.org/pdf/2509.1986…

@netzschleuder@social.skewed.de
2025-11-14 04:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csSI_bot@mastoxiv.page
2025-09-26 08:07:31

Visual Authority and the Rhetoric of Health Misinformation: A Multimodal Analysis of Social Media Videos
Mohammad Reza Zarei, Barbara Stead-Coyle, Michael Christensen, Sarah Everts, Majid Komeili
arxiv.org/abs/2509.20724

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 10:14:41

Using Medical Algorithms for Task-Oriented Dialogue in LLM-Based Medical Interviews
Rui Reis, Pedro Rangel Henriques, Jo\~ao Ferreira-Coimbra, Eva Oliveira, Nuno F. Rodrigues
arxiv.org/abs/2510.12490

@arXiv_csIR_bot@mastoxiv.page
2025-10-01 07:45:27

On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search
Nick Hagar, Nicholas Diakopoulos, Jeremy Gilbert
arxiv.org/abs/2509.25494

@arXiv_eessAS_bot@mastoxiv.page
2025-09-23 10:21:30

BeepBank-500: A Synthetic Earcon Mini-Corpus for UI Sound Research and Psychoacoustics Research
Mandip Goswami
arxiv.org/abs/2509.17277 arx…

@arXiv_csCL_bot@mastoxiv.page
2025-09-23 12:54:10

SiDiaC: Sinhala Diachronic Corpus
Nevidu Jayatilleke, Nisansa de Silva
arxiv.org/abs/2509.17912 arxiv.org/pdf/2509.17912

@arXiv_csSD_bot@mastoxiv.page
2025-10-01 08:31:37

Learning Relationships Between Separate Audio Tracks for Creative Applications
Balthazar Bujard (IRCAM, SU), J\'er\^ome Nika (IRCAM), F\'ed\'eric Bevilacqua (IRCAM), Nicolas Obin
arxiv.org/abs/2509.25296

@arXiv_csCL_bot@mastoxiv.page
2025-09-26 10:15:41

LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text
Irina Tolstykh, Aleksandra Tsybina, Sergey Yakubson, Maksim Kuprashevich
arxiv.org/abs/2509.21269

@arXiv_csIR_bot@mastoxiv.page
2025-09-29 09:35:27

Does Generative Retrieval Overcome the Limitations of Dense Retrieval?
Yingchen Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
arxiv.org/abs/2509.22116

@arXiv_csCL_bot@mastoxiv.page
2025-09-24 10:57:14

SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data
Erik Bo\v{z}\'ik, Marek \v{S}uppa
arxiv.org/abs/2509.19270 arx…

@arXiv_csSD_bot@mastoxiv.page
2025-09-17 09:14:10

More Similar than Dissimilar: Modeling Annotators for Cross-Corpus Speech Emotion Recognition
James Tavernor, Emily Mower Provost
arxiv.org/abs/2509.12295

@arXiv_csCL_bot@mastoxiv.page
2025-09-26 10:14:01

CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis
Xinzhe Xu, Liang Zhao, Hongshen Xu, Chen Chen
arxiv.org/abs/2509.21208

@arXiv_csLG_bot@mastoxiv.page
2025-10-15 14:37:21

Replaced article(s) found for cs.LG. arxiv.org/list/cs.LG/new
[7/7]:
- ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

@arXiv_eessAS_bot@mastoxiv.page
2025-09-18 08:13:31

Enhancing Speaker-Independent Dysarthric Speech Severity Classification with DSSCNet and Cross-Corpus Adaptation
Arnab Kumar Roy, Hemant Kumar Kathania, Paban Sapkota
arxiv.org/abs/2509.13442

@arXiv_csSD_bot@mastoxiv.page
2025-09-29 09:43:47

Cross-Dialect Bird Species Recognition with Dialect-Calibrated Augmentation
Jiani Ding, Qiyang Sun, Alican Akman, Bj\"orn W. Schuller
arxiv.org/abs/2509.22317

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:35:21

A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction
Cameron Morin, Matti Marttinen Larsson
arxiv.org/abs/2510.12306

@arXiv_csCL_bot@mastoxiv.page
2025-09-22 10:11:51

UPRPRC: Unified Pipeline for Reproducing Parallel Resources -- Corpus from the United Nations
Qiuyang Lu, Fangjian Shen, Zhengkai Tang, Qiang Liu, Hexuan Cheng, Hui Liu, Wushao Wen
arxiv.org/abs/2509.15789

@arXiv_csSD_bot@mastoxiv.page
2025-10-13 09:27:50

Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition
Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu
arxiv.org/abs/2510.09072

@arXiv_csIR_bot@mastoxiv.page
2025-09-22 07:50:31

Efficient and Versatile Model for Multilingual Information Retrieval of Islamic Text: Development and Deployment in Real-World Scenarios
Vera Pavlova, Mohammed Makhlouf
arxiv.org/abs/2509.15380

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:40:30

Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval
Yu Wang, Tianhao Tan, Yifei Wang
arxiv.org/abs/2510.09553

@arXiv_csCL_bot@mastoxiv.page
2025-09-12 09:23:29

Automated Classification of Tutors' Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus
Liqun He, Jiaqi Xu
arxiv.org/abs/2509.09125

@arXiv_csCL_bot@mastoxiv.page
2025-09-29 11:18:37

NeLLCom-Lex: A Neural-agent Framework to Study the Interplay between Lexical Systems and Language Use
Yuqing Zhang, Ecesu \"Urker, Tessa Verhoef, Gemma Boleda, Arianna Bisazza
arxiv.org/abs/2509.22479

@arXiv_csCL_bot@mastoxiv.page
2025-09-29 11:15:07

CHRONOBERG: Capturing Language Evolution and Temporal Awareness in Foundation Models
Niharika Hegde, Subarnaduti Paul, Lars Joel-Frey, Manuel Brack, Kristian Kersting, Martin Mundt, Patrick Schramowski
arxiv.org/abs/2509.22360

@arXiv_csCL_bot@mastoxiv.page
2025-09-26 10:19:01

SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
Yizhou Wang, Chen Tang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu Wu, Jun Yao, Pengze Li, Encheng Su, Lintao Wang, Guohang Zhuang, Yuchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Yue, Zhenfei Yin, Jiamin Wu, Qihao Zheng, Yuhao Zhou, Huihui Xu, Chenglong Ma, Yan Lu, Wenlong Zhang, Chunfeng Song, Philip Torr, Shixiang Tang, Xinzhu Ma, Wanli Ouyang, Lei Bai

@arXiv_csCL_bot@mastoxiv.page
2025-09-24 10:45:44

Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus
Chiara Alzetta, Serena Auriemma, Alessandro Bondielli, Luca Dini, Chiara Fazzone, Alessio Miaschi, Martina Miliani, Marta Sartor
arxiv.org/abs/2509.19033

@arXiv_csCL_bot@mastoxiv.page
2025-09-25 10:35:22

TianHui: A Domain-Specific Large Language Model for Diverse Traditional Chinese Medicine Scenarios
Ji Yin, Menglan He, Yujie Zhang, Linshuai Zhang, Tingting Ma, Ce Tian, Jie Wu, Lin Xu, Tao Jiang
arxiv.org/abs/2509.19834

@arXiv_csCL_bot@mastoxiv.page
2025-09-23 12:55:51

HICode: Hierarchical Inductive Coding with LLMs
Mian Zhong, Pristina Wang, Anjalie Field
arxiv.org/abs/2509.17946 arxiv.org/pdf/2509.17946

@arXiv_csCL_bot@mastoxiv.page
2025-09-25 10:43:12

Causal Understanding by LLMs: The Role of Uncertainty
Oscar Lithgow-Serrano, Vani Kanjirangat, Alessandro Antonucci
arxiv.org/abs/2509.20088

@arXiv_csCL_bot@mastoxiv.page
2025-09-25 10:38:52

CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems
Soham Bhattacharjee, Mukund K Roy, Yathish Poojary, Bhargav Dave, Mihir Raj, Vandan Mujadia, Baban Gain, Pruthwik Mishra, Arafat Ahsan, Parameswari Krishnamurthy, Ashwath Rao, Gurpreet Singh Josan, Preeti Dubey, Aadil Amin Kak, Anna Rao Kulkarni, Narendra VG, Sunita Arora, Rakesh Balbantray, Prasenjit Majumdar, Karunesh K Arora, Asif Ekbal, Dipti Mishra Sharma

@arXiv_csCL_bot@mastoxiv.page
2025-09-11 10:00:13

Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora
Thales Sales Almeida, Rodrigo Nogueira, Helio Pedrini
arxiv.org/abs/2509.08824

@arXiv_csCL_bot@mastoxiv.page
2025-09-22 10:09:21

Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment
Ke Wang, Wenning Wei, Yan Deng, Lei He, Sheng Zhao
arxiv.org/abs/2509.15701

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:33:21

Patent Language Model Pretraining with ModernBERT
Amirhossein Yousefiramandi, Ciaran Cooney
arxiv.org/abs/2509.14926 arxiv.org/pdf/2509.149…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 07:44:21

Tokenization Strategies for Low-Resource Agglutinative Languages in Word2Vec: Case Study on Turkish and Finnish
Jinfan Frank Hu
arxiv.org/abs/2509.14238

@arXiv_csCL_bot@mastoxiv.page
2025-09-17 10:31:20

Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews
Chenye Zou, Xingyue Wen, Tianyi Hu, Qian Janice Wang, Daniel Hershcovich
arxiv.org/abs/2509.12961

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 10:10:51

Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale
Hasan Abed Al Kader Hammoud, Mohammad Zbeeb, Bernard Ghanem
arxiv.org/abs/2509.14008

@arXiv_csCL_bot@mastoxiv.page
2025-09-15 09:54:31

Prominence-aware automatic speech recognition for conversational speech
Julian Linke, Barbara Schuppler
arxiv.org/abs/2509.10116 arxiv.org/…

@arXiv_csCL_bot@mastoxiv.page
2025-09-15 09:56:01

Benchmark of stylistic variation in LLM-generated texts
Ji\v{r}\'i Mili\v{c}ka, Anna Marklov\'a, V\'aclav Cvr\v{c}ek
arxiv.org/abs/2509.10179

@arXiv_csCL_bot@mastoxiv.page
2025-09-15 09:54:21

Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records
Abdulrahman Allam, Seif Ahmed, Ali Hamdi, Khaled Shaban
arxiv.org/abs/2509.10108

@arXiv_csCL_bot@mastoxiv.page
2025-09-15 09:51:31

!MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment
Mohamed Basem, Mohamed Younes, Seif Ahmed, Abdelrahman Moustafa
arxiv.org/abs/2509.10040

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:36:00

KORMo: Korean Open Reasoning Model for Everyone
Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junhun Yuk, Changsu Choi, Dongjae Shin, Huige Lee, Hoyun Song, Alice Oh, Kyungtae Lim
arxiv.org/abs/2510.09426