Tootfinder

No exact results. Similar results found.

@arXiv_csCL_bot@mastoxiv.page
2025-09-09 12:09:52

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
Marc Marone, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, Benjamin Van Durme
https://arxiv.org/abs/2509.06888

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature…

@v_i_o_l_a@openbiblio.social
2025-09-08 08:37:37

"Multilingual Scholarly Publishing and Artificial Intelligence Translation Tools: Weighing Social Justice and Climate Justice"
https://doi.org/10.3998/jep.7100

Multilingual Scholarly Publishing and Artificial Intelligence Translation Tools: Weighing Social Justice and Climate Justice
The use of English as a lingua franca for scholarly publishing has created inequities and is leading to a social justice movement to develop a more multilingual scholarly publishing ecosystem. However, implementing multilingualism is complex, and researchers and publishers are investigating the potential of AI translation tools for supporting linguistic diversity. At the same time, the climate justice movement is beginning to reveal some of the environmental and human costs associated with AI t…

@arXiv_csCL_bot@mastoxiv.page
2025-07-10 09:57:21

Checklist Engineering Empowers Multilingual LLM Judges
Mohammad Ghiasvand Mohammadkhani, Hamid Beigy
https://arxiv.org/abs/2507.06774 https://

Checklist Engineering Empowers Multilingual LLM Judges
Automated text evaluation has long been a central issue in Natural Language Processing (NLP). Recently, the field has shifted toward using Large Language Models (LLMs) as evaluators-a trend known as the LLM-as-a-Judge paradigm. While promising and easily adaptable across tasks, this approach has seen limited exploration in multilingual contexts. Existing multilingual studies often rely on proprietary models or require extensive training data for fine-tuning, raising concerns about cost, time, a…

@arXiv_csCL_bot@mastoxiv.page
2025-09-10 10:04:21

AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training
Christian Rene Thelen, Patrick Gustav Blaneck, Tobias Bornheim, Niklas Grieger, Stephan Bialonski
https://arxiv.org/abs/2509.07459

AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training
Positive, supportive online communication in social media (candy speech) has the potential to foster civility, yet automated detection of such language remains underexplored, limiting systematic analysis of its impact. We investigate how candy speech can be reliably detected in a 46k-comment German YouTube corpus by monolingual and multilingual language models, including GBERT, Qwen3 Embedding, and XLM-RoBERTa. We find that a multilingual XLM-RoBERTa-Large model trained to detect candy speech a…

@arXiv_csCL_bot@mastoxiv.page
2025-09-08 10:10:30

OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics
Wei Chu, Yuanzhe Dong, Ke Tan, Dong Han, Xavier Menendez-Pidal, Ruchao Fan, Chenfeng Miao, Chanwoo Kim, Bhiksha Raj, Rita Singh
https://arxiv.org/abs/2509.04702

OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics
OleSpeech-IV dataset is a large-scale multispeaker and multilingual conversational speech dataset with diverse topics. The audio content comes from publicly-available English podcasts, talk shows, teleconferences, and other conversations. Speaker names, turns, and transcripts are human-sourced and refined by a proprietary pipeline, while additional information such as timestamps and confidence scores is derived from the pipeline. The IV denotes its position as Tier IV in the Olewave dataset ser…

@arXiv_csCL_bot@mastoxiv.page
2025-09-09 11:52:42

Do LLMs exhibit the same commonsense capabilities across languages?
Ivan Mart\'inez-Murillo, Elena Lloret, Paloma Moreda, Albert Gatt
https://arxiv.org/abs/2509.06401 https:…

Do LLMs exhibit the same commonsense capabilities across languages?
This paper explores the multilingual commonsense generation abilities of Large Language Models (LLMs). To facilitate this investigation, we introduce MULTICOM, a novel benchmark that extends the COCOTEROS dataset to four languages: English, Spanish, Dutch, and Valencian. The task involves generating a commonsensical sentence that includes a given triplet of words. We evaluate a range of open-source LLMs, including LLaMA, Qwen, Gemma, EuroLLM, and Salamandra, on this benchmark. Our evaluation co…

@arXiv_csCL_bot@mastoxiv.page
2025-09-08 10:14:30

PRIM: Towards Practical In-Image Multilingual Machine Translation
Yanzhi Tian, Zeming Liu, Zhengyang Liu, Chong Feng, Xin Li, Heyan Huang, Yuhang Guo
https://arxiv.org/abs/2509.05146

PRIM: Towards Practical In-Image Multilingual Machine Translation
In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another. Current research of end-to-end IIMT mainly conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation, which can not fully reflect real world, causing a significant gap between the research and practical conditions. To facilitate research of IIMT in real-world scenarios, we explore Practical In-Image Multilingual Machine Translation (II…

@arXiv_csCL_bot@mastoxiv.page
2025-08-11 10:02:49

Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages
Andrea Nasuto, Stefano Maria Iacus, Francisco Rowe, Devika Jain
https://arxiv.org/abs/2508.06435

Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages
Large language models (LLMs) are transforming social-science research by enabling scalable, precise analysis. Their adaptability raises the question of whether knowledge acquired through fine-tuning in a few languages can transfer to unseen languages that only appeared during pre-training. To examine this, we fine-tune lightweight LLaMA 3.2-3B models on monolingual, bilingual, or multilingual data sets to classify immigration-related tweets from X/Twitter across 13 languages, a domain character…

@arXiv_csCL_bot@mastoxiv.page
2025-08-06 10:18:30

fact check AI at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-checked Claim Retrieval
Pranshu Rastogi
https://arxiv.org/abs/2508.03475 https://a…

fact check AI at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-checked Claim Retrieval
SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval is approached as a Learning-to-Rank task using a bi-encoder model fine-tuned from a pre-trained transformer optimized for sentence similarity. Training used both the source languages and their English translations for multilingual retrieval and only English translations for cross-lingual retrieval. Using lightweight models with fewer than 500M parameters and training on Kaggle T4 GPUs, the method achieved 92% Succes…

@arXiv_csCL_bot@mastoxiv.page
2025-09-08 10:12:40

Using LLMs for Multilingual Clinical Entity Linking to ICD-10
Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva
https://arxiv.org/abs/2509.04868 https://arx…

Using LLMs for Multilingual Clinical Entity Linking to ICD-10
The linking of clinical entities is a crucial part of extracting structured information from clinical texts. It is the process of assigning a code from a medical ontology or classification to a phrase in the text. The International Classification of Diseases - 10th revision (ICD-10) is an international standard for classifying diseases for statistical and insurance purposes. Automatically assigning the correct ICD-10 code to terms in discharge summaries will simplify the work of healthcare prof…

Tootfinder

Opt-in global Mastodon full text search. Join the index!