Tootfinder

@arXiv_csCL_bot@mastoxiv.page
2025-06-30 10:21:50

Refining Czech GEC: Insights from a Multi-Experiment Approach
Petr Pechman, Milan Straka, Jana Strakov\'a, Jakub N\'aplava
https://arxiv.org/abs/2506.22402

Refining Czech GEC: Insights from a Multi-Experiment Approach
We present a grammar error correction (GEC) system that achieves state of the art for the Czech language. Our system is based on a neural network translation approach with the Transformer architecture, and its key feature is its real-time synthetic generation pipeline, which dynamically augments sentences with artificial errors by introducing both language-agnostic and Czech-specific errors. We conduct a comprehensive series of experiments, investigating the Czech GEC corpora as bases for synth…

@arXiv_csIR_bot@mastoxiv.page
2025-06-27 08:08:29

EraRAG: Efficient and Incremental Retrieval Augmented Generation for Growing Corpora
Fangyuan Zhang, Zhengjun Huang, Yingli Zhou, Qintian Guo, Zhixun Li, Wensheng Luo, Di Jiang, Yixiang Fang, Xiaofang Zhou
https://arxiv.org/abs/2506.20963

EraRAG: Efficient and Incremental Retrieval Augmented Generation for Growing Corpora
Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large language models (LLMs) by structuring retrieval over an external corpus. However, existing approaches typically assume a static corpus, requiring expensive full-graph reconstruction whenever new documents arrive, limiting their scalability in dynamic, evolving environments. To address these limitations, we introduce EraRAG, a novel multi-layered Graph-RAG framework that supports efficient and scalable dynamic updates. Our met…

@arXiv_csSD_bot@mastoxiv.page
2025-06-23 10:33:30

Universal Music Representations? Evaluating Foundation Models on World Music Corpora
Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos
https://arxiv.org/abs/2506.17055

Universal Music Representations? Evaluating Foundation Models on World Music Corpora
Foundation models have revolutionized music information retrieval, but questions remain about their ability to generalize across diverse musical traditions. This paper presents a comprehensive evaluation of five state-of-the-art audio foundation models across six musical corpora spanning Western popular, Greek, Turkish, and Indian classical traditions. We employ three complementary methodologies to investigate these models' cross-cultural capabilities: probing to assess inherent representations…

@arXiv_csCL_bot@mastoxiv.page
2025-06-26 07:31:50

CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation
Deepon Halder, Thanmay Jayakumar, Raj Dabre
https://arxiv.org/abs/2506.19952

CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation
Large language models (LLMs), despite their ability to perform few-shot machine translation (MT), often lag behind dedicated MT systems trained on parallel corpora, which are crucial for high quality machine translation (MT). However, parallel corpora are often scarce or non-existent for low-resource languages. In this paper, we propose CycleDistill, a bootstrapping approach leveraging LLMs and few-shot translation to obtain high-quality MT systems. CycleDistill involves iteratively generating …

@arXiv_physicssocph_bot@mastoxiv.page
2025-06-25 12:57:34

Replaced article(s) found for physics.soc-ph. https://arxiv.org/list/physics.soc-ph/new
[1/1]:
- Entropy and type-token ratio in gigaword corpora
Pablo Rosillo-Rodes, Maxi San Miguel, David Sanchez

@arXiv_econGN_bot@mastoxiv.page
2025-06-19 08:39:22

Identifying economic narratives in large text corpora -- An integrated approach using Large Language Models
Tobias Schmidt, Kai-Robin Lange, Matthias Reccius, Henrik M\"uller, Michael Roos, Carsten Jentsch
https://arxiv.org/abs/2506.15041

Identifying economic narratives in large text corpora -- An integrated approach using Large Language Models
As interest in economic narratives has grown in recent years, so has the number of pipelines dedicated to extracting such narratives from texts. Pipelines often employ a mix of state-of-the-art natural language processing techniques, such as BERT, to tackle this task. While effective on foundational linguistic operations essential for narrative extraction, such models lack the deeper semantic understanding required to distinguish extracting economic narratives from merely conducting classic tas…

@arXiv_csSE_bot@mastoxiv.page
2025-06-19 08:37:13

cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree
Yilin Zhang, Xinran Zhao, Zora Zhiruo Wang, Chenyang Yang, Jiayi Wei, Tongshuang Wu
https://arxiv.org/abs/2506.15655

cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree
Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking -- the process of dividing documents into retrievable units. Existing line-based chunking heuristics often break semantic structures, splitting functions or merging unrelated code, which can degrade generation quality. We propose chunking via Abstract Syntax Trees …

@arXiv_csIR_bot@mastoxiv.page
2025-06-03 07:38:39

When Should Dense Retrievers Be Updated in Evolving Corpora? Detecting Out-of-Distribution Corpora Using GradNormIR
Dayoon Ko, Jinyoung Kim, Sohyeon Kim, Jinhyuk Kim, Jaehoon Lee, Seonghak Song, Minyoung Lee, Gunhee Kim
https://arxiv.org/abs/2506.01877

When Should Dense Retrievers Be Updated in Evolving Corpora? Detecting Out-of-Distribution Corpora Using GradNormIR
Dense retrievers encode texts into embeddings to efficiently retrieve relevant documents from large databases in response to user queries. However, real-world corpora continually evolve, leading to a shift from the original training distribution of the retriever. Without timely updates or retraining, indexing newly emerging documents can degrade retrieval performance for future queries. Thus, identifying when a dense retriever requires an update is critical for maintaining robust retrieval syst…

@arXiv_csCL_bot@mastoxiv.page
2025-06-26 09:36:40

Knowledge-Aware Diverse Reranking for Cross-Source Question Answering
Tong Zhou
https://arxiv.org/abs/2506.20476 https://arxiv.org/pd…

Knowledge-Aware Diverse Reranking for Cross-Source Question Answering
This paper presents Team Marikarp's solution for the SIGIR 2025 LiveRAG competition. The competition's evaluation set, automatically generated by DataMorgana from internet corpora, encompassed a wide range of target topics, question types, question formulations, audience types, and knowledge organization methods. It offered a fair evaluation of retrieving question-relevant supporting documents from a 15M documents subset of the FineWeb corpus. Our proposed knowledge-aware diverse reranking RAG …

@arXiv_csLG_bot@mastoxiv.page
2025-06-12 10:00:11

NDCG-Consistent Softmax Approximation with Accelerated Convergence
Yuanhao Pu, Defu Lian, Xiaolong Chen, Xu Huang, Jin Chen, Enhong Chen
https://arxiv.org/abs/2506.09454

NDCG-Consistent Softmax Approximation with Accelerated Convergence
Ranking tasks constitute fundamental components of extreme similarity learning frameworks, where extremely large corpora of objects are modeled through relative similarity relationships adhering to predefined ordinal structures. Among various ranking surrogates, Softmax (SM) Loss has been widely adopted due to its natural capability to handle listwise ranking via global negative comparisons, along with its flexibility across diverse application scenarios. However, despite its effectiveness, SM …

@arXiv_csCR_bot@mastoxiv.page
2025-06-09 08:14:02

Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems
Haowei Wang, Rupeng Zhang, Junjie Wang, Mingyang Li, Yuekai Huang, Dandan Wang, Qing Wang
https://arxiv.org/abs/2506.06151

Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems
Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by retrieving relevant documents from external corpora before generating responses. This approach significantly expands LLM capabilities by leveraging vast, up-to-date external knowledge. However, this reliance on external knowledge makes RAG systems vulnerable to corpus poisoning attacks that manipulate generated outputs via poisoned document injection. Existing poisoning attack strategies typically treat the ret…

@arXiv_csIR_bot@mastoxiv.page
2025-06-24 11:43:30

Harnessing the Power of Reinforcement Learning for Language-Model-Based Information Retriever via Query-Document Co-Augmentation
Jingming Liu, Yumeng Li, Wei Shi, Yao-Xiang Ding, Hui Su, Kun Zhou
https://arxiv.org/abs/2506.18670

Harnessing the Power of Reinforcement Learning for Language-Model-Based Information Retriever via Query-Document Co-Augmentation
Recent studies have proposed leveraging Large Language Models (LLMs) as information retrievers through query rewriting. However, for challenging corpora, we argue that enhancing queries alone is insufficient for robust semantic matching; the LLM should also have sufficient understanding of the corpus by directly handling and augmenting the documents themselves. To this end, we present an LLM-based retriever empowered to augment both user queries and corpus documents, with its policy fully explo…

@arXiv_csCL_bot@mastoxiv.page
2025-06-17 09:33:51

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
https://arxiv.org/abs/2506.12229

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora -- counting string appearances and retrieving the enclosing documents -- yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-ind…

@arXiv_statME_bot@mastoxiv.page
2025-06-09 09:42:12

Testing Hypotheses of Covariate Effects on Topics of Discourse
Gabriel Phelan, David A. Campbell
https://arxiv.org/abs/2506.05570 https://

Testing Hypotheses of Covariate Effects on Topics of Discourse
We introduce an approach to topic modelling with document-level covariates that remains tractable in the face of large text corpora. This is achieved by de-emphasizing the role of parameter estimation in an underlying probabilistic model, assuming instead that the data come from a fixed but unknown distribution whose statistical functionals are of interest. We propose combining a convex formulation of non-negative matrix factorization with standard regression techniques as a fast-to-compute and…

@arXiv_csSD_bot@mastoxiv.page
2025-06-19 08:36:13

TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data
Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
https://arxiv.org/abs/2506.15614

TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data
This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,'' such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-bas…

@arXiv_csSE_bot@mastoxiv.page
2025-06-16 10:22:19

Retrieval-Augmented Code Review Comment Generation
Hyunsun Hong, Jongmoon Baik
https://arxiv.org/abs/2506.11591 https://arxiv.org/pdf…

Retrieval-Augmented Code Review Comment Generation
Automated code review comment generation (RCG) aims to assist developers by automatically producing natural language feedback for code changes. Existing approaches are primarily either generation-based, using pretrained language models, or information retrieval-based (IR), reusing comments from similar past examples. While generation-based methods leverage code-specific pretraining on large code-natural language corpora to learn semantic relationships between code and natural language, they oft…

@arXiv_csIT_bot@mastoxiv.page
2025-06-03 16:20:54

This https://arxiv.org/abs/2504.02712 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csIT_…

TeleMoM: Consensus-Driven Telecom Intelligence via Mixture of Models
Large language models (LLMs) face significant challenges in specialized domains like telecommunication (Telecom) due to technical complexity, specialized terminology, and rapidly evolving knowledge. Traditional methods, such as scaling model parameters or retraining on domain-specific corpora, are computationally expensive and yield diminishing returns, while existing approaches like retrieval-augmented generation, mixture of experts, and fine-tuning struggle with accuracy, efficiency, and coor…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-16 08:30:59

S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamlessly Speech-Text Alignment and Streaming Speech Decoder
Yu Pan, Yuguang Yang, Yanni Hu, Jianhao Ye, Xiang Zhang, Hongbin Zhou, Lei Ma, Jianjun Zhao
https://arxiv.org/abs/2506.11160

S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamlessly Speech-Text Alignment and Streaming Speech Decoder
Multilingual speech-to-speech translation (S2ST) aims to directly convert spoken utterances from multiple source languages into natural and intelligible speech in a target language. Despite recent progress, significant challenges remain: (1) achieving high-quality and low-latency S2ST remains a critical hurdle; (2) existing S2ST approaches heavily rely on large-scale parallel speech corpora, which are extremely difficult to collect. To address these issues, we propose S2ST-Omni, an efficient an…

@arXiv_csIR_bot@mastoxiv.page
2025-06-23 09:44:00

eSapiens: A Real-World NLP Framework for Multimodal Document Understanding and Enterprise Knowledge Processing
Isaac Shi, Zeyuan Li, Wenli Wang, Lewei He, Yang Yang, Tianyu Shi
https://arxiv.org/abs/2506.16768

eSapiens: A Real-World NLP Framework for Multimodal Document Understanding and Enterprise Knowledge Processing
We introduce eSapiens, a unified question-answering system designed for enterprise settings, which bridges structured databases and unstructured textual corpora via a dual-module architecture. The system combines a Text-to-SQL planner with a hybrid Retrieval-Augmented Generation (RAG) pipeline, enabling natural language access to both relational data and free-form documents. To enhance answer faithfulness, the RAG module integrates dense and sparse retrieval, commercial reranking, and a citatio…

@arXiv_csLG_bot@mastoxiv.page
2025-06-10 19:17:32

This https://arxiv.org/abs/2504.13416 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings
Given how large parts of publicly available text are crawled to pretrain large language models (LLMs), data creators increasingly worry about the inclusion of their proprietary data for model training without attribution or licensing. Their concerns are also shared by benchmark curators whose test-sets might be compromised. In this paper, we present STAMP, a framework for detecting dataset membership-i.e., determining the inclusion of a dataset in the pretraining corpora of LLMs. Given an origi…

@arXiv_csDL_bot@mastoxiv.page
2025-06-05 07:17:16

Comparing Retrieval Strategies to Capture Interdisciplinary Scientific Research: A Bibliometric Evaluation of the Integration of Neuroscience and Computer Science
Malena Mendez Isla, Agustin Mauro, Diego Kozlowski
https://arxiv.org/abs/2506.03187

Comparing Retrieval Strategies to Capture Interdisciplinary Scientific Research: A Bibliometric Evaluation of the Integration of Neuroscience and Computer Science
Interdisciplinary scientific research is increasingly important in knowledge production, funding policies, and academic discussions on scholarly communication. While many studies focus on interdisciplinary corpora defined a priori - usually through keyword-based searches within assumed interdisciplinary domains - few explore interdisciplinarity as an emergent intersection between two distinct fields. Thus, methodological proposals for building databases at the intersection of two fields of know…

@arXiv_qbioOT_bot@mastoxiv.page
2025-06-16 14:57:43

Replaced article(s) found for q-bio.OT. https://arxiv.org/list/q-bio.OT/new
[1/1]:
English dictionaries, gold and silver standard corpora for biomedical natural language processing...

@arXiv_csSE_bot@mastoxiv.page
2025-06-16 10:25:39

Classification of Quality Characteristics in Online User Feedback using Linguistic Analysis, Crowdsourcing and LLMs
Eduard C. Groen, Fabiano Dalpiaz, Martijn van Vliet, Boris Winter, Joerg Doerr, Sjaak Brinkkemper
https://arxiv.org/abs/2506.11722

Classification of Quality Characteristics in Online User Feedback using Linguistic Analysis, Crowdsourcing and LLMs
Software qualities such as usability or reliability are among the strongest determinants of mobile app user satisfaction and constitute a significant portion of online user feedback on software products, making it a valuable source of quality-related feedback to guide the development process. The abundance of online user feedback warrants the automated identification of quality characteristics, but the online user feedback's heterogeneity and the lack of appropriate training corpora limit the a…

@arXiv_csIR_bot@mastoxiv.page
2025-06-23 08:03:49

Architecture is All You Need: Improving LLM Recommenders by Dropping the Text
Kevin Foley, Shaghayegh Agah, Kavya Priyanka Kakinada
https://arxiv.org/abs/2506.15833

Architecture is All You Need: Improving LLM Recommenders by Dropping the Text
In recent years, there has been an explosion of interest in the applications of large pre-trained language models (PLMs) to recommender systems, with many studies showing strong performance of PLMs on common benchmark datasets. PLM-based recommender models benefit from flexible and customizable prompting, an unlimited vocabulary of recommendable items, and general ``world knowledge'' acquired through pre-training on massive text corpora. While PLM-based recommenders show promise in settings whe…

@arXiv_csCR_bot@mastoxiv.page
2025-06-03 16:55:16

This https://arxiv.org/abs/2408.16028 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data
Supervised-learning-based vulnerability detectors often fall short due to limited labelled training data. In contrast, Large Language Models (LLMs) like GPT-4 are trained on vast unlabelled code corpora, yet perform only marginally better than coin flips when directly prompted to detect vulnerabilities. In this paper, we reframe vulnerability detection as anomaly detection, based on the premise that vulnerable code is rare and thus anomalous relative to patterns learned by LLMs. We introduce AN…

@arXiv_csCL_bot@mastoxiv.page
2025-06-19 08:16:54

PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning
Yuhui Shi, Yehan Yang, Qiang Sheng, Hao Mi, Beizhe Hu, Chaoxi Xu, Juan Cao
https://arxiv.org/abs/2506.15683

PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning
With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored. Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performanc…

@arXiv_csSD_bot@mastoxiv.page
2025-06-16 08:04:09

Enabling automatic transcription of child-centered audio recordings from real-world environments
Daniil Kocharov, Okko R\"as\"anen
https://arxiv.org/abs/2506.11747

Enabling automatic transcription of child-centered audio recordings from real-world environments
Longform audio recordings obtained with microphones worn by children-also known as child-centered daylong recordings-have become a standard method for studying children's language experiences and their impact on subsequent language development. Transcripts of longform speech audio would enable rich analyses at various linguistic levels, yet the massive scale of typical longform corpora prohibits comprehensive manual annotation. At the same time, automatic speech recognition (ASR)-based transcri…

@arXiv_csCL_bot@mastoxiv.page
2025-06-17 10:04:25

Understanding the Effect of Knowledge Graph Extraction Error on Downstream Graph Analyses: A Case Study on Affiliation Graphs
Erica Cai, Brendan O'Connor
https://arxiv.org/abs/2506.12367

Understanding the Effect of Knowledge Graph Extraction Error on Downstream Graph Analyses: A Case Study on Affiliation Graphs
Knowledge graphs (KGs) are useful for analyzing social structures, community dynamics, institutional memberships, and other complex relationships across domains from sociology to public health. While recent advances in large language models (LLMs) have improved the scalability and accessibility of automated KG extraction from large text corpora, the impacts of extraction errors on downstream analyses are poorly understood, especially for applied scientists who depend on accurate KGs for real-wo…

@arXiv_csCR_bot@mastoxiv.page
2025-06-03 17:36:14

This https://arxiv.org/abs/2502.11191 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training
Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-06 09:39:10

This https://arxiv.org/abs/2505.22251 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_ees…

Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition
Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of findings drawn from these two datasets. To measure the impact of contamination, LLMs trained with or w…

@arXiv_csCL_bot@mastoxiv.page
2025-06-10 19:06:51

This https://arxiv.org/abs/2506.06266 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCL_…

Cartridges: Lightweight and general-purpose long context representations via self-study
Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we loa…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-03 07:57:39

Self-Supervised Speech Quality Assessment (S3QA): Leveraging Speech Foundation Models for a Scalable Speech Quality Metric
Mattson Ogg, Caitlyn Bishop, Han Yi, Sarah Robinson
https://arxiv.org/abs/2506.01655

Self-Supervised Speech Quality Assessment (S3QA): Leveraging Speech Foundation Models for a Scalable Speech Quality Metric
Methods for automatically assessing speech quality are critical for many human language technologies. Behavioral ratings provided by human raters (e.g., mean opinion scores; MOS) are considered the gold standard, but they are susceptible to variability between individual raters, cannot easily be generalized across corpora, and are labor-intensive to collect, thus limiting the acoustic challenges they can quantify. Here, we present a new, scalable method for automatically assessing speech qualit…

@arXiv_csCL_bot@mastoxiv.page
2025-06-03 08:19:57

Benford's Curse: Tracing Digit Bias to Numerical Hallucination in LLMs
Jiandong Shao, Yao Lu, Jianfei Yang
https://arxiv.org/abs/2506.01734 https://

Benford's Curse: Tracing Digit Bias to Numerical Hallucination in LLMs
Large Language Models (LLMs) exhibit impressive performance on complex reasoning tasks, yet they frequently fail on basic numerical problems, producing incorrect outputs. Inspired by Benford's Law -- a statistical pattern where lower digits occur more frequently as leading digits -- we hypothesize that the long-tailed digit distributions in web-collected corpora may be learned by LLMs during pretraining, leading to biased numerical generation. To investigate the hypothesis, we first examine whe…

Tootfinder

Opt-in global Mastodon full text search. Join the index!