Tootfinder

Opt-in global Mastodon full text search. Join the index!

@arXiv_csCR_bot@mastoxiv.page
2025-07-25 08:29:21

MeAJOR Corpus: A Multi-Source Dataset for Phishing Email Detection
Paulo Mendes (GECAD, ISEP, Polytechnic of Porto, Portugal), Eva Maia (GECAD, ISEP, Polytechnic of Porto, Portugal), Isabel Pra\c{c}a (GECAD, ISEP, Polytechnic of Porto, Portugal)
arxiv.org/abs/2507.17978

@arXiv_csHC_bot@mastoxiv.page
2025-06-25 08:34:50

HARPT: A Corpus for Analyzing Consumers' Trust and Privacy Concerns in Mobile Health Apps
Timoteo Kelly, Abdulkadir Korkmaz, Samuel Mallet, Connor Souders, Sadra Aliakbarpour, Praveen Rao
arxiv.org/abs/2506.19268

@arXiv_csSD_bot@mastoxiv.page
2025-06-24 10:26:40

JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles
Yuto Kondo, Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko
arxiv.org/abs/2506.18296

@arXiv_csIR_bot@mastoxiv.page
2025-06-24 11:43:30

Harnessing the Power of Reinforcement Learning for Language-Model-Based Information Retriever via Query-Document Co-Augmentation
Jingming Liu, Yumeng Li, Wei Shi, Yao-Xiang Ding, Hui Su, Kun Zhou
arxiv.org/abs/2506.18670

@bici@mastodon.social
2025-07-25 20:27:37

HathiTrust was founded in 2008 as a not-for-profit collaborative of academic and research libraries now preserving 18 million digitized items in the HathiTrust Digital Library. We offer reading access to the fullest extent allowable by U.S. and international copyright law, text and data mining tools for the entire corpus, and other emerging services based on the combined collection.

The image features a stylized brown line drawing of an elephant's head and trunk. The elephant's head is depicted with a simple, curved outline, and a small square represents the eye. The trunk is elegantly curved, with a hook-like end, suggesting the elephant's trunk. The lines are smooth and continuous, creating a minimalist design. The background is plain white, emphasizing the brown lines of the elephant. The overall design is abstract and modern, focusing on the essential features of the e…
@netzschleuder@social.skewed.de
2025-07-20 20:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@davidaugust@mastodon.online
2025-07-23 16:08:33

“‘It’s obvious that you don’t respect Copyright Law and Artist Rights any more than you respect Habeas Corpus and Due Process rights, not to mention the separation of Church and State per the US Constitution. For the record, we hereby order dhsgov [US Department of Homeland Security] to cease and desist the use of our recording and demand that you immediately pull down your video.’”
“They added: ‘Oh, and go f… yourselves.’”

@gedankenstuecke@scholar.social
2025-07-23 17:07:48

At least people across academic disciplines are selling themselves out to "AI" bullshit.
Here: Let's get some software make wrong autocompletions of Latin texts and call it scholarship, while probably poisoning our knowledge corpus for years to come…
archive.ph/U0ePh

@david_colquhoun@mstdn.social
2025-05-11 18:57:07

"White House deputy chief of staff Stephen Miller says President Donald Trump is looking for ways to expand its legal power to deport migrants who are in the United States illegally. To achieve that, he says the administration is “actively looking at” suspending habeas corpus, . ."

@sascha_wolfer@fediscience.org
2025-07-23 06:22:41

- We investigate the distribution of the feature variables (corpus frequency, dictionary views, part-of-speech, polysemy) over CEFR levels.
- Variable importance analyses show us how important each variable was for the classification of each level.
We conclude: "Thus, our semi-automatic approach offers a practical solution to the limitations of existing CEFR lists, providing a framework for expanding these lists in a systematic and data-driven manner. However, our findings also reveal the importance of human oversight in the process."
Supplementary material contains an ensemble approach to classification and all used/generated data: osf.io/6s9y7/

@arXiv_csAI_bot@mastoxiv.page
2025-07-23 10:06:42

Identifying Pre-training Data in LLMs: A Neuron Activation-Based Detection Framework
Hongyi Tang, Zhihao Zhu, Yi Yang
arxiv.org/abs/2507.16414

@arXiv_csSE_bot@mastoxiv.page
2025-07-23 09:51:42

Never Come Up Empty: Adaptive HyDE Retrieval for Improving LLM Developer Support
Fangjian Lei, Mariam El Mezouar, Shayan Noei, Ying Zou
arxiv.org/abs/2507.16754

@berlinbuzzwords@floss.social
2025-05-22 11:00:28

Join Radu Pop and Pietro Mele at Berlin Buzzwords as they discuss building an extensible hybrid search solution with Elasticsearch. They will cover functional modeling, cluster architecture, and practical insights on managing billions of vectors in real-world scenarios. Radu and Pietro will also address hybrid reranking challenges and the limitations of standard fusion techniques, explaining their innovative approach.
Learn more:

Session title: Hybrid search on hybrid models, at scale
Radu Pop
Pietro Mele
Join us from June 15-17 in Berlin or participate online / berlinbuzzwords.de

The U.S. Department of Housing and Urban Development is preparing to
💥shut down seven major investigations and cases concerning alleged housing discrimination and segregation, including
⚠️ some where the agency already found civil rights violations, according to HUD records obtained by ProPublica.
The high-profile cases involve allegations that state and local governments across the South and Midwest illegally discriminated against people of color
by placing industrial…

@arXiv_csSI_bot@mastoxiv.page
2025-07-24 08:11:49

Disaster Informatics after the COVID-19 Pandemic: Bibliometric and Topic Analysis based on Large-scale Academic Literature
Ngan Tran, Haihua Chen, Ana Cleveland, Yuhan Zhou
arxiv.org/abs/2507.16820

@egallager@social.treehouse.systems
2025-05-20 19:07:22

Since Kristi Noem doesn't appear to know what habeas corpus is, I'm updating my fantasy version of the Constitution (where I imagine how I might edit it) to spell it out a bit more explicitly:
github.com/cooljeanius/w_Const

@arXiv_mathCO_bot@mastoxiv.page
2025-06-23 08:10:49

Pr\"{u}fer codes on vertex-colored rooted trees
R. W. R. Darling, Grant Fickes
arxiv.org/abs/2506.15796 arxiv.or…

@mia@hcommons.social
2025-07-19 09:23:06

#DH2025 thanks @flochiff.bsky.social for sharing this link as I wanted to follow up on Pandore! 'Pandore: automating text-processing workflows for humanities researchers' from Sorbonne Université and ObTIC - Observatoire des textes, des idées et des corpus

@arXiv_csCL_bot@mastoxiv.page
2025-07-21 09:46:30

The Expressions of Depression and Anxiety in Chinese Psycho-counseling: Usage of First-person Singular Pronoun and Negative Emotional Words
Lizhi Ma, Tong Zhao, Shuai Zhang, Nirui Song, Hongliang He, Anqi Li, Ran Feng, Huachuan Qiu, Jingsong Ma, Zhenzhong Lan
arxiv.org/abs/2507.13839

@arXiv_csSD_bot@mastoxiv.page
2025-07-23 07:59:22

A new XML conversion process for mensural music encoding : CMME\_to\_MEI (via Verovio)
David Fiala (CESR), Laurent Pugin (KNAW), Marnix van Berchum (KNAW), Martha Thomae (NOVA), K\'evin Roger (CESR, UL, CRULH)
arxiv.org/abs/2507.15991

@mgorny@pol.social
2025-06-18 12:49:32

Corpus Christie
#SłowaNaOpak

@lilmikesf@c.im
2025-06-25 18:23:24

#Drumpf #DOJ suing all 15 judges on the #Maryland #federal court bench over imposition of mandatory 2 day delay policy in any

@arXiv_csNE_bot@mastoxiv.page
2025-06-23 08:50:10

Neural Cellular Automata for ARC-AGI
Kevin Xu, Risto Miikkulainen
arxiv.org/abs/2506.15746 arxiv.org/pdf/2506.15746…

@clongclongmoo@social.bau-ha.us
2025-06-19 13:33:50

The Ambient Hermit – Lauda Sion (Reinterpreted)
#ambient

@simon_brooke@mastodon.scot
2025-06-16 12:04:00

Over the past twenty-one years, I've posted 365 posts to my blog, on average one every twenty-one days. They total almost half a million words.
That's quite some corpus of work.
#Blog
#Blogging

@arXiv_physicsmedph_bot@mastoxiv.page
2025-07-24 08:27:09

From Fiber Tracts to Tumor Spread: Biophysical Modeling of Butterfly Glioma Growth Using Diffusion Tensor Imaging
Jonas Weidner, Ivan Ezhov, Michal Balcerak, Andr\'e Datchev, Lucas Zimmer, Daniel Rueckert, Bj\"orn Menze, Benedikt Wiestler
arxiv.org/abs/2507.17707

@pbloem@sigmoid.social
2025-06-06 09:51:57

An 8TB corpus of copyright-free text for training AI models. Comes with two 7B models trained as proof-of-concept.
github.com/r-three/common-pile
I'm glad somebody has finally done this. The "we need to break copyrig…

A diagram showing the breakdown of the Common Pile corpus. It shows large chunks coming from Code, wikimedia, stackexchange etc.
@netzschleuder@social.skewed.de
2025-06-08 18:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@trochee@dair-community.social
2025-06-12 00:21:24

OH in corpus-linguistic UX convo:
> He who go too far down long tail end up wagging dog

@lysander07@sigmoid.social
2025-05-13 16:25:32

Last week, our students learned how to conduct a proper evaluation for an NLP experiment. To this end, we introduced a small textcorpus with sentences about Joseph Fourier, who counts as one of the discoverers of the greenhouse effect, responsible for global warming.

Slide of the Information Service ENgineering lecture 03, Natural Language Processing 02, section 2.6: Evaluation, Precision, and Recall
Headline: Experiment
Let's consider the following text corpus (FOURIERCORPUS):
 1
In 1807, Fourier's work on heat transfer laid the foundation for understanding the greenhouse effect.
2
Joseph Fourier's energy balance analysis showed atmosphere's heat-trapping role.
3
Fourrier's calculations, though rudimentary, suggested that the atmosphere acts as an insulato…
@arXiv_csSD_bot@mastoxiv.page
2025-06-24 09:06:40

From Generality to Mastery: Composer-Style Symbolic Music Generation via Large-Scale Pre-training
Mingyang Yao, Ke Chen
arxiv.org/abs/2506.17497

@midtsveen@social.linux.pizza
2025-06-11 21:10:58

Warframe shows us a world full of bosses and big corporations crushing the little people. The Corpus exploit workers, the Grineer enforce control with brute force, and the Solaris fight back to survive.
It’s cool to see the game hint at worker solidarity and rebellion. But at the same time, the game makes me grind endlessly or pay up real cash, which feels a lot like the very system it tries to criticize.
Warframe tells a story about breaking chains, but still keeps me locked in …

@arXiv_eessAS_bot@mastoxiv.page
2025-07-22 07:58:20

Adapting Whisper for Lightweight and Efficient Automatic Speech Recognition of Children for On-device Edge Applications
Satwik Dutta, Shruthigna Chandupatla, John Hansen
arxiv.org/abs/2507.14451

@arXiv_csDB_bot@mastoxiv.page
2025-07-08 08:51:30

AKEGEN: A LLM-based Tabular Corpus Generator for Evaluating Dataset Discovery in Data Lakes
Zhenwei Dai, Chuan Lei, Asterios Katsifodimos, Xiao Qin, Christos Faloutsos, Huzefa Rangwala
arxiv.org/abs/2507.04687

@arXiv_csCL_bot@mastoxiv.page
2025-06-30 10:20:00

MDC-R: The Minecraft Dialogue Corpus with Reference
Chris Madge, Maris Camilleri, Paloma Carretero Garcia, Mladen Karan, Juexi Shao, Prashant Jayannavar, Julian Hough, Benjamin Roth, Massimo Poesio
arxiv.org/abs/2506.22062

@arXiv_csIR_bot@mastoxiv.page
2025-07-16 08:03:31

Extracting Document Relations from Search Corpus by Marginalizing over User Queries
Yuki Iwamoto, Kaoru Tsunoda, Ken Kaneiwa
arxiv.org/abs/2507.10726

@frankel@mastodon.top
2025-06-08 16:15:43

I’ve been eying @… for some time, but I haven’t had time to play with it yet. In case you never heard about OpenRewrite, OpenRewrite takes care of refactoring your codebase to newer language, framework, and paradigm versions.
Using OpenRewrite is pretty straightforward. It already provides a large corpus of existing recipes, some of which are free.…

@arXiv_csSD_bot@mastoxiv.page
2025-07-18 09:22:42

Best Practices and Considerations for Child Speech Corpus Collection and Curation in Educational, Clinical, and Forensic Scenarios
John Hansen, Satwik Dutta, Ellen Grand
arxiv.org/abs/2507.12870

@netzschleuder@social.skewed.de
2025-07-03 11:00:05

email_enron: Email network (Enron corpus)
The Enron email corpus, containing all the email communication from the Enron corporation, which was made public as a result of legal action. Nodes are email addresses and node i links to node j if i sent at least one email to address j. Non-Enron email addresses are also present, but only their links to/from Enron addresses are observed.
This network has 36692 nodes and 367662 edges.
Tags: Social, Communication, Unweighted, Multigr…

email_enron: Email network (Enron corpus). 36692 nodes, 367662 edges. https://networks.skewed.de/net/email_enron
@arXiv_csAI_bot@mastoxiv.page
2025-06-18 08:06:21

Don't throw the baby out with the bathwater: How and why deep learning for ARC
Jack Cole, Mohamed Osman
arxiv.org/abs/2506.14276

@BBC3MusicBot@mastodonapp.uk
2025-06-22 13:52:07

🇺🇦 #NowPlaying on BBCRadio3's #MusicMap
Wolfgang Amadeus Mozart, Víkingur œlafsson & Franz Liszt:
🎵 Ave verum corpus
#WolfgangAmadeusMozart #VíkingurÓlafsson #FranzLiszt
open.spotify.com/track/6HPnobf

@arXiv_eessIV_bot@mastoxiv.page
2025-06-13 08:56:10

Generalist Models in Medical Image Segmentation: A Survey and Performance Comparison with Task-Specific Approaches
Andrea Moglia (Politecnico di Milano), Matteo Leccardi (Politecnico di Milano), Matteo Cavicchioli (Politecnico di Milano), Alice Maccarini (Universit\`a di Pavia), Marco Marcon (Politecnico di Milano), Luca Mainardi (Politecnico di Milano), Pietro Cerveri (Politecnico di Milano, Universit\`a di Pavia)

@dingsextrem@mas.to
2025-06-01 18:20:39

Adopted, because no alttext:
#Ukraine

a russian airplane, someone added a so-called 'cope cage' like they are used on tanks for drone protection, literally a cage around the corpus of the plane
@gwire@mastodon.social
2025-07-08 10:38:27

Even outside the digital transformation of "planning", I'm very interested (as a member of the public) in the development of a rich corpus of public geospatial data. (Some deep unexpressed cartographic instinct?) This blogpost is an update on the status of that:

@fanf@mendeddrum.org
2025-06-27 14:42:03

from my link log —
10 years of pomological watercolors.
parkerhiggins.net/2025/04/10-y
saved 2025-04-13

@arXiv_csCL_bot@mastoxiv.page
2025-07-10 10:02:31

SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN
Luca Mariotti, Veronica Guidetti, Federica Mandreoli
arxiv.org/abs/2507.06895

@cellfourteen@social.petertoushkov.eu
2025-05-31 17:13:46

The sheer confidence in her false answer defines insanity. It's unbelievable how resilient the US separation of powers is ->
Kristi Noem Falsely Defines Habeas Corpus as Trump’s Right to Deport People | The New York Times
youtube.com/shorts/aNf3Yb5N4J4

@arXiv_csSD_bot@mastoxiv.page
2025-06-19 08:36:13

TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data
Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
arxiv.org/abs/2506.15614

@arXiv_csAR_bot@mastoxiv.page
2025-07-10 07:38:41

Towards LLM-based Root Cause Analysis of Hardware Design Failures
Siyu Qiu, Muzhi Wang, Raheel Afsharmazayejani, Mohammad Moradi Shahmiri, Benjamin Tan, Hammond Pearce
arxiv.org/abs/2507.06512

@arXiv_eessAS_bot@mastoxiv.page
2025-07-03 08:54:50

IdolSongsJp Corpus: A Multi-Singer Song Corpus in the Style of Japanese Idol Groups
Hitoshi Suda, Junya Koguchi, Shunsuke Yoshida, Tomohiko Nakamura, Satoru Fukayama, Jun Ogata
arxiv.org/abs/2507.01349

@arXiv_csCR_bot@mastoxiv.page
2025-06-02 07:17:41

HoneySat: A Network-based Satellite Honeypot Framework
Efr\'en L\'opez-Morales (Texas A&M University-Corpus Christi), Ulysse Planta (CISPA Helmholtz Center for Information Security), Gabriele Marra (CISPA Helmholtz Center for Information Security), Carlos Gonz\'alez (German Aerospace Center), Jacob Hopkins (Texas A&M University-Corpus Christi), Majid Garoosi (CISPA Helmholtz Center for Information Security), El\'ias Obreque (Universidad de Chile), Carlos Rubio-M…

@arXiv_csCL_bot@mastoxiv.page
2025-06-19 08:16:24

Oldies but Goldies: The Potential of Character N-grams for Romanian Texts
Dana Lupsa, Sanda-Maria Avram
arxiv.org/abs/2506.15650

@sascha_wolfer@fediscience.org
2025-07-23 06:21:53

Just published:
Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy
A machine-learning method to suggest word candidates for CEFR-graded vocabulary lists.
#CEFR level of previously unlabeled words
#linguistics #CEFR #frequency #dictionary #LanguageLearning

@gedankenstuecke@scholar.social
2025-07-01 02:39:32

It's interesting how much Wikipedia's list of LLM catchphrases comes straight out of academic writing jargon.
Probably as a lot of the corpus that was mined for the LLM training goes back to the academic literature thanks to Sci-Hub et al.
en.wikipedia.org/wiki/Wikipedi

@tiotasram@kolektiva.social
2025-07-19 07:51:05

AI, AGI, and learning efficiency
My 4-month-old kid is not DDoSing Wikipedia right now, nor will they ever do so before learning to speak, read, or write. Their entire "training corpus" will not top even 100 million "tokens" before they can speak & understand language, and do so with real intentionally.
Just to emphasize that point: 100 words-per-minute times 60 minutes-per-hour times 12 hours-per-day times 365 days-per-year times 4 years is a mere 105,120,000 words. That's a ludicrously *high* estimate of words-per-minute and hours-per-day, and 4 years old (the age of my other kid) is well after basic speech capabilities are developed in many children, etc. More likely the available "training data" is at least 1 or 2 orders of magnitude less than this.
The point here is that large language models, trained as they are on multiple *billions* of tokens, are not developing their behavioral capabilities in a way that's remotely similar to humans, even if you believe those capabilities are similar (they are by certain very biased ways of measurement; they very much aren't by others). This idea that humans must be naturally good at acquiring language is an old one (see e.g. #AI #LLM #AGI

@arXiv_csNE_bot@mastoxiv.page
2025-07-16 08:57:21

Grammatical Structure and Grammatical Variations in Non-Metric Iranian Classical Music
Maziar Kanani, Sean O Leary, James McDermott
arxiv.org/abs/2507.10708

@arXiv_csDS_bot@mastoxiv.page
2025-06-04 07:21:59

Labelling Data with Unknown References
Adrian de Wynter
arxiv.org/abs/2506.03083 arxiv.org/pdf/2506.03083

@arXiv_csIR_bot@mastoxiv.page
2025-06-18 08:22:48

XGraphRAG: Interactive Visual Analysis for Graph-based Retrieval-Augmented Generation
Ke Wang, Bo Pan, Yingchaojie Feng, Yuwei Wu, Jieyi Chen, Minfeng Zhu, Wei Chen
arxiv.org/abs/2506.13782

@arXiv_csCL_bot@mastoxiv.page
2025-06-03 08:19:55

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Pierre-Carl Langlais, Carlos Rosas Hinostroza, Mattia Nee, Catherine Arnett, Pavel Chizhov, Eliot Krzystof Jones, Ir\`ene Girard, David Mach, Anastasia Stasenko, Ivan P. Yamshchikov
arxiv.org/abs/2506.01732

@arXiv_csSI_bot@mastoxiv.page
2025-06-02 10:02:57

This arxiv.org/abs/2505.08052 has been replaced.
initial toot: mastoxiv.page/@arXiv_csSI_…

@arXiv_csCR_bot@mastoxiv.page
2025-06-02 09:57:08

This arxiv.org/abs/2409.17275 has been replaced.
initial toot: mastoxiv.page/@arXiv_csCR_…

@arXiv_csCL_bot@mastoxiv.page
2025-06-19 08:12:19

Approximating Language Model Training Data from Weights
John X. Morris, Junjie Oscar Yin, Woojeong Kim, Vitaly Shmatikov, Alexander M. Rush
arxiv.org/abs/2506.15553

@arXiv_physicscompph_bot@mastoxiv.page
2025-06-02 07:33:47

Potential Effects of Loading Terminal Locations on Surface Trajectories of Oil Spill Transport
Shoshana Reich, Edward Buskey, Clint Dawson, Eirik Valseth
arxiv.org/abs/2505.24610

@arXiv_csIR_bot@mastoxiv.page
2025-07-17 07:58:10

Context-Aware Search and Retrieval Over Erasure Channels
Sara Ghasvarianjahromi, Yauhen Yakimenka, J\"org Kliewer
arxiv.org/abs/2507.11894

@arXiv_csCL_bot@mastoxiv.page
2025-06-18 09:08:18

Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers
Daniel D'souza, Julia Kreutzer, Adrien Morisot, Ahmet \"Ust\"un, Sara Hooker
arxiv.org/abs/2506.14702

@arXiv_eessAS_bot@mastoxiv.page
2025-06-03 07:32:45

Quantifying and Reducing Speaker Heterogeneity within the Common Voice Corpus for Phonetic Analysis
Miao Zhang, Aref Farhadipour, Annie Baker, Jiachen Ma, Bogdan Pricop, Eleanor Chodroff
arxiv.org/abs/2506.00733

@arXiv_csSD_bot@mastoxiv.page
2025-07-14 08:12:52

Active Learning for Text-to-Speech Synthesis with Informative Sample Collection
Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
arxiv.org/abs/2507.08319

@arXiv_csAI_bot@mastoxiv.page
2025-07-04 08:29:11

What Neuroscience Can Teach AI About Learning in Continuously Changing Environments
Daniel Durstewitz, Bruno Averbeck, Georgia Koppe
arxiv.org/abs/2507.02103

@arXiv_csCL_bot@mastoxiv.page
2025-06-10 19:06:51

This arxiv.org/abs/2506.06266 has been replaced.
initial toot: mastoxiv.page/@arXiv_csCL_…

@arXiv_csSD_bot@mastoxiv.page
2025-07-10 08:43:21

Revealing the Hidden Temporal Structure of HubertSoft Embeddings based on the Russian Phonetic Corpus
Anastasia Ananeva, Anton Tomilov, Marina Volkova
arxiv.org/abs/2507.06794

@arXiv_csCL_bot@mastoxiv.page
2025-07-17 10:04:30

StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features
Jeremi K. Ochab, Mateusz Matias, Tymoteusz Boba, Tomasz Walkowiak
arxiv.org/abs/2507.12064

@arXiv_csCR_bot@mastoxiv.page
2025-06-09 08:14:02

Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems
Haowei Wang, Rupeng Zhang, Junjie Wang, Mingyang Li, Yuekai Huang, Dandan Wang, Qing Wang
arxiv.org/abs/2506.06151

@arXiv_csIR_bot@mastoxiv.page
2025-06-27 08:08:29

EraRAG: Efficient and Incremental Retrieval Augmented Generation for Growing Corpora
Fangyuan Zhang, Zhengjun Huang, Yingli Zhou, Qintian Guo, Zhixun Li, Wensheng Luo, Di Jiang, Yixiang Fang, Xiaofang Zhou
arxiv.org/abs/2506.20963

@arXiv_csSD_bot@mastoxiv.page
2025-06-18 08:49:38

SLEEPING-DISCO 9M: A large-scale pre-training dataset for generative music modeling
Tawsif Ahmed, Andrej Radonjic, Gollam Rabby
arxiv.org/abs/2506.14293

@arXiv_csCL_bot@mastoxiv.page
2025-07-14 09:58:42

DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures
Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro
arxiv.org/abs/2507.08606

@arXiv_csIR_bot@mastoxiv.page
2025-06-11 07:41:03

Reinforcement Fine-Tuning for Reasoning towards Multi-Step Multi-Source Search in Large Language Models
Wentao Shi, Yiqing Shen
arxiv.org/abs/2506.08352

@arXiv_csCL_bot@mastoxiv.page
2025-07-14 09:56:02

Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach
Bruno Alexandre Rosa, Hil\'ario Oliveira, Luiz Rodrigues, Eduardo Araujo Oliveira, Rafael Ferreira Mello
arxiv.org/abs/2507.08487

@arXiv_csCL_bot@mastoxiv.page
2025-06-10 19:00:41

This arxiv.org/abs/2506.01495 has been replaced.
initial toot: mastoxiv.page/@arXiv_csCL_…

@arXiv_csIR_bot@mastoxiv.page
2025-06-05 09:40:29

This arxiv.org/abs/2503.06474 has been replaced.
initial toot: mastoxiv.page/@arXiv_csIR_…

@arXiv_csSD_bot@mastoxiv.page
2025-06-12 07:54:31

SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research
Ahmed Adel Attia, Jing Liu, Carl Espy-Wilson
arxiv.org/abs/2506.09206

@arXiv_csCL_bot@mastoxiv.page
2025-06-26 09:14:30

Narrative Shift Detection: A Hybrid Approach of Dynamic Topic Models and Large Language Models
Kai-Robin Lange, Tobias Schmidt, Matthias Reccius, Henrik M\"uller, Michael Roos, Carsten Jentsch
arxiv.org/abs/2506.20269

@arXiv_eessAS_bot@mastoxiv.page
2025-07-03 08:18:50

Hello Afrika: Speech Commands in Kinyarwanda
George Igwegbe, Martins Awojide, Mboh Bless, Nirel Kadzo
arxiv.org/abs/2507.01024

@arXiv_csCL_bot@mastoxiv.page
2025-07-10 10:00:01

Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams
Matthew Anderson Hendricks, Alice Cicirello
arxiv.org/abs/2507.06803

@arXiv_csIR_bot@mastoxiv.page
2025-06-04 13:36:32

This arxiv.org/abs/2505.12574 has been replaced.
initial toot: mastoxiv.page/@arXiv_csIR_…

@arXiv_csCL_bot@mastoxiv.page
2025-07-08 14:00:21

SIGIR 2025 -- LiveRAG Challenge Report
David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Oren Somekh, Ran Tavory
arxiv.org/abs/2507.04942

@arXiv_csIR_bot@mastoxiv.page
2025-07-04 08:10:01

When LLMs Disagree: Diagnosing Relevance Filtering Bias and Retrieval Divergence in SDG Search
William A. Ingram, Bipasha Banerjee, Edward A. Fox
arxiv.org/abs/2507.02139

@arXiv_csIR_bot@mastoxiv.page
2025-06-03 07:21:20

Query Drift Compensation: Enabling Compatibility in Continual Learning of Retrieval Embedding Models
Dipam Goswami, Liying Wang, Bart{\l}omiej Twardowski, Joost van de Weijer
arxiv.org/abs/2506.00037

@arXiv_csCL_bot@mastoxiv.page
2025-07-03 10:03:00

Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results
Matteo Di Cristofaro
arxiv.org/abs/2507.01764

@arXiv_csIR_bot@mastoxiv.page
2025-06-30 09:51:40

UiS-IAI@LiveRAG: Retrieval-Augmented Information Nugget-Based Generation of Responses
Weronika {\L}ajewska, Ivica Kostric, Gabriel Iturra-Bocaz, Mariam Arustashvili, Krisztian Balog
arxiv.org/abs/2506.22210

@arXiv_csIR_bot@mastoxiv.page
2025-05-30 09:56:03

This arxiv.org/abs/2505.22299 has been replaced.
initial toot: mastoxiv.page/@arXiv_csIR_…

@arXiv_csCL_bot@mastoxiv.page
2025-06-26 09:36:40

Knowledge-Aware Diverse Reranking for Cross-Source Question Answering
Tong Zhou
arxiv.org/abs/2506.20476 arxiv.org/pd…