Tootfinder

@arXiv_qbioQM_bot@mastoxiv.page
2025-07-24 12:53:58

Replaced article(s) found for q-bio.QM. https://arxiv.org/list/q-bio.QM/new
[1/1]:
- Comparative analysis of computational approaches for predicting Transthyretin (TTR) transcription...
Mariya L. Ivanova, Nicola Russo, Gueorgui Mihaylov, Konstantin Nikolic

@arXiv_eessAS_bot@mastoxiv.page
2025-07-24 08:00:59

Towards Robust Speech Recognition for Jamaican Patois Music Transcription
Jordan Madden, Matthew Stone, Dimitri Johnson, Daniel Geddez
https://arxiv.org/abs/2507.16834 https://

Towards Robust Speech Recognition for Jamaican Patois Music Transcription
Although Jamaican Patois is a widely spoken language, current speech recognition systems perform poorly on Patois music, producing inaccurate captions that limit accessibility and hinder downstream applications. In this work, we take a data-centric approach to this problem by curating more than 40 hours of manually transcribed Patois music. We use this dataset to fine-tune state-of-the-art automatic speech recognition (ASR) models, and use the results to develop scaling laws for the performance…

@arXiv_physicshistph_bot@mastoxiv.page
2025-07-24 08:50:00

Slow neutrons in Palermo: a forgotten conference by Enrico Fermi
Emanuele Goldoni, Ledo Stefanini
https://arxiv.org/abs/2507.16928 https://arxiv.org/pdf/25…

Slow neutrons in Palermo: a forgotten conference by Enrico Fermi
On October 22, 1934, in a famous experiment, Enrico Fermi and his colleagues discovered that a significant increase in induced radioactivity can be obtained when neutrons are slowed down by means of hydrogen atoms. This discovery and its explanation earned him the 1938 Nobel Prize in Physics. One year later, on October 1935, Fermi held a public speech in Palermo, Italy, presenting his findings at the 24th congress of the Italian Society for the Progress of Sciences. The transcription of his spe…

@arXiv_csSD_bot@mastoxiv.page
2025-08-20 07:44:00

Is Transfer Learning Necessary for Violin Transcription?
Yueh-Po Peng, Ting-Kang Wang, Li Su, Vincent K. M. Cheung
https://arxiv.org/abs/2508.13516 https://

Is Transfer Learning Necessary for Violin Transcription?
Automatic music transcription (AMT) has achieved remarkable progress for instruments such as the piano, largely due to the availability of large-scale, high-quality datasets. In contrast, violin AMT remains underexplored due to limited annotated data. A common approach is to fine-tune pretrained models for other downstream tasks, but the effectiveness of such transfer remains unclear in the presence of timbral and articulatory differences. In this work, we investigate whether training from scra…

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 07:46:40

Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT
Zeeshan Ahmed, Frank Seide, Niko Moritz, Ju Lin, Ruiming Xie, Simone Merello, Zhe Liu, Christian Fuegen
https://arxiv.org/abs/2508.13358

Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT
This paper tackles several challenges that arise when integrating Automatic Speech Recognition (ASR) and Machine Translation (MT) for real-time, on-device streaming speech translation. Although state-of-the-art ASR systems based on Recurrent Neural Network Transducers (RNN-T) can perform real-time transcription, achieving streaming translation in real-time remains a significant challenge. To address this issue, we propose a simultaneous translation approach that effectively balances translation…

@arXiv_csCC_bot@mastoxiv.page
2025-08-20 08:25:00

Analog computation with transcriptional networks
David Doty, Mina Latifi, David Soloveichick
https://arxiv.org/abs/2508.14017 https://arxiv.org/pdf/2508.14…

Analog computation with transcriptional networks
Transcriptional networks represent one of the most extensively studied types of systems in synthetic biology. Although the completeness of transcriptional networks for digital logic is well-established, *analog* computation plays a crucial role in biological systems and offers significant potential for synthetic biology applications. While transcriptional circuits typically rely on cooperativity and highly non-linear behavior of transcription factors to regulate *production* of proteins, they a…

@arXiv_mathNA_bot@mastoxiv.page
2025-07-18 07:56:52

Keep the beat going: Automatic drum transcription with momentum
Alisha L. Foster, Robert J. Webber
https://arxiv.org/abs/2507.12596 https://

Keep the beat going: Automatic drum transcription with momentum
A simple, interpretable way to perform automatic drum transcription is by factoring the magnitude spectrogram of a recorded musical piece using a partially fixed nonnegative matrix factorization. There are two natural ways to optimize the nonnegative matrix factorization, including a multiplicative update rule and projected gradient descent with momentum. The methods differ in their empirical accuracies and theoretical convergence guarantees. This paper summarizes the methods and their time com…

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 09:28:50

A Comparative Study of Floating-Base Space Parameterizations for Agile Whole-Body Motion Planning
Evangelos Tsiatsianas, Chairi Kiourt, Konstantinos Chatzilygeroudis
https://arxiv.org/abs/2508.11520

A Comparative Study of Floating-Base Space Parameterizations for Agile Whole-Body Motion Planning
Automatically generating agile whole-body motions for legged and humanoid robots remains a fundamental challenge in robotics. While numerous trajectory optimization approaches have been proposed, there is no clear guideline on how the choice of floating-base space parameterization affects performance, especially for agile behaviors involving complex contact dynamics. In this paper, we present a comparative study of different parameterizations for direct transcription-based trajectory optimizati…

@arXiv_csSD_bot@mastoxiv.page
2025-06-19 08:35:53

Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper
Jaza Syed, Ivan Meresman Higgs, Ond\v{r}ej C\'ifka, Mark Sandler
https://arxiv.org/abs/2506.15514

Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper
Automatic lyrics transcription (ALT) remains a challenging task in the field of music information retrieval, despite great advances in automatic speech recognition (ASR) brought about by transformer-based architectures in recent years. One of the major challenges in ALT is the high amplitude of interfering audio signals relative to conventional ASR due to musical accompaniment. Recent advances in music source separation have enabled automatic extraction of high-quality separated vocals, which c…

@arXiv_csHC_bot@mastoxiv.page
2025-07-08 12:28:30

Dude, where's my utterance? Evaluating the effects of automatic segmentation and transcription on CPS detection
Videep Venkatesha, Mariah Bradford, Nathaniel Blanchard
https://arxiv.org/abs/2507.04454

Dude, where's my utterance? Evaluating the effects of automatic segmentation and transcription on CPS detection
Collaborative Problem-Solving (CPS) markers capture key aspects of effective teamwork, such as staying on task, avoiding interruptions, and generating constructive ideas. An AI system that reliably detects these markers could help teachers identify when a group is struggling or demonstrating productive collaboration. Such a system requires an automated pipeline composed of multiple components. In this work, we evaluate how CPS detection is impacted by automating two critical components: transcr…

@arXiv_csSD_bot@mastoxiv.page
2025-06-18 08:45:12

Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription
Anna Hamberger, Sebastian Murgul, Jochen Schmidt, Michael Heizmann
https://arxiv.org/abs/2506.14223

Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription
Music transcription plays a pivotal role in Music Information Retrieval (MIR), particularly for stringed instruments like the guitar, where symbolic music notations such as MIDI lack crucial playability information. This contribution introduces the Fretting-Transformer, an encoderdecoder model that utilizes a T5 transformer architecture to automate the transcription of MIDI sequences into guitar tablature. By framing the task as a symbolic translation problem, the model addresses key challenges…

@arXiv_csDL_bot@mastoxiv.page
2025-07-08 07:49:20

An HTR-LLM Workflow for High-Accuracy Transcription and Analysis of Abbreviated Latin Court Hand
Joshua D. Isom
https://arxiv.org/abs/2507.04132 https://…

An HTR-LLM Workflow for High-Accuracy Transcription and Analysis of Abbreviated Latin Court Hand
This article presents and validates an ideal, four-stage workflow for the high-accuracy transcription and analysis of challenging medieval legal documents. The process begins with a specialized Handwritten Text Recognition (HTR) model, itself created using a novel "Clean Ground Truth" curation method where a Large Language Model (LLM) refines the training data. This HTR model provides a robust baseline transcription (Stage 1). In Stage 2, this baseline is fed, along with the original document i…

@arXiv_csCL_bot@mastoxiv.page
2025-07-14 09:59:12

The Impact of Automatic Speech Transcription on Speaker Attribution
Cristina Aggazzotti, Matthew Wiesner, Elizabeth Allyn Smith, Nicholas Andrews
https://arxiv.org/abs/2507.08660 …

The Impact of Automatic Speech Transcription on Speaker Attribution
Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced b…

@arXiv_csCY_bot@mastoxiv.page
2025-06-11 07:28:33

Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia
Katelyn Xiaoying Mei, Anna Seo Gyeong Choi, Hilke Schellmann, Mona Sloane, Allison Koenecke
https://arxiv.org/abs/2506.08846

Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia
Automatic Speech Recognition (ASR) has transformed daily tasks from video transcription to workplace hiring. ASR systems' growing use warrants robust and standardized auditing approaches to ensure automated transcriptions of high and equitable quality. This is especially critical for people with speech and language disorders (such as aphasia) who may disproportionately depend on ASR systems to navigate everyday life. In this work, we identify three pitfalls in existing standard ASR auditing pro…

@arXiv_eessAS_bot@mastoxiv.page
2025-08-20 10:32:06

Crosslisted article(s) found for eess.AS. https://arxiv.org/list/eess.AS/new
[1/1]:
- Is Transfer Learning Necessary for Violin Transcription?
Yueh-Po Peng, Ting-Kang Wang, Li Su, Vincent K. M. Cheung

@arXiv_csCL_bot@mastoxiv.page
2025-08-14 09:50:22

Assessing the Feasibility of Lightweight Whisper Models for Low-Resource Urdu Transcription
Abdul Rehman Antall, Naveed Akhtar
https://arxiv.org/abs/2508.09865 https://

Assessing the Feasibility of Lightweight Whisper Models for Low-Resource Urdu Transcription
This study evaluates the feasibility of lightweight Whisper models (Tiny, Base, Small) for Urdu speech recognition in low-resource settings. Despite Urdu being the 10th most spoken language globally with over 230 million speakers, its representation in automatic speech recognition (ASR) systems remains limited due to dialectal diversity, code-switching, and sparse training data. We benchmark these models on a curated Urdu dataset using word error rate (WER), without fine-tuning. Results show Wh…

@arXiv_physicsbioph_bot@mastoxiv.page
2025-06-12 14:29:51

Replaced article(s) found for physics.bio-ph. https://arxiv.org/list/physics.bio-ph/new/
[1/1]:
Design principles of transcription factors with intrinsically disordered regions

@arXiv_csAI_bot@mastoxiv.page
2025-07-29 18:02:41

Replaced article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[7/8]:
- Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering
Chowdhury, Aukkapinyo, Fujimura, Woo, Wasusatein, Ghourabi

@arXiv_csSD_bot@mastoxiv.page
2025-07-17 09:03:20

RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection
Sungkyun Chang, Simon Dixon, Emmanouil Benetos
https://arxiv.org/abs/2507.12175

RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection
This study introduces RUMAA, a transformer-based framework for music performance analysis that unifies score-to-performance alignment, score-informed transcription, and mistake detection in a near end-to-end manner. Unlike prior methods addressing these tasks separately, RUMAA integrates them using pre-trained score and audio encoders and a novel tri-stream decoder capturing task interdependencies through proxy tasks. It aligns human-readable MusicXML scores with repeat symbols to full-length p…

@arXiv_csFL_bot@mastoxiv.page
2025-07-01 07:34:53

Programmable Co-Transcriptional Splicing: Realizing Regular Languages via Hairpin Deletion
Da-Jung Cho, Szil\'ard Zsolt Fazekas, Shinnosuke Seki, Max Wiedenh\"oft
https://arxiv.org/abs/2506.23384

Programmable Co-Transcriptional Splicing: Realizing Regular Languages via Hairpin Deletion
RNA co-transcriptionality, where RNA is spliced or folded during transcription from DNA templates, offers promising potential for molecular programming. It enables programmable folding of nano-scale RNA structures and has recently been shown to be Turing universal. While post-transcriptional splicing is well studied, co-transcriptional splicing is gaining attention for its efficiency, though its unpredictability still remains a challenge. In this paper, we focus on engineering co-transcriptiona…

@arXiv_qbioMN_bot@mastoxiv.page
2025-07-08 09:45:50

Fast decisions with biophysically constrained gene promoter architectures
Tarek Tohme, Massimo Vergassola, Thierry Mora, Aleksandra M. Walczak
https://arxiv.org/abs/2507.03720

Fast decisions with biophysically constrained gene promoter architectures
Cells integrate signals and make decisions about their future state in short amounts of time. A lot of theoretical effort has gone into asking how to best design gene regulatory circuits that fulfill a given function, yet little is known about the constraints that performing that function in a small amount of time imposes on circuit architectures. Using an optimization framework, we explore the properties of a class of promoter architectures that distinguish small differences in transcription f…

@arXiv_csSD_bot@mastoxiv.page
2025-08-12 10:35:23

Joint Transcription of Acoustic Guitar Strumming Directions and Chords
Sebastian Murgul, Johannes Schimper, Michael Heizmann
https://arxiv.org/abs/2508.07973 https://

Joint Transcription of Acoustic Guitar Strumming Directions and Chords
Automatic transcription of guitar strumming is an underrepresented and challenging task in Music Information Retrieval (MIR), particularly for extracting both strumming directions and chord progressions from audio signals. While existing methods show promise, their effectiveness is often hindered by limited datasets. In this work, we extend a multimodal approach to guitar strumming transcription by introducing a novel dataset and a deep learning-based transcription model. We collect 90 min of r…

@arXiv_csSD_bot@mastoxiv.page
2025-06-16 08:04:09

Enabling automatic transcription of child-centered audio recordings from real-world environments
Daniil Kocharov, Okko R\"as\"anen
https://arxiv.org/abs/2506.11747

Enabling automatic transcription of child-centered audio recordings from real-world environments
Longform audio recordings obtained with microphones worn by children-also known as child-centered daylong recordings-have become a standard method for studying children's language experiences and their impact on subsequent language development. Transcripts of longform speech audio would enable rich analyses at various linguistic levels, yet the massive scale of typical longform corpora prohibits comprehensive manual annotation. At the same time, automatic speech recognition (ASR)-based transcri…

@arXiv_qbioQM_bot@mastoxiv.page
2025-06-03 07:58:28

Comparative analysis of computational approaches for predicting Transthyretin transcription activators and human dopamine D1 receptor antagonists
Mariya L. Ivanova, Nicola Russo, Konstantin Nikolic
https://arxiv.org/abs/2506.01137

Comparative analysis of computational approaches for predicting Transthyretin transcription activators and human dopamine D1 receptor antagonists
The study expands the application of scikit-learn-based machine learning (ML) to the prediction of small biomolecule functionalities based on Carbon 13 isotope (13C) NMR spectroscopy data derived from Simplified Molecular Input Line Entry System (SMILES) notations. The methodology previously demonstrated by predicting dopamine D1 receptor antagonists was upgraded with addition of new molecular features derived from the PubChem database. The enhanced ML model obtained 75.8% Accuracy, 84.2% Preci…

@arXiv_eessAS_bot@mastoxiv.page
2025-08-12 10:12:53

Score-Informed BiLSTM Correction for Refining MIDI Velocity in Automatic Piano Transcription
Zhanhong He (David), Roberto Togneri (David), Defeng (David), Huang
https://arxiv.org/abs/2508.07757

Score-Informed BiLSTM Correction for Refining MIDI Velocity in Automatic Piano Transcription
MIDI is a modern standard for storing music, recording how musical notes are played. Many piano performances have corresponding MIDI scores available online. Some of these are created by the original performer, recording on an electric piano alongside the audio, while others are through manual transcription. In recent years, automatic music transcription (AMT) has rapidly advanced, enabling machines to transcribe MIDI from audio. However, these transcriptions often require further correction. A…

@arXiv_physicsbioph_bot@mastoxiv.page
2025-07-04 09:18:51

Modelling transcriptional silencing and its coupling to 3D genome organisation
Massimiliano Semeraro, Giuseppe Negro, Davide Marenduzzo, Giada Forte
https://arxiv.org/abs/2507.02150

Modelling transcriptional silencing and its coupling to 3D genome organisation
Timely up- or down-regulation of gene expression is crucial for cellular differentiation and function. While gene upregulation via transcriptional activators has been extensively investigated, gene silencing remains understudied, especially by modelling. This study employs 3D simulations to study the biophysics of a chromatin fibre where active transcription factors compete with repressors for binding to transcription units along the fibre, and investigates how different silencing mechanisms af…

@arXiv_csSD_bot@mastoxiv.page
2025-08-12 10:37:23

Exploring Procedural Data Generation for Automatic Acoustic Guitar Fingerpicking Transcription
Sebastian Murgul, Michael Heizmann
https://arxiv.org/abs/2508.07987 https://

Exploring Procedural Data Generation for Automatic Acoustic Guitar Fingerpicking Transcription
Automatic transcription of acoustic guitar fingerpicking performances remains a challenging task due to the scarcity of labeled training data and legal constraints connected with musical recordings. This work investigates a procedural data generation pipeline as an alternative to real audio recordings for training transcription models. Our approach synthesizes training data through four stages: knowledge-based fingerpicking tablature composition, MIDI performance rendering, physical modeling us…

@arXiv_csHC_bot@mastoxiv.page
2025-07-25 07:41:11

A Custom-Built Ambient Scribe Reduces Cognitive Load and Documentation Burden for Telehealth Clinicians
Justin Morse, Kurt Gilbert, Kyle Shin, Rick Cooke, Peyton Rose, Jack Sullivan, Angelo Sisante
https://arxiv.org/abs/2507.17754

A Custom-Built Ambient Scribe Reduces Cognitive Load and Documentation Burden for Telehealth Clinicians
Clinician burnout has motivated the growing adoption of ambient medical scribes in the clinic. In this work, we introduce a custom-built ambient scribe application integrated into the EHR system at Included Health, a personalized all-in-one healthcare company offering telehealth services. The application uses Whisper for transcription and a modular in-context learning pipeline with GPT-4o to automatically generate SOAP notes and patient instructions. Testing on mock visit data shows that the no…

@arXiv_csSD_bot@mastoxiv.page
2025-06-17 10:10:41

Methods for pitch analysis in contemporary popular music: multiple pitches from harmonic tones in Vitalic's music
Emmanuel Deruty, David Meredith, Maarten Grachten, Pascal Arbez-Nicolas, Andreas Hasselholt J{\o}rgensen, Oliver S{\o}nderm{\o}lle Hansen, Magnus Stensli, Christian N{\o}rk{\ae}r Petersen
https://arxiv.org/abs/25…

Methods for pitch analysis in contemporary popular music: multiple pitches from harmonic tones in Vitalic's music
Aims. This study suggests that the use of multiple perceived pitches arising from a single harmonic complex tone is an active and intentional feature of contemporary popular music. The phenomenon is illustrated through examples drawn from the work of electronic artist Vitalic and others. Methods. Two listening tests were conducted: (1) evaluation of the number of simultaneous pitches perceived from single harmonic tones, and (2) manual pitch transcription of sequences of harmonic tones. Relat…

@arXiv_csDL_bot@mastoxiv.page
2025-07-28 07:59:01

Comparing OCR Pipelines for Folkloristic Text Digitization
Octavian M. Machidon, Alina L. Machidon
https://arxiv.org/abs/2507.19092 https://arxiv.org/pdf/2…

Comparing OCR Pipelines for Folkloristic Text Digitization
The digitization of historical folkloristic materials presents unique challenges due to diverse text layouts, varying print and handwriting styles, and linguistic variations. This study explores different optical character recognition (OCR) approaches for Slovene folkloristic and historical text digitization, integrating both traditional methods and large language models (LLMs) to improve text transcription accuracy while maintaining linguistic and structural integrity. We compare single-stage …

@arXiv_csSD_bot@mastoxiv.page
2025-06-16 07:53:59

Assessing the Impact of Anisotropy in Neural Representations of Speech: A Case Study on Keyword Spotting
Guillaume Wisniewski (LLF - UMR7110), S\'everine Guillaume (LACITO), Clara Rosina Fern\'andez (LACITO)
https://arxiv.org/abs/2506.11096

Assessing the Impact of Anisotropy in Neural Representations of Speech: A Case Study on Keyword Spotting
Pretrained speech representations like wav2vec2 and HuBERT exhibit strong anisotropy, leading to high similarity between random embeddings. While widely observed, the impact of this property on downstream tasks remains unclear. This work evaluates anisotropy in keyword spotting for computational documentary linguistics. Using Dynamic Time Warping, we show that despite anisotropy, wav2vec2 similarity measures effectively identify words without transcription. Our results highlight the robustness …

@arXiv_physicsbioph_bot@mastoxiv.page
2025-07-02 08:38:40

Topological weight and structural diversity of polydisperse chromatin loop networks
Andrea Bonato, Enrico Carlon, Sergey Kitaev, Davide Marenduzzo, Enzo Orlandini
https://arxiv.org/abs/2507.00520

Topological weight and structural diversity of polydisperse chromatin loop networks
Current biophysical models for transcriptionally active chromatin view this as a polymer with sticky sites, mimicking transcription units such as promoters and enhancers which interact via the binding of multivalent complexes of chromatin-binding proteins. It has been demonstrated that this model spontaneously leads to microphase separation, resulting in the formation of a network of loops with transcription units serving as anchors. Here, we demonstrate how to compute the topological weights o…

@arXiv_eessAS_bot@mastoxiv.page
2025-08-07 08:30:24

LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness
Zongli Ye, Jiachen Lian, Akshaj Gupta, Xuanru Zhou, Krish Patel, Haodong Li, Hwi Joo Park, Chenxu Guo, Shuhe Li, Sam Wang, Cheol Jun Cho, Zoe Ezzes, Jet M. J. Vonk, Brittany T. Morin, Rian Bogley, Lisa Wauters, Zachary A. Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli
htt…

LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness
Phonetic speech transcription is crucial for fine-grained linguistic analysis and downstream speech applications. While Connectionist Temporal Classification (CTC) is a widely used approach for such tasks due to its efficiency, it often falls short in recognition performance, especially under unclear and nonfluent speech. In this work, we propose LCS-CTC, a two-stage framework for phoneme-level speech recognition that combines a similarity-aware local alignment algorithm with a constrained CTC …

@arXiv_csSD_bot@mastoxiv.page
2025-08-15 09:07:22

Motive-level Analysis of Form-functions Association in Korean Folk song
Danbinaerin Han, Dasaem Jeong, Juhan Nam
https://arxiv.org/abs/2508.10472 https://a…

Motive-level Analysis of Form-functions Association in Korean Folk song
Computational analysis of folk song audio is challenging due to structural irregularities and the need for manual annotation. We propose a method for automatic motive segmentation in Korean folk songs by fine-tuning a speech transcription model on audio lyric with motif boundary annotation. Applying this to 856 songs, we extracted motif count and duration entropy as structural features. Statistical analysis revealed that these features vary systematically according to the social function of the…

@arXiv_csSD_bot@mastoxiv.page
2025-07-10 08:13:31

STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation
Wenxiang Guo, Yu Zhang, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Zhetao Chen, Wenhao Xu, Fei Wu, Zhou Zhao
https://arxiv.org/abs/2507.06670

STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation
Recent breakthroughs in singing voice synthesis (SVS) have heightened the demand for high-quality annotated datasets, yet manual annotation remains prohibitively labor-intensive and resource-intensive. Existing automatic singing annotation (ASA) methods, however, primarily tackle isolated aspects of the annotation pipeline. To address this fundamental challenge, we present STARS, which is, to our knowledge, the first unified framework that simultaneously addresses singing transcription, alignme…

@arXiv_eessAS_bot@mastoxiv.page
2025-08-12 09:39:03

A Survey on Non-Intrusive ASR Refinement: From Output-Level Correction to Full-Model Distillation
Mohammad Reza Peyghan, Fatemeh Rajabi, Saman Soleimani Roudi, Saeedreza Zouashkiani, Sajjad Amini, Shahrokh Ghaemmaghami
https://arxiv.org/abs/2508.07285

A Survey on Non-Intrusive ASR Refinement: From Output-Level Correction to Full-Model Distillation
Automatic Speech Recognition (ASR) has become an integral component of modern technology, powering applications such as voice-activated assistants, transcription services, and accessibility tools. Yet ASR systems continue to struggle with the inherent variability of human speech, such as accents, dialects, and speaking styles, as well as environmental interference, including background noise. Moreover, domain-specific conversations often employ specialized terminology, which can exacerbate tran…

@arXiv_csSD_bot@mastoxiv.page
2025-08-08 09:07:32

SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription
Raymond Grossman, Taejin Park, Kunal Dhawan, Andrew Titus, Sophia Zhi, Yulia Shchadilova, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg
https://arxiv.org/abs/2508.05554

SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription
We introduce SPGISpeech 2.0, a dataset suitable for speaker-tagged transcription in the financial domain. SPGISpeech 2.0 improves the diversity of applicable modeling tasks while maintaining the core characteristic of the original SPGISpeech dataset: audio snippets and their corresponding fully formatted text transcriptions, usable for end-to-end automatic speech recognition (ASR). SPGISpeech 2.0 consists of 3,780 additional hours of professionally transcribed earnings calls. Furthermore, the d…

@arXiv_physicsbioph_bot@mastoxiv.page
2025-05-27 13:48:26

This https://arxiv.org/abs/2404.19158 has been replaced.
initial toot: https://mastoxiv.page/@arX…

Protein-DNA Co-condensation is Prewetting to a Collapsed Polymer
The three-dimensional organization of chromatin is thought to play an important role in controlling gene expression. Specificity in expression is achieved through the interaction of transcription factors and other nuclear proteins with particular sequences of DNA. At unphysiological concentrations many of these nuclear proteins can phase-separate in the absence of DNA, and it has been hypothesized that, in vivo, the thermodynamic forces driving these phases help determine chromosomal organizati…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-04 07:34:43

Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss
Jiawen Huang, Felipe Sousa, Emir Demirel, Emmanouil Benetos, Igor Gadelha
https://arxiv.org/abs/2506.02339

Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss
Automatic Lyrics Transcription (ALT) aims to recognize lyrics from singing voices, similar to Automatic Speech Recognition (ASR) for spoken language, but faces added complexity due to domain-specific properties of the singing voice. While foundation ASR models show robustness in various speech tasks, their performance degrades on singing voice, especially in the presence of musical accompaniment. This work focuses on this performance gap and explores Low-Rank Adaptation (LoRA) for ALT, investig…

@arXiv_csSD_bot@mastoxiv.page
2025-08-12 08:26:13

Whisfusion: Parallel ASR Decoding via a Diffusion Transformer
Taeyoun Kwon, Junhyuk Ahn, Taegeun Yun, Heeju Jwa, Yoonchae Choi, Siwon Park, Nam-Joon Kim, Jangchan Kim, Hyun Gon Ryu, Hyuk-Jae Lee
https://arxiv.org/abs/2508.07048

Whisfusion: Parallel ASR Decoding via a Diffusion Transformer
Fast Automatic Speech Recognition (ASR) is critical for latency-sensitive applications such as real-time captioning and meeting transcription. However, truly parallel ASR decoding remains challenging due to the sequential nature of autoregressive (AR) decoders and the context limitations of non-autoregressive (NAR) methods. While modern ASR encoders can process up to 30 seconds of audio at once, AR decoders still generate tokens sequentially, creating a latency bottleneck. We propose Whisfusion…

@arXiv_csSD_bot@mastoxiv.page
2025-08-11 09:37:39

Improved Dysarthric Speech to Text Conversion via TTS Personalization
P\'eter Mihajlik, \'Eva Sz\'ekely, Piroska Barta, M\'at\'e Soma K\'ad\'ar, Gergely Dobsinszki, L\'aszl\'o T\'oth
https://arxiv.org/abs/2508.06391

Improved Dysarthric Speech to Text Conversion via TTS Personalization
We present a case study on developing a customized speech-to-text system for a Hungarian speaker with severe dysarthria. State-of-the-art automatic speech recognition (ASR) models struggle with zero-shot transcription of dysarthric speech, yielding high error rates. To improve performance with limited real dysarthric data, we fine-tune an ASR model using synthetic speech generated via a personalized text-to-speech (TTS) system. We introduce a method for generating synthetic dysarthric speech wi…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-09 08:03:02

Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models
Yuke Lin, Ming Cheng, Ze Li, Beilong Tang, Ming Li
https://arxiv.org/abs/2506.05796

Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models
Multi-speaker automatic speech recognition (MS-ASR) faces significant challenges in transcribing overlapped speech, a task critical for applications like meeting transcription and conversational analysis. While serialized output training (SOT)-style methods serve as common solutions, they often discard absolute timing information, limiting their utility in time-sensitive scenarios. Leveraging recent advances in large language models (LLMs) for conversational audio processing, we propose a novel…

@arXiv_csSD_bot@mastoxiv.page
2025-08-11 09:33:19

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models
Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, Xiangang Li
https://arxiv.org/abs/2508.06372

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models
The Speaker Diarization and Recognition (SDR) task aims to predict "who spoke when and what" within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling…

@arXiv_csSD_bot@mastoxiv.page
2025-07-11 09:29:21

Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models
Chen Feng, Yicheng Lin, Shaojie Zhuo, Chenzheng Su, Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Xiaopeng Zhang
https://arxiv.org/abs/2507.07877

Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models
Recent advances in Automatic Speech Recognition (ASR) have demonstrated remarkable accuracy and robustness in diverse audio applications, such as live transcription and voice command processing. However, deploying these models on resource constrained edge devices (e.g., IoT device, wearables) still presents substantial challenges due to strict limits on memory, compute and power. Quantization, particularly Post-Training Quantization (PTQ), offers an effective way to reduce model size and infere…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-26 09:28:50

Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR
Ale\v{s} Pra\v{z}\'ak, Marie Kune\v{s}ov\'a, Josef Psutka
https://arxiv.org/abs/2506.20288

Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR
Overlapping speech remains a major challenge for automatic speech recognition (ASR) in real-world applications, particularly in broadcast media with dynamic, multi-speaker interactions. We propose a light-weight, target-speaker-based extension to an existing streaming ASR system to enable practical transcription of overlapping speech with minimal computational overhead. Our approach combines a speaker-independent (SI) model for standard operation with a speaker-conditioned (SC) model selectivel…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-03 07:59:36

DNCASR: End-to-End Training for Speaker-Attributed ASR
Xianrui Zheng, Chao Zhang, Philip C. Woodland
https://arxiv.org/abs/2506.01916 https://

DNCASR: End-to-End Training for Speaker-Attributed ASR
This paper introduces DNCASR, a novel end-to-end trainable system designed for joint neural speaker clustering and automatic speech recognition (ASR), enabling speaker-attributed transcription of long multi-party meetings. DNCASR uses two separate encoders to independently encode global speaker characteristics and local waveform information, along with two linked decoders to generate speaker-attributed transcriptions. The use of linked decoders allows the entire system to be jointly trained und…

@arXiv_csSD_bot@mastoxiv.page
2025-06-03 07:27:19

Improving Code Switching with Supervised Fine Tuning and GELU Adapters
Linh Pham
https://arxiv.org/abs/2506.00291 https://arxiv.org/p…

Improving Code Switching with Supervised Fine Tuning and GELU Adapters
There are few code switching datasets, labeled or unlabled, that exist today. As a result, ASR requires new methods to utilize the vast monolingual data and models that exist. This paper uses OpenAI's open source ASR model, Whisper, which has been pre-trained on 680K hours of audio to perform monolingual ASR tasks. In Part 1, this paper examines how exploiting Whisper's monolingual ability to individually tokenize training text, called "Switching Tokenizers Method", improves transcription accur…

@arXiv_csSD_bot@mastoxiv.page
2025-07-02 08:27:39

Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer Architecture
Sebastian Murgul, Michael Heizmann
https://arxiv.org/abs/2507.00466

Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer Architecture
Beat tracking in musical performance MIDI is a challenging and important task for notation-level music transcription and rhythmical analysis, yet existing methods primarily focus on audio-based approaches. This paper proposes an end-to-end transformer-based model for beat and downbeat tracking in performance MIDI, leveraging an encoder-decoder architecture for sequence-to-sequence translation of MIDI input to beat annotations. Our approach introduces novel data preprocessing techniques, includi…

@arXiv_eessAS_bot@mastoxiv.page
2025-07-01 08:30:33

Investigating an Overfitting and Degeneration Phenomenon in Self-Supervised Multi-Pitch Estimation
Frank Cwitkowitz, Zhiyao Duan
https://arxiv.org/abs/2506.23371

Investigating an Overfitting and Degeneration Phenomenon in Self-Supervised Multi-Pitch Estimation
Multi-Pitch Estimation (MPE) continues to be a sought after capability of Music Information Retrieval (MIR) systems, and is critical for many applications and downstream tasks involving pitch, including music transcription. However, existing methods are largely based on supervised learning, and there are significant challenges in collecting annotated data for the task. Recently, self-supervised techniques exploiting intrinsic properties of pitch and harmonic signals have shown promise for both …

@arXiv_csSD_bot@mastoxiv.page
2025-07-01 09:47:03

You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties
Paige Tutt\"os\'i, H. Henny Yeung, Yue Wang, Jean-Julien Aucouturier, Angelica Lim
https://arxiv.org/abs/2506.23367

You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties
We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a "clarity mode" for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these eff…

Tootfinder

Opt-in global Mastodon full text search. Join the index!