Tootfinder

@jamesthebard@social.linux.pizza
2025-08-20 19:05:28

So, been wondering why my mic audio has been absolutely wretched over the last few videos and such. Come to find out, if you accidentally wreck almost all of the settings on the compressor/gate then things are going to go very, very poorly. Did a quick test run in OBS and everything sounded so much better.
Current setup for the microphone: Earthworks Audio Ethos -> Universal Audio SOLO/610 -> dbx 286s -> [insert generic audio interface] -> Linux rig

A picture of the Earthworks Audio Ethos microphone on a microphone arm, the top of an audio rack showing an Allen&Heath CQ-12T mixer beside a Universal Audio SOLO/610. Below those is a dbx 286s audio processor strip (1U).

@v_i_o_l_a@openbiblio.social
2025-08-19 20:25:07

#hamstercontent im WDR-zeitzeichen: "Laborheld und Kinderliebling: Siegeszug des Goldhamsters" https://www1.wdr.de/mediathek/audio/ze

Laborheld und Kinderliebling: Siegeszug des Goldhamsters
Alle Goldhamster weltweit stammen von vier Tieren ab - ein Inzucht-Experiment. Der erste Wurf in Gefangenschaft kommt am 18.8.1930 in Jerusalem zur Welt.

@Techmeme@techhub.social
2025-08-20 16:40:49

Google plans to update the Pixel Buds Pro 2 in September with adaptive audio, background noise reduction when using Gemini Live, and a new "moonstone" color (Victoria Song/The Verge)
https://www.theverge.com/news/762048/googles-pixel-buds-pro-2…

Google’s Pixel Buds Pro 2 are getting new AI-powered features in September
A software update will bring Adaptive Audio, improved audio processing for background noise, and more to the Pixel Buds Pro 2.

@stiefkind@mastodon.social
2025-08-21 11:17:50

»Kernenergie erzeugt kein CO2 und ist deshalb die Lösung für unser Energieproblem.« Es ist natürlich nicht ganz so simpel, wie diese Sorte populistischer Parolen. Der Beitrag "Milliardengrab Atomkraft" von 2022 (50min) beleuchtet da ein paar problematische Ecken. Leider werden diejenigen, die da aufmerksam zuhören sollten, auch hier konsequent weghören. Podcast-Empfehlung trotzdem:

Milliardengrab Atomkraft - Doku über unkalkulierbare Kosten der Atomenergie
Ende 2022 sollen die letzten deutschen Atomkraftwerke vom Netz gehen. Doch weltweit tüfteln Startups an neuen Formen von AKWs. Wie teuer ist Atomkraft wirklich und welche Kosten müssen die nachfolgenden Generationen tragen?

@NFL@darktundra.xyz
2025-10-19 19:46:34

Panthers-Jets broadcast temporarily affected by audio problems https://www.nytimes.com/athletic/6731135/2025/10/19/jets-panthers-broadcast-audio-issues/

Panthers-Jets broadcast temporarily affected by audio problems
Technical difficulties became a talking point during the first half of Jets-Panthers, after the Fox broadcast temporarily went silent.

@lightweight@mastodon.nzoss.nz
2025-10-20 18:28:46

Content warning: NZ Labour's sovereign wealth fund

Just caught the end of this discussion: https://www.rnz.co.nz/national/programmes/morningreport/audio/2019009292/labour-announces-new-sovereign-wealth-fund - it annoys me that RNZ doesn't identify Eric Crampton as b…

Labour announces new sovereign wealth fund
Labour has laid out its plan to grow businesses while also giving the Government a new revenue stream, with a proposed sovereign wealth fund. Simplicity co-founder Sam Stubbs and chief economist from The New Zealand Initiative, Eric Crampton spoke to Corin Dann.

@heiseonline@social.heise.de
2025-08-19 11:40:00

Zahlen, bitte! 74 Minuten – Die krumme Abspieldauer der Audio-CD
Geschichten ranken sich um die 74-Minuten-Abspieldauer der CD: offiziell um Beethovens 9. Symphonie komplett abzuspielen ist der wahre Grund wohl profaner.

Zahlen, bitte! 74 Minuten – Die krumme Abspieldauer der Audio-CD
Geschichten ranken sich um die 74-Minuten-Abspieldauer der CD: offiziell um Beethovens 9. Symphonie komplett abzuspielen ist der wahre Grund wohl profaner.

@matthiasott@mastodon.social
2025-09-18 14:00:31

There are few investments more valuable than a set of good speakers and headphones. It makes such a vast difference when you’re listening to your favourite music every day. 😍🎶
My setup atm:
🎧 Austrian Audio Hi-X60
https://austrian.audio/product/hi-x60/
🔈 ADAM Audio T5V

@arXiv_csSD_bot@mastoxiv.page
2025-08-20 07:54:30

DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer
Yisu Liu, Chenxing Li, Wanqian Zhang, Wenfu Wang, Meng Yu, Ruibo Fu, Zheng Lin, Weiping Wang, Dong Yu
https://arxiv.org/abs/2508.13786

DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer
Controllable text-to-audio generation aims to synthesize audio from textual descriptions while satisfying user-specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade-offs among accurate temporal localization, open-vocabulary scalability, and practical efficiency. To address these challeng…

@arXiv_csCV_bot@mastoxiv.page
2025-08-20 10:22:00

InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing
Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, Xiaoming Wei
https://arxiv.org/abs/2508.14033

InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing
Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while en…

@fluchtkapsel@nerdculture.de
2025-10-20 21:54:29

Content warning: tech, linux, bash

@jredlund@social.linux.pizza
2025-08-21 17:08:43

AVL Drumkits and New Jams
# music #improvisation I have been using the Hydrogen drum sequencer to make drum grooves. I was saving them as audio files because when I saved them as midi files, the drum mapping was usually off for the drum kit plugins I had in Reaper. The disadvantages of this approach are a) I can't change the tempo, and b) I can't modify the groove by addin…

AVL Drumkits and New Jams
# music #improvisation I have been using the Hydrogen drum sequencer to make drum grooves. I was saving them as audio files because when I saved them as midi files, the drum mapping was usually off…

@Techmeme@techhub.social
2025-09-18 14:35:45

Inside Apple's Audio Labs, where the company develops features like spatial audio and uses tuning studios as reference points for AirPods and other products (Lance Ulanoff/TechRadar)
https://www.

Inside Apple's magical audio Labs, where the AirPods Pro sound you hear comes to life
Inside Apple Audio Labs

@michaels@mstdn.nursing.unibas.ch
2025-10-17 17:53:56

Can you transcribe from audio files? #AI #mistralai

Screenshot reads:

can you transcribe from audio files?

Yes, I can transcribe audio files! If you upload an audio file (such as a recording, voice memo, or any audio clip), I can transcribe the spoken content into text for you. Just upload the file, and I'll handle the rest.

What file format?

I can transcribe audio files in common formats, including:
• MP3
• WAV
• M4A
• OGG
• FLAC
• AAC
• WebM (audio only)
If your file is in one of these formats, you can upload it, and I'll transcribe the au…

@arXiv_eessAS_bot@mastoxiv.page
2025-08-20 08:13:00

MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence
Sonal Kumar, \v{S}imon Sedl\'a\v{c}ek, Vaibhavi Lokegaonkar, Fernando L\'opez, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Pli\v{c}ka, Miroslav Hlav\'a\v{c}ek, William Fineas Ellingwood, Sathvik Udupa, Siyuan Hou, Allison Ferner, Sara Barahona, Cecilia Bola\~nos, Satish Rahi, Laura Herrera-Alarc\'on, Satvik Dixit, Siddhi Patil, Soham Deshmukh, Lasha K…

MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence
Audio comprehension-including speech, non-speech sounds, and music-is essential for achieving human-level intelligence. Consequently, AI agents must demonstrate holistic audio understanding to qualify as generally intelligent. However, evaluating auditory intelligence comprehensively remains challenging. To address this gap, we introduce MMAU-Pro, the most comprehensive and rigorously curated benchmark for assessing audio intelligence in AI systems. MMAU-Pro contains 5,305 instances, where each…

@ErikJonker@mastodon.social
2025-10-21 05:36:31

Trying out https://www.huxe.com/ , a tool for AI generated personal podcasts about anything you want, not perfect, but works great for certain categories, like a podcast about the great hackernews post of today or main AI industry headline, very convenient when commuting.

Huxe | Content that exists because you do
Huxe transforms your daily information into personalized audio intelligence. While you commute, exercise, or just need a screen break — stay ahead without the endless scroll.

@Mediagazer@mstdn.social
2025-07-21 00:05:36

A look at the video podcasting surge; study: ~75% of podcast consumers play video episodes and ~30% play video episodes minimized or in background mode (Joseph Bernstein/New York Times)
https://www.nytimes.com/2025/07/20/style/p

@arXiv_csCR_bot@mastoxiv.page
2025-09-19 09:46:41

Threat Modeling for Enhancing Security of IoT Audio Classification Devices under a Secure Protocols Framework
Sergio Benlloch-Lopez, Miquel Viel-Vazquez, Javier Naranjo-Alcazar, Jordi Grau-Haro, Pedro Zuccarello
https://arxiv.org/abs/2509.14657

Threat Modeling for Enhancing Security of IoT Audio Classification Devices under a Secure Protocols Framework
The rapid proliferation of IoT nodes equipped with microphones and capable of performing on-device audio classification exposes highly sensitive data while operating under tight resource constraints. To protect against this, we present a defence-in-depth architecture comprising a security protocol that treats the edge device, cellular network and cloud backend as three separate trust domains, linked by TPM-based remote attestation and mutually authenticated TLS 1.3. A STRIDE-driven threat model…

@lightweight@mastodon.nzoss.nz
2025-08-19 01:23:37

This is a very good interview - https://www.rnz.co.nz/national/programmes/ninetonoon/audio/2019000546/is-big-tech-more-powerful-than-government - Prof Alexandra Andhov is spot on: a) bigtech is more powerful than our gov…

Is Big Tech more powerful than government?
How much of your life is ruled by an algorithm? It's possibly more than you think. Everyday algorithms are aiding us to make decisions - and increasingly making decisions about us.

@thomasfuchs@hachyderm.io
2025-10-15 16:44:47

PSA about common connectors on audio and phone equipment:
The "audio jack" on headphones etc. is called a "phone connector" (it was first used for phone switchboards all the way back in the 1800s).
The "phone jack" on telephones (looks like a smaller Ethernet jack) is called a "registered jack".
The "audio connector" on analog audio and video cables is called a "RCA plug".
Now you know jack shit.

@publicvoit@graz.social
2025-10-11 08:36:40

In case you need some #silence to relax: https://github.com/anars/blank-audio

GitHub - anars/blank-audio: Set of blank MP3 audio files
Set of blank MP3 audio files. Contribute to anars/blank-audio development by creating an account on GitHub.

@mariyadelano@hachyderm.io
2025-10-20 20:41:14

I’ve worked over the past year to reduce the amount of noise in my consciousness on a daily basis.
By that I mean - information noise, not literal sounds “noise”. (That problem was solved long ago by some good earplugs and noise canceling earphones.)
I’ve gotten used to spending less time on social media, regularly blocking most apps on my devices (anything with a feed news, most work communication apps, etc.), putting my phone and other devices aside for extended periods of time. Often go to work places with my iPad explicitly having its WiFi turned off and selecting cafes that don’t offer WiFi at all.
Negotiated better boundaries at work and in personal life where I exchange messages with people less often but try to make those interactions more meaningful, and people rarely expect me to respond to requests in less than 24 hours. Spent a lot of time setting up custom notification settings on all apps that would allow it, so I get fewer pings. With software, choosing fewer cloud-based options and using tools that are simple and require as few interruptions as possible.
Accustomed myself to lower-tech versions of doing things I like to do: reading on paper, writing by hand, drawing in physical sketchbooks, got a typewriter for typing without a screen. Choosing to call people on audio more, trying to make more of an effort to see people in person. Going to museums to look at art instead of browsing Pinterest. Defaulting to the library when looking for information.
I’m commenting on this now for two reasons:
1. I am pretty proud of myself for how much I’ve actually managed to reduce the constant stream of modern life esp. as a remote worker in tech!
2. Now that I’ve reached a breaking point of reducing enough noise that it’s NOTICEABLE - I am struck by the silence. I don’t know what to do with it. I don’t know how to navigate it and fill it. I made this space to be able to read and write and think more deeply - for now I feel stuck in limbo where I’m just reacquainting myself with the concept of having any space in my mind at all.

@arXiv_csCL_bot@mastoxiv.page
2025-09-17 09:16:00

MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models
Vijay Govindarajan, Pratik Patel, Sahil Tripathi, Md Azizul Hoque, Gautam Siddharth Kashyap
https://arxiv.org/abs/2509.12591

MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models
Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our metho…

@muz4now@mastodon.world
2025-10-15 23:12:01

There's Nothing Fluffy About This Audio
#inspiration #Sounds4MusicMaking
https://muz4no…

There's Nothing Fluffy About This Audio
Stan Stewart - @muz4now

@arXiv_csSD_bot@mastoxiv.page
2025-08-19 09:19:20

Towards Automatic Evaluation and High-Quality Pseudo-Parallel Dataset Construction for Audio Editing: A Human-in-the-Loop Method
Yuhang Jia, Hui Wang, Xin Nie, Yujie Guo, Lianru Gao, Yong Qin
https://arxiv.org/abs/2508.11966

Towards Automatic Evaluation and High-Quality Pseudo-Parallel Dataset Construction for Audio Editing: A Human-in-the-Loop Method
Audio editing aims to manipulate audio content based on textual descriptions, supporting tasks such as adding, removing, or replacing audio events. Despite recent progress, the lack of high-quality benchmark datasets and comprehensive evaluation metrics remains a major challenge for both assessing audio editing quality and improving the task itself. In this work, we propose a novel approach for audio editing task by incorporating expert knowledge into both the evaluation and dataset constructio…

@arXiv_csMM_bot@mastoxiv.page
2025-08-21 07:37:49

FakeHunter: Multimodal Step-by-Step Reasoning for Explainable Video Forensics
Chen Chen, Runze Li, Zejun Zhang, Pukun Zhao, Fanqing Zhou, Longxiang Wang, Haojian Huang
https://arxiv.org/abs/2508.14581 …

FakeHunter: Multimodal Step-by-Step Reasoning for Explainable Video Forensics
FakeHunter is a multimodal deepfake detection framework that combines memory-guided retrieval, chain-of-thought (Observation-Thought-Action) reasoning, and tool-augmented verification to provide accurate and interpretable video forensics. FakeHunter encodes visual content using CLIP and audio using CLAP, generating joint audio-visual embeddings that retrieve semantically similar real exemplars from a FAISS-indexed memory bank for contextual grounding. Guided by the retrieved context, the system…

@arXiv_csSI_bot@mastoxiv.page
2025-08-19 08:31:30

MAD: A Benchmark for Multi-Turn Audio Dialogue Fact-Checking
Chaewan Chun, Lysandre Terrisse, Delvin Ce Zhang, Dongwon Lee
https://arxiv.org/abs/2508.12186 https://

MAD: A Benchmark for Multi-Turn Audio Dialogue Fact-Checking
Despite the growing popularity of audio platforms, fact-checking spoken content remains significantly underdeveloped. Misinformation in speech often unfolds across multi-turn dialogues, shaped by speaker interactions, disfluencies, overlapping speech, and emotional tone-factors that complicate both claim detection and verification. Existing datasets fall short by focusing on isolated sentences or text transcripts, without modeling the conversational and acoustic complexity of spoken misinformat…

@vform@openbiblio.social
2025-09-03 19:49:11

Neue Audio-Sharing-Funktionen für Android dank LE Audio
https://stadt-bremerhaven.de/neue-audio-sharing-funktionen-fuer-android-dank-le-audio/
"Nach der Einführung von LE Audio Auracast für Hörgeräte zu Beginn des Jahres..."…

Neue Audio-Sharing-Funktionen für Android dank LE Audio
Google erweitert die Möglichkeiten zum Teilen von Audio-Inhalten unter Android. Die neue Bluetooth-Generation LE Audio macht es möglich, ...

@arXiv_csAI_bot@mastoxiv.page
2025-08-20 11:23:47

Crosslisted article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[4/6]:
- End-to-End Audio-Visual Learning for Cochlear Implant Sound Coding in Noisy Environments
Meng-Ping Lin, Enoch Hsin-Ho Huang, Shao-Yi Chien, Yu Tsao

@guerda@ruhr.social
2025-08-16 09:17:15

Cooles Feature. Ich bin kein großer AI Fan gleichzeitig sehe ich bei Transkription tatsächlich Potenzial. Und nahtlos und offline SRT Dateien erstellen zu lassen ist super.
FFmpeg 8.0 integriert Whisper: Lokale Audio-Transkription ohne Cloud | heise online

FFmpeg 8.0 integriert Whisper: Lokale Audio-Transkription ohne Cloud
Die kommende Version 8.0 von FFmpeg erhält optional Whisper. OpenAIs KI-Transkription werden sich flexibel einsetzen lassen.

@jamesthebard@social.linux.pizza
2025-09-19 19:01:58

Okay, got the new version finished. The BNC connectors are for RGB Sync, and the other two BNC connectors were removed and replaced with a DB9 connector for the glory that is CGA. Kept the RCA audio port on the side of the board because audio will mostly go in directly to the PVM. Overall I'm much more satisfied with this layout and connector choice than the previous version.
#electronics

A CMPTR signal breakout board that exposes RGB via BNC connectors, CGA via a DB9 port, and a lone audio port on the side.

@arXiv_csSE_bot@mastoxiv.page
2025-08-21 09:31:00

Towards a DSL to Formalize Multimodal Requirements
Marcos Gomez-Vazquez, Jordi Cabot
https://arxiv.org/abs/2508.14631 https://arxiv.org/pdf/2508.14631

Towards a DSL to Formalize Multimodal Requirements
Multimodal systems, which process multiple input types such as text, audio, and images, are becoming increasingly prevalent in software systems, enabled by the huge advancements in Machine Learning. This triggers the need to easily define the requirements linked to these new types of user interactions, potentially involving more than one modality at the same time. This remains an open challenge due to the lack of languages and methods adapted to the diverse nature of multimodal interactions, wi…

@malik@Mastodon.Social
2025-08-19 09:23:40

Braucht ihr Audio- oder Bühnenequipment? Wir kaufen viel bei DJ-checkpoint.de (vor allem Lampen, Kabel, aber die haben auch Mikrofone und alles andere)— mit dem Code DARKK gibt es da 5 % Rabatt (wir kriegen davon nichts). Chef Daniel ist einfach ein guter Typ, kann man supporten.
Vielleicht habt ihr ein paar Musikanten-FreundInnen, die das interessiert. 🚀

@arXiv_csSD_bot@mastoxiv.page
2025-09-19 09:44:21

Spatial Audio Motion Understanding and Reasoning
Arvind Krishna Sridhar, Yinyi Guo, Erik Visser
https://arxiv.org/abs/2509.14666 https://arxiv.org/pdf/2509…

Spatial Audio Motion Understanding and Reasoning
Spatial audio reasoning enables machines to interpret auditory scenes by understanding events and their spatial attributes. In this work, we focus on spatial audio understanding with an emphasis on reasoning about moving sources. First, we introduce a spatial audio encoder that processes spatial audio to detect multiple overlapping events and estimate their spatial attributes, Direction of Arrival (DoA) and source distance, at the frame level. To generalize to unseen events, we incorporate an a…

@arXiv_csCV_bot@mastoxiv.page
2025-09-18 10:25:01

Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
Yaru Chen, Ruohao Guo, Liting Gao, Yang Xiang, Qingyu Luo, Zhenbo Li, Wenwu Wang
https://arxiv.org/abs/2509.14097

Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via a…

@cdamian@rls.social
2025-08-19 18:08:07

LEHMANN Audio 007 - Raphael Dincsoy b2b Tamara Wirth by Lehmann Club / Labor Lehmann
https://on.soundcloud.com/PrAOEQSO0h2uIRJx0E
Raphael Dincsoy:
SoundCloud:

@seeingwithsound@mas.to
2025-10-17 08:18:25

Does spatialized audio enhance the creation of mental representations? Spoiler: No (for their SnapStick-based setup) https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2025.1660373/full "seven blind individuals and se…

Frontiers | Does spatialized audio enhance the creation of mental representations?
Navigating unfamiliar environments without vision is a considerable challenge for blind individuals, as it requires constructing accurate cognitive maps. Bin...

@v_i_o_l_a@openbiblio.social
2025-08-20 06:41:37

"DigiPop – eine Wissensplattform als Living Document"
https://doi.org/10.1515/abitech-2025-0049
"DigiPop (Digitales Propädeutikum) ist eine innovative Online-Wissensplattform, die digitale Kompetenzen in den Geisteswissenschaften fördert. Sie bietet Studierenden und L…

DigiPop – eine Wissensplattform als Living Document
Zusammenfassung DigiPop (Digitales Propädeutikum) ist eine innovative Online-Wissensplattform, die digitale Kompetenzen in den Geisteswissenschaften fördert. Sie bietet Studierenden und Lehrenden einen praxisorientierten Zugang zu digitalen Tools, Methoden und Ressourcen, die speziell auf die Anforderungen geisteswissenschaftlicher Disziplinen zugeschnitten sind. Die Inhalte sind in acht Modalitäten organisiert, die Themen wie Text-, Bild-, Audio- und Videobearbeitung sowie Social Media und …

@nobodyinperson@fosstodon.org
2025-08-19 17:27:21

What's the equivalent of #Immich but for audio and video files?
I'm looking for a self-hostable #FOSS thing that I can throw the rest of my media library in the face (movies, music, downloaded youtube videos, etc.) and it'll sort everything nicely, maybe even query databases to find metada…

@Techmeme@techhub.social
2025-07-21 00:01:14

A look at the video podcasting surge; study: ~75% of podcast consumers play video episodes and ~30% play video episodes minimized or in background mode (Joseph Bernstein/New York Times)
https://www.nytimes.com/2025/07/20/style/p

@danyork@mastodon.social
2025-08-18 09:28:18

New audio - What is happening with the Linux Professional Institute (LPI) certification program? Over the weekend I sat down with Matt Rice, LPI's Executive Director. Way back in 1998/99, I was one of the co-founders of LPI, but haven't been involved in 20 years until I was recently inducted into LPI's "Hall of Fellows". Matt visited me in Vermont to give me the physical award, and so I pulled out the microphones to learn more. 🙂

@ncoca@social.coop
2025-09-15 23:40:19

The last few days, #Slack, which I only use via a browser, keeps asking for video and audio permission, even though I'm not joining a call or using their audio feature.
Is slack trying to steal my #data and I'm only noticing because I monitor and track all such requests due to

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 09:28:31

Aligning Audio Captions with Human Preferences
Kartik Hegde, Rehana Mahfuz, Yinyi Guo, Erik Visser
https://arxiv.org/abs/2509.14659 https://arxiv.org/pdf/2…

Aligning Audio Captions with Human Preferences
Current audio captioning systems rely heavily on supervised learning with paired audio-caption datasets, which are expensive to curate and may not reflect human preferences in real-world scenarios. To address this limitation, we propose a preference-aligned audio captioning framework based on Reinforcement Learning from Human Feedback (RLHF). To effectively capture nuanced human preferences, we train a Contrastive Language-Audio Pretraining (CLAP)-based reward model using human-labeled pairwise…

@berlinbuzzwords@floss.social
2025-09-19 11:04:02

At Berlin Buzzwords 2025, Dhrubo Saha discussed how OpenSearch pipelines are integrating ML inference processors for powerful multi-modal search. Learn to search directly within images, audio, and text – locally on your own hardware!
Watch the full session: https://youtu.be/eg3NoKFxvzA?si=cP3o1p5tTQo-UZSA
Berlin Buzzwords returns on 7-9 June 2026! Get 36% off with our Trust Us Ticket: https://tickets.plainschwarz.com/bbuzz26/c/8Hvk0ZvJA/

@arXiv_csSD_bot@mastoxiv.page
2025-08-20 07:44:19

Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement
Rong Chao, Wenze Ren, You-Jin Li, Kuo-Hsuan Hung, Sung-Feng Huang, Szu-Wei Fu, Wen-Huang Cheng, Yu Tsao
https://arxiv.org/abs/2508.13624

Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement
Recent Mamba-based models have shown promise in speech enhancement by efficiently modeling long-range temporal dependencies. However, models like Speech Enhancement Mamba (SEMamba) remain limited to single-speaker scenarios and struggle in complex multi-speaker environments such as the cocktail party problem. To overcome this, we introduce AVSEMamba, an audio-visual speech enhancement model that integrates full-face visual cues with a Mamba-based temporal backbone. By leveraging spatiotemporal …

@heiseonline@social.heise.de
2025-09-11 09:15:00

YouTube: Multilanguage-Audio ab sofort für alle verfügbar
Zwei Jahre lang testete YouTube mit ausgewählten Kanälen das Bereitstellen eigener Übersetzungen. In Kürze sollen alle Creator die Funktion erhalten.
http…

YouTube: Multilanguage-Audio ab sofort für alle verfügbar
Zwei Jahre lang testete YouTube mit ausgewählten Kanälen das Bereitstellen eigener Übersetzungen. In Kürze sollen alle Creator die Funktion erhalten.

@arXiv_csCL_bot@mastoxiv.page
2025-08-20 08:09:29

Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts
Duygu Altinok
https://arxiv.org/abs/2508.13376 https://arxiv.org/pdf/2508.1…

Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts
ASR systems often struggle with maintaining syntactic and semantic accuracy in long audio transcripts, impacting tasks like Named Entity Recognition (NER), capitalization, and punctuation. We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper. Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of…

@arXiv_csAI_bot@mastoxiv.page
2025-08-21 12:33:04

Replaced article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[4/4]:
- When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs
Kim, Dingeto, Kwon, Choi, Lee, Park, Lee, Shin

@Techmeme@techhub.social
2025-10-21 18:10:52

Google Fi plans an AI call quality feature to filter out background sounds like wind next month, and will bring full RCS support on the web in December (Ryan Whitwam/Ars Technica)
https://arstechnica.com/gadgets/2025/10/google-…

Google Fi is getting enhanced web calls and messaging, AI bill summaries
Google’s MVNO gets better web support, clearer audio, and yes, more AI.

@arXiv_csMM_bot@mastoxiv.page
2025-08-19 07:49:40

Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation
Zhilin Gao, Yunhao Li, Sijing Wu, Yuqin Cao, Huiyu Duan, Guangtao Zhai
https://arxiv.org/abs/2508.12020

Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation
The Audio-to-3D-Gesture (A2G) task has enormous potential for various applications in virtual reality and computer graphics, etc. However, current evaluation metrics, such as Fréchet Gesture Distance or Beat Constancy, fail at reflecting the human preference of the generated 3D gestures. To cope with this problem, exploring human preference and an objective quality assessment metric for AI-generated 3D human gestures is becoming increasingly significant. In this paper, we introduce the Ges-QA …

@arXiv_csSD_bot@mastoxiv.page
2025-07-21 08:49:00

OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder
Shikhar Bharadwaj, Samuele Cornell, Kwanghee Choi, Satoru Fukayama, Hye-jin Shim, Soham Deshmukh, Shinji Watanabe
https://arxiv.org/abs/2507.14129

OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder
Masked token prediction has emerged as a powerful pre-training objective across language, vision, and speech, offering the potential to unify these diverse modalities through a single pre-training task. However, its application for general audio understanding remains underexplored, with BEATs being the only notable example. BEATs has seen limited modifications due to the absence of open-source pre-training code. Furthermore, BEATs was trained only on AudioSet, restricting its broader downstream…

@arXiv_csSD_bot@mastoxiv.page
2025-08-19 09:54:20

FoleySpace: Vision-Aligned Binaural Spatial Audio Generation
Lei Zhao, Rujin Chen, Chi Zhang, Xiao-Lei Zhang, Xuelong Li
https://arxiv.org/abs/2508.12918 https://

FoleySpace: Vision-Aligned Binaural Spatial Audio Generation
Recently, with the advancement of AIGC, deep learning-based video-to-audio (V2A) technology has garnered significant attention. However, existing research mostly focuses on mono audio generation that lacks spatial perception, while the exploration of binaural spatial audio generation technologies, which can provide a stronger sense of immersion, remains insufficient. To solve this problem, we propose FoleySpace, a framework for video-to-binaural audio generation that produces immersive and spat…

@Mediagazer@mstdn.social
2025-09-04 19:55:55

The New York Times is shuttering its NYT Audio app, launched in May 2023, and integrating podcast archives and audio journalism into its News app (Mark Stenberg/Adweek)
https://www.adweek.com/media/new-york-times-audio-app-shuts-down/

EXCLUSIVE: The New York Times Is Shutting Down Its Audio App
The standalone product will shutter in October, as the publisher doubles down on audio in its flagship News app

@matthiasott@mastodon.social
2025-10-19 22:12:54

✍️ New post: Acoustic Room Treatment and Building Sound Panels, Part 1: Planning
#blogtober #audio #recording #sound

Acoustic Room Treatment and Building Sound Panels, Part 1: Planning · Matthias Ott
Matthias Ott is an independent user experience designer and web design engineer from Stuttgart, Germany. He also teaches Interface Prototyping at the Muthesius Academy of Fine Arts and Design.

@arXiv_eessAS_bot@mastoxiv.page
2025-08-20 07:53:40

End-to-End Audio-Visual Learning for Cochlear Implant Sound Coding in Noisy Environments
Meng-Ping Lin, Enoch Hsin-Ho Huang, Shao-Yi Chien, Yu Tsao
https://arxiv.org/abs/2508.13576

End-to-End Audio-Visual Learning for Cochlear Implant Sound Coding in Noisy Environments
The cochlear implant (CI) is a remarkable biomedical device that successfully enables individuals with severe-to-profound hearing loss to perceive sound by converting speech into electrical stimulation signals. Despite advancements in the performance of recent CI systems, speech comprehension in noisy or reverberant conditions remains a challenge. Recent and ongoing developments in deep learning reveal promising opportunities for enhancing CI sound coding capabilities, not only through replicat…

@jamesthebard@social.linux.pizza
2025-08-18 17:49:25

Finally cleaned up and reorganized the consoles along with some of the audio stuff. I really need to dust more often...
#retroconsoles

A picture showing (from left-to-right): a 19" audio rack with a mixer and microphone amp on top, a compresser/expander (dbx 286s), a shelf holding miniDiscs and a mini PC, a TASCAM MD-350 MiniDisc player, a 4U rack-mounted PC (my main rig), and a Sony MDS-E12 MiniDisc player. A shelf full of retro consoles including an NES, Sega Master System, PS1, GBA (top shelf), SNES, Dreamcast, PS2, and Gamecube (middle shelf), Sega Saturn, Sega Genesis/Mega Drive (bottom shelf). Beside the shelves are tw…

@arXiv_csSD_bot@mastoxiv.page
2025-07-21 07:45:40

Temporal Adaptation of Pre-trained Foundation Models for Music Structure Analysis
Yixiao Zhang, Haonan Chen, Ju-Chiang Wang, Jitong Chen
https://arxiv.org/abs/2507.13572

Temporal Adaptation of Pre-trained Foundation Models for Music Structure Analysis
Audio-based music structure analysis (MSA) is an essential task in Music Information Retrieval that remains challenging due to the complexity and variability of musical form. Recent advances highlight the potential of fine-tuning pre-trained music foundation models for MSA tasks. However, these models are typically trained with high temporal feature resolution and short audio windows, which limits their efficiency and introduces bias when applied to long-form audio. This paper presents a tempor…

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 10:11:41

Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality
Sami Ul Haq, Sheila Castilho, Yvette Graham
https://arxiv.org/abs/2509.14023 https://ar…

Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality
Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches. However, despite these advancements, MT quality assessment remains largely text centric, typically relying on human experts who read and compare texts. Since many real-world MT applications (e.g Google Translate Voice Mode, iFLYTEK Translator) involve translation being spoken rather printed or read, a more natural way to assess translation quality would be through…

@arXiv_csSD_bot@mastoxiv.page
2025-09-19 10:13:11

Exploring How Audio Effects Alter Emotion with Foundation Models
Stelios Katsis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou
https://arxiv.org/abs/2509.15151

Exploring How Audio Effects Alter Emotion with Foundation Models
Audio effects (FX) such as reverberation, distortion, modulation, and dynamic range processing play a pivotal role in shaping emotional responses during music listening. While prior studies have examined links between low-level audio features and affective perception, the systematic impact of audio FX on emotion remains underexplored. This work investigates how foundation models - large-scale neural architectures pretrained on multimodal data - can be leveraged to analyze these effects. Such mo…

@arXiv_csMM_bot@mastoxiv.page
2025-09-19 08:28:41

MMED: A Multimodal Micro-Expression Dataset based on Audio-Visual Fusion
Junbo Wang, Yan Zhao, Shuo Li, Shibo Wang, Shigang Wang, Jian Wei
https://arxiv.org/abs/2509.14592 https…

MMED: A Multimodal Micro-Expression Dataset based on Audio-Visual Fusion
Micro-expressions (MEs) are crucial leakages of concealed emotion, yet their study has been constrained by a reliance on silent, visual-only data. To solve this issue, we introduce two principal contributions. First, MMED, to our knowledge, is the first dataset capturing the spontaneous vocal cues that co-occur with MEs in ecologically valid, high-stakes interactions. Second, the Asymmetric Multimodal Fusion Network (AMF-Net) is a novel method that effectively fuses a global visual summary with…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 09:46:01

Mitigating data replication in text-to-audio generative diffusion models through anti-memorization guidance
Francisco Messina, Francesca Ronchini, Luca Comanducci, Paolo Bestagini, Fabio Antonacci
https://arxiv.org/abs/2509.14934

Mitigating data replication in text-to-audio generative diffusion models through anti-memorization guidance
A persistent challenge in generative audio models is data replication, where the model unintentionally generates parts of its training data during inference. In this work, we address this issue in text-to-audio diffusion models by exploring the use of anti-memorization strategies. We adopt Anti-Memorization Guidance (AMG), a technique that modifies the sampling process of pre-trained diffusion models to discourage memorization. Our study explores three types of guidance within AMG, each designe…

@arXiv_csCV_bot@mastoxiv.page
2025-08-12 12:48:03

StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang
https://arxiv.org/abs/2508.08248

StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main…

@Techmeme@techhub.social
2025-09-10 22:51:06

YouTube rolls out multi-language audio after a two-year pilot, saying creators saw 25% of watch time coming from views in their video's non-primary language (Lauren Forristal/TechCrunch)
https://techcrunch.com/2025/09/10/yout

YouTube’s multi-language audio feature for dubbing videos rolls out to all creators | TechCrunch
YouTube's multi-language audio feature is rolling out to millions of creators, helping them reach a wider global audience.

@arXiv_csSD_bot@mastoxiv.page
2025-08-18 07:39:10

LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters
Haomin Zhang, Kristin Qi, Shuxin Yang, Zihao Chen, Chaofan Ding, Xinhan Di
https://arxiv.org/abs/2508.11074

LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters
Generating high-quality and temporally synchronized audio from video content is essential for video editing and post-production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short-form audio generation for video segments under 10 seconds or rely on noisy datasets for long-form video-to-audio zsynthesis. To address these limitations, we introduce LD-LAudio-V1, an extension of state-of-the-art video-to-audio models and it …

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:29:31

Not in Sync: Unveiling Temporal Bias in Audio Chat Models
Jiayu Yao, Shenghua Liu, Yiwei Wang, Rundong Cheng, Lingrui Mei, Baolong Bi, Zhen Xiong, Xueqi Cheng
https://arxiv.org/abs/2510.12185

Not in Sync: Unveiling Temporal Bias in Audio Chat Models
Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning, yet their ability to locate when events occur remains underexplored. We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction. For example, when asked "At which second does the lecturer introduce the key formula?", models often predict timestamps that are consistently earlier or later than the ground truth. Through control…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-18 09:27:51

DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models
Kevin Wilkinghoff, Zheng-Hua Tan
https://arxiv.org/abs/2509.13927 https://

DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models
Reasoning about spatial audio with large language models requires a spatial audio encoder as an acoustic front-end to obtain audio embeddings for further processing. Such an encoder needs to capture all information required to detect the type of sound events, as well as the direction and distance of their corresponding sources. Accomplishing this with a single audio encoder is demanding as the information required for each of these tasks is mostly independent of each other. As a result, the per…

@arXiv_csSD_bot@mastoxiv.page
2025-09-19 10:02:51

Spatial-CLAP: Learning Spatially-Aware audio--text Embeddings for Multi-Source Conditions
Kentaro Seki, Yuki Okamoto, Kouei Yamaoka, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari
https://arxiv.org/abs/2509.14785

Spatial-CLAP: Learning Spatially-Aware audio--text Embeddings for Multi-Source Conditions
Contrastive language--audio pretraining (CLAP) has achieved remarkable success as an audio--text embedding framework, but existing approaches are limited to monaural or single-source conditions and cannot fully capture spatial information. The central challenge in modeling spatial information lies in multi-source conditions, where the correct correspondence between each sound source and its location is required. To tackle this problem, we propose Spatial-CLAP, which introduces a content-aware s…

@Techmeme@techhub.social
2025-08-07 21:46:03

Source: Meta has acquired WaveForms AI, which is working on AI that understands and mimics emotion in audio and debuted in December with a $40M seed led by a16z (Kalley Huang/The Information)
https://www.theinformation.com/articles/meta-acquires-ai-audio-star…

Meta Acquires AI Audio Startup WaveForms
Meta Platforms has acquired WaveForms AI, a small startup working on artificial intelligence capable of understanding emotion and mimicking it in audio form, according to a person with direct knowledge. The acquisition comes as Meta continues to overhaul its efforts in AI, after stumbles earlier ...

@arXiv_csSD_bot@mastoxiv.page
2025-09-17 10:02:09

Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs
Han Yin, Jung-Woo Choi
https://arxiv.org/abs/2509.13148 https:/…

Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs
Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate the LALM's audio understanding performance, researchers have proposed different benchmarks. However, key aspects for real-world interactions are underexplored in existing benchmarks, i.e., audio signals typically contain both speech and non-speech components, and energy levels of these components can vary signifi…

@arXiv_csSD_bot@mastoxiv.page
2025-08-19 09:53:50

MATPAC : Enhanced Masked Latent Prediction for Self-Supervised Audio Representation Learning
Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, Slim Essid
https://arxiv.org/abs/2508.12709

MATPAC++: Enhanced Masked Latent Prediction for Self-Supervised Audio Representation Learning
Masked latent prediction has emerged as a leading paradigm in self-supervised learning (SSL), especially for general audio and music representation learning. While recent methods have demonstrated strong performance, the role of the predictor module used at the output of such SSL systems remains mainly overlooked, despite being crucial for solving the pretext task at hand. In particular, this module should be able to deal with the ambiguity inherent in audio content, especially when it is compo…

@arXiv_eessAS_bot@mastoxiv.page
2025-08-18 08:34:00

ASAudio: A Survey of Advanced Spatial Audio Research
Zhiyuan Zhu, Yu Zhang, Wenxiang Guo, Changhao Pan, Zhou Zhao
https://arxiv.org/abs/2508.10924 https://…

ASAudio: A Survey of Advanced Spatial Audio Research
With the rapid development of spatial audio technologies today, applications in AR, VR, and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and syste…

@arXiv_csSD_bot@mastoxiv.page
2025-08-19 09:46:50

Cross-Modal Knowledge Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection
Qing Wang, Ya Jiang, Hang Chen, Sabato Marco Siniscalchi, Jun Du, Jianqing Gao
https://arxiv.org/abs/2508.12334

Cross-Modal Knowledge Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection
This work presents a cross-modal knowledge distillation (CMKD) framework combined with multi-level data augmentation for low-resource audio-visual (AV) sound event localization and detection (SELD). An audio-only SELD model acts as the teacher, transferring knowledge to an AV student model through both output responses and intermediate feature representations. To enhance learning, data augmentation is applied by mixing features randomly selected from multiple network layers and associated loss …

@Techmeme@techhub.social
2025-09-10 11:11:16

Spotify plans to roll out lossless, offering 24-bit / 44.1 kHz FLAC audio, to all Premium users in the next two months, without needing a new higher-priced tier (Terrence O'Brien/The Verge)
https://www.theverge.com/spotify/775189/spotify-lossless-streaming-f…

Spotify adds lossless streaming after 8 years of teasing
Spotify is finally adding support for streaming lossless 24-bit / 44.1 kHz FLAC audio to its Premium plan.

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 09:33:21

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
Ye-Xin Lu, Yu Gu, Kun Wei, Hui-Peng Du, Yang Ai, Zhen-Hua Ling
https://arxiv.org/abs/2509.14684

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms …

@arXiv_csSD_bot@mastoxiv.page
2025-09-18 08:25:51

RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing
Liting Gao, Yi Yuan, Yaru Chen, Yuelan Cheng, Zhenbo Li, Juan Wen, Shubin Zhang, Wenwu Wang
https://arxiv.org/abs/2509.14003

RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing
Diffusion models have shown remarkable progress in text-to-audio generation. However, text-guided audio editing remains in its early stages. This task focuses on modifying the target content within an audio signal while preserving the rest, thus demanding precise localization and faithful editing according to the text prompt. Existing training-based and zero-shot methods that rely on full-caption or costly optimization often struggle with complex editing or lack practicality. In this work, we p…

@Techmeme@techhub.social
2025-09-18 00:31:08

Meta announces the Oakley Meta Vanguard, with a 12MP wide-angle camera, Garmin and Strava integrations, and immersive audio, launching on October 21 for $499 (Aisha Malik/TechCrunch)
https://techcrunch.com/2025/09/17/meta-unveils-its-ne…

Meta unveils its new Oakley Meta Vanguard smart glasses for athletes | TechCrunch
The new smart glasses are geared toward runners, cyclists, and other athletes.

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 09:27:51

Enhancing Situational Awareness in Wearable Audio Devices Using a Lightweight Sound Event Localization and Detection System
Jun-Wei Yeow, Ee-Leng Tan, Santi Peksi, Zhen-Ting Ong, Woon-Seng Gan
https://arxiv.org/abs/2509.14650

Enhancing Situational Awareness in Wearable Audio Devices Using a Lightweight Sound Event Localization and Detection System
Wearable audio devices with active noise control (ANC) enhance listening comfort but often at the expense of situational awareness. However, this auditory isolation may mask crucial environmental cues, posing significant safety risks. To address this, we propose an environmental intelligence framework that combines Acoustic Scene Classification (ASC) with Sound Event Localization and Detection (SELD). Our system first employs a lightweight ASC model to infer the current environment. The scene p…

@arXiv_csSD_bot@mastoxiv.page
2025-09-18 08:43:31

Comprehensive Evaluation of CNN-Based Audio Tagging Models on Resource-Constrained Devices
Jordi Grau-Haro, Ruben Ribes-Serrano, Javier Naranjo-Alcazar, Marta Garcia-Ballesteros, Pedro Zuccarello
https://arxiv.org/abs/2509.14049

Comprehensive Evaluation of CNN-Based Audio Tagging Models on Resource-Constrained Devices
Convolutional Neural Networks (CNNs) have demonstrated exceptional performance in audio tagging tasks. However, deploying these models on resource-constrained devices like the Raspberry Pi poses challenges related to computational efficiency and thermal management. In this paper, a comprehensive evaluation of multiple convolutional neural network (CNN) architectures for audio tagging on the Raspberry Pi is conducted, encompassing all 1D and 2D models from the Pretrained Audio Neural Networks (P…

@arXiv_csSD_bot@mastoxiv.page
2025-08-18 08:01:00

Pretrained Conformers for Audio Fingerprinting and Retrieval
Kemal Altwlkany, Elmedin Selmanovic, Sead Delalic
https://arxiv.org/abs/2508.11609 https://arx…

Pretrained Conformers for Audio Fingerprinting and Retrieval
Conformers have shown great results in speech processing due to their ability to capture both local and global interactions. In this work, we utilize a self-supervised contrastive learning framework to train conformer-based encoders that are capable of generating unique embeddings for small segments of audio, generalizing well to previously unseen data. We achieve state-of-the-art results for audio retrieval tasks while using only 3 seconds of audio to generate embeddings. Our models are almost…

@Techmeme@techhub.social
2025-09-03 18:15:55

Google Pixel 8 and newer models can now stream audio to multiple headphones at once with expanded Bluetooth LE Audio and Auracast support (Terrence O'Brien/The Verge)
https://www.theverge.com/news/770233/google-android-bluetooth-audio-le-multipl…

Pixel phones can now stream music for you and a friend at the same time
Google’s Pixel phones are getting expanded Bluetooth LE Audio support for streaming to multiple headphones at once.

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 09:49:21

Discrete optimal transport is a strong audio adversarial attack
Anton Selitskiy, Akib Shahriyar, Jishnuraj Prakasan
https://arxiv.org/abs/2509.14959 https://

Discrete optimal transport is a strong audio adversarial attack
In this paper, we show that discrete optimal transport (DOT) is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level WavLM embeddings of generated speech are aligned to an unpaired bona fide pool via entropic OT and a top-$k$ barycentric projection, then decoded with a neural vocoder. Evaluated on ASVspoof2019 and ASVspoof5 with AASIST baselines, DOT yields consisten…

@arXiv_csSD_bot@mastoxiv.page
2025-08-19 09:06:59

Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding
Zhifeng Kong, Arushi Goel, Joao Felipe Santos, Sreyan Ghosh, Rafael Valle, Wei Ping, Bryan Catanzaro
https://arxiv.org/abs/2508.11818

Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding
Chain-of-thought reasoning has demonstrated significant improvements in large language models and vision language models, yet its potential for audio language models remains largely unexplored. In this technical report, we take a preliminary step towards closing this gap. For better assessment of sound reasoning, we propose AF-Reasoning-Eval, a benchmark targeting common-sense reasoning and the ability to discriminate among closely related choices. To prepare training corpus for sound reasoning…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 08:47:41

Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior
Yochai Yemini, Rami Ben-Ari, Sharon Gannot, Ethan Fetaya
https://arxiv.org/abs/2509.14379

Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior
In this paper, we address the problem of single-microphone speech separation in the presence of ambient noise. We propose a generative unsupervised technique that directly models both clean speech and structured noise components, training exclusively on these individual signals rather than noisy mixtures. Our approach leverages an audio-visual score model that incorporates visual cues to serve as a strong generative speech prior. By explicitly modelling the noise distribution alongside the spee…

@arXiv_csSD_bot@mastoxiv.page
2025-08-19 09:25:39

Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection
Bing Han, Anbai Jiang, Xinhu Zheng, Wei-Qiang Zhang, Jia Liu, Pingyi Fan, Yanmin Qian
https://arxiv.org/abs/2508.12230

Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection
Machine anomalous sound detection (ASD) is a valuable technique across various applications. However, its generalization performance is often limited due to challenges in data collection and the complexity of acoustic environments. Inspired by the success of large pre-trained models in numerous fields, this paper introduces a robust ASD model that leverages self-supervised pre-trained models trained on large-scale speech and audio datasets. Although there are inconsistencies between the pre-tra…

@arXiv_csSD_bot@mastoxiv.page
2025-10-15 08:43:22

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
Jinchuan Tian, Sang-gil Lee, Zhifeng Kong, Sreyan Ghosh, Arushi Goel, Chao-Han Huck Yang, Wenliang Dai, Zihan Liu, Hanrong Ye, Shinji Watanabe, Mohammad Shoeybi, Bryan Catanzaro, Rafael Valle, Wei Ping
https://arxiv.org/abs/2510.12000

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces U}nified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that dir…

@arXiv_csSD_bot@mastoxiv.page
2025-09-19 10:09:51

Temporally Heterogeneous Graph Contrastive Learning for Multimodal Acoustic event Classification
Yuanjian Chen, Yang Xiao, Jinjie Huang
https://arxiv.org/abs/2509.14893 https://…

Temporally Heterogeneous Graph Contrastive Learning for Multimodal Acoustic event Classification
Multimodal acoustic event classification plays a key role in audio-visual systems. Although combining audio and visual signals improves recognition, it is still difficult to align them over time and to reduce the effect of noise across modalities. Existing methods often treat audio and visual streams separately, fusing features later with contrastive or mutual information objectives. Recent advances explore multimodal graph learning, but most fail to distinguish between intra- and inter-modal t…

@arXiv_csSD_bot@mastoxiv.page
2025-09-19 09:33:01

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
Qingyu Liu, Yushen Chen, Zhikang Niu, Chunhui Wang, Yunting Yang, Bowen Zhang, Jian Zhao, Pengcheng Zhu, Kai Yu, Xie Chen
https://arxiv.org/abs/2509.14579

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
Flow-matching-based text-to-speech (TTS) models have shown high-quality speech synthesis. However, most current flow-matching-based TTS models still rely on reference transcripts corresponding to the audio prompt for synthesis. This dependency prevents cross-lingual voice cloning when audio prompt transcripts are unavailable, particularly for unseen languages. The key challenges for flow-matching-based TTS models to remove audio prompt transcripts are identifying word boundaries during training…

@arXiv_csSD_bot@mastoxiv.page
2025-08-18 07:36:20

Perturbed Public Voices (P$^{2}$V): A Dataset for Robust Audio Deepfake Detection
Chongyang Gao, Marco Postiglione, Isabel Gortner, Sarit Kraus, V. S. Subrahmanian
https://arxiv.org/abs/2508.10949

Perturbed Public Voices (P$^{2}$V): A Dataset for Robust Audio Deepfake Detection
Current audio deepfake detectors cannot be trusted. While they excel on controlled benchmarks, they fail when tested in the real world. We introduce Perturbed Public Voices (P$^{2}$V), an IRB-approved dataset capturing three critical aspects of malicious deepfakes: (1) identity-consistent transcripts via LLMs, (2) environmental and adversarial noise, and (3) state-of-the-art voice cloning (2020-2025). Experiments reveal alarming vulnerabilities of 22 recent audio deepfake detectors: models trai…

@arXiv_csSD_bot@mastoxiv.page
2025-10-15 08:13:41

SeeingSounds: Learning Audio-to-Visual Alignment via Text
Simone Carnemolla, Matteo Pennisi, Chiara Russo, Simone Palazzo, Daniela Giordano, Concetto Spampinato
https://arxiv.org/abs/2510.11738

SeeingSounds: Learning Audio-to-Visual Alignment via Text
We introduce SeeingSounds, a lightweight and modular framework for audio-to-image generation that leverages the interplay between audio, language, and vision-without requiring any paired audio-visual data or training on visual generative models. Rather than treating audio as a substitute for text or relying solely on audio-to-text mappings, our method performs dual alignment: audio is projected into a semantic language space via a frozen language encoder, and, contextually grounded into the vis…

@arXiv_csSD_bot@mastoxiv.page
2025-09-16 07:37:56

Combining Audio and Non-Audio Inputs in Evolved Neural Networks for Ovenbird
Sergio Poo Hernandez, Vadim Bulitko, Erin Bayne
https://arxiv.org/abs/2509.10566 https://

Combining Audio and Non-Audio Inputs in Evolved Neural Networks for Ovenbird
In the last several years the use of neural networks as tools to automate species classification from digital data has increased. This has been due in part to the high classification accuracy of image classification through Convolutional Neural Networks (CNN). In the case of audio data CNN based recognizers are used to automate the classification of species in audio recordings by using information from sound visualization (i.e., spectrograms). It is common for these recognizers to use the spect…

@arXiv_csSD_bot@mastoxiv.page
2025-10-14 11:35:48

Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning
Kuan-Yi Lee, Tsung-En Lin, Hung-Yi Lee
https://arxiv.org/abs/2510.11454 https://

Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning
Recent advancements in large multimodal models (LMMs) have shown strong capabilities in audio understanding. However, most systems rely solely on end-to-end reasoning, limiting interpretability and accuracy for tasks that require structured knowledge or specialized signal analysis. In this work, we present Audio-Maestro -- a tool-augmented audio reasoning framework that enables audio-language models to autonomously call external tools and integrate their timestamped outputs into the reasoning p…

@arXiv_csSD_bot@mastoxiv.page
2025-10-15 08:56:42

Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis
Junnuo Wang
https://arxiv.org/abs/2510.12175 https://

Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis
Recent advances in diffusion-based generative models have enabled high-quality text-to-audio synthesis, but fine-grained acoustic control remains a significant challenge in open-source research. We present Audio Palette, a diffusion transformer (DiT) based model that extends the Stable Audio Open architecture to address this "control gap" in controllable audio generation. Unlike prior approaches that rely solely on semantic conditioning, Audio Palette introduces four time-varying control signal…

@arXiv_csSD_bot@mastoxiv.page
2025-08-19 09:02:50

Prediction of Spotify Chart Success Using Audio and Streaming Features
Ian Jacob Cabansag, Paul Ntegeka
https://arxiv.org/abs/2508.11632 https://arxiv.org/…

Prediction of Spotify Chart Success Using Audio and Streaming Features
Spotify's streaming charts offer a real-time lens into music popularity, driving discovery, playlists, and even revenue potential. Understanding what influences a song's rise in ranks on these charts-especially early on-can guide marketing efforts, investment decisions, and even artistic direction. In this project, we developed a classification pipeline to predict a song's chart success based on its musical characteristics and early engagement data. Using all 2024 U.S. Top 200 Spotify Daily Cha…

@arXiv_csSD_bot@mastoxiv.page
2025-09-19 08:00:31

A long-form single-speaker real-time MRI speech dataset and benchmark
Sean Foley, Jihwan Lee, Kevin Huang, Xuan Shi, Yoonjeong Lee, Louis Goldstein, Shrikanth Narayanan
https://arxiv.org/abs/2509.14479

A long-form single-speaker real-time MRI speech dataset and benchmark
We release the USC Long Single-Speaker (LSS) dataset containing real-time MRI video of the vocal tract dynamics and simultaneous audio obtained during speech production. This unique dataset contains roughly one hour of video and audio data from a single native speaker of American English, making it one of the longer publicly available single-speaker datasets of real-time MRI speech data. Along with the articulatory and acoustic raw data, we release derived representations of the data that are s…

@arXiv_csSD_bot@mastoxiv.page
2025-08-20 10:30:18

Crosslisted article(s) found for cs.SD. https://arxiv.org/list/cs.SD/new
[1/1]:
- End-to-End Audio-Visual Learning for Cochlear Implant Sound Coding in Noisy Environments
Meng-Ping Lin, Enoch Hsin-Ho Huang, Shao-Yi Chien, Yu Tsao

@arXiv_csSD_bot@mastoxiv.page
2025-10-15 08:31:42

Audio-Guided Visual Perception for Audio-Visual Navigation
Yi Wang, Yinfeng Yu, Fuchun Sun, Liejun Wang, Wendong Zheng
https://arxiv.org/abs/2510.11760 https://

Audio-Guided Visual Perception for Audio-Visual Navigation
Audio-Visual Embodied Navigation aims to enable agents to autonomously navigate to sound sources in unknown 3D environments using auditory cues. While current AVN methods excel on in-distribution sound sources, they exhibit poor cross-source generalization: navigation success rates plummet and search paths become excessively long when agents encounter unheard sounds or unseen environments. This limitation stems from the lack of explicit alignment mechanisms between auditory signals and correspo…

@arXiv_csSD_bot@mastoxiv.page
2025-09-16 09:16:46

Neural Audio Codecs for Prompt-Driven Universal Source Separation
Adhiraj Banerjee, Vipul Arora
https://arxiv.org/abs/2509.11717 https://arxiv.org/pdf/2509…

Neural Audio Codecs for Prompt-Driven Universal Source Separation
Text-guided source separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to fixed-class separation. We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a Transformer masker modulated by CLAP-derived FiLM param…

@arXiv_csSD_bot@mastoxiv.page
2025-09-16 07:41:06

STASE: A spatialized text-to-audio synthesis engine for music generation
Tutti Chi, Letian Gao, Yixiao Zhang
https://arxiv.org/abs/2509.11124 https://arxiv…

STASE: A spatialized text-to-audio synthesis engine for music generation
While many text-to-audio systems produce monophonic or fixed-stereo outputs, generating audio with user-defined spatial properties remains a challenge. Existing deep learning-based spatialization methods often rely on latent-space manipulations, which can limit direct control over psychoacoustic parameters critical to spatial perception. To address this, we introduce STASE, a system that leverages a Large Language Model (LLM) as an agent to interpret spatial cues from text. A key feature of STA…

@arXiv_csSD_bot@mastoxiv.page
2025-09-16 07:39:46

Emoanti: audio anti-deepfake with refined emotion-guided representations
Xiaokang Li, Yicheng Gong, Dinghao Zou, Xin Cao, Sunbowen Lee
https://arxiv.org/abs/2509.10781 https://

Emoanti: audio anti-deepfake with refined emotion-guided representations
Audio deepfake is so sophisticated that the lack of effective detection methods is fatal. While most detection systems primarily rely on low-level acoustic features or pretrained speech representations, they frequently neglect high-level emotional cues, which can offer complementary and potentially anti-deepfake information to enhance generalization. In this work, we propose a novel audio anti-deepfake system that utilizes emotional features (EmoAnti) by exploiting a pretrained Wav2Vec2 (W2V2) …

@arXiv_csSD_bot@mastoxiv.page
2025-10-14 10:45:08

MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations
Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Xintong Hu, Yu Zhang, Li Tang, Rui Yang, Han Wang, Zongbao Zhang, Yuhan Wang, Yixuan Chen, Hankun Xu, Ke Xu, Pengfei Fan, Zhetao Chen, Yanhao Yu, Qiange Huang, Fei Wu, Zhou Zhao
https://arxiv.org/abs/2510.1039…

MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations
Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance re…

Tootfinder

Opt-in global Mastodon full text search. Join the index!