Tootfinder

@EarthOrgUK@mastodon.energy
2025-10-14 09:51:03

On Website Technicals (2025-06) - Tech updates: Junited - Rigby to Buttersafe - GPTBot badness, captions, diversion delay, under-volt, X11 fossil. #Junited2025 - https://www.earth.org.uk/note-on-site-tech

On Website Technicals (2025-06)
Tech updates: Junited - Rigby to Buttersafe - GPTBot badness, captions, diversion delay, under-volt, X11 fossil. #Junited2025

@chris@mstdn.chrisalemany.ca
2025-09-13 15:11:58

Reading about Baldur von Schirach. Sounds familiar.
“In February 1928 he became a university group leader of the National Socialist German Students' League.”
“He worked to broaden the Nazi Party's appeal to the bourgeoisie. Schirach was supported by Hitler in internal elections, who also wanted the Nazi Party to have a broad social base.”
“Schirach was skilled at bureaucratic power struggles. He founded the School Children's Leagues (Schülerbünde) to create competition to the Hitler Youth. He made an ally of Joseph Goebbels.”
“Schirach was named national youth leader of the party in 1931.”
“With Heinrich Hoffmann, Schirach produced several propaganda books of Hoffmann's photographs, including "Hitler As No One Knows Him", "Youth Around Hitler", and "Hitler in His Mountains". Schirach wrote the captions. The books sold hundreds of thousands of copies, earning Schirach and Hoffmann substantial royalties.”
“On 16 June 1932, he was made Reichsführer of the Party's Hitler Youth organization, and resigned from the Student League. Under Schirach, the Hitler Youth stewarded NSDAP events, and 21 members died in 1932. Schirach described these deaths as "blood sacrifice" for propaganda purposes. One example was Herbert Norkus, a fifteen-year-old boy who was stabbed to death by Communists. In a 31 May 1932 speech, Schirach recounted Norkus's death and called for a "National Socialist dictatorship". Schirach gave a memorial speech on the third anniversary of Norkus's death in January 1935.”
#hitleryouth #fascism #theAmericanFascist

@whitequark@mastodon.social
2025-10-12 17:23:13

none of these words are in the bible

Make Meet calls with Google Meet
Important: Legacy calls upgrade to Meet calls, which have expanded features like live captions, in-call chat, stackable effects, cloud encryption, screen sharing and more.

As users move over to Meet calling, some legacy calling features are being upgraded. A few features, like Family Mode, Moments and Knock Knock, are no longer available.

To use the new calling experience, update your Meet app to the latest version.

When all parties in the call use the latest…

@yaya@jorts.horse
2025-10-10 10:16:55

:bighonk: https://mastodon.social/@closedcaptionsbot/115349289439806600

Closed Captions (@closedcaptionsbot@mastodon.social)
[horn honking]

@arXiv_csSD_bot@mastoxiv.page
2025-08-07 08:29:24

MiDashengLM: Efficient Audio Understanding with General Audio Captions
Heinrich Dinkel, Gang Li, Jizhong Liu, Jian Luan, Yadong Niu, Xingwei Sun, Tianzi Wang, Qiyang Xiao, Junbo Zhang, Jiahao Zhou
https://arxiv.org/abs/2508.03983

MiDashengLM: Efficient Audio Understanding with General Audio Captions
Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring fu…

@arXiv_csHC_bot@mastoxiv.page
2025-08-28 09:49:21

CapTune: Adapting Non-Speech Captions With Anchored Generative Models
Jeremy Zhengqi Huang, Calu\~a de Lacerda Pataca, Liang-Yuan Wu, Dhruv Jain
https://arxiv.org/abs/2508.19971

CapTune: Adapting Non-Speech Captions With Anchored Generative Models
Non-speech captions are essential to the video experience of deaf and hard of hearing (DHH) viewers, yet conventional approaches often overlook the diversity of their preferences. We present CapTune, a system that enables customization of non-speech captions based on DHH viewers' needs while preserving creator intent. CapTune allows caption authors to define safe transformation spaces using concrete examples and empowers viewers to personalize captions across four dimensions: level of detail, e…

@UP8@mastodon.social
2025-08-05 14:10:56

🤯 Interpretable EEG-to-Image Generation with Semantic Prompts
#eeg #ai

Interpretable EEG-to-Image Generation with Semantic Prompts
Decoding visual experience from brain signals offers exciting possibilities for neuroscience and interpretable AI. While EEG is accessible and temporally precise, its limitations in spatial detail hinder image reconstruction. Our model bypasses direct EEG-to-image generation by aligning EEG signals with multilevel semantic captions -- ranging from object-level to abstract themes -- generated by a large language model. A transformer-based EEG encoder maps brain activity to these captions through…

@Migurski@mastodon.social
2025-10-08 20:02:39

Auto-captioning at this tech conference is having a tough time with background room chitchat

Screen with an empty podium and a bunch of weird English gobbledygook in the captions

@davidaugust@mastodon.online
2025-08-06 17:55:39

#USpol

screenshot of a post by Thomas Massie @RepThomasMassie: A meme featuring two panels with captions. In the top panel, there is a scene from a movie showing a driver looking shocked inside a vehicle; caption reads: "Democrats leaving Texas to protect their district." In the bottom panel, there is an image of speaker of the house johnson driving the other way, looking out from a vehicle; caption reads: "Republicans leaving D.C. to protect the Epstein files." Aug 6, 2025 1:54pm UTC

@arXiv_csCV_bot@mastoxiv.page
2025-10-09 10:34:31

Addressing the ID-Matching Challenge in Long Video Captioning
Zhantao Yang, Huangji Wang, Ruili Feng, Han Zhang, Yuting Hu, Shangwen Zhu, Junyan Li, Yu Liu, Fan Cheng
https://arxiv.org/abs/2510.06973

Addressing the ID-Matching Challenge in Long Video Captioning
Generating captions for long and complex videos is both critical and challenging, with significant implications for the growing fields of text-to-video generation and multi-modal understanding. One key challenge in long video captioning is accurately recognizing the same individuals who appear in different frames, which we refer to as the ID-Matching problem. Few prior works have focused on this important issue. Those that have, usually suffer from limited generalization and depend on point-wis…

@samvarma@fosstodon.org
2025-10-02 19:36:23

It is really calling watch a guy with a heavy German accent on a YouTube video and see that the automatically generated captions are basically perfect, and I can't dictate a single sentence in perfect English into my $1400 flagship device without making a correction
*galling
#iOS26

@grumpybozo@toad.social
2025-08-30 23:07:19

Meta meta meta...
WTF is with every video having word-flash captions? The one in this toot is an example of one of multiple constant-flux caption style. THAT'S NOT HOW PEOPLE READ!
I can barely watch such videos. https://journa.host/@lolgop/115119135266797749

@EarthOrgUK@mastodon.energy
2025-09-30 09:51:02

On Website Technicals (2020-02) - Tech updates: GSC Review annoyance, CSS dark mode, video captions, lazy loading, srcset issues. - https://www.earth.org.uk/note-on-site-technicals-33.html

On Website Technicals (2020-02)
Tech updates: GSC Review annoyance, CSS dark mode, video captions, lazy loading, srcset issues.

@aardrian@toot.cafe
2025-07-22 20:43:16

Reason #2608 I do not trust “AI” to generate captions or transcripts:
“Complete silence is always hallucinated as 'ترجمة نانسي قنقر' in Arabic which translates as 'Translation by Nancy Qunqar'”
More examples in replies.
#a11y #accessibility

@arXiv_csCV_bot@mastoxiv.page
2025-10-09 10:49:01

MATRIX: Mask Track Alignment for Interaction-aware Video Generation
Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, Seungryong Kim
https://arxiv.org/abs/2510.07310

MATRIX: Mask Track Alignment for Interaction-aware Video Generation
Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether n…

@arXiv_csLG_bot@mastoxiv.page
2025-10-01 11:57:57

Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces
John Gkountouras, Ivan Titov
https://arxiv.org/abs/2509.26594 https://a…

Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces
Recent text-only models demonstrate remarkable mathematical reasoning capabilities. Extending these to visual domains requires vision-language models to translate images into text descriptions. However, current models, trained to produce captions for human readers, often omit the precise details that reasoning systems require. This creates an interface mismatch: reasoners often fail not due to reasoning limitations but because they lack access to critical visual information. We propose Adaptive…

@arXiv_csCL_bot@mastoxiv.page
2025-09-01 09:40:52

BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning
Jo\~ao Guilherme Alves Santos, Giovana Kerche Bon\'as, Thales Sales Almeida
https://arxiv.org/abs/2508.21294

BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning
With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing …

@matthiasott@mastodon.social
2025-09-18 13:46:52

Had an amazing time speaking about Web Design Engineering at @… Freiburg last week! 🎉 It was an honour to be invited and to meet so many wonderful people and good friends there – a truly smashing experience! Thank you, everyone! 🤗💚🎈
📸 Photos by @…

Matthias on stage at Smashing Conf Freiburg, talking to the audience, with a monitor behind me displaying live captions.

Me on stage, viewed from afar with a truckload of modern CSS properties and functions on the screen behind me.

Vitaly Friedman and I talking on a red sofa during the Q&A after the talk.

@seeingwithsound@mas.to
2025-09-21 15:26:21

(YouTube, Chinese w/o captions but graphical English subtitles) Blind patient treated with ZM-02 optogenetic gene therapy #RP

@arXiv_eessAS_bot@mastoxiv.page
2025-08-29 08:41:41

Sound event detection with audio-text models and heterogeneous temporal annotations
Manu Harju, Annamaria Mesaros
https://arxiv.org/abs/2508.20703 https://…

Sound event detection with audio-text models and heterogeneous temporal annotations
Recent advances in generating synthetic captions based on audio and related metadata allow using the information contained in natural language as input for other audio tasks. In this paper, we propose a novel method to guide a sound event detection system with free-form text. We use machine-generated captions as complementary information to the strong labels for training, and evaluate the systems using different types of textual inputs. In addition, we study a scenario where only part of the tr…

@arXiv_csCV_bot@mastoxiv.page
2025-07-29 12:16:11

Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
Licai Sun, Xingxun Jiang, Haoyu Chen, Yante Li, Zheng Lian, Biu Liu, Yuan Zong, Wenming Zheng, Jukka M. Lepp\"anen, Guoying Zhao
https://arxiv.org/abs/2507.21015

Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
Current facial emotion recognition systems are predominately trained to predict a fixed set of predefined categories or abstract dimensional values. This constrained form of supervision hinders generalization and applicability, as it reduces the rich and nuanced spectrum of emotions into oversimplified labels or scales. In contrast, natural language provides a more flexible, expressive, and interpretable way to represent emotions, offering a much broader source of supervision. Yet, leveraging s…

@EarthOrgUK@mastodon.energy
2025-07-22 19:51:03

On Website Technicals (2025-06) - Tech updates: Junited - Rigby to Buttersafe - GPTBot badness, captions, diversion delay, under-volt, X11 fossil. #Junited2025 - https://www.earth.org.uk/note-on-site-tech

On Website Technicals (2025-06)
Tech updates: Junited - Rigby to Buttersafe - GPTBot badness, captions, diversion delay, under-volt, X11 fossil. #Junited2025

@arXiv_csCV_bot@mastoxiv.page
2025-07-28 10:14:11

LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences
Yusuke Hirota, Boyi Li, Ryo Hachiuma, Yueh-Hua Wu, Boris Ivanovic, Yuta Nakashima, Marco Pavone, Yejin Choi, Yu-Chiang Frank Wang, Chao-Han Huck Yang
https://arxiv.org/abs/2507.19362

LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences
Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (\eg, hallucination), and societal biases …

@arXiv_csMM_bot@mastoxiv.page
2025-09-22 08:36:11

Jamendo-QA: A Large-Scale Music Question Answering Dataset
Junyoung Koh, Soo Yong Kim, Yongwon Choi, Gyu Hyeong Choi
https://arxiv.org/abs/2509.15662 https://

Jamendo-QA: A Large-Scale Music Question Answering Dataset
We introduce Jamendo-QA, a large-scale dataset for Music Question Answering (Music-QA). The dataset is built on freely licensed tracks from the Jamendo platform and is automatically annotated using the Qwen-Omni model. Jamendo-QA provides question-answer pairs and captions aligned with music audio, enabling both supervised training and zero-shot evaluation. Our resource aims to fill the gap of music-specific QA datasets and foster further research in music understanding, retrieval, and generati…

@arXiv_csCV_bot@mastoxiv.page
2025-10-06 10:03:49

One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi
https://arxiv.org/abs/2510.02898 …

One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions. We present \frameworkName{}, a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of a…

@arXiv_csSD_bot@mastoxiv.page
2025-09-29 09:37:08

Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment
Yunyi Liu, Shaofan Yang, Kai Li, Xu Li
https://arxiv.org/abs/2509.21919 https://

Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment
Human auditory perception is shaped by moving sound sources in 3D space, yet prior work in generative sound modelling has largely been restricted to mono signals or static spatial audio. In this work, we introduce a framework for generating moving sounds given text prompts in a controllable fashion. To enable training, we construct a synthetic dataset that records moving sounds in binaural format, their spatial trajectories, and text captions about the sound event and spatial motion. Using this…

@arXiv_csCL_bot@mastoxiv.page
2025-09-17 09:16:00

MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models
Vijay Govindarajan, Pratik Patel, Sahil Tripathi, Md Azizul Hoque, Gautam Siddharth Kashyap
https://arxiv.org/abs/2509.12591

MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models
Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our metho…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 09:28:31

Aligning Audio Captions with Human Preferences
Kartik Hegde, Rehana Mahfuz, Yinyi Guo, Erik Visser
https://arxiv.org/abs/2509.14659 https://arxiv.org/pdf/2…

Aligning Audio Captions with Human Preferences
Current audio captioning systems rely heavily on supervised learning with paired audio-caption datasets, which are expensive to curate and may not reflect human preferences in real-world scenarios. To address this limitation, we propose a preference-aligned audio captioning framework based on Reinforcement Learning from Human Feedback (RLHF). To effectively capture nuanced human preferences, we train a Contrastive Language-Audio Pretraining (CLAP)-based reward model using human-labeled pairwise…

@arXiv_csCV_bot@mastoxiv.page
2025-09-24 11:09:54

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs
Israfel Salazar, Desmond Elliott, Yova Kementchedjhieva
https://arxiv.org/abs/2509.19207 …

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs
Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, but understanding long, dense captions remains an open challenge. We hypothesize that compositionality, the capacity to reason about object-attribute bindings and inter-object relationships, is key to understanding longer captions. In this paper, we investigate the interaction between compositionality and long-caption understanding, asking whether training for one property enhance…

@arXiv_csCV_bot@mastoxiv.page
2025-10-02 10:55:51

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation
Siheng Wan, Zhengtao Yao, Zhengdao Li, Junhao Dong, Yanshu Li, Yikai Li, Linshan Li, Haoyan Xu, Yijiang Li, Zhikang Dong, Huacan Wang, Jifeng Shen
https://arxiv.org/abs/2510.00974

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation
Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while main…

@ubuntourist@mastodon.social
2025-09-18 21:33:32

From the Ministry of Truth:
#resist #authoritarianism #fascism #news

EDITORIAL CARTOON:

Official seals for the Department of Defense, the Department of Health
and Human Services and the Department of Justice.

CAPTIONS:

* Department of War

* Department of War on Science

* Department of War on Democrats

Signed: Bramhall'25 (NYDN)

@arXiv_csCV_bot@mastoxiv.page
2025-09-29 11:16:57

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer
Song Fei, Tian Ye, Lujia Wang, Lei Zhu
https://arxiv.org/abs/2509.22414 https://

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer
Universal image restoration (UIR) aims to recover images degraded by unknown mixtures while preserving semantics -- conditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free UIR framework that adapts a large diffusion transformer (Flux.1) without image captions. LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to …

@arXiv_eessAS_bot@mastoxiv.page
2025-07-24 08:00:59

Towards Robust Speech Recognition for Jamaican Patois Music Transcription
Jordan Madden, Matthew Stone, Dimitri Johnson, Daniel Geddez
https://arxiv.org/abs/2507.16834 https://

Towards Robust Speech Recognition for Jamaican Patois Music Transcription
Although Jamaican Patois is a widely spoken language, current speech recognition systems perform poorly on Patois music, producing inaccurate captions that limit accessibility and hinder downstream applications. In this work, we take a data-centric approach to this problem by curating more than 40 hours of manually transcribed Patois music. We use this dataset to fine-tune state-of-the-art automatic speech recognition (ASR) models, and use the results to develop scaling laws for the performance…

@arXiv_csCV_bot@mastoxiv.page
2025-07-25 10:21:02

SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning
Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim
https://arxiv.org/abs/2507.18616

SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning
Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasets generated by text-to-image (T2I) models to mitigate the need for costly manual annotation. However, these T2I models often produce images that exhibit semantic misalignments with their corresponding input captions (e.g., missing objects, incorrect attributes), resulting in noisy synthetic image-caption pairs that can hinder model training. Existing dataset pruning techniques are largely designed for removing noisy text in…

@arXiv_csCV_bot@mastoxiv.page
2025-07-23 10:31:22

Enhancing Remote Sensing Vision-Language Models Through MLLM and LLM-Based High-Quality Image-Text Dataset Generation
Yiguo He, Junjie Zhu, Yiying Li, Xiaoyu Zhang, Chunping Qiu, Jun Wang, Qiangjuan Huang, Ke Yang
https://arxiv.org/abs/2507.16716

Enhancing Remote Sensing Vision-Language Models Through MLLM and LLM-Based High-Quality Image-Text Dataset Generation
The application of Vision-language foundation models (VLFMs) to remote sensing (RS) imagery has garnered significant attention due to their superior capability in various downstream tasks. A key challenge lies in the scarcity of high-quality, large-scale, image-text paired training data. Recently, several works introduced extensive image-text datasets for RS and trained their VLFMs. However, due to the rudimentary methods used for generating captions, the quality of datasets is suboptimal, requ…

Tootfinder

Opt-in global Mastodon full text search. Join the index!