Tootfinder

@arXiv_csCL_bot@mastoxiv.page
2025-08-19 11:44:50

Word Meanings in Transformer Language Models
Jumbly Grindrod, Peter Grindrod
https://arxiv.org/abs/2508.12863 https://arxiv.org/pdf/2508.12863

Word Meanings in Transformer Language Models
We investigate how word meanings are represented in the transformer language models. Specifically, we focus on whether transformer models employ something analogous to a lexical store - where each word has an entry that contains semantic information. To do this, we extracted the token embedding space of RoBERTa-base and k-means clustered it into 200 clusters. In our first study, we then manually inspected the resultant clusters to consider whether they are sensitive to semantic information. In …

@arXiv_csLG_bot@mastoxiv.page
2025-09-19 10:26:51

Attention Beyond Neighborhoods: Reviving Transformer for Graph Clustering
Xuanting Xie, Bingheng Li, Erlin Pan, Rui Hou, Wenyu Chen, Zhao Kang
https://arxiv.org/abs/2509.15024 h…

Attention Beyond Neighborhoods: Reviving Transformer for Graph Clustering
Attention mechanisms have become a cornerstone in modern neural networks, driving breakthroughs across diverse domains. However, their application to graph structured data, where capturing topological connections is essential, remains underexplored and underperforming compared to Graph Neural Networks (GNNs), particularly in the graph clustering task. GNN tends to overemphasize neighborhood aggregation, leading to a homogenization of node representations. Conversely, Transformer tends to over g…

@arXiv_csCV_bot@mastoxiv.page
2025-08-18 09:54:00

Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models
Erez Meoded
https://arxiv.org/abs/2508.11499 https://arxiv.org/pdf/25…

Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models
Historical handwritten text recognition (HTR) is essential for unlocking the cultural and scholarly value of archival documents, yet digitization is often hindered by scarce transcriptions, linguistic variation, and highly diverse handwriting styles. In this study, we apply TrOCR, a state-of-the-art transformer-based HTR model, to 16th-century Latin manuscripts authored by Rudolf Gwalther. We investigate targeted image preprocessing and a broad suite of data augmentation techniques, introducing…

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 09:24:30

OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal Navigation
Tatiana Zemskova, Aleksei Staroverov, Dmitry Yudin, Aleksandr Panov
https://arxiv.org/abs/2508.11479 h…

OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal Navigation
Open-vocabulary Object Goal Navigation requires an embodied agent to reach objects described by free-form language, including categories never seen during training. Existing end-to-end policies overfit small simulator datasets, achieving high success on training scenes but failing to generalize and exhibiting unsafe behaviour (frequent collisions). We introduce OVSegDT, a lightweight transformer policy that tackles these issues with two synergistic components. The first component is the semanti…

@arXiv_csSE_bot@mastoxiv.page
2025-07-18 09:20:02

ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells
Samal Nursapa, Anastassiya Samuilova, Alessio Bucaioni. Phuong T. Nguyen
https://arxiv.org/abs/2507.12561

ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells
Architectural smells such as God Class, Cyclic Dependency, and Hub-like Dependency degrade software quality and maintainability. Existing tools detect such smells but rarely suggest how to fix them. This paper explores the use of pre-trained transformer models--CodeBERT and CodeT5--for recommending suitable refactorings based on detected smells. We frame the task as a three-class classification problem and fine-tune both models on over 2 million refactoring instances mined from 11,149 open-sour…

@peterhoneyman@a2mi.social
2025-08-18 20:00:51

i am determined to read the attention/transformer paper
i even printed it out

Attention Is All You Need
Ashish Vaswani
Noam Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan N. Gomez
Łukasz Kaiser
Illia Polosukhin

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with …

@arXiv_eessIV_bot@mastoxiv.page
2025-08-19 08:22:30

FractMorph: A Fractional Fourier-Based Multi-Domain Transformer for Deformable Image Registration
Shayan Kebriti, Shahabedin Nabavi, Ali Gooya
https://arxiv.org/abs/2508.12445 h…

FractMorph: A Fractional Fourier-Based Multi-Domain Transformer for Deformable Image Registration
Deformable image registration (DIR) is a crucial and challenging technique for aligning anatomical structures in medical images and is widely applied in diverse clinical applications. However, existing approaches often struggle to capture fine-grained local deformations and large-scale global deformations simultaneously within a unified framework. We present FractMorph, a novel 3D dual-parallel transformer-based architecture that enhances cross-image feature matching through multi-domain fracti…

@arXiv_csGR_bot@mastoxiv.page
2025-07-18 08:44:42

HairFormer: Transformer-Based Dynamic Neural Hair Simulation
Joy Xiaoji Zhang, Jingsen Zhu, Hanyu Chen, Steve Marschner
https://arxiv.org/abs/2507.12600 ht…

HairFormer: Transformer-Based Dynamic Neural Hair Simulation
Simulating hair dynamics that generalize across arbitrary hairstyles, body shapes, and motions is a critical challenge. Our novel two-stage neural solution is the first to leverage Transformer-based architectures for such a broad generalization. We propose a Transformer-powered static network that predicts static draped shapes for any hairstyle, effectively resolving hair-body penetrations and preserving hair fidelity. Subsequently, a dynamic network with a novel cross-attention mechanism fuses…

@Techmeme@techhub.social
2025-08-18 23:35:45

Nvidia debuts the Nemotron-Nano-9B-v2, a hybrid Mamba-transformer model, saying it achieves scores comparable to or better than Qwen3-8B on reasoning benchmarks (Carl Franzen/VentureBeat)
https://venturebeat.com/ai/nvidia-rele

Nvidia releases a new small, open model Nemotron-Nano-9B-v2 with toggle on/off reasoning
Developers are free to create and distribute derivative models. Importantly, Nvidia does not claim ownership of any outputs generated...

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 09:40:01

MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis
Keyu An, Zhiyu Zhang, Changfeng Gao, Yabin Li, Zhendong Peng, Haoxu Wang, Zhihao Du, Han Zhao, Zhifu Gao, Xiangang Li
https://arxiv.org/abs/2509.14784

MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis
This work introduces MELA-TTS, a novel joint transformer-diffusion framework for end-to-end text-to-speech synthesis. By autoregressively generating continuous mel-spectrogram frames from linguistic and speaker conditions, our architecture eliminates the need for speech tokenization and multi-stage processing pipelines. To address the inherent difficulties of modeling continuous features, we propose a representation alignment module that aligns output representations of the transformer decoder …

@arXiv_csCL_bot@mastoxiv.page
2025-08-18 09:40:50

Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training
Marc Brinner, Sina Zarrie{\ss}
https://arxiv.org/abs/2508.11393 https://a…

Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training
We propose an end-to-end differentiable training paradigm for stable training of a rationalized transformer classifier. Our approach results in a single model that simultaneously classifies a sample and scores input tokens based on their relevance to the classification. To this end, we build on the widely-used three-player-game for training rationalized models, which typically relies on training a rationale selector, a classifier and a complement classifier. We simplify this approach by making …

@arXiv_physicsplasmph_bot@mastoxiv.page
2025-07-18 08:19:52

Early Prediction of Current Quench Events in the ADITYA Tokamak using Transformer based Data Driven Models
Jyoti Agarwal, Bhaskar Chaudhury, Jaykumar Navadiya, Shrichand Jakhar, Manika Sharma
https://arxiv.org/abs/2507.12797

Early Prediction of Current Quench Events in the ADITYA Tokamak using Transformer based Data Driven Models
Disruptions in tokamak plasmas, marked by sudden thermal and current quenches, pose serious threats to plasma-facing components and system integrity. Accurate early prediction, with sufficient lead time before disruption onset, is vital to enable effective mitigation strategies. This study presents a novel data-driven approach for predicting early current quench, a key precursor to disruptions, using transformer-based deep learning models, applied to ADITYA tokamak diagnostic data. Using multiv…

@arXiv_csSD_bot@mastoxiv.page
2025-07-18 09:35:52

Multi-Class-Token Transformer for Multitask Self-supervised Music Information Retrieval
Yuexuan Kong, Vincent Lostanlen, Romain Hennequin, Mathieu Lagrange, Gabriel Meseguer-Brocal
https://arxiv.org/abs/2507.12996

Multi-Class-Token Transformer for Multitask Self-supervised Music Information Retrieval
Contrastive learning and equivariant learning are effective methods for self-supervised learning (SSL) for audio content analysis. Yet, their application to music information retrieval (MIR) faces a dilemma: the former is more effective on tagging (e.g., instrument recognition) but less effective on structured prediction (e.g., tonality estimation); The latter can match supervised methods on the specific task it is designed for, but it does not generalize well to other tasks. In this article, w…

@arXiv_eessSY_bot@mastoxiv.page
2025-09-18 09:51:51

Large Language Model-Empowered Decision Transformer for UAV-Enabled Data Collection
Zhixion Chen, Jiangzhou Wang, and Hyundong Shin, Arumugam Nallanathan
https://arxiv.org/abs/2509.13934

Large Language Model-Empowered Decision Transformer for UAV-Enabled Data Collection
The deployment of unmanned aerial vehicles (UAVs) for reliable and energy-efficient data collection from spatially distributed devices holds great promise in supporting diverse Internet of Things (IoT) applications. Nevertheless, the limited endurance and communication range of UAVs necessitate intelligent trajectory planning. While reinforcement learning (RL) has been extensively explored for UAV trajectory optimization, its interactive nature entails high costs and risks in real-world environ…

@arXiv_csRO_bot@mastoxiv.page
2025-09-18 10:03:51

UltraHiT: A Hierarchical Transformer Architecture for Generalizable Internal Carotid Artery Robotic Ultrasonography
Teng Wang, Haojun Jiang, Yuxuan Wang, Zhenguo Sun, Xiangjie Yan, Xiang Li, Gao Huang
https://arxiv.org/abs/2509.13832

UltraHiT: A Hierarchical Transformer Architecture for Generalizable Internal Carotid Artery Robotic Ultrasonography
Carotid ultrasound is crucial for the assessment of cerebrovascular health, particularly the internal carotid artery (ICA). While previous research has explored automating carotid ultrasound, none has tackled the challenging ICA. This is primarily due to its deep location, tortuous course, and significant individual variations, which greatly increase scanning complexity. To address this, we propose a Hierarchical Transformer-based decision architecture, namely UltraHiT, that integrates high-lev…

@arXiv_qbioQM_bot@mastoxiv.page
2025-08-18 08:24:30

Brain Tumor Segmentation in Sub-Sahara Africa with Advanced Transformer and ConvNet Methods: Fine-Tuning, Data Mixing and Ensembling
Toufiq Musah, Chantelle Amoako-Atta, John Amankwaah Otu, Lukman E. Ismaila, Swallah Alhaji Suraka, Oladimeji Williams, Isaac Tigbee, Kato Hussein Wabbi, Samantha Katsande, Kanyiri Ahmed Yakubu, Adedayo Kehinde Lawal, Anita Nsiah Donkor, Naeem Mwinlanaah Adamu, Adebowale Akande, John Othieno, Prince Ebenezer Adjei, Zhang Dong, Confidence Raymond, Udunna C.…

Brain Tumor Segmentation in Sub-Sahara Africa with Advanced Transformer and ConvNet Methods: Fine-Tuning, Data Mixing and Ensembling
Brain tumors are among the deadliest cancers worldwide, with particularly devastating impact in Sub-Saharan Africa (SSA) where limited access to medical imaging infrastructure and expertise often delays diagnosis and treatment planning. Accurate brain tumor segmentation is crucial for treatment planning, surgical guidance, and monitoring disease progression, yet manual segmentation is time-consuming and subject to inter-observer variability. Recent advances in deep learning, based on Convolutio…

@pbloem@sigmoid.social
2025-08-18 11:58:02

The HRM paper has been mostly debunked by the ARC-AGI people.
https://arcprize.org/blog/hrm-analysis
The results are legit but most of them are not down to the architecture (swapping it out for a transformer doesn't change that much).
Also, the model is purely transductive. It onl…

The Hidden Drivers of HRM's Performance on ARC-AGI
We scored on hidden tasks, ran ablations, and found that performance from the Hierarchical Reasoning Model comes from an unexpected source

@arXiv_astrophIM_bot@mastoxiv.page
2025-07-17 08:48:00

Image-Based Multi-Survey Classification of Light Curves with a Pre-Trained Vision Transformer
Daniel Moreno-Cartagena, Guillermo Cabrera-Vives, Alejandra M. Mu\~noz Arancibia, Pavlos Protopapas, Francisco F\"orster, M\'arcio Catelan, A. Bayo, Pablo A. Est\'evez, P. S\'anchez-S\'aez, Franz E. Bauer, M. Pavez-Herrera, L. Hern\'andez-Garc\'ia, Gonzalo Rojas

Image-Based Multi-Survey Classification of Light Curves with a Pre-Trained Vision Transformer
We explore the use of Swin Transformer V2, a pre-trained vision Transformer, for photometric classification in a multi-survey setting by leveraging light curves from the Zwicky Transient Facility (ZTF) and the Asteroid Terrestrial-impact Last Alert System (ATLAS). We evaluate different strategies for integrating data from these surveys and find that a multi-survey architecture which processes them jointly achieves the best performance. These results highlight the importance of modeling survey-s…

@arXiv_csCV_bot@mastoxiv.page
2025-07-18 10:21:22

Taming Diffusion Transformer for Real-Time Mobile Video Generation
Yushu Wu, Yanyu Li, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ke Ma, Arpit Sahni, Ju Hu, Aliaksandr Siarohin, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov
https://arxiv.org/abs/2507.13343

Taming Diffusion Transformer for Real-Time Mobile Video Generation
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and real-time generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of…

@arXiv_csAR_bot@mastoxiv.page
2025-07-17 08:18:40

Characterizing State Space Model (SSM) and SSM-Transformer Hybrid Language Model Performance with Long Context Length
Saptarshi Mitra, Rachid Karami, Haocheng Xu, Sitao Huang, Hyoukjun Kwon
https://arxiv.org/abs/2507.12442

Characterizing State Space Model (SSM) and SSM-Transformer Hybrid Language Model Performance with Long Context Length
The demand for machine intelligence capable of processing continuous, long-context inputs on local devices is growing rapidly. However, the quadratic complexity and memory requirements of traditional Transformer architectures make them inefficient and often unusable for these tasks. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and hybrids, which promise near-linear scaling. While most current research focuses on the accuracy and theoretical throughp…

@arXiv_csAI_bot@mastoxiv.page
2025-09-18 08:13:11

Semantic Fusion with Fuzzy-Membership Features for Controllable Language Modelling
Yongchao Huang, Hassan Raza
https://arxiv.org/abs/2509.13357 https://arx…

Semantic Fusion with Fuzzy-Membership Features for Controllable Language Modelling
We propose semantic fusion, a lightweight scheme that augments a Transformer language model (LM) with a parallel, fuzzy-membership feature channel that encodes token-level semantics. Each token is represented by a vector of interpretable features (e.g. part-of-speech cues, shallow roles, boundary flags, sentiment polarity and strength) whose values are graded degrees from differentiable membership functions (e.g. power kernels). These per-token vectors form a sentence-level semantic matrix fuse…

@arXiv_csGR_bot@mastoxiv.page
2025-08-19 08:22:39

MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration
Yuanxin Wei, Lansong Diao, Bujiao Chen, Shenggan Cheng, Zhengping Qian, Wenyuan Yu, Nong Xiao, Wei Lin, Jiangsu Du
https://arxiv.org/abs/2508.12691

MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration
Leveraging the Transformer architecture and the diffusion process, video DiT models have emerged as a dominant approach for high-quality video generation. However, their multi-step iterative denoising process incurs high computational cost and inference latency. Caching, a widely adopted optimization method in DiT models, leverages the redundancy in the diffusion process to skip computations in different granularities (e.g., step, cfg, block). Nevertheless, existing caching methods are limited …

@arXiv_statML_bot@mastoxiv.page
2025-09-19 08:40:31

Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models
Samet Demir, Zafer Dogan
https://arxiv.org/abs/2509.15152 https://

Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models
We study the in-context learning (ICL) capabilities of pretrained Transformers in the setting of nonlinear regression. Specifically, we focus on a random Transformer with a nonlinear MLP head where the first layer is randomly initialized and fixed while the second layer is trained. Furthermore, we consider an asymptotic regime where the context length, input dimension, hidden dimension, number of training tasks, and number of training samples jointly grow. In this setting, we show that the rand…

@arXiv_csCR_bot@mastoxiv.page
2025-07-18 12:17:34

Replaced article(s) found for cs.CR. https://arxiv.org/list/cs.CR/new
[1/1]:
- TBDetector:Transformer-Based Detector for Advanced Persistent Threats with Provenance Graph
Wang, Wen, Zhang, Zhao, Ma, Luo, Xu, Nie, Wu, Liu

@arXiv_eessAS_bot@mastoxiv.page
2025-09-18 09:39:11

SV-Mixer: Replacing the Transformer Encoder with Lightweight MLPs for Self-Supervised Model Compression in Speaker Verification
Jungwoo Heo, Hyun-seo Shin, Chan-yeong Lim, Kyo-won Koo, Seung-bin Kim, Jisoo Son, Ha-Jin Yu
https://arxiv.org/abs/2509.14136

SV-Mixer: Replacing the Transformer Encoder with Lightweight MLPs for Self-Supervised Model Compression in Speaker Verification
Self-supervised learning (SSL) has pushed speaker verification accuracy close to state-of-the-art levels, but the Transformer backbones used in most SSL encoders hinder on-device and real-time deployment. Prior compression work trims layer depth or width yet still inherits the quadratic cost of self-attention. We propose SV-Mixer, the first fully MLP-based student encoder for SSL distillation. SV-Mixer replaces Transformer with three lightweight modules: Multi-Scale Mixing for multi-resolution …

@arXiv_eessSP_bot@mastoxiv.page
2025-09-17 08:56:49

NEFT: A Unified Transformer Framework for Efficient Near-Field CSI Feedback in XL-MIMO Systems
Haiyang Li, Tianqi Mao, Pengyu Wang, Ruiqi Liu, Shunyu Li, Zhaocheng Wang
https://arxiv.org/abs/2509.12748

NEFT: A Unified Transformer Framework for Efficient Near-Field CSI Feedback in XL-MIMO Systems
Extremely large-scale multiple-input multiple-output (XL-MIMO) systems, operating in the near-field region due to their massive antenna arrays, are a key enabler of next-generation wireless communications but face significant challenges in channel state information (CSI) feedback. Deep learning has emerged as a powerful tool by learning compact CSI representations for feedback. However, existing methods struggle to capture the intricate structure of near-field CSI while incurring prohibitive co…

@arXiv_csCV_bot@mastoxiv.page
2025-07-18 10:20:32

Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy
Yiting Yang, Hao Luo, Yuan Sun, Qingsen Yan, Haokui Zhang, Wei Dong, Guoqing Wang, Peng Wang, Yang Yang, Hengtao Shen
https://arxiv.org/abs/2507.13260

Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy
A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this work, we observe an approximate orthogonality among any two ro…

@arXiv_grqc_bot@mastoxiv.page
2025-09-16 10:44:37

Transformer Networks for Continuous Gravitational-wave Searches
Prasanna. M. Joshi, Reinhard Prix
https://arxiv.org/abs/2509.10912 https://arxiv.org/pdf/25…

Transformer Networks for Continuous Gravitational-wave Searches
Wide-parameter-space searches for continuous gravitational waves (CWs) using semi-coherent matched-filter methods require enormous computing power, which limits their achievable sensitivity. Here we explore an alternative search method based on training neural networks as classifiers on detector strain data with minimal pre-processing. Contrary to previous studies using convolutional neural networks (CNNs), we investigate the suitability of the transformer architecture, specifically the Vision …

@arXiv_astrophHE_bot@mastoxiv.page
2025-07-17 09:10:10

Enhancements to the IceCube Extremely High Energy Neutrino Selection using Graph & Transformer Based Neural Networks
Maxwell Nakos (for the IceCube Collaboration), Aske Rosted (for the IceCube Collaboration), Lu Lu (for the IceCube Collaboration)
https://arxiv.org/abs/2507.11774

Enhancements to the IceCube Extremely High Energy Neutrino Selection using Graph & Transformer Based Neural Networks
KM3NeT has recently reported the detection of a very high-energy neutrino event, while IceCube has previously set upper limits on the differential neutrino flux above 100 PeV but has yet to observe a neutrino event with an energy comparable to that of the KM3NeT detection. To improve diffuse measurements above 10 PeV, we apply machine learning techniques to enhance atmospheric muon background rejection and directional reconstruction. We utilize a Graph Neural Network (GNN) to perform a classifi…

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 09:30:00

MultiPark: Multimodal Parking Transformer with Next-Segment Prediction
Han Zheng, Zikang Zhou, Guli Zhang, Zhepei Wang, Kaixuan Wang, Peiliang Li, Shaojie Shen, Ming Yang, Tong Qin
https://arxiv.org/abs/2508.11537

MultiPark: Multimodal Parking Transformer with Next-Segment Prediction
Parking accurately and safely in highly constrained spaces remains a critical challenge. Unlike structured driving environments, parking requires executing complex maneuvers such as frequent gear shifts and steering saturation. Recent attempts to employ imitation learning (IL) for parking have achieved promising results. However, existing works ignore the multimodal nature of parking behavior in lane-free open space, failing to derive multiple plausible solutions under the same situation. Notab…

@arXiv_csDL_bot@mastoxiv.page
2025-08-18 10:02:57

Crosslisted article(s) found for cs.DL. https://arxiv.org/list/cs.DL/new
[1/1]:
- Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models
Erez Meoded

@arXiv_csAI_bot@mastoxiv.page
2025-08-18 10:56:31

Crosslisted article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[5/5]:
- Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models
Erez Meoded

@arXiv_csHC_bot@mastoxiv.page
2025-08-15 07:47:02

Pre-trained Transformer-models using chronic invasive electrophysiology for symptom decoding without patient-individual training
Timon Merk, Saeed Salehi, Richard M. Koehler, Qiming Cui, Maria Olaru, Amelia Hahn, Nicole R. Provenza, Simon Little, Reza Abbasi-Asl, Phil A. Starr, Wolf-Julian Neumann
https://arxiv.org/abs/2508.10160

Pre-trained Transformer-models using chronic invasive electrophysiology for symptom decoding without patient-individual training
Neural decoding of pathological and physiological states can enable patient-individualized closed-loop neuromodulation therapy. Recent advances in pre-trained large-scale foundation models offer the potential for generalized state estimation without patient-individual training. Here we present a foundation model trained on chronic longitudinal deep brain stimulation recordings spanning over 24 days. Adhering to long time-scale symptom fluctuations, we highlight the extended context window of 30…

@arXiv_csRO_bot@mastoxiv.page
2025-09-18 10:04:51

Pre-Manipulation Alignment Prediction with Parallel Deep State-Space and Transformer Models
Motonari Kambara, Komei Sugiura
https://arxiv.org/abs/2509.13839 https://

Pre-Manipulation Alignment Prediction with Parallel Deep State-Space and Transformer Models
In this work, we address the problem of predicting the future success of open-vocabulary object manipulation tasks. Conventional approaches typically determine success or failure after the action has been carried out. However, they make it difficult to prevent potential hazards and rely on failures to trigger replanning, thereby reducing the efficiency of object manipulation sequences. To overcome these challenges, we propose a model, which predicts the alignment between a pre-manipulation egoc…

@arXiv_csPL_bot@mastoxiv.page
2025-07-17 08:17:00

Universal Synthesis of Differentiably Tunable Numerical Abstract Transformers
Shaurya Gomber, Debangshu Banerjee, Gagandeep Singh
https://arxiv.org/abs/2507.11827

Universal Synthesis of Differentiably Tunable Numerical Abstract Transformers
Numerical abstract interpretation is a widely used framework for the static analysis of numerical programs. However, existing numerical abstract interpreters rely on hand-crafted, instruction-specific transformers tailored to each domain, with no general algorithm for handling common operations across domains. This limits extensibility, prevents precise compositional reasoning over instruction sequences, and forces all downstream tasks to use the same fixed transformer regardless of their preci…

@arXiv_qfinMF_bot@mastoxiv.page
2025-08-19 14:44:06

Replaced article(s) found for q-fin.MF. https://arxiv.org/list/q-fin.MF/new
[1/1]:
- Quantformer: from attention to profit with a quantitative transformer trading strategy
Zhaofeng Zhang, Banghao Chen, Shengxin Zhu, Nicolas Langren\'e

@arXiv_csCR_bot@mastoxiv.page
2025-08-19 11:21:20

A Robust Cross-Domain IDS using BiGRU-LSTM-Attention for Medical and Industrial IoT Security
Afrah Gueriani, Hamza Kheddar, Ahmed Cherif Mazari, Mohamed Chahine Ghanem
https://arxiv.org/abs/2508.12470 …

A Robust Cross-Domain IDS using BiGRU-LSTM-Attention for Medical and Industrial IoT Security
The increased Internet of Medical Things IoMT and the Industrial Internet of Things IIoT interconnectivity has introduced complex cybersecurity challenges, exposing sensitive data, patient safety, and industrial operations to advanced cyber threats. To mitigate these risks, this paper introduces a novel transformer-based intrusion detection system IDS, termed BiGAT-ID a hybrid model that combines bidirectional gated recurrent units BiGRU, long short-term memory LSTM networks, and multi-head att…

@arXiv_eessIV_bot@mastoxiv.page
2025-08-18 08:31:50

LKFMixer: Exploring Large Kernel Feature For Efficient Image Super-Resolution
Yinggan Tang, Quanwei Hu
https://arxiv.org/abs/2508.11391 https://arxiv.org/p…

LKFMixer: Exploring Large Kernel Feature For Efficient Image Super-Resolution
The success of self-attention (SA) in Transformer demonstrates the importance of non-local information to image super-resolution (SR), but the huge computing power required makes it difficult to implement lightweight models. To solve this problem, we propose a pure convolutional neural network (CNN) model, LKFMixer, which utilizes large convolutional kernel to simulate the ability of self-attention to capture non-local features. Specifically, we increase the kernel size to 31 to obtain the larg…

@arXiv_csIT_bot@mastoxiv.page
2025-08-15 09:13:22

Predictive Position Control for Movable Antenna Arrays in UAV Communications: A Spatio-Temporal Transformer-LSTM Framework
Kan Yu, Kaixuan Li, Xiaowu Liu, Qixun Zhang, Zhiyong Feng
https://arxiv.org/abs/2508.10720

Predictive Position Control for Movable Antenna Arrays in UAV Communications: A Spatio-Temporal Transformer-LSTM Framework
In complex urban environments, dynamic obstacles and multipath effects lead to significant link attenuation and pervasive coverage blind spots. Conventional approaches based on large-scale fixed antenna arrays and UAV trajectory optimization struggle to balance energy efficiency, real-time adaptation, and spatial flexibility. The movable antenna (MA) technology has emerged as a promising solution, offering enhanced spatial flexibility and reduced energy consumption to overcome the bottlenecks o…

@arXiv_csIR_bot@mastoxiv.page
2025-08-11 09:55:39

eSASRec: Enhancing Transformer-based Recommendations in a Modular Fashion
Daria Tikhonovich, Nikita Zelinskiy, Aleksandr V. Petrov, Mayya Spirina, Andrei Semenov, Andrey V. Savchenko, Sergei Kuliev
https://arxiv.org/abs/2508.06450

eSASRec: Enhancing Transformer-based Recommendations in a Modular Fashion
Since their introduction, Transformer-based models, such as SASRec and BERT4Rec, have become common baselines for sequential recommendations, surpassing earlier neural and non-neural methods. A number of following publications have shown that the effectiveness of these models can be improved by, for example, slightly updating the architecture of the Transformer layers, using better training objectives, and employing improved loss functions. However, the additivity of these modular improvements …

@arXiv_qbioNC_bot@mastoxiv.page
2025-08-14 08:06:52

Perceptual Reality Transformer: Neural Architectures for Simulating Neurological Perception Conditions
Baihan Lin
https://arxiv.org/abs/2508.09852 https://…

Perceptual Reality Transformer: Neural Architectures for Simulating Neurological Perception Conditions
Neurological conditions affecting visual perception create profound experiential divides between affected individuals and their caregivers, families, and medical professionals. We present the Perceptual Reality Transformer, a comprehensive framework employing six distinct neural architectures to simulate eight neurological perception conditions with scientifically-grounded visual transformations. Our system learns mappings from natural images to condition-specific perceptual states, enabling ot…

@arXiv_csLG_bot@mastoxiv.page
2025-09-16 12:46:37

Dynamic Relational Priming Improves Transformer in Multivariate Time Series
Hunjae Lee, Corey Clark
https://arxiv.org/abs/2509.12196 https://arxiv.org/pdf/…

Dynamic Relational Priming Improves Transformer in Multivariate Time Series
Standard attention mechanisms in transformers employ static token representations that remain unchanged across all pair-wise computations in each layer. This limits their representational alignment with the potentially diverse relational dynamics of each token-pair interaction. While they excel in domains with relatively homogeneous relationships, standard attention's static relational learning struggles to capture the diverse, heterogeneous inter-channel dependencies of multivariate time serie…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:33:21

Patent Language Model Pretraining with ModernBERT
Amirhossein Yousefiramandi, Ciaran Cooney
https://arxiv.org/abs/2509.14926 https://arxiv.org/pdf/2509.149…

Patent Language Model Pretraining with ModernBERT
Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over …

@arXiv_csCV_bot@mastoxiv.page
2025-07-16 10:38:21

Streaming 4D Visual Geometry Transformer
Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu
https://arxiv.org/abs/2507.11539 https://

Streaming 4D Visual Geometry Transformer
Perceiving and reconstructing 4D spatial-temporal geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and real-time applications, we propose a streaming 4D visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the histo…

@arXiv_csAR_bot@mastoxiv.page
2025-07-18 08:08:42

An ultra-low-power CGRA for accelerating Transformers at the edge
Rohit Prasad
https://arxiv.org/abs/2507.12904 https://arxiv.org/pdf…

An ultra-low-power CGRA for accelerating Transformers at the edge
Transformers have revolutionized deep learning with applications in natural language processing, computer vision, and beyond. However, their computational demands make it challenging to deploy them on low-power edge devices. This paper introduces an ultra-low-power, Coarse-Grained Reconfigurable Array (CGRA) architecture specifically designed to accelerate General Matrix Multiplication (GEMM) operations in transformer models tailored for the energy and resource constraints of edge applications.…

@arXiv_csAI_bot@mastoxiv.page
2025-09-18 09:08:41

From Next Token Prediction to (STRIPS) World Models -- Preliminary Results
Carlos N\'u\~nez-Molina, Vicen\c{c} G\'omez, Hector Geffner
https://arxiv.org/abs/2509.13389 h…

From Next Token Prediction to (STRIPS) World Models -- Preliminary Results
We consider the problem of learning propositional STRIPS world models from action traces alone, using a deep learning architecture (transformers) and gradient descent. The task is cast as a supervised next token prediction problem where the tokens are the actions, and an action $a$ may follow an action sequence if the hidden effects of the previous actions do not make an action precondition of $a$ false. We show that a suitable transformer architecture can faithfully represent propositional STR…

@arXiv_csRO_bot@mastoxiv.page
2025-09-16 12:05:07

Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer
Travis Davies, Yiqi Huang, Yunxin Liu, Xiang Chen, Huxian Liu, Luhui Hu
https://arxiv.org/abs/2509.11865

Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer
Scaling Transformer policies and diffusion models has advanced robotic manipulation, yet combining these techniques in lightweight, cross-embodiment learning settings remains challenging. We study design choices that most affect stability and performance for diffusion-transformer policies trained on heterogeneous, multimodal robot data, and introduce Tenma, a lightweight diffusion-transformer for bi-manual arm control. Tenma integrates multiview RGB, proprioception, and language via a cross-emb…

@arXiv_eessAS_bot@mastoxiv.page
2025-08-19 08:45:09

Arabic ASR on the SADA Large-Scale Arabic Speech Corpus with Transformer-Based Models
Branislav Gerazov, Marcello Politi, S\'ebastien Brati\`eres
https://arxiv.org/abs/2508.12968

Arabic ASR on the SADA Large-Scale Arabic Speech Corpus with Transformer-Based Models
We explore the performance of several state-of-the-art automatic speech recognition (ASR) models on a large-scale Arabic speech dataset, the SADA (Saudi Audio Dataset for Arabic), which contains 668 hours of high-quality audio from Saudi television shows. The dataset includes multiple dialects and environments, specifically a noisy subset that makes it particularly challenging for ASR. We evaluate the performance of the models on the SADA test set, and we explore the impact of fine-tuning, lang…

@arXiv_astrophIM_bot@mastoxiv.page
2025-07-18 08:04:22

Astro-MoE: Mixture of Experts for Multiband Astronomical Time Series
Martina C\'adiz-Leyton, Guillermo Cabrera-Vives, Pavlos Protopapas, Daniel Moreno-Cartagena, Ignacio Becker
https://arxiv.org/abs/2507.12611

Astro-MoE: Mixture of Experts for Multiband Astronomical Time Series
Multiband astronomical time series exhibit heterogeneous variability patterns, sampling cadences, and signal characteristics across bands. Standard transformers apply shared parameters to all bands, potentially limiting their ability to model this rich structure. In this work, we introduce Astro-MoE, a foundational transformer architecture that enables dynamic processing via a Mixture of Experts module. We validate our model on both simulated (ELAsTiCC-1) and real-world datasets (Pan-STARRS1).

@arXiv_csCL_bot@mastoxiv.page
2025-07-16 10:31:11

Addressing Data Imbalance in Transformer-Based Multi-Label Emotion Detection with Weighted Loss
Xia Cui
https://arxiv.org/abs/2507.11384 https://

Addressing Data Imbalance in Transformer-Based Multi-Label Emotion Detection with Weighted Loss
This paper explores the application of a simple weighted loss function to Transformer-based models for multi-label emotion detection in SemEval-2025 Shared Task 11. Our approach addresses data imbalance by dynamically adjusting class weights, thereby enhancing performance on minority emotion classes without the computational burden of traditional resampling methods. We evaluate BERT, RoBERTa, and BART on the BRIGHTER dataset, using evaluation metrics such as Micro F1, Macro F1, ROC-AUC, Accurac…

@arXiv_csRO_bot@mastoxiv.page
2025-09-17 10:43:40

An Uncertainty-Weighted Decision Transformer for Navigation in Dense, Complex Driving Scenarios
Zhihao Zhang, Chengyang Peng, Minghao Zhu, Ekim Yurtsever, Keith A. Redmill
https://arxiv.org/abs/2509.13132

An Uncertainty-Weighted Decision Transformer for Navigation in Dense, Complex Driving Scenarios
Autonomous driving in dense, dynamic environments requires decision-making systems that can exploit both spatial structure and long-horizon temporal dependencies while remaining robust to uncertainty. This work presents a novel framework that integrates multi-channel bird's-eye-view occupancy grids with transformer-based sequence modeling for tactical driving in complex roundabout scenarios. To address the imbalance between frequent low-risk states and rare safety-critical decisions, we propose…

@arXiv_eessIV_bot@mastoxiv.page
2025-07-18 08:56:52

Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion
Caixia Dong, Duwei Dai, Xinyi Han, Fan Liu, Xu Yang, Zongfang Li, Songhua Xu
https://arxiv.org/abs/2507.12938

Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion
Accurate coronary artery segmentation is critical for computeraided diagnosis of coronary artery disease (CAD), yet it remains challenging due to the small size, complex morphology, and low contrast with surrounding tissues. To address these challenges, we propose a novel segmentation framework that leverages the power of vision foundation models (VFMs) through a parallel encoding architecture. Specifically, a vision transformer (ViT) encoder within the VFM captures global structural features, …

@arXiv_csSD_bot@mastoxiv.page
2025-09-12 08:30:29

Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms
Weixing Wei, Kazuyoshi Yoshii
https://arxiv.org/abs/2509.09318 https://arx…

Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms
This paper investigates automatic piano transcription based on computationally-efficient yet high-performant variants of the Transformer that can capture longer-term dependency over the whole musical piece. Recently, transformer-based sequence-to-sequence models have demonstrated excellent performance in piano transcription. These models, however, fail to deal with the whole piece at once due to the quadratic complexity of the self-attention mechanism, and music signals are thus typically proce…

@arXiv_csCV_bot@mastoxiv.page
2025-08-19 12:05:40

Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation
Qirui Li, Guangcong Zheng, Qi Zhao, Jie Li, Bin Dong, Yiwu Yao, Xi Li
https://arxiv.org/abs/2508.12969

Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation
The computational demands of self-attention mechanisms pose a critical challenge for transformer-based video generation, particularly in synthesizing ultra-long sequences. Current approaches, such as factorized attention and fixed sparse patterns, fail to fully exploit the inherent spatio-temporal redundancies in video data. Through systematic analysis of video diffusion transformers (DiT), we uncover a key insight: Attention matrices exhibit structured, yet heterogeneous sparsity patterns, whe…

@arXiv_csCL_bot@mastoxiv.page
2025-09-19 10:30:31

Llama-Mimi: Speech Language Models with Interleaved Semantic and Acoustic Tokens
Issa Sugiura, Shuhei Kurita, Yusuke Oda, Ryuichiro Higashinaka
https://arxiv.org/abs/2509.14882 …

Llama-Mimi: Speech Language Models with Interleaved Semantic and Acoustic Tokens
We propose Llama-Mimi, a speech language model that uses a unified tokenizer and a single Transformer decoder to jointly model sequences of interleaved semantic and acoustic tokens. Comprehensive evaluation shows that Llama-Mimi achieves state-of-the-art performance in acoustic consistency and possesses the ability to preserve speaker identity. Our analysis further demonstrates that increasing the number of quantizers improves acoustic fidelity but degrades linguistic performance, highlighting …

@arXiv_eessSP_bot@mastoxiv.page
2025-09-11 08:39:53

RTR: A Transformer-Based Lossless Crossover with Perfect Phase Alignment
Xiangying Li, Jiankuan Li, Yong Tang
https://arxiv.org/abs/2509.08272 https://arxi…

RTR: A Transformer-Based Lossless Crossover with Perfect Phase Alignment
This paper proposes a transformer-based lossless crossover method, termed Resonant Transformer Router (RTR), which achieves frequency separation while ensuring perfect phase alignment between low-frequency (LF) and high-frequency (HF) channels at the crossover frequency. The core property of RTR is that its frequency responses satisfy a linear complementary relation HLF(f)+HHF(f)=1. so that the original signal can be perfectly reconstructed by linear summation of the two channels. Theoretical d…

@arXiv_csIR_bot@mastoxiv.page
2025-07-17 08:06:40

Sparse Autoencoders for Sequential Recommendation Models: Interpretation and Flexible Control
Anton Klenitskiy, Konstantin Polev, Daria Denisova, Alexey Vasilev, Dmitry Simakov, Gleb Gusev
https://arxiv.org/abs/2507.12202

Sparse Autoencoders for Sequential Recommendation Models: Interpretation and Flexible Control
Many current state-of-the-art models for sequential recommendations are based on transformer architectures. Interpretation and explanation of such black box models is an important research question, as a better understanding of their internals can help understand, influence, and control their behavior, which is very important in a variety of real-world applications. Recently sparse autoencoders (SAE) have been shown to be a promising unsupervised approach for extracting interpretable features f…

@arXiv_csLG_bot@mastoxiv.page
2025-08-15 10:10:12

Self-Supervised Temporal Super-Resolution of Energy Data using Generative Adversarial Transformer
Xuanhao Mu, G\"okhan Demirel, Yuzhe Zhang, Jianlei Liu, Thorsten Schlachter, Veit Hagenmeyer
https://arxiv.org/abs/2508.10587

Self-Supervised Temporal Super-Resolution of Energy Data using Generative Adversarial Transformer
To bridge the temporal granularity gap in energy network design and operation based on Energy System Models, resampling of time series is required. While conventional upsampling methods are computationally efficient, they often result in significant information loss or increased noise. Advanced models such as time series generation models, Super-Resolution models and imputation models show potential, but also face fundamental challenges. The goal of time series generative models is to learn the…

@arXiv_csCR_bot@mastoxiv.page
2025-08-15 09:41:42

A Transformer-Based Approach for DDoS Attack Detection in IoT Networks
Sandipan Dey, Payal Santosh Kate, Vatsala Upadhyay, Abhishek Vaish
https://arxiv.org/abs/2508.10636 https:…

A Transformer-Based Approach for DDoS Attack Detection in IoT Networks
DDoS attacks have become a major threat to the security of IoT devices and can cause severe damage to the network infrastructure. IoT devices suffer from the inherent problem of resource constraints and are therefore susceptible to such resource-exhausting attacks. Traditional methods for detecting DDoS attacks are not efficient enough to cope with the dynamic nature of IoT networks, as well as the scalability of the attacks, diversity of protocols, high volume of traffic, and variability in de…

@arXiv_astrophIM_bot@mastoxiv.page
2025-07-18 08:17:42

A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys
Yufeng Luo, Adam D. Myers, Alex Drlica-Wagner, Dario Dematties, Salma Borchani, Frank Valdes, Arjun Dey, David Schlegel, Rongpu Zhou, DESI Legacy Imaging Surveys Team
https://arxiv.org/abs/2507.12784

A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys
As the data volume of astronomical imaging surveys rapidly increases, traditional methods for image anomaly detection, such as visual inspection by human experts, are becoming impractical. We introduce a machine-learning-based approach to detect poor-quality exposures in large imaging surveys, with a focus on the DECam Legacy Survey (DECaLS) in regions of low extinction (i.e., $E(B-V)<0.04$). Our semi-supervised pipeline integrates a vision transformer (ViT), trained via self-supervised learnin…

@arXiv_csCV_bot@mastoxiv.page
2025-07-17 10:29:00

CytoSAE: Interpretable Cell Embeddings for Hematology
Muhammed Furkan Dasdelen, Hyesu Lim, Michele Buck, Katharina S. G\"otze, Carsten Marr, Steffen Schneider
https://arxiv.org/abs/2507.12464

CytoSAE: Interpretable Cell Embeddings for Hematology
Sparse autoencoders (SAEs) emerged as a promising tool for mechanistic interpretability of transformer-based foundation models. Very recently, SAEs were also adopted for the visual domain, enabling the discovery of visual concepts and their patch-wise attribution to tokens in the transformer model. While a growing number of foundation models emerged for medical imaging, tools for explaining their inferences are still lacking. In this work, we show the applicability of SAEs for hematology. We pr…

@arXiv_csCL_bot@mastoxiv.page
2025-07-18 09:57:02

The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen
https://arxiv.org/abs/2507.13332

The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are…

@arXiv_csRO_bot@mastoxiv.page
2025-09-17 10:31:10

GRATE: a Graph transformer-based deep Reinforcement learning Approach for Time-efficient autonomous robot Exploration
Haozhan Ni, Jingsong Liang, Chenyu He, Yuhong Cao, Guillaume Sartoretti
https://arxiv.org/abs/2509.12863

GRATE: a Graph transformer-based deep Reinforcement learning Approach for Time-efficient autonomous robot Exploration
Autonomous robot exploration (ARE) is the process of a robot autonomously navigating and mapping an unknown environment. Recent Reinforcement Learning (RL)-based approaches typically formulate ARE as a sequential decision-making problem defined on a collision-free informative graph. However, these methods often demonstrate limited reasoning ability over graph-structured data. Moreover, due to the insufficient consideration of robot motion, the resulting RL policies are generally optimized to mi…

@arXiv_csCV_bot@mastoxiv.page
2025-08-19 12:08:00

Real-Time Beach Litter Detection and Counting: A Comparative Analysis of RT-DETR Model Variants
Miftahul Huda, Arsyiah Azahra, Putri Maulida Chairani, Dimas Rizky Ramadhani, Nabila Azhari, Ade Lailani
https://arxiv.org/abs/2508.13101

Real-Time Beach Litter Detection and Counting: A Comparative Analysis of RT-DETR Model Variants
Coastal pollution is a pressing global environmental issue, necessitating scalable and automated solutions for monitoring and management. This study investigates the efficacy of the Real-Time Detection Transformer (RT-DETR), a state-of-the-art, end-to-end object detection model, for the automated detection and counting of beach litter. A rigorous comparative analysis is conducted between two model variants, RT-DETR-Large (RT-DETR-L) and RT-DETR-Extra-Large (RT-DETR-X), trained on a publicly ava…

@arXiv_csCL_bot@mastoxiv.page
2025-09-18 10:16:51

Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nick Karpov, Jagadeesh Balam, Boris Ginsburg
https://arxiv.org/abs/2509.14128

Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
This report introduces Canary-1B-v2, a fast, robust multilingual model for Automatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). Built with a FastConformer encoder and Transformer decoder, it supports 25 languages primarily European. The model was trained on 1.7M hours of total data samples, including Granary and NeMo ASR Set 3.0, with non-speech audio added to reduce hallucinations for ASR and AST. We describe its two-stage pre-training and fine-tuning process with dynamic d…

@arXiv_csAR_bot@mastoxiv.page
2025-07-15 09:17:41

Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving
Wonung Kim, Yubin Lee, Yoonsung Kim, Jinwoo Hwang, Seongryong Oh, Jiyong Jung, Aziz Huseynov, Woong Gyu Park, Chang Hyun Park, Divya Mahajan, Jongse Park
https://arxiv.org/abs/2507.10178

Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving
Transformers are the driving force behind today's Large Language Models (LLMs), serving as the foundation for their performance and versatility. Yet, their compute and memory costs grow with sequence length, posing scalability challenges for long-context inferencing. In response, the algorithm community is exploring alternative architectures, such as state space models (SSMs), linear attention, and recurrent neural networks (RNNs), which we refer to as post-transformers. This shift presents a k…

@arXiv_csCR_bot@mastoxiv.page
2025-09-15 07:37:11

Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks
Taniya Gidatkar, Oluwaseun Ajao, Matthew Shardlow
https://arxiv.org/abs/2509.09706

Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks
This study evaluates the resilience of large language models (LLMs) against adversarial attacks, specifically focusing on Flan-T5, BERT, and RoBERTa-Base. Using systematically designed adversarial tests through TextFooler and BERTAttack, we found significant variations in model robustness. RoBERTa-Base and FlanT5 demonstrated remarkable resilience, maintaining accuracy even when subjected to sophisticated attacks, with attack success rates of 0%. In contrast. BERT-Base showed considerable vulne…

@arXiv_csCV_bot@mastoxiv.page
2025-07-17 10:26:40

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition
Hayat Ullah, Muhammad Ali Shafique, Abbas Khan, Arslan Munir
https://arxiv.org/abs/2507.12426

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition
The landscape of video recognition has evolved significantly, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures for improved accuracy. While 3D CNNs have been effective at capturing spatiotemporal dynamics, recent Transformer models leverage self-attention to model long-range spatial and temporal dependencies. Despite achieving state-of-the-art performance on major benchmarks, Transformers remain computationally expensive, particularly with dense …

@arXiv_csIR_bot@mastoxiv.page
2025-09-15 07:38:41

Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs
Maxim Zhelnin, Dmitry Redko, Volkov Daniil, Anna Volodkevich, Petr Sokerin, Valeriy Shevchenko, Egor Shvetsov, Alexey Vasilev, Darya Denisova, Ruslan Izmailov, Alexey Zaytsev
https://arxiv.org/abs/2509.09682…

Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs
Sequential recommendations (SR) with transformer-based architectures are widely adopted in real-world applications, where SR models require frequent retraining to adapt to ever-changing user preferences. However, training transformer-based SR models often encounters a high computational cost associated with scoring extensive item catalogs, often exceeding thousands of items. This occurs mainly due to the use of cross-entropy loss, where peak memory scales proportionally to catalog size, batch s…

@arXiv_csLG_bot@mastoxiv.page
2025-09-12 09:53:19

Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication
Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis
https://arxiv.org/abs/2509.09168

Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication
Large-scale transformer models have emerged as a powerful tool for semantic communication systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision transformers to jointly reduce inference time and transmissi…

@arXiv_csRO_bot@mastoxiv.page
2025-07-16 09:47:01

Whom to Respond To? A Transformer-Based Model for Multi-Party Social Robot Interaction
He Zhu, Ryo Miyoshi, Yuki Okafuji
https://arxiv.org/abs/2507.10960 h…

Whom to Respond To? A Transformer-Based Model for Multi-Party Social Robot Interaction
Prior human-robot interaction (HRI) research has primarily focused on single-user interactions, where robots do not need to consider the timing or recipient of their responses. However, in multi-party interactions, such as at malls and hospitals, social robots must understand the context and decide both when and to whom they should respond. In this paper, we propose a Transformer-based multi-task learning framework to improve the decision-making process of social robots, particularly in multi-u…

@arXiv_csCL_bot@mastoxiv.page
2025-08-18 09:37:20

LLM Compression: How Far Can We Go in Balancing Size and Performance?
Sahil Sk, Debasish Dhal, Sonal Khosla, Sk Shahid, Sambit Shekhar, Akash Dhaka, Shantipriya Parida, Dilip K. Prasad, Ond\v{r}ej Bojar
https://arxiv.org/abs/2508.11318

LLM Compression: How Far Can We Go in Balancing Size and Performance?
Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answeri…

@arXiv_csSD_bot@mastoxiv.page
2025-07-17 09:03:20

RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection
Sungkyun Chang, Simon Dixon, Emmanouil Benetos
https://arxiv.org/abs/2507.12175

RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection
This study introduces RUMAA, a transformer-based framework for music performance analysis that unifies score-to-performance alignment, score-informed transcription, and mistake detection in a near end-to-end manner. Unlike prior methods addressing these tasks separately, RUMAA integrates them using pre-trained score and audio encoders and a novel tri-stream decoder capturing task interdependencies through proxy tasks. It aligns human-readable MusicXML scores with repeat symbols to full-length p…

@arXiv_csCV_bot@mastoxiv.page
2025-08-18 09:51:30

RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator
Zhiming Liu, Nantheera Anantrasirichai
https://arxiv.org/abs/2508.11409 https://arxiv…

RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator
Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer and 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose RMFAT: Recurrent Mu…

@arXiv_astrophIM_bot@mastoxiv.page
2025-07-16 08:45:41

Mapping Diffuse Radio Sources Using TUNA: A Transformer-Based Deep Learning Approach
Nicoletta Sanvitale, Claudio Gheller, Franco Vazza, Annalisa Bonafede, Virginia Cuciti, Emanuele De Rubeis, Federica Govoni, Matteo Murgia, Valentina Vacca
https://arxiv.org/abs/2507.11320

Mapping Diffuse Radio Sources Using TUNA: A Transformer-Based Deep Learning Approach
Vision Transformers are used via a customized TransUNet architecture, which is a hybrid model combining Transformers into a U-Net backbone, to achieve precise, automated, and fast segmentation of radio astronomy data affected by calibration and imaging artifacts, addressing the identification of faint, diffuse radio sources. Trained on mock radio observations from numerical simulations, the network is applied to the LOFAR Two-meter Sky Survey data. It is then evaluated on key use cases, specifi…

@arXiv_csCL_bot@mastoxiv.page
2025-08-18 09:08:00

Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics
Carter Blum, Katja Filipova, Ann Yuan, Asma Ghandeharioun, Julian Zimmert, Fred Zhang, Jessica Hoffmann, Tal Linzen, Martin Wattenberg, Lucas Dixon, Mor Geva
https://arxiv.org/abs/2508.11017

Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics
Large language models (LLMs) struggle with cross-lingual knowledge transfer: they hallucinate when asked in one language about facts expressed in a different language during training. This work introduces a controlled setting to study the causes and dynamics of this phenomenon by training small Transformer models from scratch on synthetic multilingual datasets. We identify a learning phase wherein a model develops either separate or unified representations of the same facts across languages, an…

@arXiv_csAI_bot@mastoxiv.page
2025-08-11 09:42:39

AntiCheatPT: A Transformer-Based Approach to Cheat Detection in Competitive Computer Games
Mille Mei Zhen Loo, Gert Luzkov, Paolo Burelli
https://arxiv.org/abs/2508.06348 https:…

AntiCheatPT: A Transformer-Based Approach to Cheat Detection in Competitive Computer Games
Cheating in online video games compromises the integrity of gaming experiences. Anti-cheat systems, such as VAC (Valve Anti-Cheat), face significant challenges in keeping pace with evolving cheating methods without imposing invasive measures on users' systems. This paper presents AntiCheatPT\_256, a transformer-based machine learning model designed to detect cheating behaviour in Counter-Strike 2 using gameplay data. To support this, we introduce and publicly release CS2CD: A labelled dataset o…

@arXiv_csLG_bot@mastoxiv.page
2025-07-17 10:21:20

PRISM: Distributed Inference for Foundation Models at Edge
Muhammad Azlan Qazi, Alexandros Iosifidis, Qi Zhang
https://arxiv.org/abs/2507.12145 https://

PRISM: Distributed Inference for Foundation Models at Edge
Foundation models (FMs) have achieved remarkable success across a wide range of applications, from image classification to natural langurage processing, but pose significant challenges for deployment at edge. This has sparked growing interest in developing practical and efficient strategies for bringing foundation models to edge environments. In this work, we propose PRISM, a communication-efficient and compute-aware strategy for distributed Transformer inference on edge devices. Our method lev…

@arXiv_csCV_bot@mastoxiv.page
2025-08-18 09:52:40

Data-Driven Deepfake Image Detection Method -- The 2024 Global Deepfake Image Detection Challenge
Xiaoya Zhu, Yibing Nan, Shiguo Lian
https://arxiv.org/abs/2508.11464 https://…

Data-Driven Deepfake Image Detection Method -- The 2024 Global Deepfake Image Detection Challenge
With the rapid development of technology in the field of AI, deepfake technology has emerged as a double-edged sword. It has not only created a large amount of AI-generated content but also posed unprecedented challenges to digital security. The task of the competition is to determine whether a face image is a Deepfake image and output its probability score of being a Deepfake image. In the image track competition, our approach is based on the Swin Transformer V2-B classification network. And o…

@arXiv_eessAS_bot@mastoxiv.page
2025-09-19 09:29:31

SpeechMLC: Speech Multi-label Classification
Miseul Kim, Seyun Um, Hyeonjin Cha, Hong-goo Kang
https://arxiv.org/abs/2509.14677 https://arxiv.org/pdf/2509.…

SpeechMLC: Speech Multi-label Classification
In this paper, we propose a multi-label classification framework to detect multiple speaking styles in a speech sample. Unlike previous studies that have primarily focused on identifying a single target style, our framework effectively captures various speaker characteristics within a unified structure, making it suitable for generalized human-computer interaction applications. The proposed framework integrates cross-attention mechanisms within a transformer decoder to extract salient features …

@arXiv_eessIV_bot@mastoxiv.page
2025-07-17 09:24:30

Unit-Based Histopathology Tissue Segmentation via Multi-Level Feature Representation
Ashkan Shakarami, Azade Farshad, Yousef Yeganeh, Lorenzo Nicole, Peter Schuffler, Stefano Ghidoni, Nassir Navab
https://arxiv.org/abs/2507.12427

Unit-Based Histopathology Tissue Segmentation via Multi-Level Feature Representation
We propose UTS, a unit-based tissue segmentation framework for histopathology that classifies each fixed-size 32 * 32 tile, rather than each pixel, as the segmentation unit. This approach reduces annotation effort and improves computational efficiency without compromising accuracy. To implement this approach, we introduce a Multi-Level Vision Transformer (L-ViT), which benefits the multi-level feature representation to capture both fine-grained morphology and global tissue context. Trained to s…

@arXiv_csCR_bot@mastoxiv.page
2025-07-17 08:12:10

Effective Fine-Tuning of Vision Transformers with Low-Rank Adaptation for Privacy-Preserving Image Classification
Haiwei Lin, Shoko Imaizumi, Hitoshi Kiya
https://arxiv.org/abs/2507.11943

Effective Fine-Tuning of Vision Transformers with Low-Rank Adaptation for Privacy-Preserving Image Classification
We propose a low-rank adaptation method for training privacy-preserving vision transformer (ViT) models that efficiently freezes pre-trained ViT model weights. In the proposed method, trainable rank decomposition matrices are injected into each layer of the ViT architecture, and moreover, the patch embedding layer is not frozen, unlike in the case of the conventional low-rank adaptation methods. The proposed method allows us not only to reduce the number of trainable parameters but to also main…

@arXiv_csSD_bot@mastoxiv.page
2025-09-17 12:32:20

Replaced article(s) found for cs.SD. https://arxiv.org/list/cs.SD/new
[1/1]:
- SwinSRGAN: Swin Transformer-based Generative Adversarial Network for High-Fidelity Speech Super-R...
Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, Yulin Wu, Chenhao Hu, Xueyang Lv

@arXiv_csCV_bot@mastoxiv.page
2025-08-15 10:26:02

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, Xingang Pan
https://arxiv.org/abs/2508.10893

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling.…

@arXiv_csLG_bot@mastoxiv.page
2025-07-09 10:13:12

The Fourier Spectral Transformer Networks For Efficient and Generalizable Nonlinear PDEs Prediction
Beibei Li
https://arxiv.org/abs/2507.05584 https://

The Fourier Spectral Transformer Networks For Efficient and Generalizable Nonlinear PDEs Prediction
In this work we propose a unified Fourier Spectral Transformer network that integrates the strengths of classical spectral methods and attention based neural architectures. By transforming the original PDEs into spectral ordinary differential equations, we use high precision numerical solvers to generate training data and use a Transformer network to model the evolution of the spectral coefficients. We demonstrate the effectiveness of our approach on the two dimensional incompressible Navier-St…

@arXiv_eessIV_bot@mastoxiv.page
2025-07-14 08:37:32

Depth-Sequence Transformer (DST) for Segment-Specific ICA Calcification Mapping on Non-Contrast CT
Xiangjian Hou, Ebru Yaman Akcicek, Xin Wang, Kazem Hashemizadeh, Scott Mcnally, Chun Yuan, Xiaodong Ma
https://arxiv.org/abs/2507.08214

Depth-Sequence Transformer (DST) for Segment-Specific ICA Calcification Mapping on Non-Contrast CT
While total intracranial carotid artery calcification (ICAC) volume is an established stroke biomarker, growing evidence shows this aggregate metric ignores the critical influence of plaque location, since calcification in different segments carries distinct prognostic and procedural risks. However, a finer-grained, segment-specific quantification has remained technically infeasible. Conventional 3D models are forced to process downsampled volumes or isolated patches, sacrificing the global con…

@arXiv_csAI_bot@mastoxiv.page
2025-07-16 10:09:51

Modeling Code: Is Text All You Need?
Daniel Nichols, Konstantinos Parasyris, Harshitha Menon, Brian R. Bartoldson, Giorgis Georgakoudis, Tal Ben-Nun, Abhinav Bhatele
https://arxiv.org/abs/2507.11467

Modeling Code: Is Text All You Need?
Code LLMs have become extremely popular recently for modeling source code across a variety of tasks, such as generation, translation, and summarization. However, transformer-based models are limited in their capabilities to reason through structured, analytical properties of code, such as control and data flow. Previous work has explored the modeling of these properties with structured data and graph neural networks. However, these approaches lack the generative capabilities and scale of modern…

@arXiv_csCV_bot@mastoxiv.page
2025-09-16 12:44:07

3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data
Nojod M. Alotaibi, Areej M. Alhothali, Manar S. Ali
https://arxiv.org/abs/2509.12143

3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data
Major depressive disorder (MDD) is a prevalent mental health condition that negatively impacts both individual well-being and global public health. Automated detection of MDD using structural magnetic resonance imaging (sMRI) and deep learning (DL) methods holds increasing promise for improving diagnostic accuracy and enabling early intervention. Most existing methods employ either voxel-level features or handcrafted regional representations built from predefined brain atlases, limiting their a…

@arXiv_csLG_bot@mastoxiv.page
2025-09-08 09:53:10

VARMA-Enhanced Transformer for Time Series Forecasting
Jiajun Song, Xiaoou Liu
https://arxiv.org/abs/2509.04782 https://arxiv.org/pdf/2509.04782

VARMA-Enhanced Transformer for Time Series Forecasting
Transformer-based models have significantly advanced time series forecasting. Recent work, like the Cross-Attention-only Time Series transformer (CATS), shows that removing self-attention can make the model more accurate and efficient. However, these streamlined architectures may overlook the fine-grained, local temporal dependencies effectively captured by classical statistical models like Vector AutoRegressive Moving Average model (VARMA). To address this gap, we propose VARMAformer, a novel …

@arXiv_csCV_bot@mastoxiv.page
2025-07-14 10:03:02

Generalizable 7T T1-map Synthesis from 1.5T and 3T T1 MRI with an Efficient Transformer Model
Zach Eidex, Mojtaba Safari, Tonghe Wang, Vanessa Wildman, David S. Yu, Hui Mao, Erik Middlebrooks, Aparna Kesewala, Xiaofeng Yang
https://arxiv.org/abs/2507.08655

Generalizable 7T T1-map Synthesis from 1.5T and 3T T1 MRI with an Efficient Transformer Model
Purpose: Ultra-high-field 7T MRI offers improved resolution and contrast over standard clinical field strengths (1.5T, 3T). However, 7T scanners are costly, scarce, and introduce additional challenges such as susceptibility artifacts. We propose an efficient transformer-based model (7T-Restormer) to synthesize 7T-quality T1-maps from routine 1.5T or 3T T1-weighted (T1W) images. Methods: Our model was validated on 35 1.5T and 108 3T T1w MRI paired with corresponding 7T T1 maps of patients with c…

@arXiv_csLG_bot@mastoxiv.page
2025-09-10 10:37:41

Transformer-Based Approach to Optimal Sensor Placement for Structural Health Monitoring of Probe Cards
Mehdi Bejani, Marco Mauri, Daniele Acconcia, Simone Todaro, Stefano Mariani
https://arxiv.org/abs/2509.07603

Transformer-Based Approach to Optimal Sensor Placement for Structural Health Monitoring of Probe Cards
This paper presents an innovative Transformer-based deep learning strategy for optimizing the placement of sensors aiming at structural health monitoring of semiconductor probe cards. Failures in probe cards, including substrate cracks and loosened screws, would critically affect semiconductor manufacturing yield and reliability. Some failure modes could be detected by equipping a probe card with adequate sensors. Frequency response functions from simulated failure scenarios are adopted within …

@arXiv_csRO_bot@mastoxiv.page
2025-09-17 10:34:50

Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins
Erblin Isaku, Hassan Sartaj, Shaukat Ali, Beatriz Sanguino, Tongtong Wang, Guoyuan Li, Houxiang Zhang, Thomas Peyrucain
https://arxiv.org/abs/2509.12982

Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins
Self-adaptive robots (SARs) in complex, uncertain environments must proactively detect and address abnormal behaviors, including out-of-distribution (OOD) cases. To this end, digital twins offer a valuable solution for OOD detection. Thus, we present a digital twin-based approach for OOD detection (ODiSAR) in SARs. ODiSAR uses a Transformer-based digital twin to forecast SAR states and employs reconstruction error and Monte Carlo dropout for uncertainty quantification. By combining reconstructi…

@arXiv_csCL_bot@mastoxiv.page
2025-09-12 09:23:49

ViRanker: A BGE-M3 & Blockwise Parallel Transformer Cross-Encoder for Vietnamese Reranking
Phuong-Nam Dang, Kieu-Linh Nguyen, Thanh-Hieu Pham
https://arxiv.org/abs/2509.09131

ViRanker: A BGE-M3 & Blockwise Parallel Transformer Cross-Encoder for Vietnamese Reranking
This paper presents ViRanker, a cross-encoder reranking model tailored to the Vietnamese language. Built on the BGE-M3 encoder and enhanced with the Blockwise Parallel Transformer, ViRanker addresses the lack of competitive rerankers for Vietnamese, a low-resource language with complex syntax and diacritics. The model was trained on an 8 GB curated corpus and fine-tuned with hybrid hard-negative sampling to strengthen robustness. Evaluated on the MMARCO-VI benchmark, ViRanker achieves strong ea…

@arXiv_csCV_bot@mastoxiv.page
2025-09-15 10:02:31

I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation
Jordan Sassoon, Michal Szczepanski, Martyna Poreba
https://arxiv.org/abs/2509.10334 https://

I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation
Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segm…

@arXiv_csCL_bot@mastoxiv.page
2025-08-14 09:48:12

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng
https://arxiv.org/abs/2508.09834

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of…

@arXiv_csCV_bot@mastoxiv.page
2025-09-08 09:53:50

Scale-interaction transformer: a hybrid cnn-transformer model for facial beauty prediction
Djamel Eddine Boukhari
https://arxiv.org/abs/2509.05078 https://…

Scale-interaction transformer: a hybrid cnn-transformer model for facial beauty prediction
Automated Facial Beauty Prediction (FBP) is a challenging computer vision task due to the complex interplay of local and global facial features that influence human perception. While Convolutional Neural Networks (CNNs) excel at feature extraction, they often process information at a fixed scale, potentially overlooking the critical inter-dependencies between features at different levels of granularity. To address this limitation, we introduce the Scale-Interaction Transformer (SIT), a novel hy…

@arXiv_csCV_bot@mastoxiv.page
2025-08-12 12:46:43

THAT: Token-wise High-frequency Augmentation Transformer for Hyperspectral Pansharpening
Hongkun Jin, Hongcheng Jiang, Zejun Zhang, Yuan Zhang, Jia Fu, Tingfeng Li, Kai Luo
https://arxiv.org/abs/2508.08183

THAT: Token-wise High-frequency Augmentation Transformer for Hyperspectral Pansharpening
Transformer-based methods have demonstrated strong potential in hyperspectral pansharpening by modeling long-range dependencies. However, their effectiveness is often limited by redundant token representations and a lack of multi-scale feature modeling. Hyperspectral images exhibit intrinsic spectral priors (e.g., abundance sparsity) and spatial priors (e.g., non-local similarity), which are critical for accurate reconstruction. From a spectral-spatial perspective, Vision Transformers (ViTs) fa…

@arXiv_csCV_bot@mastoxiv.page
2025-09-17 10:55:20

TexTAR : Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images
Rohan Kumar, Jyothi Swaroopa Jinka, Ravi Kiran Sarvadevabhatla
https://arxiv.org/abs/2509.13151

TexTAR : Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images
Recognizing textual attributes such as bold, italic, underline and strikeout is essential for understanding text semantics, structure, and visual presentation. These attributes highlight key information, making them crucial for document analysis. Existing methods struggle with computational efficiency or adaptability in noisy, multilingual settings. To address this, we introduce TexTAR, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR). Our novel data selection pip…

@arXiv_csCV_bot@mastoxiv.page
2025-07-10 08:30:21

Hierarchical Multi-Stage Transformer Architecture for Context-Aware Temporal Action Localization
Hayat Ullah, Arslan Munir, Oliver Nina
https://arxiv.org/abs/2507.06411

Hierarchical Multi-Stage Transformer Architecture for Context-Aware Temporal Action Localization
Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss fun…

@arXiv_csCV_bot@mastoxiv.page
2025-09-11 09:13:53

Symmetry Interactive Transformer with CNN Framework for Diagnosis of Alzheimer's Disease Using Structural MRI
Zheng Yang, Yanteng Zhang, Xupeng Kou, Yang Liu, Chao Ren
https://arxiv.org/abs/2509.08243

Symmetry Interactive Transformer with CNN Framework for Diagnosis of Alzheimer's Disease Using Structural MRI
Structural magnetic resonance imaging (sMRI) combined with deep learning has achieved remarkable progress in the prediction and diagnosis of Alzheimer's disease (AD). Existing studies have used CNN and transformer to build a well-performing network, but most of them are based on pretraining or ignoring the asymmetrical character caused by brain disorders. We propose an end-to-end network for the detection of disease-based asymmetric induced by left and right brain atrophy which consist of 3D CN…

Tootfinder

Opt-in global Mastodon full text search. Join the index!