Tootfinder

@arXiv_csLG_bot@mastoxiv.page
2025-06-12 10:00:11

NDCG-Consistent Softmax Approximation with Accelerated Convergence
Yuanhao Pu, Defu Lian, Xiaolong Chen, Xu Huang, Jin Chen, Enhong Chen
https://arxiv.org/abs/2506.09454

NDCG-Consistent Softmax Approximation with Accelerated Convergence
Ranking tasks constitute fundamental components of extreme similarity learning frameworks, where extremely large corpora of objects are modeled through relative similarity relationships adhering to predefined ordinal structures. Among various ranking surrogates, Softmax (SM) Loss has been widely adopted due to its natural capability to handle listwise ranking via global negative comparisons, along with its flexibility across diverse application scenarios. However, despite its effectiveness, SM …

@arXiv_statML_bot@mastoxiv.page
2025-06-02 10:19:47

This https://arxiv.org/abs/2405.06003 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_sta…

Binary Hypothesis Testing for Softmax Models and Leverage Score Models
Softmax distributions are widely used in machine learning, including Large Language Models (LLMs), where the attention unit uses softmax distributions. We abstract the attention unit as the softmax model, where given a vector input, the model produces an output drawn from the softmax distribution (which depends on the vector input). We consider the fundamental problem of binary hypothesis testing in the setting of softmax models. That is, given an unknown softmax model, which is known to be one…

@arXiv_csLG_bot@mastoxiv.page
2025-06-10 19:20:05

This https://arxiv.org/abs/2505.17282 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Attention with Trained Embeddings Provably Selects Important Tokens
Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e., $\texttt{Softmax}( p^\top E_X^\top ) E_X v = \frac{ \sum_{i=1}^T \exp(p^\top E_{x_i}) E_{x_i}^\top v}{\sum_{j=1}^T \exp(p^\top E_{x_{j…

@arXiv_eessSP_bot@mastoxiv.page
2025-06-06 07:24:07

Joint Beamforming and Integer User Association using a GNN with Gumbel-Softmax Reparameterizations
Qing Lyu, Mai Vu
https://arxiv.org/abs/2506.05241 https:…

Joint Beamforming and Integer User Association using a GNN with Gumbel-Softmax Reparameterizations
Machine learning (ML) models can effectively optimize a multi-cell wireless network by designing the beamforming vectors and association decisions. Existing ML designs, however, often needs to approximate the integer association variables with a probability distribution output. We propose a novel graph neural network (GNN) structure that jointly optimize beamforming vectors and user association while guaranteeing association output as integers. The integer association constraints are satisfied …

@arXiv_csSD_bot@mastoxiv.page
2025-07-09 08:45:02

Differentiable Reward Optimization for LLM based TTS system
Changfeng Gao, Zhihao Du, Shiliang Zhang
https://arxiv.org/abs/2507.05911 https://

Differentiable Reward Optimization for LLM based TTS system
This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to conventional reinforcement learning from human feedback (RLHF) approaches applied to TTS, DiffRO directly compute the rewards based on neural codec tokens, rather than relying on synthesized audio. Furthermore, we employ the Gumbel-Softmax technique to render the reward function differentiable, thereb…

@arXiv_mathNA_bot@mastoxiv.page
2025-06-19 09:06:37

Intrinsic and Extrinsic Organized Attention: Softmax Invariance and Network Sparsity
Oluwadamilola Fasina, Ruben V. C. Pohle, Pei-Chun Su, Ronald R. Coifman
https://arxiv.org/abs/2506.15541

Intrinsic and Extrinsic Organized Attention: Softmax Invariance and Network Sparsity
We examine the intrinsic (within the attention head) and extrinsic (amongst the attention heads) structure of the self-attention mechanism in transformers. Theoretical evidence for invariance of the self-attention mechanism to softmax activation is obtained by appealing to paradifferential calculus, (and is supported by computational examples), which relies on the intrinsic organization of the attention heads. Furthermore, we use an existing methodology for hierarchical organization of tensors …

@arXiv_csLG_bot@mastoxiv.page
2025-06-03 17:43:13

This https://arxiv.org/abs/2303.17475 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…

Learning distributed representations with efficient SoftMax normalization
Learning distributed representations, or embeddings, that encode the relational similarity patterns among objects is a relevant task in machine learning. A popular method to learn the embedding matrices $X, Y$ is optimizing a loss function of the term ${\rm SoftMax}(XY^T)$. The complexity required to calculate this term, however, runs quadratically with the problem size, making it a computationally heavy solution. In this article, we propose a linear-time heuristic approximation to compute the …

@arXiv_statML_bot@mastoxiv.page
2025-06-13 09:52:10

Box-Constrained Softmax Function and Its Application for Post-Hoc Calibration
Kyohei Atarashi, Satoshi Oyama, Hiromi Arai, Hisashi Kashima
https://arxiv.org/abs/2506.10572

Box-Constrained Softmax Function and Its Application for Post-Hoc Calibration
Controlling the output probabilities of softmax-based models is a common problem in modern machine learning. Although the $\mathrm{Softmax}$ function provides soft control via its temperature parameter, it lacks the ability to enforce hard constraints, such as box constraints, on output probabilities, which can be critical in certain applications requiring reliable and trustworthy models. In this work, we propose the box-constrained softmax ($\mathrm{BCSoftmax}$) function, a novel generalizatio…

@arXiv_qfinGN_bot@mastoxiv.page
2025-06-06 07:38:03

Neural Jumps for Option Pricing
Duosi Zheng, Hanzhong Guo, Yanchu Liu, Wei Huang
https://arxiv.org/abs/2506.05137 https://arxiv.org/p…

Neural Jumps for Option Pricing
Recognizing the importance of jump risk in option pricing, we propose a neural jump stochastic differential equation model in this paper, which integrates neural networks as parameter estimators in the conventional jump diffusion model. To overcome the problem that the backpropagation algorithm is not compatible with the jump process, we use the Gumbel-Softmax method to make the jump parameter gradient learnable. We examine the proposed model using both simulated data and S&P 500 index options.…

@arXiv_csCV_bot@mastoxiv.page
2025-06-25 10:32:30

Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han
https://arxiv.org/abs/2506.19852…

Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time …

@arXiv_csAR_bot@mastoxiv.page
2025-05-29 07:15:28

Refining Datapath for Microscaling ViTs
Can Xiao, Jianyi Cheng, Aaron Zhao
https://arxiv.org/abs/2505.22194 https://arxiv.org/pdf/250…

Refining Datapath for Microscaling ViTs
Vision Transformers (ViTs) leverage the transformer architecture to effectively capture global context, demonstrating strong performance in computer vision tasks. A major challenge in ViT hardware acceleration is that the model family contains complex arithmetic operations that are sensitive to model accuracy, such as the Softmax and LayerNorm operations, which cannot be mapped onto efficient hardware with low precision. Existing methods only exploit parallelism in the matrix multiplication ope…

@arXiv_csSD_bot@mastoxiv.page
2025-07-04 09:01:41

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning
Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang
https://arxiv.org/abs/2507.02666

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning
In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appro…

@arXiv_csIR_bot@mastoxiv.page
2025-06-19 08:21:24

Advancing Loss Functions in Recommender Systems: A Comparative Study with a R\'enyi Divergence-Based Solution
Shengjia Zhang, Jiawei Chen, Changdong Li, Sheng Zhou, Qihao Shi, Yan Feng, Chun Chen, Can Wang
https://arxiv.org/abs/2506.15120

Advancing Loss Functions in Recommender Systems: A Comparative Study with a Rényi Divergence-Based Solution
Loss functions play a pivotal role in optimizing recommendation models. Among various loss functions, Softmax Loss (SL) and Cosine Contrastive Loss (CCL) are particularly effective. Their theoretical connections and differences warrant in-depth exploration. This work conducts comprehensive analyses of these losses, yielding significant insights: 1) Common strengths -- both can be viewed as augmentations of traditional losses with Distributional Robust Optimization (DRO), enhancing robustness to…

@arXiv_eessAS_bot@mastoxiv.page
2025-06-17 10:12:25

Evaluating Logit-Based GOP Scores for Mispronunciation Detection
Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik
https://arxiv.org/abs/2506.12067

Evaluating Logit-Based GOP Scores for Mispronunciation Detection
Pronunciation assessment relies on goodness of pronunciation (GOP) scores, traditionally derived from softmax-based posterior probabilities. However, posterior probabilities may suffer from overconfidence and poor phoneme separation, limiting their effectiveness. This study compares logit-based GOP scores with probability-based GOP scores for mispronunciation detection. We conducted our experiment on two L2 English speech datasets spoken by Dutch and Mandarin speakers, assessing classification …

@arXiv_csAR_bot@mastoxiv.page
2025-07-16 08:38:01

SystolicAttention: Fusing FlashAttention within a Single Systolic Array
Jiawei Lin, Guokai Chen, Yuanlong Li, Thomas Bourgeat
https://arxiv.org/abs/2507.11331

SystolicAttention: Fusing FlashAttention within a Single Systolic Array
Transformer models rely heavily on scaled dot-product attention (SDPA), typically implemented using the FlashAttention algorithm. However, current systolic-array-based accelerators face significant challenges when executing FlashAttention. Systolic arrays can only achieve high utilization for consecutive and large matrix multiplications. In contrast, FlashAttention requires frequently interleaved matrix multiplications and softmax operations. The frequent data swaps between the systolic array…

Tootfinder

Opt-in global Mastodon full text search. Join the index!