NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective

Yan, Fanqi; Nguyen, Huy; Akbarian, Pedram; Ho, Nhat; Rinaldo, Alessandro (May 2025, cs.LG)

At the core of the popular Transformer architecture is the self-attention mechanism, which dynamically assigns softmax weights to each input token so that the model can focus on the most salient information. However, the softmax structure slows down the attention computation due to its row-wise nature, and it inherently introduces competition among tokens: as the weight assigned to one token increases, the weights of others decrease. This competitive dynamic may narrow the focus of self-attention to a limited set of features, potentially overlooking other informative characteristics. Recent experimental studies have shown that using the element-wise sigmoid function helps eliminate token competition and reduce the computational overhead. Despite these promising empirical results, a rigorous comparison between sigmoid and softmax self-attention mechanisms remains absent in the literature. This paper closes this gap by theoretically demonstrating that sigmoid self-attention is more sample-efficient than its softmax counterpart. Toward that goal, we represent the self-attention matrix as a mixture of experts and show that ``experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention.
more » « less
Free, publicly-accessible full text available May 27, 2026
On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

Nguyen, Huy; Doan, Thong T; Pham, Quang Pham; Bui, Nghi D_Q; Ho, Nhat; Rinaldo, Alessandro (June 2025, cs.LG)

Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implemen- tations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.
more » « less
Free, publicly-accessible full text available June 12, 2026
Convergence Rates for Softmax Gating Mixture of Experts

Nguyen, Huy; Ho, Nhat; Rinaldo, Alessandro (March 2025, https://doi.org/10.48550/arXiv.2503.03213)

Mixture of experts (MoE) has recently emerged as an effective framework to advance the efficiency and scalability of machine learning models by softly dividing complex tasks among multiple specialized sub-models termed experts. Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights. Despite its widespread use in practice, a comprehensive study on the effects of the softmax gating on the MoE has been lacking in the literature. To bridge this gap in this paper, we perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating, respectively. Furthermore, our theories also provide useful insights into the design of sample-efficient expert structures. In particular, we demonstrate that it requires polynomially many data points to estimate experts satisfying our proposed strong identifiability condition, namely a commonly used two-layer feed-forward network. In stark contrast, estimating linear experts, which violate the strong identifiability condition, necessitates exponentially many data points as a result of intrinsic parameter interactions expressed in the language of partial differential equations. All the theoretical results are substantiated with a rigorous guarantee.
more » « less
Free, publicly-accessible full text available March 5, 2026
Global Optimality of the EM Algorithm for Mixtures of Two-Component Linear Regressions

https://doi.org/10.1109/TIT.2024.3435522

Kwon, Jeongyeol; Qian, Wei; Chen, Yudong; Caramanis, Constantine; Davis, Damek; Ho, Nhat (September 2024, IEEE Transactions on Information Theory)

Full Text Available
A Diffusion Process Perspective on Posterior Contraction Rates for Parameters

https://doi.org/10.1137/22M1516038

Mou, Wenlong; Ho, Nhat; Wainwright, Martin; Bartlett, Peter L; Jordan, Michael (June 2024, SIAM Journal on Mathematics of Data Science)

Full Text Available
Statistical and Computational Complexities of BFGS Quasi-Newton Method for Generalized Linear Models

Jin, Qiujiang; Ren, Tongzheng; Ho, Nhat; Mokhtari, Aryan (May 2024, Transactions on machine learning research)

The gradient descent (GD) method has been used widely to solve parameter estimation in generalized linear models (GLMs), a generalization of linear models when the link function can be non-linear. In GLMs with a polynomial link function, it has been shown that in the high signal-to-noise ratio (SNR) regime, due to the problem's strong convexity and smoothness, GD converges linearly and reaches the final desired accuracy in a logarithmic number of iterations. In contrast, in the low SNR setting, where the problem becomes locally convex, GD converges at a slower rate and requires a polynomial number of iterations to reach the desired accuracy. Even though Newton's method can be used to resolve the flat curvature of the loss functions in the low SNR case, its computational cost is prohibitive in high-dimensional settings as it is $$\mathcal{O}(d^3)$$, where $$d$$ the is the problem dimension. To address the shortcomings of GD and Newton's method, we propose the use of the BFGS quasi-Newton method to solve parameter estimation of the GLMs, which has a per iteration cost of $$\mathcal{O}(d^2)$$. When the SNR is low, for GLMs with a polynomial link function of degree $$p$$, we demonstrate that the iterates of BFGS converge linearly to the optimal solution of the population least-square loss function, and the contraction coefficient of the BFGS algorithm is comparable to that of Newton's method. Moreover, the contraction factor of the linear rate is independent of problem parameters and only depends on the degree of the link function $$p$$. Also, for the empirical loss with $$n$$ samples, we prove that in the low SNR setting of GLMs with a polynomial link function of degree $$p$$, the iterates of BFGS reach a final statistical radius of $$\mathcal{O}((d/n)^{\frac{1}{2p+2}})$$ after at most $$\log(n/d)$$ iterations. This complexity is significantly less than the number required for GD, which scales polynomially with $(n/d)$.
more » « less
Full Text Available
A Diffusion Process Perspective on Posterior Contraction Rates for Parameters

Mou, Wenlong; Ho, Nhat; Wainwright, Martin; Bartlett, Peter L.; Jordan, Michael (May 2024, SIAM journal on mathematics of data science)

Full Text Available
A Primal-Dual Framework for Transformers and Neural Networks

Nguyen, Tan Minh; Nguyen, Tam; Ho, Nhat; Bertozzi, Andrea L.; Baraniuk, Richard; Osher, Stanley (February 2023, The Eleventh International Conference on Learning Representations (ICLR), 2023)

Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification.
more » « less
Full Text Available
Joint Self-Supervised Image-Volume Representation Learning with Intra-inter Contrastive Clustering

https://doi.org/10.1609/aaai.v37i12.26687

Nguyen, Duy M.; Nguyen, Hoang; Mai, Truong T.; Cao, Tri; Nguyen, Binh T.; Ho, Nhat; Swoboda, Paul; Albarqouni, Shadi; Xie, Pengtao; Sonntag, Daniel (June 2023, Proceedings of the AAAI Conference on Artificial Intelligence)

Collecting large-scale medical datasets with fully annotated samples for training of deep networks is prohibitively expensive, especially for 3D volume data. Recent breakthroughs in self-supervised learning (SSL) offer the ability to overcome the lack of labeled training samples by learning feature representations from unlabeled data. However, most current SSL techniques in the medical field have been designed for either 2D images or 3D volumes. In practice, this restricts the capability to fully leverage unlabeled data from numerous sources, which may include both 2D and 3D data. Additionally, the use of these pre-trained networks is constrained to downstream tasks with compatible data dimensions.In this paper, we propose a novel framework for unsupervised joint learning on 2D and 3D data modalities. Given a set of 2D images or 2D slices extracted from 3D volumes, we construct an SSL task based on a 2D contrastive clustering problem for distinct classes. The 3D volumes are exploited by computing vectored embedding at each slice and then assembling a holistic feature through deformable self-attention mechanisms in Transformer, allowing incorporating long-range dependencies between slices inside 3D volumes. These holistic features are further utilized to define a novel 3D clustering agreement-based SSL task and masking embedding prediction inspired by pre-trained language models. Experiments on downstream tasks, such as 3D brain segmentation, lung nodule detection, 3D heart structures segmentation, and abnormal chest X-ray detection, demonstrate the effectiveness of our joint 2D and 3D SSL approach. We improve plain 2D Deep-ClusterV2 and SwAV by a significant margin and also surpass various modern 2D and 3D SSL approaches.
more » « less
Full Text Available
Revisiting Fixed Support Wasserstein Barycenter: Computational Hardness and Efficient Algorithms

Lin, Tianyi; Ho, Nhat; Chen, Xi; Cuturi, Macro; Jordan, Michael I. (December 2020, Advances in neural information processing systems)
null (Ed.)
Full Text Available

Search for: All records