NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees

Nguyen, Thien H; Nguyen, Huy L (July 2025, Proceedings of Machine Learning Research)

We introduce two complementary techniques for efficient optimization that reduce memory requirements while accelerating training of large-scale neural networks. The first technique, Subset-Norm step size, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) through step-size sharing. Subset-Norm (SN) reduces AdaGrad’s memory footprint from O(d) to O(sqrt(d)), where d is the model size. For non-convex smooth objectives under coordinate-wise sub-gaussian noise, we show a noise-adapted high-probability convergence guarantee with improved dimensional dependence of SN over existing methods. Our second technique, Subspace-Momentum, reduces the momentum state’s memory footprint by restricting momentum to a low-dimensional subspace while performing SGD in the orthogonal complement. We prove a high-probability convergence result for Subspace-Momentum under standard assumptions. Empirical evaluation on pre-training and fine-tuning LLMs demonstrates the effectiveness of our methods. For instance, combining Subset-Norm with Subspace-Momentum achieves Adam’s validation perplexity for LLaMA 1B in approximately half the training tokens (6.8B vs 13.1B) while reducing Adam’s optimizer-states memory footprint by more than 80% with minimal additional hyperparameter tuning.
more » « less
Free, publicly-accessible full text available July 13, 2026
Proof of the transverse instability of Stokes waves

https://doi.org/10.1007/s40818-024-00188-7

Creedon, Ryan P; Nguyen, Huy Q; Strauss, Walter A (June 2025, Annals of PDE)

Free, publicly-accessible full text available June 1, 2026
Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures

Ene, Alina; Epasto, Alessandro; Mirrokni, Vahab; Nguyen, Hoai-An; Nguyen, Huy L; Woodruff, David P; Zhong, Peilin (July 2025, Proceedings of Machine Learning Research)

In the maximum coverage problem we are given d subsets from a universe [n], and the goal is to output k subsets such that their union covers the largest possible number of distinct items. We present the first algorithm for maximum coverage in the turnstile streaming model, where updates which insert or delete an item from a subset come one-by-one. Notably our algorithm only uses polylogn update time. We also present turnstile streaming algorithms for targeted and general fingerprinting for risk management where the goal is to determine which features pose the greatest re-identification risk in a dataset. As part of our work, we give a result of independent interest: an algorithm to estimate the complement of the pth frequency moment of a vector for p ≥ 2. Empirical evaluation confirms the practicality of our fingerprinting algorithms demonstrating a speedup of up to 210x over prior work.
more » « less
Free, publicly-accessible full text available July 13, 2026
Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective

Yan, Fanqi; Nguyen, Huy; Akbarian, Pedram; Ho, Nhat; Rinaldo, Alessandro (May 2025, cs.LG)

At the core of the popular Transformer architecture is the self-attention mechanism, which dynamically assigns softmax weights to each input token so that the model can focus on the most salient information. However, the softmax structure slows down the attention computation due to its row-wise nature, and it inherently introduces competition among tokens: as the weight assigned to one token increases, the weights of others decrease. This competitive dynamic may narrow the focus of self-attention to a limited set of features, potentially overlooking other informative characteristics. Recent experimental studies have shown that using the element-wise sigmoid function helps eliminate token competition and reduce the computational overhead. Despite these promising empirical results, a rigorous comparison between sigmoid and softmax self-attention mechanisms remains absent in the literature. This paper closes this gap by theoretically demonstrating that sigmoid self-attention is more sample-efficient than its softmax counterpart. Toward that goal, we represent the self-attention matrix as a mixture of experts and show that ``experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention.
more » « less
Free, publicly-accessible full text available May 27, 2026
Stokes waves are unstable, even very small ones

https://doi.org/10.4171/EMSS/90

Creedon, Ryan P; Nguyen, Huy Q; Strauss, Walter A (May 2025, EMS Surveys in Mathematical Sciences)

WS: it is difficult to write about Thomas Kappeler in the past tense. He was a brilliant mathematician, but more importantly he was a wonderfully open, generous, and friendly person. I was fortunate that we had many opportunities to spend time together and discuss mathematics. I greatly miss him.A Stokes wave is a traveling free-surface periodic water wave that is constant in the direction transverse to the direction of propagation. Even Stokes waves of very small amplitude are unstable when subjected to various perturbations. We present a brief survey of this phenomenon with emphasis on transverse perturbations.
more » « less
Free, publicly-accessible full text available May 16, 2026
On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

Nguyen, Huy; Doan, Thong T; Pham, Quang Pham; Bui, Nghi D_Q; Ho, Nhat; Rinaldo, Alessandro (June 2025, cs.LG)

Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implemen- tations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.
more » « less
Free, publicly-accessible full text available June 12, 2026
Online and Streaming Algorithms for Constrained k-Submodular Maximization

https://doi.org/10.1609/aaai.v39i19.34266

Spaeh, Fabian Christian; Ene, Alina; Nguyen, Huy (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

Constrained k-submodular maximization is a general framework that captures many discrete optimization problems such as ad allocation, influence maximization, personalized recommendation, and many others. In many of these applications, datasets are large or decisions need to be made in an online manner, which motivates the development of efficient streaming and online algorithms. In this work, we develop single-pass streaming and online algorithms for constrained k-submodular maximization with both monotone and general (possibly non-monotone) objectives subject to cardinality and knapsack constraints. Our algorithms achieve provable constant-factor approximation guarantees which improve upon the state of the art in almost all settings. Moreover, they achieve the fastest known running times and have optimal space usage. We experimentally evaluate our algorithms on instances for ad allocation and other applications, where we observe that our algorithms are practical and scalable, and construct solutions that are comparable in value even to offline greedy algorithms.
more » « less
Free, publicly-accessible full text available April 11, 2026
Convergence Rates for Softmax Gating Mixture of Experts

Nguyen, Huy; Ho, Nhat; Rinaldo, Alessandro (March 2025, https://doi.org/10.48550/arXiv.2503.03213)

Mixture of experts (MoE) has recently emerged as an effective framework to advance the efficiency and scalability of machine learning models by softly dividing complex tasks among multiple specialized sub-models termed experts. Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights. Despite its widespread use in practice, a comprehensive study on the effects of the softmax gating on the MoE has been lacking in the literature. To bridge this gap in this paper, we perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating, respectively. Furthermore, our theories also provide useful insights into the design of sample-efficient expert structures. In particular, we demonstrate that it requires polynomially many data points to estimate experts satisfying our proposed strong identifiability condition, namely a commonly used two-layer feed-forward network. In stark contrast, estimating linear experts, which violate the strong identifiability condition, necessitates exponentially many data points as a result of intrinsic parameter interactions expressed in the language of partial differential equations. All the theoretical results are substantiated with a rigorous guarantee.
more » « less
Free, publicly-accessible full text available March 5, 2026
Online and Streaming Algorithms for Constrained k-Submodular Maximization

Spaeh, Fabian; Ene, Alina; Nguyen, Huy L (February 2025, AAAI Conference on Artificial Intelligence)

Constrained k-submodular maximization is a general framework that captures many discrete optimization problems such as ad allocation, influence maximization, personalized recommendation, and many others. In many of these applications, datasets are large or decisions need to be made in an online manner, which motivates the development of efficient streaming and online algorithms. In this work, we develop single-pass streaming and online algorithms for constrained k-submodular maximization with both monotone and general (possibly non-monotone) objectives subject to cardinality and knapsack constraints. Our algorithms achieve provable constant-factor approximation guarantees which improve upon the state of the art in almost all settings. Moreover, they achieve the fastest known running times and have optimal space usage. We experimentally evaluate our algorithms on instances for ad allocation and other applications, where we observe that our algorithms are practical and scalable, and construct solutions that are comparable in value even to offline greedy algorithms.
more » « less
Free, publicly-accessible full text available February 25, 2026
Monitoring Molecular Interactions with Cell Membranes Using Time-Dependent Second Harmonic Generation Microscopy

https://doi.org/10.1021/acs.biochem.4c00302

Hamal, Prakash; Sahu, Sushant_P; Piers, Peter_P; Nguyen, Huy; Kamble, Shashank_S; McCarley, Robin_L; Gartia, Manas_R; Haber, Louis_H (March 2025, Biochemistry)

« Prev Next »

Search for: All records