NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications

Jaiswal, Ajay; Wang, Yifan; Yin, Lu; Liu, Shiwei; Chen; Runjin; Zhao, Jiawei; Grama, Ananth; Tian, Yuandong; Wang, Zhangyang (July 2025, International Conference on Machine Learning (ICML))

Free, publicly-accessible full text available July 13, 2026
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Zhao, Jiawei; Zhang, Zhenyu; Chen, Beidi; Wang, Zhangyang; Anandkumar, Anima; Tian, Yuandong (July 2024, International Conference on Machine Learning (ICML))

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
more » « less
Full Text Available
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Zhao, Jiawei; Zhang, Zhenyu; Chen, Beidi; Wang, Zhangyang; Anandkumar, Anima; Tian, Yuandong (July 2024, International Conference on Machine Learning (ICML))

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
more » « less
Full Text Available
EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS

Xiao, Guangxuan; Tian, Yuandong; Chen, Beidi; Han, Song; Lewis, Mike (May 2024, The Twelfth International Conference on Learning Representations)

Full Text Available
Provably Efficient Policy Optimization for Two-Player Zero-Sum Markov Games

Zhao, Yulai; Tian, Yuandong; Lee, Jason D; Du, Simon S (January 2022, h International Conference on Artificial Intelligence and Statistics (AISTATS))

Full Text Available
Multi-objective Optimization by Learning Space Partitions

Zhao, Yiyang; Wang, Linnan; Yang, Kevin; Zhang, Tianjun; Guo, Tian; Tian, Yuandong (January 2022, International Conference on Learning Representations (ICLR'22))

Full Text Available
Network planning with deep reinforcement learning

https://doi.org/10.1145/3452296.3472902

Zhu, Hang; Gupta, Varun; Ahuja, Satyajeet Singh; Tian, Yuandong; Zhang, Ying; Jin, Xin (August 2021, Proceedings of the 2021 ACM SIGCOMM 2021 Conference)

Network planning is critical to the performance, reliability and cost of web services. This problem is typically formulated as an Integer Linear Programming (ILP) problem. Today's practice relies on hand-tuned heuristics from human experts to address the scalability challenge of ILP solvers. In this paper, we propose NeuroPlan, a deep reinforcement learning (RL) approach to solve the network planning problem. This problem involves multi-step decision making and cost minimization, which can be naturally cast as a deep RL problem. We develop two important domain-specific techniques. First, we use a graph neural network (GNN) and a novel domain-specific node-link transformation for state encoding, in order to handle the dynamic nature of the evolving network topology during planning decision making. Second, we leverage a two-stage hybrid approach that first uses deep RL to prune the search space and then uses an ILP solver to find the optimal solution. This approach resembles today's practice, but avoids human experts with an RL agent in the first stage. Evaluation on real topologies and setups from large production networks demonstrates that NeuroPlan scales to large topologies beyond the capability of ILP solvers, and reduces the cost by up to 17% compared to hand-tuned heuristics.
more » « less
Full Text Available
MADE: Exploration via Maximizing Deviation from Explored Regions

Zhang, Tianjun; Rashidinejad, Paria; Jiao, Jiantao; Tian, Yuandong; Gonzalez, Joseph E; Russell, Stuart (January 2021, Advances in Neural Information Processing Systems 34 (NeurIPS 2021))

Full Text Available
FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions

https://doi.org/10.1109/CVPR42600.2020.01298

Wan, Alvin; Dai, Xiaoliang; Zhang, Peizhao; He, Zijian; Tian, Yuandong; Xie, Saining; Wu, Bichen; Yu, Matthew; Xu, Tao; Chen, Kan; et al (June 2020, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
null (Ed.)
Differentiable Neural Architecture Search (DNAS) has demonstrated great success in designing state-of-the-art, efficient neural networks. However, DARTS-based DNAS's search space is small when compared to other search methods', since all candidate network layers must be explicitly instantiated in memory. To address this bottleneck, we propose a memory and computationally efficient DNAS variant: DMaskingNAS. This algorithm expands the search space by up to 10^14x over conventional DNAS, supporting searches over spatial and channel dimensions that are otherwise prohibitively expensive: input resolution and number of filters. We propose a masking mechanism for feature map reuse, so that memory and computational costs stay nearly constant as the search space expands. Furthermore, we employ effective shape propagation to maximize per-FLOP or per-parameter accuracy. The searched FBNetV2s yield state-of-the-art performance when compared with all previous architectures. With up to 421x less search cost, DMaskingNAS finds models with 0.9% higher accuracy, 15% fewer FLOPs than MobileNetV3-Small; and with similar accuracy but 20% fewer FLOPs than Efficient-B0. Furthermore, our FBNetV2 outperforms MobileNetV3 by 2.6% in accuracy, with equivalent model size. FBNetV2 models are open-sourced at https://github.com/facebookresearch/mobile-vision.
more » « less
Full Text Available

Search for: All records