Memory-Efficient LLM Training with Online Subspace Descent

Liang, Kaizhao; Liu, Bo; Chen, Lizhang; Liu, Qiang

Recently, a wide range of memory-efficient LLM training algorithms have gained substantial popularity. These methods leverage the low-rank structure of gradients to project optimizer states into a subspace using a projection matrix found by singular value decomposition (SVD). However, convergence of these algorithms is highly dependent on the update rules of their projection matrix. This work provides the first convergence guarantee for arbitrary update rules of projection matrices, generally applicable to optimizers that can be analyzed with Hamiltonian Descent, including common ones such as LION and Adam. Inspired by this theoretical understanding, the authors propose Online Subspace Descent, a new family of subspace descent optimizers that do not rely on SVD. Instead of updating the projection matrix with eigenvectors, Online Subspace Descent updates it with online PCA. This approach is flexible and introduces minimal overhead to training. Experiments show that for pretraining LLaMA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream task performance than state-of-the-art low-rank training methods across settings, narrowing the gap with full-rank baselines.

More Like this