RADIUS: RANGE-BASED GRADIENT SPARSITY FOR LARGE FOUNDATION MODEL PRE-TRAINING

Zheng, M; Zhang, Z

Citation Details

This content will become publicly available on May 17, 2026

RADIUS: RANGE-BASED GRADIENT SPARSITY FOR LARGE FOUNDATION MODEL PRE-TRAINING

We present Radius, a gradient sparsity algorithm and system to accelerate large foundation model (FM) training while preserving downstream task performance. Radius leverages two key insights in large FM pre-training: 1) only a small portion of gradients contribute to the model updates in each iteration, and 2) the spatial distribution of the gradients with large magnitude is stable over time. Radius overcomes the scaling problem of existing top-k sparsity methods, as it maintains the structure of sparse gradients thus avoids dense communication. We examine the convergence and speed of Radius on pre-training GPT models (355M and 2.0B) in data-parallel and compare it with the baseline top-k sparsification methods. Our results show that using the existing top-k method with AdamW optimizer fails to converge, and the training speed improvement with sparse communication is marginal. In contrast, Radius with 40% sparsity reduces per-step training time by 21% (19% for overall training time) across 64 NVIDIA A100 GPUs that are connected by the Slingshot 11 interconnect while preserving the downstream task performance. more »

Award ID(s):: 2340011

PAR ID:: 10591330

Author(s) / Creator(s):: Zheng, M; Zhang, Z

Publisher / Repository:: Eighth Conference on Machine Learning and Systems

Date Published:: 2025-05-17

Format(s):: Medium: X

Location:: Santa Clara, CA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on May 17, 2026
Conference Paper:
The DOI is not currently available.

More Like this