HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training

Kim, G_W; Li, J; Gandham, S; Baldonado, O; Gangidi, A; Balaji, P; Wang, Z; Akella, A

Citation Details

This content will become publicly available on June 5, 2026

HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training

Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments. more »

Award ID(s):: 2505865

PAR ID:: 10631424

Author(s) / Creator(s):: Kim, G_W; Li, J; Gandham, S; Baldonado, O; Gangidi, A; Balaji, P; Wang, Z; Akella, A

Publisher / Repository:: https://doi.org/10.48550/arXiv.2506.04531

Date Published:: 2025-06-05

ISSN:: 2506.04531

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on June 5, 2026
Conference Paper:
The DOI is not currently available.

More Like this