skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Synchronous vs. Asynchronous GPU Graph Frameworks
Recent node-level GPU accelerated graph processing frameworks have separately chosen synchronous and asynchronous architectures. Which is better under which circumstances, and why? We focus on Gunrock (a synchronous framework) vs. Groute (an asynchronous framework) with 3 primitives on 3 different datasets. We identify load balance, kernel count, and communication latency and bandwidth as quantities of particular interest.  more » « less
Award ID(s):
1629657
PAR ID:
10027619
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
The 7th Workshop on Multi-core and Rack-scale Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Rothblum, Guy; Wee, Hoeteck (Ed.)
    It is well known that without randomization, Byzantine agreement (BA) requires a linear number of rounds in the synchronous setting, while it is flat out impossible in the asynchronous setting. The primitive which allows to bypass the above limitation is known as oblivious common coin (OCC). It allows parties to agree with constant probability on a random coin, where agreement is oblivious, i.e., players are not aware whether or not agreement has been achieved. The starting point of our work is the observation that no known protocol exists for information-theoretic multi-valued OCC with optimal resiliency in the asynchronous setting (with eventual message delivery). This apparent hole in the literature is particularly problematic, as multi-valued OCC is implicitly or explicitly used in several constructions. In this paper, we present the first information-theoretic multi-valued OCC protocol in the asynchronous setting with optimal resiliency, i.e., tolerating up to n/3 corruptions, thereby filling this important gap. Further, our protocol efficiently implements OCC with an exponential-size domain, a property which is not even achieved by known constructions in the simpler, synchronous setting. We then turn to the problem of round-preserving parallel composition of asynchronous BA. A protocol for this task was proposed by Ben-Or and El-Yaniv [Distributed Computing ’03]. Their construction, however, is flawed in several ways. Thus, as a second contribution, we provide a simpler, more modular protocol for the above task. Finally, and as a contribution of independent interest, we provide proofs in Canetti’s Universal Composability framework; this makes our work the first one offering composability guarantees, which are important as BA is a core building block of secure multi-party computation protocols. 
    more » « less
  2. null (Ed.)
    ASYNC is a framework that supports the implementation of asynchrony and history for optimization methods on distributed computing platforms. The popularity of asynchronous optimization methods has increased in distributed machine learning. However, their applicability and practical experimentation on distributed systems are limited because current bulk-processing cloud engines do not provide a robust support for asynchrony and history. With introducing three main modules and bookkeeping system-specific and application parameters, ASYNC provides practitioners with a framework to implement asynchronous machine learning methods. To demonstrate ease-of-implementation in ASYNC, the synchronous and asynchronous variants of two well-known optimization methods, stochastic gradient descent and SAGA, are demonstrated in ASYNC. 
    more » « less
  3. Insects have developed diverse flight actuation mechanisms, including synchronous and asynchronous musculature. Indirect actuation, used by insects with both synchronous and asynchronous musculature, transforms thorax exoskeletal deformation into wing rotation. Though thorax deformation is often attributed exclusively to muscle tension, the inertial and aerodynamic forces generated by the flapping wings may also contribute. In this study, a tethered flight experiment was used to simultaneously measure thorax deformation and the inertial/aerodynamic forces acting on the thorax generated by the flapping wing. Compared to insects with synchronous musculature, insects with asynchronous muscle deformed their thorax 60% less relative to their thorax diameter and their wings generated 2.8 times greater forces relative to their body weight. In a second experiment, dorsalventral thorax stiffness was measured across species. Accounting for weight and size, the asynchronous thorax was on average 3.8 times stiffer than the synchronous thorax in the dorsalventral direction. Differences in thorax stiffness and forces acting at the wing hinge led us to hypothesize about differing roles of series and parallel elasticity in the thoraxes of insects with synchronous and asynchronous musculature. Specifically, wing hinge elasticity may contribute more to wing motion in insects with asynchronous musculature than in those with synchronous musculature. 
    more » « less
  4. Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments. 
    more » « less
  5. Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments. 
    more » « less