skip to main content


Title: SHMEMGraph: Efficient and Balanced Graph Processing Using One-Sided Communication
State-of-the-art synchronous graph processing frameworks face both inefficiency and imbalance issues that cause their performance to be suboptimal. These issues include the inefficiency of communication and the imbalanced graph computation/communication costs in an iteration. We propose to replace their conventional two-sided communication model with the one-sided counterpart. Accordingly, we design SHMEMGraph, an efficient and balanced graph processing framework that is formulated across a global memory space and takes advantage of the flexibility and efficiency of one-sided communication for graph processing. Through an efficient one-sided communication channel, SHMEMGraph utilizes the high-performance operations with RDMA while minimizing the resource contention within a computer node. In addition, SHMEMGraph synthesizes a number of optimizations to address both computation imbalance and communication imbalance. By using a graph of 1 billion edges, our evaluation shows that compared to the state-of-the-art Gemini framework, SHMEMGraph achieves an average improvement of 35.5% in terms of job completion time for five representative graph algorithms.  more » « less
Award ID(s):
1744336 1564647 1561041
NSF-PAR ID:
10063470
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
Page Range / eLocation ID:
513 to 522
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We present Atos, a dynamic scheduling framework for multi-node-GPU systems that supports PGAS-style lightweight one-sided memory operations within and between nodes. Atos's lightweight GPU-to-GPU communication enables latency hiding and can smooth the interconnection usage for bisection-limited problems. These benefits are significant for dynamic, irregular applications that often involve fine-grained communication at unpredictable times and without predetermined patterns. Some principles for high performance: (1) do not involve the CPU in the communication control path; (2) allow GPU communication within kernels, addressing memory consistency directly rather than relying on synchronization with the CPU; (3) perform dynamic communication aggregation when interconnections have limited bandwidth. By lowering the overhead of communication and allowing it within GPU kernels, we support large, high-utilization GPU kernels but with more frequent communication. We evaluate Atos on two irregular problems: Breadth-First-Search and PageRank. Atos outperforms the state-of-the-art graph libraries Gunrock, Groute and Galois on both single-node-multi-GPU and multi-node-GPU settings. 
    more » « less
  2. In this paper, we present GraphTM, an efficient and scalable framework for processing transactions in a distributed environment. The distributed environment is modeled as a graph where each node of the graph is a processing node that issues transactions. The objects that transactions use to execute are also on the graph nodes (the initial placement may be arbitrary). The transactions execute on the nodes which issue them after collecting all the objects that they need following the data-flow model of computation. This collection is done by issuing the requests for the objects as soon as transaction starts and wait until all required objects for the transaction come to the requesting node. The challenge is on how to schedule the transactions so that two crucial performance metrics, namely (i) total execution time to commit all the transactions, and (ii) total communication cost involved in moving the objects to the requesting nodes, are minimized. We implemented GraphTM in Java and assessed its performance through 3 micro-benchmarks and 5 complex benchmarks from STAMP benchmark suite on 5 different network topologies, namely, clique, line, grid, cluster, and star, that make an underlying communication network for a representative set of distributed systems commonly used in practice. The results show the efficiency and scalability of our approach. 
    more » « less
  3. Cas Cremers and Engin Kirda (Ed.)
    Vector Oblivious Linear Evaluation (VOLE) supports fast and scalable interactive Zero-Knowledge (ZK) proofs. Despite recent improvements to VOLE-based ZK, compiling proof statements to a control-flow oblivious form (e.g., a circuit) continues to lead to expensive proofs. One useful setting where this inefficiency stands out is when the statement is a disjunction of clauses $\mathcal{L}_1 \lor \cdots \lor \mathcal{L}_B$. Typically, ZK requires paying the price to handle all $B$ branches. Prior works have shown how to avoid this price in communication, but not in computation. Our main result, $\mathsf{Batchman}$, is asymptotically and concretely efficient VOLE-based ZK for batched disjunctions, i.e. statements containing $R$ repetitions of the same disjunction. This is crucial for, e.g., emulating CPU steps in ZK. Our prover and verifier complexity is only $\bigO(RB+R|\C|+B|\C|)$, where $|\C|$ is the maximum circuit size of the $B$ branches. Prior works' computation scales in $RB|\C|$. For non-batched disjunctions, we also construct a VOLE-based ZK protocol, $\mathsf{Robin}$, which is (only) communication efficient. For small fields and for statistical security parameter $\lambda$, this protocol's communication improves over the previous state of the art ($\mathsf{Mac'n'Cheese}$, Baum et al., CRYPTO'21) by up to factor $\lambda$. Our implementation outperforms prior state of the art. E.g., we achieve up to $6\times$ improvement over $\mathsf{Mac'n'Cheese}$ (Boolean, single disjunction), and for arithmetic batched disjunctions our experiments show we improve over $\mathsf{QuickSilver}$ (Yang et al., CCS'21) by up to $70\times$ and over $\mathsf{AntMan}$ (Weng et al., CCS'22) by up to $36\times$. 
    more » « less
  4. null (Ed.)
    Graph processing frameworks are typically designed to optimize the evaluation of a single graph query. However, in practice, we often need to respond to multiple graph queries, either from different users or from a single user performing a complex analytics task. Therefore in this paper we develop SimGQ, a system that optimizes simultaneous evaluation of a group of vertex queries that originate at different source vertices (e.g., multiple shortest path queries originating at different source vertices) and delivers substantial speedups over a conventional framework that evaluates and responds to queries one by one. The performance benefits are achieved via batching and sharing. Batching fully utilizes system resources to evaluate a batch of queries and amortizes runtime overheads incurred due to fetching vertices and edge lists, synchronizing threads, and maintaining computation frontiers. Sharing dynamically identifies shared queries that substantially represent subcomputations in the evaluation of different queries in a batch, evaluates the shared queries, and then uses their results to accelerate the evaluation of all queries in the batch. With four input power-law graphs and four graph algorithms SimGQ achieves speedups of up to 45.67 × with batch sizes of up to 512 queries over the baseline implementation that evaluates the queries one by one using the state of the art Ligra system. Moreover, both batching and sharing contribute substantially to the speedups. 
    more » « less
  5. A central theme in federated learning (FL) is the fact that client data distributions are often not independent and identically distributed (IID), which has strong implications on the training process. While most existing FL algorithms focus on the conventional non-IID setting of class imbalance or missing classes across clients, in practice, the distribution differences could be more complex, e.g., changes in class conditional (domain) distributions. In this paper, we consider this complex case in FL wherein each client has access to only one domain distribution. For tasks such as domain generalization, most existing learning algorithms require access to data from multiple clients (i.e., from multiple domains) during training, which is prohibitive in FL. To address this challenge, we propose a federated domain translation method that generates pseudodata for each client which could be useful for multiple downstream learning tasks. We empirically demonstrate that our translation model is more resource-efficient (in terms of both communication and computation) and easier to train in an FL setting than standard domain translation methods. Furthermore, we demonstrate that the learned translation model enables use of state-of-the-art domain generalization methods in a federated setting, which enhances accuracy and robustness to increases in the synchronization period compared to existing methodology. 
    more » « less