We present BurstZ, a bandwidth-efficient accelerator platform for scientific computing. While accelerators such as GPUs and FPGAs provide enormous computing capabilities, their effectiveness quickly deteriorates once the working set becomes larger than the on-board memory capacity, causing the performance to become bottlenecked either by the communication bandwidth between the host and the accelerator. Compression has not been very useful in solving this issue due to the difficulty of efficiently compressing floating point numbers, which scientific data often consists of. Most compression algorithms are either ineffective with floating point numbers, or has a high performance overhead. BurstZ is an FPGA-based accelerator platform which addresses the bandwidth issue via a novel hardware-optimized floating point compression algorithm, which we call sZFP. We demonstrate that BurstZ can completely remove the communication bottleneck for accelerators, using a 3D stencil-code accelerator implemented on a prototype BurstZ implementation. Evaluated against hand-optimized implementations of stencil code accelerators of the same architecture, our BurstZ prototype outperformed an accelerator without compression by almost 4X, and even an accelerator with enough memory for the entire dataset by over 2X. BurstZ improved communication efficiency so much, our prototype was even able to outperform the upper limit projected performance of an optimized stencil core with ideal memory access characteristics, by over 2X.
more »
« less
Node-Aware Stencil Communication for Heterogeneous Supercomputers
High-performance distributed computing systems increasingly feature nodes that have multiple CPU sockets and multiple GPUs. The communication bandwidth between these components is non-uniform. Furthermore, these systems can expose different communication capabilities between these components. For communication-heavy applications, optimally using these capabilities is challenging and essential for performance. Bespoke codes with optimized communication may be non-portable across run-time/software/hardware configurations, and existing stencil frameworks neglect optimized communication. This work presents node-aware approaches for automatic data placement and communication implementation for 3D stencil codes on multi-GPU nodes with non-homogeneous communication performance and capabilities. Benchmarking results in the Summit system show that choices in placement can result in a 20% improvement in single-node exchange, and communication specialization can yield a further 6x improvement in exchange time in a single node, and a 16% improvement at 1536 GPUs.
more »
« less
- Award ID(s):
- 1725729
- PAR ID:
- 10190061
- Date Published:
- Journal Name:
- 2020 IEEE International Parallel and Distributed Processing Symposium Workshops
- Page Range / eLocation ID:
- 796 to 805
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Modern machine learning frameworks can train neural networks using multiple nodes in parallel, each computing parameter updates with stochastic gradient descent (SGD) and sharing them asynchronously through a central parameter server. Due to communication overhead and bottlenecks, the total throughput of SGD updates in a cluster scales sublinearly, saturating as the number of nodes increases. In this paper, we present a solution to predicting training throughput from profiling traces collected from a single-node configuration. Our approach is able to model the interaction of multiple nodes and the scheduling of concurrent transmissions between the parameter server and each node. By accounting for the dependencies between received parts and pending computations, we predict overlaps between computation and communication and generate synthetic execution traces for configurations with multiple nodes. We validate our approach on TensorFlow training jobs for popular image classification neural networks, on AWS and on our in-house cluster, using nodes equipped with GPUs or only with CPUs. We also investigate the effects of data transmission policies used in TensorFlow and the accuracy of our approach when combined with optimizations of the transmission schedule.more » « less
-
Interactive proof systems allow a resource-bounded verifier to decide an intractable language (or compute a hard function) by communicating with a powerful but untrusted prover. Such systems guarantee that the prover can only convince the verifier of true statements. In the context of centralized computation, a celebrated result shows that interactive proofs are extremely powerful, allowing polynomial-time verifiers to decide any language in PSPACE. In this work we initiate the study of interactive distributed proofs: a network of nodes interacts with a single untrusted prover, who sees the entire network graph, to decide whether the graph satisfies some property. We focus on the communication cost of the protocol — the number of bits the nodes must exchange with the prover and each other. Our model can also be viewed as a generalization of the various models of “distributed NP” (proof labeling schemes, etc.) which received significant attention recently: while these models only allow the prover to present each network node with a string of advice, our model allows for back-and-forth interaction. We prove both upper and lower bounds for the new model. We show that for some problems, interaction can exponentially decrease the communication cost compared to a non-interactive prover, but on the other hand, some problems retain non-trivial cost even with interaction.more » « less
-
In this paper, we present GraphTM, an efficient and scalable framework for processing transactions in a distributed environment. The distributed environment is modeled as a graph where each node of the graph is a processing node that issues transactions. The objects that transactions use to execute are also on the graph nodes (the initial placement may be arbitrary). The transactions execute on the nodes which issue them after collecting all the objects that they need following the data-flow model of computation. This collection is done by issuing the requests for the objects as soon as transaction starts and wait until all required objects for the transaction come to the requesting node. The challenge is on how to schedule the transactions so that two crucial performance metrics, namely (i) total execution time to commit all the transactions, and (ii) total communication cost involved in moving the objects to the requesting nodes, are minimized. We implemented GraphTM in Java and assessed its performance through 3 micro-benchmarks and 5 complex benchmarks from STAMP benchmark suite on 5 different network topologies, namely, clique, line, grid, cluster, and star, that make an underlying communication network for a representative set of distributed systems commonly used in practice. The results show the efficiency and scalability of our approach.more » « less
-
Summary To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, also referred to as HCAs (Host Channel Adapters), resulting in a “multi‐rail”/“multi‐HCA” network. For example, the ThetaGPU system at Argonne National Laboratory (ANL) has eight adapters per node; with this many networking resources available, utilizing all of them becomes non‐trivial. The Message Passing Interface (MPI) is a dominant model for high‐performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we provide a thorough performance analysis of existing multirail solutions and their implications on collectives and present the necessity for further enhancement. Specifically, we propose novel designs for hierarchical, multi‐HCA‐aware Allgather. The proposed designs fully utilize all the available network adapters within a node and provide high overlap between inter‐node and intra‐node communication. At the micro‐benchmark level, we see large inter‐node improvements up to 62% and 61% better than HPC‐X and MVAPICH2‐X for 1024 processes. Because Allgather is used in Ring‐Allreduce, our designs also improve its performance by 56% and 44% compared to HPC‐X and MVAPICH2‐X, respectively. At the application level, our enhanced Allgather shows and improvement in a matrix‐vector multiplication kernel when compared to HPC‐X and MVAPICH2‐X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2‐X.more » « less