skip to main content


Title: Accelerating communication with multi‐HCA aware collectives in MPI
Summary To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, also referred to as HCAs (Host Channel Adapters), resulting in a “multi‐rail”/“multi‐HCA” network. For example, the ThetaGPU system at Argonne National Laboratory (ANL) has eight adapters per node; with this many networking resources available, utilizing all of them becomes non‐trivial. The Message Passing Interface (MPI) is a dominant model for high‐performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we provide a thorough performance analysis of existing multirail solutions and their implications on collectives and present the necessity for further enhancement. Specifically, we propose novel designs for hierarchical, multi‐HCA‐aware Allgather. The proposed designs fully utilize all the available network adapters within a node and provide high overlap between inter‐node and intra‐node communication. At the micro‐benchmark level, we see large inter‐node improvements up to 62% and 61% better than HPC‐X and MVAPICH2‐X for 1024 processes. Because Allgather is used in Ring‐Allreduce, our designs also improve its performance by 56% and 44% compared to HPC‐X and MVAPICH2‐X, respectively. At the application level, our enhanced Allgather shows and improvement in a matrix‐vector multiplication kernel when compared to HPC‐X and MVAPICH2‐X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2‐X.  more » « less
Award ID(s):
2018627
NSF-PAR ID:
10464671
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Concurrency and Computation: Practice and Experience
ISSN:
1532-0626
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The increase in scale and heterogeneity of high-performance computing (HPC) systems predispose the performance of Message Passing Interface (MPI) collective communications to be susceptible to noise, and to adapt to a complex mix of hardware capabilities. The designs of state of the art MPI collectives heavily rely on synchronizations; these designs magnify noise across the participating processes, resulting in significant performance slowdown. Therefore, such design philosophy must be reconsidered to efficiently and robustly run on the large-scale heterogeneous platforms. In this paper, we present ADAPT, a new collective communication framework in Open MPI, using event-driven techniques to morph collective algorithms to heterogeneous environments. The core concept of ADAPT is to relax synchronizations, while mamtaining the minimal data dependencies of MPI collectives. To fully exploit the different bandwidths of data movement lanes in heterogeneous systems, we extend the ADAPT collective framework with a topology-aware communication tree. This removes the boundaries of different hardware topologies while maximizing the speed of data movements. We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. Our results demonstrate drastic performance improvements and a strong resistance against noise compared to other state of the art MPI libraries. In particular, we demonstrate at least 1.3X and 1.5X speedup for CPU data and 2X and 10X speedup for GPU data using ADAPT event-based broadcast and reduce operations. 
    more » « less
  2. Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one can checkpoint an MPI application under one MPI implementation and perhaps over TCP, and then restart under a second MPI implementation over InfiniBand on a cluster with a different number of CPU cores per node. This technique is based on a novel "split-process" approach, which enables two separate programs to co-exist within a single process with a single address space. This work overcomes the limitations of the two most widely adopted transparent checkpointing solutions, BLCR and DMTCP/InfiniBand, which require separate modifications to each MPI implementation and/or underlying network API. The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computation running natively on that same remote cluster. 
    more » « less
  3. As High Performance Computing (HPC) applications with data security requirements are increasingly moving to execute in the public cloud, there is a demand that the cloud infrastructure for HPC should support privacy and integrity. Incorporating privacy and integrity mechanisms in the communication infrastructure of today's public cloud is challenging because recent advances in the networking infrastructure in data centers have shifted the communication bottleneck from the network links to the network end points and because encryption is computationally intensive. In this work, we consider incorporating encryption to support privacy and integrity in the Message Passing Interface (MPI) library, which is widely used in HPC applications. We empirically study four contemporary cryptographic libraries, OpenSSL, BoringSSL, Libsodium, and CryptoPP using micro-benchmarks and NAS parallel benchmarks to evaluate their overheads for encrypting MPI messages on two different networking technologies, 10Gbps Ethernet and 40Gbps InfiniBand. The results indicate that (1) the performance differs drastically across cryptographic libraries, and (2) effectively supporting privacy and integrity in MPI communications on high speed data center networks is challenging-even with the most efficient cryptographic library, encryption can still introduce very significant overheads in some scenarios such as a single MPI communication operation on InfiniBand, but (3) the overall overhead may not be prohibitive for practical uses since there can be multiple concurrent communications. 
    more » « less
  4. In recent times, geospatial datasets are growing in terms of size, complexity and heterogeneity. High performance systems are needed to analyze such data to produce actionable insights in an efficient manner. For polygonal a.k.a vector datasets, operations such as I/O, data partitioning, communication, and load balancing becomes challenging in a cluster environment. In this work, we present MPI-Vector-IO, a parallel I/O library that we have designed using MPI-IO specifically for partitioning and reading irregular vector data formats such as Well Known Text. It makes MPI aware of spatial data, spatial primitives and provides support for spatial data types embedded within collective computation and communication using MPI message-passing library. These abstractions along with parallel I/O support are useful for parallel Geographic Information System (GIS) application development on HPC platforms. Performance evaluation is done on Lustre and GPFS filesystems. MPI-Vector-IO scales well with MPI processes and file size and achieves bandwidth up to 22 GB/s for common spatial data access patterns. We observed that independent file read functions performed better than collective functions in MPI-IO for contiguous access pattern on Lustre. In general, the I/O is improved by one to two orders of magnitude over real-world datasets using up to 1152 CPU cores. Spatial Join query is used as an exemplar to demonstrate an end-to-end application using MPI-Vector-IO. 
    more » « less
  5. In this paper, we consider a remote inference system, where a neural network is used to infer a time-varying target (e.g., robot movement), based on features (e.g., video clips) that are progressively received from a sensing node (e.g., a camera). Each feature is a temporal sequence of sensory data. The inference error is determined by (i) the timeliness and (ii) the sequence length of the feature, where we use Age of Information (AoI) as a metric for timeliness. While a longer feature can typically provide better inference performance, it often requires more channel resources for sending the feature. To minimize the time-averaged inference error, we study a learning and communication co-design problem that jointly optimizes feature length selection and transmission scheduling. When there is a single sensor-predictor pair and a single channel, we develop low-complexity optimal co-designs for both the cases of time-invariant and time-variant feature length. When there are multiple sensor-predictor pairs and multiple channels, the co-design problem becomes a restless multi-arm multi-action bandit problem that is PSPACE-hard. For this setting, we design a low-complexity algorithm to solve the problem. Trace-driven evaluations demonstrate the potential of these co-designs to reduce inference error by up to 10000 times. 
    more » « less