skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: View-aware Message Passing Through the Integration of Kokkos and ExaMPI
Kokkos provides in-memory advanced data structures, concurrency, and algorithms to support performance portable C++ parallel programming across CPUs and GPUs. The Message Passing Interface (MPI) provides the most widely used message passing model for inter-node communication. Many programmers use both Kokkos and MPI together. In this paper, Kokkos is integrated within an MPI implementation for ease of use in applications that use both Kokkos and MPI, without sacrificing performance. For instance, this model allows passing first-class Kokkos objects directly to extended C++-based MPI APIs. We prototype this integrated model using ExaMPI, a C++17- based subset implementation of MPI-4.We then demonstrate use of our C++-friendly APIs and Kokkos extensions through benchmarks and a mini-application.We explain why direct use of Kokkos within certain parts of the MPI implementation is crucial to performance and enhanced expressivity. Although the evaluation in this paper focuses on CPU-based examples, we also motivate why making Kokkos memory spaces visible to the MPI implementation generalizes the idea of “CPU memory” and “GPU memory” in ways that enable further optimizations in heterogeneous Exascale architectures. Finally, we describe future goals and show how these mesh both with a possible future C++ API for MPI-5 as well as the potential to accelerate MPI on such architectures.  more » « less
Award ID(s):
2405142 2412182
PAR ID:
10513373
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
Proceedings of the 30th European MPI Users' Group Meetin
ISBN:
9798400709135
Page Range / eLocation ID:
1 to 10
Subject(s) / Keyword(s):
Exascale Kokkos High-Performance Computing MPI
Format(s):
Medium: X
Location:
Bristol United Kingdom
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The Message Passing Interface (MPI) has been the dominant message passing solution for scientific computing for decades. MPI point-to-point communications are highly efficient mechanisms for process-to- process communication. However, MPI performance is slowed by concurrency protections in the MPI library when processes utilize multiple threads. MPI’s current thread-level interface imposes these overheads throughout the library when thread safety is needed. While much work has been done to reduce multithreading overheads in MPI, a solution is needed that reduces the number of messages exchanged in a threaded environment. Partitioned communication is included in the MPI 4.0 standard as an alternative that addresses the challenges of multithreaded communication in MPI today. Partitioned communication reduces overall message volume by creating a buffer-sharing mechanism between threads such that they can indicate when portions of a communication buffer are available to be sent. Separation of the control and data planes in MPI is enabled by allowing persistent initialization and single occurrence message buffer matching from the indication that the data is ready to be sent. This enables the usage commands (destination, size, etc.) can be set up prior to data buffer readiness with readiness triggered by a simple doorbell/counter later. This approach is useful for future development of MPI operations in environments where traditional networking commands can have performance challenges, like accelerators (GPUs, FPGAs). In this paper,we detail the design and implementation of a layered library (built on top of MPI-3.1) and an integrated Open MPI solution that supports the new, MPI-4.0 partitioned communication feature set. The library will enable applications to use currently released MPI implementations and older legacy libraries to provide partitioned communication support while also enabling further exploration of this new communication model in new applications and use cases. We will compare the designs of the library and native Open MPI support, provide performance results and comparisons between the two approaches, and lessons learned from the implementation of partitioned communication in both library and native forms. We find that the native implementation and library have similar performance with a percentage difference under 0.94% in microbenchmarks and performance within 5% for a partitioned communication enabled proxy application. 
    more » « less
  2. The Message Passing Interface (MPI) is a software platform that can utilize the parallel capabilities of most multi-processors, making it useful for teaching students about parallel and distributed computing (PDC). MPI provides language bindings for Fortran and C/C++, but many university instructors lack expertise in these languages, preventing them from using MPI in their courses. OpenMPI is a free implementation of MPI that also provides Java bindings, allowing instructors who know Java but not C/C++ or Fortran to teach PDC. However, Java has a reputation as a “slow” language, so some say it is unsuitable for teaching PDC. This paper gives a head-to-head comparison of the performance of OpenMPI's Java and C bindings. Our study shows that by default, Java can be faster than C unless one takes special measures, and it exhibits similar speedup, efficiency, and scalability. We conclude that Java is a suitable language for teaching PDC. 
    more » « less
  3. Ghafoor, Sheikh; Prasad, Sushil K. (Ed.)
    The ACM/IEEE CS 2013 curriculum recommendations state that every undergraduate CS major should learn about parallel and distributed computing (PDC). One way to accomplish this is to teach students about the Message Passing Interface (MPI), a platform that is commonly used on modern supercomputers and Beowulf clusters, but can also be used on a Network of Workstations (NoW), or a multicore laptop or desktop. MPI incorporates many PDC concepts and can serve as a platform for hands-on learning activities in which students must apply those concepts. The MPI standard defines language bindings for Fortran and C/C++, but many university instructors lack expertise in these languages, preventing them from using MPI in their courses. OpenMPI is a free implementation of the MPI standard that also provides Java bindings for MPI. This paper describes how to install OpenMPI with these Java bindings; to illustrate the use of these bindings, the paper also presents several patternlets—minimalist example programs—that show how to implement PDC design patterns using OpenMPI and Java. This provides a new means of introducing students to PDC concepts. 
    more » « less
  4. Gurfinkel, Arie; Ganesh, Vijay (Ed.)
    Procedure contracts are a well-known approach for specifying programs in a modular way. We investigate a new contract theory for collective procedures in parallel message-passing programs. As in the sequential setting, one can verify that a procedure f conforms to its contract using only the contracts, and not the implementations, of the collective procedures called by f. We apply this approach to C programs that use the Message Passing Interface (MPI), introducing a new contract language that extends the ANSI/ISO C Specification Language. We present contracts for the standard MPI collective functions, as well as many user-defined collective functions. A prototype verification system has been implemented using the CIVL model checker for checking contract satisfaction within small bounds on the number of processes. 
    more » « less
  5. The increase in scale and heterogeneity of high-performance computing (HPC) systems predispose the performance of Message Passing Interface (MPI) collective communications to be susceptible to noise, and to adapt to a complex mix of hardware capabilities. The designs of state of the art MPI collectives heavily rely on synchronizations; these designs magnify noise across the participating processes, resulting in significant performance slowdown. Therefore, such design philosophy must be reconsidered to efficiently and robustly run on the large-scale heterogeneous platforms. In this paper, we present ADAPT, a new collective communication framework in Open MPI, using event-driven techniques to morph collective algorithms to heterogeneous environments. The core concept of ADAPT is to relax synchronizations, while mamtaining the minimal data dependencies of MPI collectives. To fully exploit the different bandwidths of data movement lanes in heterogeneous systems, we extend the ADAPT collective framework with a topology-aware communication tree. This removes the boundaries of different hardware topologies while maximizing the speed of data movements. We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. Our results demonstrate drastic performance improvements and a strong resistance against noise compared to other state of the art MPI libraries. In particular, we demonstrate at least 1.3X and 1.5X speedup for CPU data and 2X and 10X speedup for GPU data using ADAPT event-based broadcast and reduce operations. 
    more » « less