null
(Ed.)
The Message Passing Interface (MPI) has been the dominant message
passing solution for scientific computing for decades. MPI point-to-point
communications are highly efficient mechanisms for process-to-
process communication. However, MPI performance is slowed
by concurrency protections in the MPI library when processes utilize
multiple threads. MPI’s current thread-level interface imposes
these overheads throughout the library when thread safety is needed.
While much work has been done to reduce multithreading overheads
in MPI, a solution is needed that reduces the number of messages
exchanged in a threaded environment.
Partitioned communication is included in the MPI 4.0 standard
as an alternative that addresses the challenges of multithreaded
communication in MPI today. Partitioned communication reduces
overall message volume by creating a buffer-sharing mechanism
between threads such that they can indicate when portions of a
communication buffer are available to be sent. Separation of the
control and data planes in MPI is enabled by allowing persistent
initialization and single occurrence message buffer matching from
the indication that the data is ready to be sent. This enables the usage
commands (destination, size, etc.) can be set up prior to data buffer
readiness with readiness triggered by a simple doorbell/counter later.
This approach is useful for future development of MPI operations
in environments where traditional networking commands can have
performance challenges, like accelerators (GPUs, FPGAs).
In this paper,we detail the design and implementation of a layered
library (built on top of MPI-3.1) and an integrated Open MPI solution
that supports the new, MPI-4.0 partitioned communication feature
set. The library will enable applications to use currently released MPI
implementations and older legacy libraries to provide partitioned
communication support while also enabling further exploration of
this new communication model in new applications and use cases.
We will compare the designs of the library and native Open MPI
support, provide performance results and comparisons between the
two approaches, and lessons learned from the implementation of
partitioned communication in both library and native forms.
We find that the native implementation and library have similar
performance with a percentage difference under 0.94% in microbenchmarks
and performance within 5% for a partitioned communication
enabled proxy application.
more »
« less