skip to main content


Title: Checkpointing OpenSHMEM Programs Using Compiler Analysis
The importance of fault tolerance continues to increase for HPC applications. The continued growth in size and complexity of HPC systems, and of the applications them- selves, is leading to an increased likelihood of failures during execution. However, most HPC programming models do not have a built-in fault tolerance mechanism. Instead, application developers usually rely on external support such as application- level checkpoint-restart (C/R) libraries to make their codes fault tolerant. However, this increases the burden on the application developer, who must use the libraries carefully to ensure correct behavior and to minimize the overheads. The C/R routines will be employed to save the values of all needed program variables at the places in the code where they are invoked. It is important for correctness that the program data is in a consistent state at these places. It is non-trivial to determine such points in OpenSHMEM, which relies upon single-sided communications to provide high performance. The amount of data to be collected, and the frequency with which this is performed, must also be carefully tuned, as the overheads introduced by C/R calls can be extremely high. There is very little prior work on checkpoint-restart support in the context of the OpenSHMEM programming interface. In this paper, we introduce OpenSHMEM and describe the challenges it poses for checkpointing. We identify the safest places for inserting C/R calls in an OpenSHMEM program and describe a straightforward approach for identifying the data that needs to be checkpointed at these positions in the code. We provide these two functionalities in a tool that exploits compiler analyses to propose checkpoints and the sets of data for saving at them, to the application developer.  more » « less
Award ID(s):
1725499
NSF-PAR ID:
10285753
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
Page Range / eLocation ID:
51 to 60
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Checkpoint/restart (C/R) provides fault-tolerant computing capability, enables long running applications, and provides scheduling flexibility for computing centers to support diverse workloads with different priority. It is therefore vital to get transparent C/R capability working at NERSC. MANA, by Garg et. al., is a transparent checkpointing tool that has been selected due to its MPI-agnostic and network-agnostic approach. However, originally written as a proof-of-concept code, MANA was not ready to use with NERSC's diverse production workloads, which are dominated by MPI and hybrid MPI+OpenMP applications. In this talk, we present ongoing work at NERSC to enable MANA for NERSC's production workloads, including fixing bugs that were exposed by the top applications at NERSC, adding new features to address system changes, evaluating C/R overhead at scale, etc. The lessons learned from making MANA production-ready for HPC applications will be useful for C/R tool developers, supercomputing centers and HPC end-users alike. 
    more » « less
  2. Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one can checkpoint an MPI application under one MPI implementation and perhaps over TCP, and then restart under a second MPI implementation over InfiniBand on a cluster with a different number of CPU cores per node. This technique is based on a novel "split-process" approach, which enables two separate programs to co-exist within a single process with a single address space. This work overcomes the limitations of the two most widely adopted transparent checkpointing solutions, BLCR and DMTCP/InfiniBand, which require separate modifications to each MPI implementation and/or underlying network API. The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computation running natively on that same remote cluster. 
    more » « less
  3. Lawall, Julia ; Williams, Dan (Ed.)
    Persistent memory (PMEM) allows direct access to fast storage at byte granularity. Previously, processor caches backed by persistent memory were not persistent, complicating the design of persistent applications and reducing their performance. A new generation of systems with flush-on-fail semantics effectively offer persistent caches, offering the potential for much simpler, faster PMEM programming models. This work proposes Whole Process Persistence (WPP), a new programming model for systems with persistent caches. In the WPP model, all process state is made persistent. On restart after power failure, this state is reloaded and execution resumes in an application-defined interrupt handler. We also describe the Zhuque runtime, which transparently provides WPP by interposing on the C bindings for system calls in userspace. It requires little or no programmer effort to run applications on Zhuque. Our measurements show that Zhuque outperforms state of the art PMEM libraries, demonstrating mean speedups across all benchmarks of 5.24x over PMDK, 3.01x over Mnemosyne, 5.43x over Atlas, and 4.11x over Clobber-NVM. More important, unlike existing systems, Zhuque places no restrictions on how applications implement concurrency, allowing us to run a newer version of Memcached on Zhuque and gain more than 7.5x throughput over the fastest existing persistent implementations. 
    more » « less
  4. MANA-2.0 is a scalable, future-proof design for transparent checkpointing of MPI-based computations. Its network transparency (“network-agnostic”) feature ensures that MANA-2.0 will provide a viable, efficient mechanism for trans-parently checkpointing MPI applications on current and future supercomputers. MANA-2.0 is an enhancement of previous work, the original MANA, which interposes MPI calls, and is a work in progress intended for production deployment. MANA-2.0 implements a series of new algorithms and features that improve MANA's scalability and reliability, enabling transparent checkpoint-restart over thousands of MPI processes. MANA-2.0 is being tested on today's Cori supercomputer at NERSC using Cray MPICH library over the Cray GNI network, but it is designed to work over any standard MPI running over an arbitrary network. Two widely-used HPC applications were selected to demonstrate the enhanced features of MANA-2.0: GROMACS, a molecular dynamics simulation code with frequent point-to-point communication, and VASP, a materials science code with frequent MPI collective communication. Perhaps the most important lesson to be learned from MANA-2.0 is a series of algorithms and data structures for library-based transformations that enable MPI-based computations over MANA-2.0 to reliably survive the checkpoint-restart transition. 
    more » « less
  5. Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data flow rollback (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data flow graphs. We demonstrate the benefits of DFR for an MPI stencil code by localising rollback, and then reduce energy consumption by 10-12% on idling nodes via frequency scaling. We also provide large-scale estimates for the energy savings of DFR compared to global rollback, which for stencil codes increase as n2 for a process count n. 
    more » « less