skip to main content


Title: Energy-efficient localised rollback via data flow analysis and frequency scaling
Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data flow rollback (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data flow graphs. We demonstrate the benefits of DFR for an MPI stencil code by localising rollback, and then reduce energy consumption by 10-12% on idling nodes via frequency scaling. We also provide large-scale estimates for the energy savings of DFR compared to global rollback, which for stencil codes increase as n2 for a process count n.  more » « less
Award ID(s):
1838271 1939076
NSF-PAR ID:
10192062
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
EuroMPI'18: Proceedings of the 25th European MPI Users' Group Meeting
Volume:
25
Issue:
11
Page Range / eLocation ID:
1 to 11
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many‐core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm‐based fault tolerance (ABFT) using adaptive mesh refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solution may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables, such as positivity or boundedness, may be violated during interpolation. These challenges can be addressed by the combination of two techniques: (1) a fault‐tolerant message passing interface (MPI) implementation to recover from runtime node failures, and (2) high‐order interpolation schemes to preserve the physical solution and reconstruct lost data. The approach considered here uses a “limited essentially nonoscillatory” (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault‐tolerant MPI‐user‐level failure mitigation to recover from runtime failure and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10× faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.

     
    more » « less
  2. MANA-2.0 is a scalable, future-proof design for transparent checkpointing of MPI-based computations. Its network transparency (“network-agnostic”) feature ensures that MANA-2.0 will provide a viable, efficient mechanism for trans-parently checkpointing MPI applications on current and future supercomputers. MANA-2.0 is an enhancement of previous work, the original MANA, which interposes MPI calls, and is a work in progress intended for production deployment. MANA-2.0 implements a series of new algorithms and features that improve MANA's scalability and reliability, enabling transparent checkpoint-restart over thousands of MPI processes. MANA-2.0 is being tested on today's Cori supercomputer at NERSC using Cray MPICH library over the Cray GNI network, but it is designed to work over any standard MPI running over an arbitrary network. Two widely-used HPC applications were selected to demonstrate the enhanced features of MANA-2.0: GROMACS, a molecular dynamics simulation code with frequent point-to-point communication, and VASP, a materials science code with frequent MPI collective communication. Perhaps the most important lesson to be learned from MANA-2.0 is a series of algorithms and data structures for library-based transformations that enable MPI-based computations over MANA-2.0 to reliably survive the checkpoint-restart transition. 
    more » « less
  3. null (Ed.)
    The advent of Persistent Memory (PM) devices enables systems to actively persist information at low costs, including program state traditionally in volatile memory. However, this trend poses a reliability challenge in which multiple classes of soft faults that go away after restart in traditional systems turn into hard (recurring) faults in PM systems. In this paper, we first characterize this rising problem with an empirical study of 28 real-world bugs. We analyze how they cause hard faults in PM systems. We then propose Arthas, a tool to effectively recover PM systems from hard faults. Arthas checkpoints PM states via fine-grained versioning and uses program slicing of fault instructions to revert problematic PM states to good versions. We evaluate Arthas on 12 real-world hard faults from five large PM systems. Arthas successfully recovers the systems for all cases while discarding 10× less data on average compared to state-of-the-art checkpoint-rollback solutions. 
    more » « less
  4. Abstract. Global water models (GWMs) simulate the terrestrial watercycle on the global scale and are used to assess the impacts of climatechange on freshwater systems. GWMs are developed within different modellingframeworks and consider different underlying hydrological processes, leadingto varied model structures. Furthermore, the equations used to describevarious processes take different forms and are generally accessible onlyfrom within the individual model codes. These factors have hindered aholistic and detailed understanding of how different models operate, yetsuch an understanding is crucial for explaining the results of modelevaluation studies, understanding inter-model differences in theirsimulations, and identifying areas for future model development. This studyprovides a comprehensive overview of how 16 state-of-the-art GWMs aredesigned. We analyse water storage compartments, water flows, and humanwater use sectors included in models that provide simulations for theInter-Sectoral Impact Model Intercomparison Project phase 2b (ISIMIP2b). Wedevelop a standard writing style for the model equations to enhance modelintercomparison, improvement, and communication. In this study, WaterGAP2used the highest number of water storage compartments, 11, and CWatM used 10compartments. Six models used six compartments, while four models (DBH,JULES-W1, Mac-PDM.20, and VIC) used the lowest number, three compartments.WaterGAP2 simulates five human water use sectors, while four models (CLM4.5,CLM5.0, LPJmL, and MPI-HM) simulate only water for the irrigation sector. Weconclude that, even though hydrological processes are often based on similarequations for various processes, in the end these equations have beenadjusted or models have used different values for specific parameters orspecific variables. The similarities and differences found among the modelsanalysed in this study are expected to enable us to reduce the uncertaintyin multi-model ensembles, improve existing hydrological processes, andintegrate new processes. 
    more » « less
  5. Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one can checkpoint an MPI application under one MPI implementation and perhaps over TCP, and then restart under a second MPI implementation over InfiniBand on a cluster with a different number of CPU cores per node. This technique is based on a novel "split-process" approach, which enables two separate programs to co-exist within a single process with a single address space. This work overcomes the limitations of the two most widely adopted transparent checkpointing solutions, BLCR and DMTCP/InfiniBand, which require separate modifications to each MPI implementation and/or underlying network API. The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computation running natively on that same remote cluster. 
    more » « less