The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many‐core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm‐based fault tolerance (ABFT) using adaptive mesh refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solution may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables, such as positivity or boundedness, may be violated during interpolation. These challenges can be addressed by the combination of two techniques: (1) a fault‐tolerant message passing interface (MPI) implementation to recover from runtime node failures, and (2) high‐order interpolation schemes to preserve the physical solution and reconstruct lost data. Themore »
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- EuroMPI'18: Proceedings of the 25th European MPI Users' Group Meeting
- Page Range or eLocation-ID:
- 1 to 11
- Sponsoring Org:
- National Science Foundation
More Like this
MANA-2.0 is a scalable, future-proof design for transparent checkpointing of MPI-based computations. Its network transparency (“network-agnostic”) feature ensures that MANA-2.0 will provide a viable, efficient mechanism for trans-parently checkpointing MPI applications on current and future supercomputers. MANA-2.0 is an enhancement of previous work, the original MANA, which interposes MPI calls, and is a work in progress intended for production deployment. MANA-2.0 implements a series of new algorithms and features that improve MANA's scalability and reliability, enabling transparent checkpoint-restart over thousands of MPI processes. MANA-2.0 is being tested on today's Cori supercomputer at NERSC using Cray MPICH library over the Cray GNI network, but it is designed to work over any standard MPI running over an arbitrary network. Two widely-used HPC applications were selected to demonstrate the enhanced features of MANA-2.0: GROMACS, a molecular dynamics simulation code with frequent point-to-point communication, and VASP, a materials science code with frequent MPI collective communication. Perhaps the most important lesson to be learned from MANA-2.0 is a series of algorithms and data structures for library-based transformations that enable MPI-based computations over MANA-2.0 to reliably survive the checkpoint-restart transition.
The advent of Persistent Memory (PM) devices enables systems to actively persist information at low costs, including program state traditionally in volatile memory. However, this trend poses a reliability challenge in which multiple classes of soft faults that go away after restart in traditional systems turn into hard (recurring) faults in PM systems. In this paper, we first characterize this rising problem with an empirical study of 28 real-world bugs. We analyze how they cause hard faults in PM systems. We then propose Arthas, a tool to effectively recover PM systems from hard faults. Arthas checkpoints PM states via fine-grained versioning and uses program slicing of fault instructions to revert problematic PM states to good versions. We evaluate Arthas on 12 real-world hard faults from five large PM systems. Arthas successfully recovers the systems for all cases while discarding 10× less data on average compared to state-of-the-art checkpoint-rollback solutions.
Understanding each other's models: an introduction and a standard representation of 16 global water models to support intercomparison, improvement, and communicationAbstract. Global water models (GWMs) simulate the terrestrial watercycle on the global scale and are used to assess the impacts of climatechange on freshwater systems. GWMs are developed within different modellingframeworks and consider different underlying hydrological processes, leadingto varied model structures. Furthermore, the equations used to describevarious processes take different forms and are generally accessible onlyfrom within the individual model codes. These factors have hindered aholistic and detailed understanding of how different models operate, yetsuch an understanding is crucial for explaining the results of modelevaluation studies, understanding inter-model differences in theirsimulations, and identifying areas for future model development. This studyprovides a comprehensive overview of how 16 state-of-the-art GWMs aredesigned. We analyse water storage compartments, water flows, and humanwater use sectors included in models that provide simulations for theInter-Sectoral Impact Model Intercomparison Project phase 2b (ISIMIP2b). Wedevelop a standard writing style for the model equations to enhance modelintercomparison, improvement, and communication. In this study, WaterGAP2used the highest number of water storage compartments, 11, and CWatM used 10compartments. Six models used six compartments, while four models (DBH,JULES-W1, Mac-PDM.20, and VIC) used the lowest number, three compartments.WaterGAP2 simulates five human water use sectors, while four models (CLM4.5,CLM5.0, LPJmL, and MPI-HM) simulate only watermore »
Advances in biomolecular simulation methods and access to large scale computer resources have led to a massive increase in the amount of data generated. The key enablers have been optimization and parallelization of the simulation codes. However, much of the software used to analyze trajectory data from these simulations is still run in serial, or in some cases many threads via shared memory. Here, we describe the addition of multiple levels of parallel trajectory processing to the molecular dynamics simulation analysis software CPPTRAJ. In addition to the existing OpenMP shared‐memory parallelism, CPPTRAJ now has two additional levels of message passing (MPI) parallelism involving both across‐trajectory processing and across‐ensemble processing. All three levels of parallelism can be simultaneously active, leading to significant speed ups in data analysis of large datasets on the NCSA Blue Waters supercomputer by better leveraging the many available nodes and its parallel file system. © 2018 Wiley Periodicals, Inc.