skip to main content


Title: Node failure resiliency for Uintah without checkpointing
Summary

The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many‐core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm‐based fault tolerance (ABFT) using adaptive mesh refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solution may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables, such as positivity or boundedness, may be violated during interpolation. These challenges can be addressed by the combination of two techniques: (1) a fault‐tolerant message passing interface (MPI) implementation to recover from runtime node failures, and (2) high‐order interpolation schemes to preserve the physical solution and reconstruct lost data. The approach considered here uses a “limited essentially nonoscillatory” (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault‐tolerant MPI‐user‐level failure mitigation to recover from runtime failure and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10× faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.

 
more » « less
NSF-PAR ID:
10453870
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Concurrency and Computation: Practice and Experience
Volume:
31
Issue:
20
ISSN:
1532-0626
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper presents a formulation of multiparty session types (MPSTs) for practical fault-tolerant distributed programming. We tackle the challenges faced by session types in the context of distributed systems involving asynchronous and concurrent partial failures – such as supporting dynamic replacement of failed parties and retrying failed protocol segments in an ongoing multiparty session – in the presence of unreliable failure detection. Key to our approach is that we develop a novel model of event-driven concurrency for multiparty sessions. Inspired by real-world practices, it enables us to unify the session-typed handling of regular I/O events with failure handling and the combination of features needed to express practical fault-tolerant protocols. Moreover, the characteristics of our model allow us to prove a global progress property for well-typed processes engaged in multiple concurrent sessions, which does not hold in traditional MPST systems. To demonstrate its practicality, we implement our framework as a toolchain and runtime for Scala, and use it to specify and implement a session-typed version of the cluster management system of the industrial-strength Apache Spark data analytics framework. Our session-typed cluster manager composes with other vanilla Spark components to give a functioning Spark runtime; e.g., it can execute existing third-party Spark applications without code modification. A performance evaluation using the TPC-H benchmark shows our prototype implementation incurs an average overhead below 10%. 
    more » « less
  2. Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data flow rollback (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data flow graphs. We demonstrate the benefits of DFR for an MPI stencil code by localising rollback, and then reduce energy consumption by 10-12% on idling nodes via frequency scaling. We also provide large-scale estimates for the energy savings of DFR compared to global rollback, which for stencil codes increase as n2 for a process count n. 
    more » « less
  3. Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one can checkpoint an MPI application under one MPI implementation and perhaps over TCP, and then restart under a second MPI implementation over InfiniBand on a cluster with a different number of CPU cores per node. This technique is based on a novel "split-process" approach, which enables two separate programs to co-exist within a single process with a single address space. This work overcomes the limitations of the two most widely adopted transparent checkpointing solutions, BLCR and DMTCP/InfiniBand, which require separate modifications to each MPI implementation and/or underlying network API. The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computation running natively on that same remote cluster. 
    more » « less
  4. The addition of geometric reconfigurability in a cable driven parallel robot (CDPR) introduces kinematic redundancies which can be exploited for manipulating structural and mechanical properties of the robot through redundancy resolution. In the event of a cable failure, a reconfigurable CDPR (rCDPR) can also realign its geometric arrangement to overcome the effects of cable failure and recover the original expected trajectory and complete the trajectory tracking task. In this paper we discuss a fault tolerant control (FTC) framework that relies on an Interactive Multiple Model (IMM) adaptive estimation filter for simultaneous fault detection and diagnosis (FDD) and task recovery. The redundancy resolution scheme for the kinematically redundant CDPR takes into account singularity avoidance, manipulability and wrench quality maximization during trajectory tracking. We further introduce a trajectory tracking methodology that enables the automatic task recovery algorithm to consistently return to the point of failure. This is particularly useful for applications where the planned trajectory is of greater importance than the goal positions, such as painting, welding or 3D printing applications. The proposed control framework is validated in simulation on a planar rCDPR with elastic cables and parameter uncertainties to introduce modeled and unmodeled dynamics in the system as it tracks a complete trajectory despite the occurrence of multiple cable failures. As cables fail one by one, the robot topology changes from an over-constrained to a fully constrained and then an under-constrained CDPR. The framework is applied with a constant-velocity kinematic feedforward controller which has the advantage of generating steady-state inputs despite dynamic oscillations during cable failures, as well as a Linear Quadratic Regulator (LQR) feedback controller to locally dampen these oscillations. 
    more » « less
  5. Abstract Adaptive mesh refinement (AMR) is the art of solving PDEs on a mesh hierarchy with increasing mesh refinement at each level of the hierarchy. Accurate treatment on AMR hierarchies requires accurate prolongation of the solution from a coarse mesh to a newly defined finer mesh. For scalar variables, suitably high-order finite volume WENO methods can carry out such a prolongation. However, classes of PDEs, such as computational electrodynamics (CED) and magnetohydrodynamics (MHD), require that vector fields preserve a divergence constraint. The primal variables in such schemes consist of normal components of the vector field that are collocated at the faces of the mesh. As a result, the reconstruction and prolongation strategies for divergence constraint-preserving vector fields are necessarily more intricate. In this paper we present a fourth-order divergence constraint-preserving prolongation strategy that is analytically exact. Extension to higher orders using analytically exact methods is very challenging. To overcome that challenge, a novel WENO-like reconstruction strategy is invented that matches the moments of the vector field in the faces, where the vector field components are collocated. This approach is almost divergence constraint-preserving, therefore, we call it WENO-ADP. To make it exactly divergence constraint-preserving, a touch-up procedure is developed that is based on a constrained least squares (CLSQ) method for restoring the divergence constraint up to machine accuracy. With the touch-up, it is called WENO-ADPT. It is shown that refinement ratios of two and higher can be accommodated. An item of broader interest in this work is that we have also been able to invent very efficient finite volume WENO methods, where the coefficients are very easily obtained and the multidimensional smoothness indicators can be expressed as perfect squares. We demonstrate that the divergence constraint-preserving strategy works at several high orders for divergence-free vector fields as well as vector fields, where the divergence of the vector field has to match a charge density and its higher moments. We also show that our methods overcome the late time instability that has been known to plague adaptive computations in CED. 
    more » « less