skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Lightweight Fault Tolerance in Pregel-Like Systems
Pregel-like systems are popular for iterative graph processing thanks to their user-friendly vertex-centric programming model. However, existing Pregel-like systems only adopt a naïve checkpointing approach for fault tolerance, which saves a large amount of data about the state of computation and signi!cantly degrades the failure-free execution performance. Advanced fault tolerance/recovery techniques are left unexplored in the context of Pregel-like systems. This paper proposes a non-invasive lightweight checkpointing (LWCP) scheme which minimizes the data saved to each checkpoint, and additional data required for recovery are generated online from the saved data. This improvement results in 10x speedup in checkpointing, and an integration of it with a recently proposed log-based recovery approach can further speed up recovery when failure occurs. Extensive experiments veri!ed that our proposed LWCP techniques are able to signi!cantly improve the performance of both checkpointing and recovery in a Pregel-like system.  more » « less
Award ID(s):
1755464
PAR ID:
10140003
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the 48th International Conference on Parallel Processing (ICPP) 2019
Page Range / eLocation ID:
1 to 10
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Aguilera, Marcos; Yadgar, Gala (Ed.)
    Training Deep Neural Networks (DNNs) is a resource-hungry and time-consuming task. During training, the model performs computation at the GPU to learn weights, repeatedly, over several epochs. The learned weights reside in GPU memory, and are occasionally checkpointed (written to persistent storage) for fault-tolerance. Traditionally, model parameters are checkpointed at epoch boundaries; for modern deep networks, an epoch runs for several hours. An interruption to the training job due to preemption, node failure, or process failure, therefore results in the loss of several hours worth of GPU work on recovery. We present CheckFreq, an automatic, fine-grained checkpointing framework that (1) algorithmically determines the checkpointing frequency at the granularity of iterations using systematic online profiling, (2) dynamically tunes checkpointing frequency at runtime to bound the checkpointing overhead using adaptive rate tuning, (3) maintains the training data invariant of using each item in the dataset exactly once per epoch by checkpointing data loader state using a light-weight resumable iterator, and (4) carefully pipelines checkpointing with computation to reduce the checkpoint cost by introducing two-phase checkpointing. Our experiments on a variety of models, storage backends, and GPU generations show that CheckFreq can reduce the recovery time from hours to seconds while bounding the runtime overhead within 3.5%. 
    more » « less
  2. Big data systems have evolved beyond scalable storage and rudimentary processing to supporting complex data analytics in near real-time, such as Apache Spark Streaming [31], Comet [14], Incremental Hadoop [17], MapReduce Online [7], Apache Storm [28], StreamScope [19], and IBM Streams [1]. These systems are particularly challenging to build owing to two requirements: low latency and fault tolerance. Many of the above systems evolved from a batch processing design and are thus architected to break down a steady stream of input events into a series of micro-batches and then perform batch-like computations on each successive micro-batch as a micro-batch job. In terms of latency, the systems are expected to respond to each micro-batch in seconds with an output The constant operation further entails that the systems must be robust to hardware, software and network-level failures. To incorporate fault-tolerance, the common approach is to use checkpointing and rollback recovery, whereby a streaming application periodically saves its in-memory state to persistent storage. 
    more » « less
  3. Nowadays erasure coding is one of the most significant techniques in cloud storage systems, which provides both quick parallel I/O processing and high capabilities of fault tolerance on massive data accesses. In these systems, triple disk failure tolerant arrays (3DFTs) is a typical configuration, which is supported by several classic erasure codes like Reed-Solomon (RS) codes, Local Reconstruction Codes (LRC), Minimum Storage Regeneration (MSR) codes, etc. For an online recovery process, the foreground application workloads and the background recovery workloads are handled simultaneously, which requires a comprehensive understanding on both two types of workload characteristics. Although several techniques have been proposed to accelerate the I/O requests of online recovery processes, they are typically unilateral due to the fact that the above two workloads are not combined together to achieve high cost-effective performance.To address this problem, we propose Erasure Codes Fusion (EC-Fusion), an efficient hybrid erasure coding framework in cloud storage systems. EC-Fusion is a combination of RS and MSR codes, which dynamically selects the appropriate code based on its properties. On one hand, for write-intensive application workloads or low risk on data loss in recovery workloads, EC-Fusion uses RS code to decrease the computational overhead and storage cost concurrently. On the other hand, for read-intensive or frequent reconstruction in workloads, MSR code is a proper choice. Therefore, a better overall application and recovery performance can be achieved in a cost-effective fashion. To demonstrate the effectiveness of EC-Fusion, several experiments are conducted in hadoop systems. The results show that, compared with the traditional hybrid erasure coding techniques, EC-Fusion accelerates the response time for application by up to 1.77×, and reduces the reconstruction time by up to 69.10%. 
    more » « less
  4. Long-running scientific workflows, such as tomographic data analysis pipelines, are prone to a variety of failures, including hardware and network disruptions, as well as software errors. These failures can substantially degrade performance and increase turnaround times, particularly in large-scale, geographically distributed, and time-sensitive environments like synchrotron radiation facilities. In this work, we propose and evaluate resilience strategies aimed at mitigating the impact of failures in tomographic reconstruction workflows. Specifically, we introduce an asynchronous, non-blocking checkpointing mechanism and a dynamic load redistribution technique with lazy recovery, designed to enhance workflow reliability and minimize failure-induced overheads. These approaches facilitate progress preservation, balanced load distribution, and efficient recovery in error-prone environments. To evaluate their effectiveness, we implement a 3D tomographic reconstruction pipeline and deploy it across Argonne's leadership computing infrastructure and synchrotron facilities. Our results demonstrate that the proposed resilience techniques significantly reduce failure impact—by up to 500× —while maintaining negligible overhead (<3%). 
    more » « less
  5. As the design space for high-performance computer (HPC) systems grows larger and more complex, modeling and simulation (MODSIM) techniques become more important to better optimize systems. Furthermore, recent extreme-scale systems and newer technologies can lead to higher system fault rates, which negatively affect system performance and other metrics. Therefore, it is important for system designers to consider the effects of faults and fault-tolerance (FT) techniques on system design through MODSIM. BE-SST is an existing MODSIM methodology and workflow that facilitates preliminary exploration & reduction of large design spaces, particularly by highlighting areas of the space for detailed study and pruning less optimal areas. This paper presents the overall methodology for adding fault-tolerance awareness (FT-awareness) into BE-SST. We present the process used to extend BE-SST, enabling the creation of models that predict the time needed to perform a checkpoint instance for the given system configuration. Additionally, this paper presents a case study where a full HPC system is simulated using BE-SST, including application, hardware, and checkpointing. We validate the models and simulation against actual system measurements, finding an average percent error of less than 17% for the instance models and about 20% for system simulation, a level of accuracy acceptable for initial exploration and pruning of the design space. Finally, we show how FT-aware simulation results are used for comparing FT levels in the design space. 
    more » « less