Lightweight Fault Tolerance in Pregel-Like Systems

Yan, Da; Cheng, James; Chen, Hongzhi; Long, Cheng; Bangalore, Purushotham

doi:10.1145/3337821.3337823

Citation Details

Lightweight Fault Tolerance in Pregel-Like Systems

Pregel-like systems are popular for iterative graph processing thanks to their user-friendly vertex-centric programming model. However, existing Pregel-like systems only adopt a naïve checkpointing approach for fault tolerance, which saves a large amount of data about the state of computation and signi!cantly degrades the failure-free execution performance. Advanced fault tolerance/recovery techniques are left unexplored in the context of Pregel-like systems. This paper proposes a non-invasive lightweight checkpointing (LWCP) scheme which minimizes the data saved to each checkpoint, and additional data required for recovery are generated online from the saved data. This improvement results in 10x speedup in checkpointing, and an integration of it with a recently proposed log-based recovery approach can further speed up recovery when failure occurs. Extensive experiments veri!ed that our proposed LWCP techniques are able to signi!cantly improve the performance of both checkpointing and recovery in a Pregel-like system. more »

Award ID(s):: 1755464

PAR ID:: 10140003

Author(s) / Creator(s):: Yan, Da; Cheng, James; Chen, Hongzhi; Long, Cheng; Bangalore, Purushotham

Date Published:: 2019-01-01

Journal Name:: Proceedings of the 48th International Conference on Parallel Processing (ICPP) 2019

Page Range / eLocation ID:: 1 to 10

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3337821.3337823

More Like this