Design and Evaluation of a Risk-Aware Failure Identification Scheme for Improved RAS in Erasure-Coded Data Centers

Huang, Weichen; Fang, Juntao; Wan, Shenggang; Xie, Changsheng; He, Xubin

doi:10.1109/TPDS.2020.3010048

Citation Details

Design and Evaluation of a Risk-Aware Failure Identification Scheme for Improved RAS in Erasure-Coded Data Centers

Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. In a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high-risk stripe, a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low-risk stripe, a longer identification time is adopted, thus reducing the repair network traffic. Therefore, RAS can be improved simultaneously. We also propose three optimization techniques to reduce the additional overhead that RAFI imposes on management nodes' and to ensure that RAFI can work properly under large-scale clusters. We use simulation, emulation, and prototyping implementation to evaluate RAFI from multiple aspects. Simulation and prototype results prove the effectiveness and correctness of RAFI, and the performance improvement of the optimization techniques on RAFI is demonstrated by running the emulator. more »

Award ID(s):: 1717660

PAR ID:: 10175431

Author(s) / Creator(s):: Huang, Weichen; Fang, Juntao; Wan, Shenggang; Xie, Changsheng; He, Xubin

Date Published:: 2020-07-01

Journal Name:: IEEE Transactions on Parallel and Distributed Systems

ISSN:: 1045-9219

Page Range / eLocation ID:: 1 to 1

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.1109/TPDS.2020.3010048

More Like this