skip to main content


Title: RAFI: Risk-Aware Failure Identification to Improve the RAS in Erasure-coded Data Centers
Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. Compared to the recovery phase of the data repair, which is widely studied and well optimized, the failure identification phase of the data repair is less investigated. Moreover, in a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high risk stripe (a stripe with many failed chunks), a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low risk stripe (one with only a few failed chunks), a longer identification time is adopted, thus reducing the repair network traffic. Therefore, the RAS can be improved simultaneously. We use both simulations and prototyping implementation to evaluate RAFI. Results collected from extensive simulations demonstrate the effectiveness and efficiency of RAFI on improving the RAS. We implement a prototype on HDFS to verify the correctness and evaluate the computational cost of RAFI.  more » « less
Award ID(s):
1717660 1702474 1547804
NSF-PAR ID:
10065106
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
The USENIX Annual Technical Conference (ATC)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. In a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high-risk stripe, a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low-risk stripe, a longer identification time is adopted, thus reducing the repair network traffic. Therefore, RAS can be improved simultaneously. We also propose three optimization techniques to reduce the additional overhead that RAFI imposes on management nodes' and to ensure that RAFI can work properly under large-scale clusters. We use simulation, emulation, and prototyping implementation to evaluate RAFI from multiple aspects. Simulation and prototype results prove the effectiveness and correctness of RAFI, and the performance improvement of the optimization techniques on RAFI is demonstrated by running the emulator. 
    more » « less
  2. Storage area networks (SANs) are a widely used and dependable solution for data storage. Nevertheless, the occurrence of cascading failures caused by overloading has emerged as a significant risk to the reliability of SANs, impeding the delivery of the desired quality of service to users. This paper makes contributions by proposing both static and dynamic load-triggered redistribution strategies to alleviate the cascading failure risk during the mission time. Two types of node selection rules, respectively based on the load level and node reliability, are studied and compared. Based on the SAN component reliability evaluation using the accelerated failure-time model under the power law, the SAN reliability is evaluated using binary decision diagrams. A detailed case study of a mesh SAN is conducted to compare the performance of different cascading failure mitigation schemes using criteria of SAN reliability improvement ratio and resulting SAN reliability after the mitigation.

     
    more » « less
  3. Accurate prediction of product failures and the need for repair services become critical for various reasons, including understanding the warranty performance of manufacturers, defining cost-efficient repair strategies, and compliance with safety standards. The purpose of this study is to use machine learning tools to analyze several parameters crucial for achieving a robust repair service system, including the number of repairs, the time of the next repair ticket or product failure, and the time to repair. A large dataset of over 530,000 repairs and maintenance of medical devices has been investigated by employing the Support Vector Machine (SVM) tool. SVM with four kernel functions is used to forecast the timing of the next failure or repair request in the system for two different products and two different failure types, namely random failure and physical damage. A frequency analysis is also conducted to explore the product quality level based on product failure and the time to repair it. Besides, the best probability distributions are fitted for the number of failures, the time between failures, and the time to repair. The results reveal the value of data analytics and machine learning tools in analyzing post-market product performance and the cost of repair and maintenance operations. 
    more » « less
  4. Abstract Accurate prediction of product failures and the need for repair services become critical for various reasons, including understanding the warranty performance of manufacturers, defining cost-efficient repair strategies, and compliance with safety standards. The purpose of this study is to use machine learning tools to analyze several parameters crucial for achieving a robust repair service system, including the number of repairs, the time of the next repair ticket or product failure, and the time to repair. A large data set of over 530,000 repairs and maintenance of medical devices has been investigated by employing the Support Vector Machine (SVM) tool. SVM with four kernel functions is used to forecast the timing of the next failure or repair request in the system for two different products and two different failure types, namely, random failure and physical damage. Frequency analysis is also conducted to explore the product quality level based on product failure and the time to repair it. Besides, the best probability distributions are fitted for the failure count, the time between failures, and the time to repair. The results reveal the value of data analytics and machine learning tools in analyzing post-market product performance and the cost of repair and maintenance operations. 
    more » « less
  5. Triple modular redundancy (TMR) is commonly employed to increase the reliability and mean time to failure (MTTF) of a system. This improvement can be shown by using a continuous time Markov chain. However, typical Markov chain models do not model common cause failures (CCF), which is a singular event that simultaneously causes failure in multiple redundant modules. This paper introduces a new Markov chain to model CCF in TMR with repair systems. This new model is compared to the idealized models of TMR with repair without CCF. The fundamental limitations that CCF imposes on the system are shown and discussed. In a motivating example, it is seen that CCF imposes a limitation of 51× on the reliability improvement in a system with TMR and repair compared to a simplex system, (i.e., without TMR). A case study is also presented where the likelihood of CCF is reduced by a factor of 18× using various mitigation techniques. Reducing the CCF compounds the reliability improvement of TMR with repair and leads to a overall system reliability improvement of 10,000× compared to the simplex system as supported by the proposed model. 
    more » « less