Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. Compared to the recovery phase of the data repair, which is widely studied and well optimized, the failure identification phase of the data repair is less investigated. Moreover, in a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high risk stripe (a stripe with many failed chunks), a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low risk stripe (one with only a few failed chunks), a longer identification time is adopted, thus reducing the repair network traffic. Therefore, the RAS can be improved simultaneously. We use both simulations and prototyping implementation to evaluate RAFI. Results collected from extensive simulations demonstrate the effectiveness and efficiency of RAFI on improving the RAS. We implement a prototype on HDFS to verify the correctness and evaluate the computational cost of RAFI.
more »
« less
Design and Evaluation of a Risk-Aware Failure Identification Scheme for Improved RAS in Erasure-Coded Data Centers
Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. In a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high-risk stripe, a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low-risk stripe, a longer identification time is adopted, thus reducing the repair network traffic. Therefore, RAS can be improved simultaneously. We also propose three optimization techniques to reduce the additional overhead that RAFI imposes on management nodes' and to ensure that RAFI can work properly under large-scale clusters. We use simulation, emulation, and prototyping implementation to evaluate RAFI from multiple aspects. Simulation and prototype results prove the effectiveness and correctness of RAFI, and the performance improvement of the optimization techniques on RAFI is demonstrated by running the emulator.
more »
« less
- Award ID(s):
- 1717660
- PAR ID:
- 10175431
- Date Published:
- Journal Name:
- IEEE Transactions on Parallel and Distributed Systems
- ISSN:
- 1045-9219
- Page Range / eLocation ID:
- 1 to 1
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Triple modular redundancy (TMR) is commonly employed to increase the reliability and mean time to failure (MTTF) of a system. This improvement can be shown by using a continuous time Markov chain. However, typical Markov chain models do not model common cause failures (CCF), which is a singular event that simultaneously causes failure in multiple redundant modules. This paper introduces a new Markov chain to model CCF in TMR with repair systems. This new model is compared to the idealized models of TMR with repair without CCF. The fundamental limitations that CCF imposes on the system are shown and discussed. In a motivating example, it is seen that CCF imposes a limitation of 51× on the reliability improvement in a system with TMR and repair compared to a simplex system, (i.e., without TMR). A case study is also presented where the likelihood of CCF is reduced by a factor of 18× using various mitigation techniques. Reducing the CCF compounds the reliability improvement of TMR with repair and leads to a overall system reliability improvement of 10,000× compared to the simplex system as supported by the proposed model.more » « less
-
Reliability enhancement of microgrids is challenged by environmental and operational failures. Centrally controlled microgrids are susceptible to failures at high probability due to a single-point-of-failure, e.g. the central controller. True decentralization of microgrid architecture entails elimination of the central controller, attaining a parallel configuration for the system. In this paper, decentralized microgrid control architecture is proposed as a solution for reliability degradation over the time, and analyzes the reliability aspects of centralized and decentralized control architectures for microgrids. Degree of importance of a single controller in centralized and decentralized architectures is determined and validated by Markov Chain Models (MCM). Results confirm that higher reliability is achieved when true decentralization of control architecture is adopted. Challenges of implementing a true decentralized control architecture are discussed. Hardware-In-the-Loop simulation results for microgrid controller failure scenarios for both architectures are presented and discussed.more » « less
-
Storage area networks (SANs) are a widely used and dependable solution for data storage. Nevertheless, the occurrence of cascading failures caused by overloading has emerged as a significant risk to the reliability of SANs, impeding the delivery of the desired quality of service to users. This paper makes contributions by proposing both static and dynamic load-triggered redistribution strategies to alleviate the cascading failure risk during the mission time. Two types of node selection rules, respectively based on the load level and node reliability, are studied and compared. Based on the SAN component reliability evaluation using the accelerated failure-time model under the power law, the SAN reliability is evaluated using binary decision diagrams. A detailed case study of a mesh SAN is conducted to compare the performance of different cascading failure mitigation schemes using criteria of SAN reliability improvement ratio and resulting SAN reliability after the mitigation.more » « less
-
Accurate prediction of product failures and the need for repair services become critical for various reasons, including understanding the warranty performance of manufacturers, defining cost-efficient repair strategies, and compliance with safety standards. The purpose of this study is to use machine learning tools to analyze several parameters crucial for achieving a robust repair service system, including the number of repairs, the time of the next repair ticket or product failure, and the time to repair. A large dataset of over 530,000 repairs and maintenance of medical devices has been investigated by employing the Support Vector Machine (SVM) tool. SVM with four kernel functions is used to forecast the timing of the next failure or repair request in the system for two different products and two different failure types, namely random failure and physical damage. A frequency analysis is also conducted to explore the product quality level based on product failure and the time to repair it. Besides, the best probability distributions are fitted for the number of failures, the time between failures, and the time to repair. The results reveal the value of data analytics and machine learning tools in analyzing post-market product performance and the cost of repair and maintenance operations.more » « less
An official website of the United States government

