Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. In a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high-risk stripe, a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low-risk stripe, a longer identification time is adopted, thus reducing the repair network traffic. Therefore, RAS can be improved simultaneously. We also propose three optimization techniques to reduce the additional overhead that RAFI imposes on management nodes' and to ensure that RAFI can work properly under large-scale clusters. We use simulation, emulation, and prototyping implementation to evaluate RAFI from multiple aspects. Simulation and prototype results prove the effectiveness and correctness of RAFI, and the performance improvement of the optimization techniques on RAFI is demonstrated by running the emulator.
more »
« less
Modeling Common Cause Failures in Systems with Triple Modular Redundancy and Repair
Triple modular redundancy (TMR) is commonly employed to increase the reliability and mean time to failure (MTTF) of a system. This improvement can be shown by using a continuous time Markov chain. However, typical Markov chain models do not model common cause failures (CCF), which is a singular event that simultaneously causes failure in multiple redundant modules. This paper introduces a new Markov chain to model CCF in TMR with repair systems. This new model is compared to the idealized models of TMR with repair without CCF. The fundamental limitations that CCF imposes on the system are shown and discussed. In a motivating example, it is seen that CCF imposes a limitation of 51× on the reliability improvement in a system with TMR and repair compared to a simplex system, (i.e., without TMR). A case study is also presented where the likelihood of CCF is reduced by a factor of 18× using various mitigation techniques. Reducing the CCF compounds the reliability improvement of TMR with repair and leads to a overall system reliability improvement of 10,000× compared to the simplex system as supported by the proposed model.
more »
« less
- Award ID(s):
- 1738550
- NSF-PAR ID:
- 10183426
- Date Published:
- Journal Name:
- 2020 Annual Reliability and Maintainability Symposium (RAMS)
- Page Range / eLocation ID:
- 1 to 6
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Reliability enhancement of microgrids is challenged by environmental and operational failures. Centrally controlled microgrids are susceptible to failures at high probability due to a single-point-of-failure, e.g. the central controller. True decentralization of microgrid architecture entails elimination of the central controller, attaining a parallel configuration for the system. In this paper, decentralized microgrid control architecture is proposed as a solution for reliability degradation over the time, and analyzes the reliability aspects of centralized and decentralized control architectures for microgrids. Degree of importance of a single controller in centralized and decentralized architectures is determined and validated by Markov Chain Models (MCM). Results confirm that higher reliability is achieved when true decentralization of control architecture is adopted. Challenges of implementing a true decentralized control architecture are discussed. Hardware-In-the-Loop simulation results for microgrid controller failure scenarios for both architectures are presented and discussed.more » « less
-
In this work, we consider the problem of mode clustering in Markov jump models. This model class consists of multiple dynamical modes with a switching sequence that determines how the system switches between them over time. Under different active modes, the observations can have different characteristics. Given the observations only and without knowing the mode sequence, the goal is to cluster the modes based on their transition distributions in the Markov chain to find a reduced-rank Markov matrix that is embedded in the original Markov chain. Our approach involves mode sequence estimation, mode clustering and reduced-rank model estimation, where mode clustering is achieved by applying the singular value decomposition and k-means. We show that, under certain conditions, the clustering error can be bounded, and the reduced-rank Markov chain is a good approximation to the original Markov chain. Through simulations, we show the efficacy of our approach and the application of our approach to real world scenarios. Index Terms—Switched model, Markov chain, clusteringmore » « less
-
Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. Compared to the recovery phase of the data repair, which is widely studied and well optimized, the failure identification phase of the data repair is less investigated. Moreover, in a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high risk stripe (a stripe with many failed chunks), a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low risk stripe (one with only a few failed chunks), a longer identification time is adopted, thus reducing the repair network traffic. Therefore, the RAS can be improved simultaneously. We use both simulations and prototyping implementation to evaluate RAFI. Results collected from extensive simulations demonstrate the effectiveness and efficiency of RAFI on improving the RAS. We implement a prototype on HDFS to verify the correctness and evaluate the computational cost of RAFI.more » « less
-
This paper presents a new recursive Hybrid consensus filter for distributed state estimation on a Hidden Markov Model (HMM), which is well suited to multirobot applications and settings. The proposed algorithm is scalable, robust to network failure and capable of handling non-Gaussian transition and observation models and is, therefore, quite general. No global knowledge of the communication network is assumed. Iterative Conservative Fusion (ICF) is used to reach consensus over potentially correlated priors, while consensus over likelihoods is handled using weights based on a Metropolis Hastings Markov Chain (MHMC). The proposed method is evaluated in a multi-agent tracking problem and a high-dimensional HMM and it is shown that its performance surpasses the competing algorithms.more » « less