skip to main content


Title: Design and Evaluation of a Risk-Aware Failure Identification Scheme for Improved RAS in Erasure-Coded Data Centers
Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. In a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high-risk stripe, a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low-risk stripe, a longer identification time is adopted, thus reducing the repair network traffic. Therefore, RAS can be improved simultaneously. We also propose three optimization techniques to reduce the additional overhead that RAFI imposes on management nodes' and to ensure that RAFI can work properly under large-scale clusters. We use simulation, emulation, and prototyping implementation to evaluate RAFI from multiple aspects. Simulation and prototype results prove the effectiveness and correctness of RAFI, and the performance improvement of the optimization techniques on RAFI is demonstrated by running the emulator.  more » « less
Award ID(s):
1717660
NSF-PAR ID:
10175431
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
IEEE Transactions on Parallel and Distributed Systems
ISSN:
1045-9219
Page Range / eLocation ID:
1 to 1
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. Compared to the recovery phase of the data repair, which is widely studied and well optimized, the failure identification phase of the data repair is less investigated. Moreover, in a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high risk stripe (a stripe with many failed chunks), a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low risk stripe (one with only a few failed chunks), a longer identification time is adopted, thus reducing the repair network traffic. Therefore, the RAS can be improved simultaneously. We use both simulations and prototyping implementation to evaluate RAFI. Results collected from extensive simulations demonstrate the effectiveness and efficiency of RAFI on improving the RAS. We implement a prototype on HDFS to verify the correctness and evaluate the computational cost of RAFI. 
    more » « less
  2. Optical network failure management (ONFM) is a promising application of machine learning (ML) to optical networking. Typical ML-based ONFM approaches exploit historical monitored data, retrieved in a specific domain (e.g., a link or a network), to train supervised ML models and learn failure characteristics (a signature) that will be helpful upon future failure occurrence in that domain. Unfortunately, in operational networks, data availability often constitutes a practical limitation to the deployment of ML-based ONFM solutions, due to scarce availability of labeled data comprehensively modeling all possible failure types. One could purposely inject failures to collect training data, but this is time consuming and not desirable by operators. A possible solution is transfer learning (TL), i.e., training ML models on a source domain (SD), e.g., a laboratory testbed, and then deploying trained models on a target domain (TD), e.g., an operator network, possibly fine-tuning the learned models by re-training with few TD data. Moreover, in those cases when TL re-training is not successful (e.g., due to the intrinsic difference in SD and TD), another solution is domain adaptation, which consists of combining unlabeled SD and TD data before model training. We investigate domain adaptation and TL for failure detection and failure-cause identification across different lightpaths leveraging real optical SNR data. We find that for the considered scenarios, up to 20% points of accuracy increase can be obtained with domain adaptation for failure detection, while for failure-cause identification, only combining domain adaptation with model re-training provides significant benefit, reaching 4%–5% points of accuracy increase in the considered cases.

     
    more » « less
  3. Many industrial products consist of multiple components that are necessary for system operation. There is an abundance of literature on modeling the lifetime of such components through competing risks models. During the life‐cycle of a product, it is common for there to be incremental design changes to improve reliability, to reduce costs, or due to changes in availability of certain part numbers. These changes can affect product reliability but are often ignored in system lifetime modeling. By incorporating this information about changes in part numbers over time (information that is readily available in most production databases), better accuracy can be achieved in predicting time to failure, thus yielding more accurate field‐failure predictions. This paper presents methods for estimating parameters and predictions for this generational model and a comparison with existing methods through the use of simulation. Our results indicate that the generational model has important practical advantages and outperforms the existing methods in predicting field failures. Copyright © 2016 John Wiley & Sons, Ltd.

     
    more » « less
  4. Triple modular redundancy (TMR) is commonly employed to increase the reliability and mean time to failure (MTTF) of a system. This improvement can be shown by using a continuous time Markov chain. However, typical Markov chain models do not model common cause failures (CCF), which is a singular event that simultaneously causes failure in multiple redundant modules. This paper introduces a new Markov chain to model CCF in TMR with repair systems. This new model is compared to the idealized models of TMR with repair without CCF. The fundamental limitations that CCF imposes on the system are shown and discussed. In a motivating example, it is seen that CCF imposes a limitation of 51× on the reliability improvement in a system with TMR and repair compared to a simplex system, (i.e., without TMR). A case study is also presented where the likelihood of CCF is reduced by a factor of 18× using various mitigation techniques. Reducing the CCF compounds the reliability improvement of TMR with repair and leads to a overall system reliability improvement of 10,000× compared to the simplex system as supported by the proposed model. 
    more » « less
  5. Abstract

    In clinical research and practice, landmark models are commonly used to predict the risk of an adverse future event, using patients' longitudinal biomarker data as predictors. However, these data are often observable only at intermittent visits, making their measurement times irregularly spaced and unsynchronized across different subjects. This poses challenges to conducting dynamic prediction at any post‐baseline time. A simple solution is the last‐value‐carry‐forward method, but this may result in bias for the risk model estimation and prediction. Another option is to jointly model the longitudinal and survival processes with a shared random effects model. However, when dealing with multiple biomarkers, this approach often results in high‐dimensional integrals without a closed‐form solution, and thus the computational burden limits its software development and practical use. In this article, we propose to process the longitudinal data by functional principal component analysis techniques, and then use the processed information as predictors in a class of flexible linear transformation models to predict the distribution of residual time‐to‐event occurrence. The measurement schemes for multiple biomarkers are allowed to be different within subject and across subjects. Dynamic prediction can be performed in a real‐time fashion. The advantages of our proposed method are demonstrated by simulation studies. We apply our approach to the African American Study of Kidney Disease and Hypertension, predicting patients' risk of kidney failure or death by using four important longitudinal biomarkers for renal functions.

     
    more » « less