skip to main content


Title: RAFI: Risk-Aware Failure Identification to Improve the RAS in Erasure-coded Data Centers
Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. Compared to the recovery phase of the data repair, which is widely studied and well optimized, the failure identification phase of the data repair is less investigated. Moreover, in a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high risk stripe (a stripe with many failed chunks), a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low risk stripe (one with only a few failed chunks), a longer identification time is adopted, thus reducing the repair network traffic. Therefore, the RAS can be improved simultaneously. We use both simulations and prototyping implementation to evaluate RAFI. Results collected from extensive simulations demonstrate the effectiveness and efficiency of RAFI on improving the RAS. We implement a prototype on HDFS to verify the correctness and evaluate the computational cost of RAFI.  more » « less
Award ID(s):
1717660 1702474 1547804
NSF-PAR ID:
10065106
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
The USENIX Annual Technical Conference (ATC)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. In a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high-risk stripe, a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low-risk stripe, a longer identification time is adopted, thus reducing the repair network traffic. Therefore, RAS can be improved simultaneously. We also propose three optimization techniques to reduce the additional overhead that RAFI imposes on management nodes' and to ensure that RAFI can work properly under large-scale clusters. We use simulation, emulation, and prototyping implementation to evaluate RAFI from multiple aspects. Simulation and prototype results prove the effectiveness and correctness of RAFI, and the performance improvement of the optimization techniques on RAFI is demonstrated by running the emulator. 
    more » « less
  2. Optical network failure management (ONFM) is a promising application of machine learning (ML) to optical networking. Typical ML-based ONFM approaches exploit historical monitored data, retrieved in a specific domain (e.g., a link or a network), to train supervised ML models and learn failure characteristics (a signature) that will be helpful upon future failure occurrence in that domain. Unfortunately, in operational networks, data availability often constitutes a practical limitation to the deployment of ML-based ONFM solutions, due to scarce availability of labeled data comprehensively modeling all possible failure types. One could purposely inject failures to collect training data, but this is time consuming and not desirable by operators. A possible solution is transfer learning (TL), i.e., training ML models on a source domain (SD), e.g., a laboratory testbed, and then deploying trained models on a target domain (TD), e.g., an operator network, possibly fine-tuning the learned models by re-training with few TD data. Moreover, in those cases when TL re-training is not successful (e.g., due to the intrinsic difference in SD and TD), another solution is domain adaptation, which consists of combining unlabeled SD and TD data before model training. We investigate domain adaptation and TL for failure detection and failure-cause identification across different lightpaths leveraging real optical SNR data. We find that for the considered scenarios, up to 20% points of accuracy increase can be obtained with domain adaptation for failure detection, while for failure-cause identification, only combining domain adaptation with model re-training provides significant benefit, reaching 4%–5% points of accuracy increase in the considered cases.

     
    more » « less
  3. Millions of Californians access drinking water via domestic wells, which are vulnerable to drought andunsustainable groundwater management. Groundwater overdraft and the possibility of longer droughtduration under climate change threatens domestic well reliability, yet we lack tools to assess the impact ofsuch events. Here, we leverage 943 469 well completion reports and 20 years of groundwater elevationdata to develop a spatially-explicit domestic well failure model covering California’s Central Valley. Ourmodel successfully reproduces the spatial distribution of observed domestic well failures during the severe2012–2016 drought(n=2027). Next, the impact of longer drought duration(5–8years)on domestic wellfailure is evaluated, indicating that if the 2012–2016 drought would have continued into a 6 to 8 year longdrought, a total of 4037–5460 to 6538–8056 wells would fail. The same drought duration scenarios withan intervening wet winter in 2017 lead to an average of498 and 738 fewer well failures. Additionally, wemap vulnerable wells at high failure risk andfind that they align with clusters of predicted well failures.Lastly, we evaluate how the timing and implementation of different projected groundwater managementregimes impact groundwater levels and thus domestic well failure. When historic overdraft persists until2040, domestic well failures range from 5966 to 10 466(depending on the historic period considered).When sustainability is achieved progressively between 2020 and 2040, well failures range from 3677 to6943, and from 1516 to 2513 when groundwater is not allowed to decline after 2020. 
    more » « less
  4. Smart grids can be vulnerable to attacks and accidents, and any initial failures in smart grids can grow to a large blackout because of cascading failure. Because of the importance of smart grids in modern society, it is crucial to protect them against cascading failures. Simulation of cascading failures can help identify the most vulnerable transmission lines and guide prioritization in protection planning, hence, it is an effective approach to protect smart grids from cascading failures. However, due to the enormous number of ways that the smart grids may fail initially, it is infeasible to simulate cascading failures at a large scale nor identify the most vulnerable lines efficiently. In this paper, we aim at 1) developing a method to run cascading failure simulations at scale and 2) building simplified, diffusion based cascading failure models to support efficient and theoretically bounded identification of most vulnerable lines. The goals are achieved by first constructing a novel connection between cascading failures and natural languages, and then adapting the powerful transformer model in NLP to learn from cascading failure data. Our trained transformer models have good accuracy in predicting the total number of failed lines in a cascade and identifying the most vulnerable lines. We also constructed independent cascade (IC) diffusion models based on the attention matrices of the transformer models, to support efficient vulnerability analysis with performance bounds. 
    more » « less
  5. null (Ed.)
    The objective of International Ocean Discovery Program (IODP) Expedition 384 was to carry out engineering tests with the goal of improving the chances of success in deep (>1 km) drilling and coring in igneous ocean crust. A wide range of tools and technologies for potential testing were proposed by the Deep Crustal Drilling Engineering Working Group in 2017 based on reports from recent crustal drilling expeditions. The JOIDES Resolution Facility Board further prioritized the testing opportunities in 2018. The top priority of all recommendations was an evaluation of drilling and coring bits because rate of penetration and bit wear and tear are the prevalent issue in deep crustal drilling attempts, and bit failures often require an excessive amount of fishing and hole cleaning time. The plan included drilling in basalt with three different types of drill bits: a tungsten carbide insert (TCI) tricone bit, a polycrystalline diamond compact (PDC) bit, and a more novel TCI/PDC hybrid bit. In addition, a TCI bit was to be paired with an underreamer with expanding cutter blocks instead of extending arms. Finally, a type of rotary core barrel (RCB) PDC coring bit that was acquired for the R/V JOIDES Resolution several years ago but never deployed would also be given a test run. A second objective was added when additional operating time became available for Expedition 384 as a result of the latest schedule changes. This objective included the assessment and potential improvement of current procedures for advanced piston corer (APC) core orientation. Expedition 384 began in Kristiansand, Norway, on 20 July 2020. The location for tests was based on various factors, including the JOIDES Resolution's location at the time, our inability to obtain territorial clearance in a short period of time, and a suitable combination of sediment and igneous rock for the drilling and coring operations. IODP Expedition 395, which was postponed due to the COVID-19 pandemic, had proposed sites that were suitable for our testing and offered the opportunity to carry out some serendipitous sampling, logging, and casing work for science. We first spent 3 days triple coring the top 70 m of sediment at Site U1554 (Proposed Site REYK-6A) to obtain cores for evaluating potential problems with the magnetic core orientation tools and for assessing other potential sources of errors that might explain prior anomalous core orientation results. Comparison of the observed core orientation from magnetic orientation tools to the expected orientation based on the paleomagnetic directions recorded in the cores revealed an 180° misalignment in the assembly of one of the tools. This misalignment appears to have persisted over several years and could explain most of the problems previously noted. The assembly part was fixed, and this problem was eliminated for future expeditions. We subsequently spent 20 days at Site U1555 (Proposed Site REYK-13A) to test the three types of drill bits, an underreamer, and a coring bit in six holes. The TCI bits were the best performers, the TCI/PDC hybrid bit did not stand up to the harsh formation, and the PDC bit did not get sufficient run time because of a mud motor failure. The cutter block underreamer is not considered able to perform major hole opening in basalt but could be useful for knocking out ledges. The PDC coring bit cut good quality basalt cores at an unacceptably low rate. In the seventh and final hole (U1555G), we used a regular RCB coring bit to recover the entire 130 m basalt section specified in the Expedition 395 Scientific Prospectus and provided the project team with shipboard data and samples. The basalt section was successfully wireline logged before the logging winch motor failed, which precluded further operations for safety reasons. Additional operations plans in support of Expedition 395, including coring, logging, and casing at Site U1554, had to be canceled, and Expedition 384 ended prematurely on 24 August in Kristiansand. 
    more » « less