Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. In a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high-risk stripe, a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in amore »
RAFI: Risk-Aware Failure Identification to Improve the RAS in Erasure-coded Data Centers
Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. Compared to the recovery phase of the data repair, which is widely studied and well optimized, the failure identification phase of the data repair is less investigated. Moreover, in a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS.
To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high risk stripe (a stripe with many failed chunks), a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low risk stripe (one with only a few failed chunks), a longer identification time is adopted, thus reducing the repair network traffic. Therefore, the RAS can be improved simultaneously.
We use both simulations and prototyping implementation to evaluate RAFI. Results collected from extensive simulations demonstrate the effectiveness and efficiency of RAFI on improving the RAS. We implement a prototype on HDFS to verify the correctness and evaluate more »
- Publication Date:
- NSF-PAR ID:
- 10065106
- Journal Name:
- The USENIX Annual Technical Conference (ATC)
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Millions of Californians access drinking water via domestic wells, which are vulnerable to drought andunsustainable groundwater management. Groundwater overdraft and the possibility of longer droughtduration under climate change threatens domestic well reliability, yet we lack tools to assess the impact ofsuch events. Here, we leverage 943 469 well completion reports and 20 years of groundwater elevationdata to develop a spatially-explicit domestic well failure model covering California’s Central Valley. Ourmodel successfully reproduces the spatial distribution of observed domestic well failures during the severe2012–2016 drought(n=2027). Next, the impact of longer drought duration(5–8years)on domestic wellfailure is evaluated, indicating that if the 2012–2016 droughtmore »
-
Smart grids can be vulnerable to attacks and accidents, and any initial failures in smart grids can grow to a large blackout because of cascading failure. Because of the importance of smart grids in modern society, it is crucial to protect them against cascading failures. Simulation of cascading failures can help identify the most vulnerable transmission lines and guide prioritization in protection planning, hence, it is an effective approach to protect smart grids from cascading failures. However, due to the enormous number of ways that the smart grids may fail initially, it is infeasible to simulate cascading failures at amore »
-
The objective of International Ocean Discovery Program (IODP) Expedition 384 was to carry out engineering tests with the goal of improving the chances of success in deep (>1 km) drilling and coring in igneous ocean crust. A wide range of tools and technologies for potential testing were proposed by the Deep Crustal Drilling Engineering Working Group in 2017 based on reports from recent crustal drilling expeditions. The JOIDES Resolution Facility Board further prioritized the testing opportunities in 2018. The top priority of all recommendations was an evaluation of drilling and coring bits because rate of penetration and bit wear andmore »
-
A major challenge in mobile crowdsensing applications is the generation of false (or spam) contributions resulting from selfish and malicious behaviors of users, or wrong perception of an event. Such false contributions induce loss of revenue owing to undue incentivization, and also affect the operational reliability of the applications. To counter these problems, we propose an event-trust and user-reputation model, called QnQ, to segregate different user classes such as honest, selfish, or malicious. The resultant user reputation scores, are based on both `quality' (accuracy of contribution) and `quantity' (degree of participation) of their contributions. Specifically, QnQ exploits a rating feedbackmore »