skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
When a failure occurs in production systems, the highest priority is to quickly mitigate it. Despite its importance, failure mitigation is done in a reactive and ad-hoc way: taking some fixed actions only after a severe symptom is observed. For cloud systems, such a strategy is inadequate. In this paper, we propose a preventive and adaptive failure mitigation service, Narya, that is integrated in a production cloud, Microsoft Azure's compute platform. Narya predicts imminent host failures based on multi-layer system signals and then decides smart mitigation actions. The goal is to avert VM failures. Narya's decision engine takes a novel online experimentation approach to continually explore the best mitigation action. Narya further enhances the adaptive decision capability through reinforcement learning. Narya has been running in production for 15 months. It on average reduces VM interruptions by 26% compared to the previous static strategy.  more » « less
Award ID(s):
1942794
PAR ID:
10227105
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Large-scale cloud services deploy hundreds of configuration changes to production systems daily. At such velocity, con- figuration changes have inevitably become prevalent causes of production failures. Existing misconfiguration detection and configuration validation techniques only check configu- ration values. These techniques cannot detect common types of failure-inducing configuration changes, such as those that cause code to fail or those that violate hidden constraints. We present ctests, a new type of tests for detecting failure- inducing configuration changes to prevent production failures. The idea behind ctests is simple—connecting production sys- tem configurations to software tests so that configuration changes can be tested in the context of code affected by the changes. So, ctests can detect configuration changes that ex- pose dormant software bugs and diverse misconfigurations. We show how to generate ctests by transforming the many existing tests in mature systems. The key challenge that we address is the automated identification of test logic and oracles that can be reused in ctests. We generated thousands of ctests from the existing tests in five cloud systems. Our results show that ctests are effective in detecting failure-inducing configuration changes before deployment. We evaluate ctests on real-world failure-inducing configura- tion changes, injected misconfigurations, and deployed con- figuration files from public Docker images. Ctests effectively detect real-world failure-inducing configuration changes and misconfigurations in the deployed files. 
    more » « less
  2. Partial failures occur frequently in cloud systems and can cause serious damage including inconsistency and data loss. Unfortunately, these failures are not well understood. Nor can they be effectively detected. In this paper, we first study 100 real-world partial failures from five mature systems to understand their characteristics. We find that these failures are caused by a variety of defects that require the unique conditions of the production environment to be triggered. Manually writing effective detectors to systematically detect such failures is both time-consuming and error-prone. We thus propose OmegaGen, a static analysis tool that automatically generates customized watchdogs for a given program by using a novel program reduction technique. We have successfully applied OmegaGen to six large distributed systems. In evaluating 22 real-world partial failure cases in these systems, the generated watchdogs can detect 20 cases with a median detection time of 4.2 seconds, and pinpoint the failure scope for 18 cases. The generated watchdogs also expose an unknown, confirmed partial failure bug in the latest version of ZooKeeper. 
    more » « less
  3. Debugging a failure usually requires reproducing it first. This can be hard for failures in production distributed systems, where bugs are exposed only by some unusual faulty events. While fault injection testing becomes popular, existing solutions are designed for bug finding. They are ineffective and inefficient to reproduce a specific failure during debugging. We explore a new type of fault injection technique for quickly reproducing a given fault-induced production failure in distributed systems. We present a tool, Anduril, that uses static causal analysis and a novel feedback-driven algorithm to quickly search the enormous fault space for the root-cause fault and timing. We evaluate Anduril on 22 real-world complex fault-induced failures from five large-scale distributed systems. Anduril reproduced all failures by identifying and injecting the root-cause faults at the right time, in a median of 8 minutes. 
    more » « less
  4. null (Ed.)
    Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead. 
    more » « less
  5. Storage area networks (SANs) are a widely used and dependable solution for data storage. Nevertheless, the occurrence of cascading failures caused by overloading has emerged as a significant risk to the reliability of SANs, impeding the delivery of the desired quality of service to users. This paper makes contributions by proposing both static and dynamic load-triggered redistribution strategies to alleviate the cascading failure risk during the mission time. Two types of node selection rules, respectively based on the load level and node reliability, are studied and compared. Based on the SAN component reliability evaluation using the accelerated failure-time model under the power law, the SAN reliability is evaluated using binary decision diagrams. A detailed case study of a mesh SAN is conducted to compare the performance of different cascading failure mitigation schemes using criteria of SAN reliability improvement ratio and resulting SAN reliability after the mitigation. 
    more » « less