Cloud infrastructure in production constantly experiences gray failures: a degraded state in which failures go undetected by system mechanisms, yet adversely affect end-users. Addressing the underlying anomalies on host nodes is crucial to address gray failures. However, current approaches suffer from two key limitations: first, existing detection relies solely on singular-dimension signals from hosts, thus often suffering from biased views due to differential observability; second, existing mitigation actions are often insufficient, primarily consisting of host-level operations such as reboots, which leave most production issues to manual intervention. This paper presents PANACEA, a holistic framework to automatically detect and mitigate host anomalies, addressing gray failures in production cloud infrastructure. PANACEA expands beyond host-level scope: it aggregates and correlates insights from VMs and application layers to bridge the detection gap, and orchestrates fine-grained and safe mitigation across all levels. PANACEA is versatile, designed to support a wide range of anomalies. It has been deployed in production at millions of hosts.
more »
« less
Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
When a failure occurs in production systems, the highest priority is to quickly mitigate it. Despite its importance, failure mitigation is done in a reactive and ad-hoc way: taking some fixed actions only after a severe symptom is observed. For cloud systems, such a strategy is inadequate. In this paper, we propose a preventive and adaptive failure mitigation service, Narya, that is integrated in a production cloud, Microsoft Azure's compute platform. Narya predicts imminent host failures based on multi-layer system signals and then decides smart mitigation actions. The goal is to avert VM failures. Narya's decision engine takes a novel online experimentation approach to continually explore the best mitigation action. Narya further enhances the adaptive decision capability through reinforcement learning. Narya has been running in production for 15 months. It on average reduces VM interruptions by 26% compared to the previous static strategy.
more »
« less
- Award ID(s):
- 1942794
- PAR ID:
- 10227105
- Date Published:
- Journal Name:
- Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Large-scale cloud services deploy hundreds of configuration changes to production systems daily. At such velocity, con- figuration changes have inevitably become prevalent causes of production failures. Existing misconfiguration detection and configuration validation techniques only check configu- ration values. These techniques cannot detect common types of failure-inducing configuration changes, such as those that cause code to fail or those that violate hidden constraints. We present ctests, a new type of tests for detecting failure- inducing configuration changes to prevent production failures. The idea behind ctests is simple—connecting production sys- tem configurations to software tests so that configuration changes can be tested in the context of code affected by the changes. So, ctests can detect configuration changes that ex- pose dormant software bugs and diverse misconfigurations. We show how to generate ctests by transforming the many existing tests in mature systems. The key challenge that we address is the automated identification of test logic and oracles that can be reused in ctests. We generated thousands of ctests from the existing tests in five cloud systems. Our results show that ctests are effective in detecting failure-inducing configuration changes before deployment. We evaluate ctests on real-world failure-inducing configura- tion changes, injected misconfigurations, and deployed con- figuration files from public Docker images. Ctests effectively detect real-world failure-inducing configuration changes and misconfigurations in the deployed files.more » « less
-
Partial failures occur frequently in cloud systems and can cause serious damage including inconsistency and data loss. Unfortunately, these failures are not well understood. Nor can they be effectively detected. In this paper, we first study 100 real-world partial failures from five mature systems to understand their characteristics. We find that these failures are caused by a variety of defects that require the unique conditions of the production environment to be triggered. Manually writing effective detectors to systematically detect such failures is both time-consuming and error-prone. We thus propose OmegaGen, a static analysis tool that automatically generates customized watchdogs for a given program by using a novel program reduction technique. We have successfully applied OmegaGen to six large distributed systems. In evaluating 22 real-world partial failure cases in these systems, the generated watchdogs can detect 20 cases with a median detection time of 4.2 seconds, and pinpoint the failure scope for 18 cases. The generated watchdogs also expose an unknown, confirmed partial failure bug in the latest version of ZooKeeper.more » « less
-
Debugging a failure usually requires reproducing it first. This can be hard for failures in production distributed systems, where bugs are exposed only by some unusual faulty events. While fault injection testing becomes popular, existing solutions are designed for bug finding. They are ineffective and inefficient to reproduce a specific failure during debugging. We explore a new type of fault injection technique for quickly reproducing a given fault-induced production failure in distributed systems. We present a tool, Anduril, that uses static causal analysis and a novel feedback-driven algorithm to quickly search the enormous fault space for the root-cause fault and timing. We evaluate Anduril on 22 real-world complex fault-induced failures from five large-scale distributed systems. Anduril reproduced all failures by identifying and injecting the root-cause faults at the right time, in a median of 8 minutes.more » « less
-
null (Ed.)Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.more » « less
An official website of the United States government

