When a failure occurs in production systems, the highest priority is to quickly mitigate it. Despite its importance, failure mitigation is done in a reactive and ad-hoc way: taking some fixed actions only after a severe symptom is observed. For cloud systems, such a strategy is inadequate. In this paper, we propose a preventive and adaptive failure mitigation service, Narya, that is integrated in a production cloud, Microsoft Azure's compute platform. Narya predicts imminent host failures based on multi-layer system signals and then decides smart mitigation actions. The goal is to avert VM failures. Narya's decision engine takes a novel online experimentation approach to continually explore the best mitigation action. Narya further enhances the adaptive decision capability through reinforcement learning. Narya has been running in production for 15 months. It on average reduces VM interruptions by 26% compared to the previous static strategy.
more »
« less
This content will become publicly available on May 3, 2026
Orchestrating Cross-Layer Anomaly Detection and Mitigation to Address Gray Failures in Large-Scale Cloud Infrastructure
Cloud infrastructure in production constantly experiences gray failures: a degraded state in which failures go undetected by system mechanisms, yet adversely affect end-users. Addressing the underlying anomalies on host nodes is crucial to address gray failures. However, current approaches suffer from two key limitations: first, existing detection relies solely on singular-dimension signals from hosts, thus often suffering from biased views due to differential observability; second, existing mitigation actions are often insufficient, primarily consisting of host-level operations such as reboots, which leave most production issues to manual intervention. This paper presents PANACEA, a holistic framework to automatically detect and mitigate host anomalies, addressing gray failures in production cloud infrastructure. PANACEA expands beyond host-level scope: it aggregates and correlates insights from VMs and application layers to bridge the detection gap, and orchestrates fine-grained and safe mitigation across all levels. PANACEA is versatile, designed to support a wide range of anomalies. It has been deployed in production at millions of hosts.
more »
« less
- Award ID(s):
- 2441284
- PAR ID:
- 10659230
- Publisher / Repository:
- IEEE
- Date Published:
- Page Range / eLocation ID:
- 10 to 12
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Memory leak is a notorious issue. Despite the extensive efforts, addressing memory leaks in large production cloud systems remains challenging. Existing solutions incur high overhead and/or suffer from high inaccuracies. This paper presents RESIN, a solution designed to holistically address memory leaks in production cloud infrastructure. RESIN takes a divide-and-conquer approach to tackle the challenges. It performs a low-overhead detection first with a robust bucketization-based pivot scheme to identify suspicious leaking entities. It then takes live heap snapshots at appropriate time points in carefully sampled leak entities. RESIN analyzes the collected snapshots for leak diagnosis. Finally, RESIN automatically mitigates detected leaks. RESIN has been running in production in Microsoft Azure for 3 years. It reports on average 24 leak tickets each month with high accuracy and low overhead, and provides effective diagnosis reports. Its results translate into a 41× reduction of VM reboots caused by low memory.more » « less
-
Systems software today is composed of numerous modules and exhibits complex failure modes. Existing failure detectors focus on catching simple, complete failures and treat programs uniformly at the process level. In this paper, we argue that modern software needs intrinsic failure detectors that are tailored to individual systems and can detect anomalies within a process at finer granularity. We particularly advocate a notion of intrinsic software watchdogs and propose an abstraction for it. Among the different styles of watchdogs, we believe watchdogs that imitate the main program can provide the best combination of completeness, accuracy and localization for detecting gray failures. But, manually constructing such mimic-type watchdogs is challenging and time-consuming. To close this gap, we present an early exploration for automatically generating mimic-type watchdogsmore » « less
-
Modern attacks against enterprises often have multiple targets inside the enterprise network. Due to the large size of these networks and increasingly stealthy attacks, attacker activities spanning multiple hosts are extremely difficult to correlate during a threat-hunting effort. In this paper, we present a method for an efficient cross-host attack correlation across multiple hosts. Unlike previous works, our approach does not require lateral movement detection techniques or host-level modifications. Instead, our approach relies on an observation that attackers have a few strategic mission objectives on every host that they infiltrate, and there exist only a handful of techniques for achieving those objectives. The central idea behind our approach involves comparing (OS agnostic) activities on different hosts and correlating the hosts that display the use of similar tactics, techniques, and procedures. We implement our approach in a tool called Ostinato and successfully evaluate it in threat hunting scenarios involving DARPA-led red team engagements spanning 500 hosts and in another multi-host attack scenario. Ostinato successfully detected 21 additional compromised hosts, which the underlying host-based detection system overlooked in activities spanning multiple days of the attack campaign. Additionally, Ostinato successfully reduced alarms generated from the underlying detection system by more than 90%, thus helping to mitigate the threat alert fatigue problem.more » « less
-
Abstract People often modify the shoreline to mitigate erosion and protect property from storm impacts. The 2 main approaches to modification are gray infrastructure (e.g., bulkheads and seawalls) and natural or green infrastructure (NI) (e.g., living shorelines). Gray infrastructure is still more often used for coastal protection than NI, despite having more detrimental effects on ecosystem parameters, such as biodiversity. We assessed the impact of gray infrastructure on biodiversity and whether the adoption of NI can mitigate its loss. We examined the literature to quantify the relationship of gray infrastructure and NI to biodiversity and developed a model with temporal geospatial data on ecosystem distribution and shoreline modification to project future shoreline modification for our study location, coastal Georgia (United States). We applied the literature‐derived empirical relationships of infrastructure effects on biodiversity to the shoreline modification projections to predict change in biodiversity under different NI versus gray infrastructure scenarios. For our study area, which is dominated by marshes and use of gray infrastructure, when just under half of all new coastal infrastructure was to be NI, previous losses of biodiversity from gray infrastructure could be mitigated by 2100 (net change of biodiversity of +0.14%, 95% confidence interval −0.10% to +0.39%). As biodiversity continues to decline from human impacts, it is increasingly imperative to minimize negative impacts when possible. We therefore suggest policy and the permitting process be changed to promote the adoption of NI.more » « less
An official website of the United States government
