NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Demystifying and Checking Silent Semantic Violations in Large Distributed Systems

Lou, Chang; Jing, Yuzhuo; Huang, Peng (July 2022, 16th USENIX Symposium on Operating Systems Design and Implementation)

Distributed systems today offer rich features with numerous semantics that users depend on. Bugs can cause a system to silently violate its semantics without apparent anomalies. Such silent violations cause prolonged damage and are difficult to address. Yet, this problem is under-investigated. In this paper, we first study 109 real-world silent semantic failures from nine widely-used distributed systems to shed some light on this difficult problem. Our study reveals more than a dozen informative findings. For example, it shows that surprisingly the majority of the studied failures were violating semantics that existed since the system’s first stable release. Guided by insights from our study, we design Oathkeeper, a tool that automatically infers semantic rules from past failures and enforces the rules at runtime to detect new failures. Evaluation shows that the inferred rules detect newer violations, and Oathkeeper only incurs 1.27% overhead.
more » « less
Full Text Available
RESIN: A Holistic Service for Dealing with Memory Leaks in Production Cloud Infrastructure

Lou, Chang; Chen, Cong; Huang, Peng; Dang, Yingnong; Qin, Si; Yang, Xinsheng; Li, Xukun; Lin, Qingwei; Chintalapati, Murali (July 2022, 16th USENIX Symposium on Operating Systems Design and Implementation)

Memory leak is a notorious issue. Despite the extensive efforts, addressing memory leaks in large production cloud systems remains challenging. Existing solutions incur high overhead and/or suffer from high inaccuracies. This paper presents RESIN, a solution designed to holistically address memory leaks in production cloud infrastructure. RESIN takes a divide-and-conquer approach to tackle the challenges. It performs a low-overhead detection first with a robust bucketization-based pivot scheme to identify suspicious leaking entities. It then takes live heap snapshots at appropriate time points in carefully sampled leak entities. RESIN analyzes the collected snapshots for leak diagnosis. Finally, RESIN automatically mitigates detected leaks. RESIN has been running in production in Microsoft Azure for 3 years. It reports on average 24 leak tickets each month with high accuracy and low overhead, and provides effective diagnosis reports. Its results translate into a 41× reduction of VM reboots caused by low memory.
more » « less
Full Text Available
Understanding and dealing with hard faults in persistent memory systems

https://doi.org/10.1145/3447786.3456252

Choi, Brian; Burns, Randal; Huang, Peng (April 2021, Proceedings of the Sixteenth European Conference on Computer Systems)
null (Ed.)
The advent of Persistent Memory (PM) devices enables systems to actively persist information at low costs, including program state traditionally in volatile memory. However, this trend poses a reliability challenge in which multiple classes of soft faults that go away after restart in traditional systems turn into hard (recurring) faults in PM systems. In this paper, we first characterize this rising problem with an empirical study of 28 real-world bugs. We analyze how they cause hard faults in PM systems. We then propose Arthas, a tool to effectively recover PM systems from hard faults. Arthas checkpoints PM states via fine-grained versioning and uses program slicing of fault instructions to revert problematic PM states to good versions. We evaluate Arthas on 12 real-world hard faults from five large PM systems. Arthas successfully recovers the systems for all cases while discarding 10× less data on average compared to state-of-the-art checkpoint-rollback solutions.
more » « less
Full Text Available
Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions

Levy, Sebastien; Yao, Randolph; Wu, Youjiang; Dang, Yingnong; Huang, Peng; Mu, Zheng; Zhao, Pu; Ramani, Tarun; Govindraju, Naga; Li, Xukun; et al (November 2020, Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation)

When a failure occurs in production systems, the highest priority is to quickly mitigate it. Despite its importance, failure mitigation is done in a reactive and ad-hoc way: taking some fixed actions only after a severe symptom is observed. For cloud systems, such a strategy is inadequate. In this paper, we propose a preventive and adaptive failure mitigation service, Narya, that is integrated in a production cloud, Microsoft Azure's compute platform. Narya predicts imminent host failures based on multi-layer system signals and then decides smart mitigation actions. The goal is to avert VM failures. Narya's decision engine takes a novel online experimentation approach to continually explore the best mitigation action. Narya further enhances the adaptive decision capability through reinforcement learning. Narya has been running in production for 15 months. It on average reduces VM interruptions by 26% compared to the previous static strategy.
more » « less
Full Text Available
Understanding, Detecting and Localizing Partial Failures in Large System Software

Lou, Chang; Huang, Peng; Smith, Scott (February 2020, Proceedings of the 17th USENIX Symposium on Networked Systems Design)

Partial failures occur frequently in cloud systems and can cause serious damage including inconsistency and data loss. Unfortunately, these failures are not well understood. Nor can they be effectively detected. In this paper, we first study 100 real-world partial failures from five mature systems to understand their characteristics. We find that these failures are caused by a variety of defects that require the unique conditions of the production environment to be triggered. Manually writing effective detectors to systematically detect such failures is both time-consuming and error-prone. We thus propose OmegaGen, a static analysis tool that automatically generates customized watchdogs for a given program by using a novel program reduction technique. We have successfully applied OmegaGen to six large distributed systems. In evaluating 22 real-world partial failure cases in these systems, the generated watchdogs can detect 20 cases with a median detection time of 4.2 seconds, and pinpoint the failure scope for 18 cases. The generated watchdogs also expose an unknown, confirmed partial failure bug in the latest version of ZooKeeper.
more » « less
Full Text Available

Search for: All records