NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Once Bitten, Still Shy: Can We Prevent Cloud Systems from Repeating Their Mistakes?

https://doi.org/10.1145/3772356.3772385

Parikesit, Dimas; Lou, Chang (November 2025, ACM)

Cloud systems constantly experience changes. Unfortunately, these changes often introduce regression failures, breaking the same features or functionalities repeatedly. Such failures disrupt cloud availability and waste developers' efforts in re-investigating similar incidents. In this position paper, we argue that regression failures can be effectively prevented by enforcing low-level semantics, a new class of intermediate rules empirically inferred from past incidents, yet capable of offering partial correctness guarantees. Our experience shows that such rules are valuable to strengthen system correctness guarantees and expose new bugs.
more » « less
Free, publicly-accessible full text available November 17, 2026
Deriving semantic checkers from tests to detect silent failures in production distributed systems

Lou, Chang; Parikesit, Dimas Shidqi; Huang, Yujin; Yang, Zhewen; Diwangkara, Senapati; Jing, Yuzhuo; Kistijantoro, Achmad Imam; Yuan, Ding; Nath, Suman; Huang, Peng (July 2025, 19th USENIX Symposium on Operating Systems Design and Implementation)

Production distributed systems provide rich features, but various defects can cause a system to silently violate its semantics without explicit errors. Such failures cause serious consequences. Yet, they are extremely challenging to detect, as it requires deep domain knowledge and substantial manual efforts to write good checkers. In this paper, we explore a novel approach that directly derives semantic checkers from system test code. We first present a large-scale study on existing system test cases. Guided by the study findings, we develop T2C, a framework that uses static and dynamic analysis to transform and generalize a test into a runtime checker. We apply T2C on four large, popular distributed systems and successfully derive tens to hundreds of checkers. These checkers detect 15 out of 20 real-world silent failures we reproduce and incur small runtime overhead.
more » « less
Free, publicly-accessible full text available July 7, 2026
Orchestrating Cross-Layer Anomaly Detection and Mitigation to Address Gray Failures in Large-Scale Cloud Infrastructure

https://doi.org/10.1109/AIOps66738.2025.00008

Li, Ze; Lou, Chang; Yenugutala, Vignatha; Ramamurthy, Vivek; Blanchard, Eion; Ma, Minghua; Chintalapati, Murali (May 2025, IEEE)

Cloud infrastructure in production constantly experiences gray failures: a degraded state in which failures go undetected by system mechanisms, yet adversely affect end-users. Addressing the underlying anomalies on host nodes is crucial to address gray failures. However, current approaches suffer from two key limitations: first, existing detection relies solely on singular-dimension signals from hosts, thus often suffering from biased views due to differential observability; second, existing mitigation actions are often insufficient, primarily consisting of host-level operations such as reboots, which leave most production issues to manual intervention. This paper presents PANACEA, a holistic framework to automatically detect and mitigate host anomalies, addressing gray failures in production cloud infrastructure. PANACEA expands beyond host-level scope: it aggregates and correlates insights from VMs and application layers to bridge the detection gap, and orchestrates fine-grained and safe mitigation across all levels. PANACEA is versatile, designed to support a wide range of anomalies. It has been deployed in production at millions of hosts.
more » « less
Free, publicly-accessible full text available May 3, 2026
Verifying Semantic Equivalence of Large Models with Equality Saturation

https://doi.org/10.1145/3721146.3721943

Zulkifli, Kahfi S; Qian, Wenbo; Zhu, Shaowei; Zhou, Yuan; Zhang, Zhen; Lou, Chang (March 2025, ACM)

Modern machine learning frameworks support very large models by incorporating parallelism and optimization techniques. Yet, these very techniques add new layers of complexity in ensuring the correctness of the computation. An incorrect implementation of these techniques might lead to compile-time or runtime errors that can easily be observed and fixed, but it might also lead to silent errors that will result in incorrect computations in training or inference, which do not exhibit any obvious symptom until the model is used later. These subtle errors not only waste computation resources, but involve significant developer effort to detect and diagnose. In this work, we propose Aerify, a framework to automatically expose silent errors by verifying semantic equivalence of models with equality saturation. Aerify constructs equivalence graphs (e-graphs) from intermediate representations of tensor programs, and incrementally applies rewriting rules---derived from generic templates and refined via domain-specific analysis---to prove or disprove equivalence at scale. When discrepancies remain unproven, Aerify pinpoints the corresponding graph segments and maps them back to source code, simplifying debugging and reducing developer overhead. Our preliminary results show strong potentials of Aerify in detecting real-world silent errors.
more » « less
Free, publicly-accessible full text available March 30, 2026

Search for: All records