NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Relational Debugging — Pinpointing Root Causes of Performance Problems

Ren, Xiang; Wang, Sitao; Jin, Zhuqi; Lion, David; Chiu, Adrian; Xu, Tianyin; Yuan, Ding (July 2023, 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI '23))

Full Text Available
Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems

https://doi.org/10.1145/3552326.3587448

Tang, Lilia; Bhandari, Chaitanya; Zhang, Yongle; Karanika, Anna; Ji, Shuyang; Gupta, Indranil; Xu, Tianyin (May 2023, 18th European Conference on Computer Systems (EuroSys '23))

Full Text Available
Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker

Chen, Yinfang; Sun, Xudong; Nath, Suman; Yang, Ze; Xu, Tianyin (April 2023, 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI '23))

Full Text Available
Automatic Reliability Testing for Cluster Management Controllers

Sun, Xudong; Luo, Wenqing; Gu, Jiawei Tyler; Ganesan, Aishwarya; Alagappan, Ramnatthan; Gasch, Michael; Suresh, Lalith; Xu, Tianyin (July 2022, Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22))

Modern cluster managers like Borg, Omega and Kubernetes rely on the state-reconciliation principle to be highly resilient and extensible. In these systems, all cluster-management logic is embedded in a loosely coupled collection of microservices called controllers. Each controller independently observes the current cluster state and issues corrective actions to converge the cluster to a desired state. However, the complex distributed nature of the overall system makes it hard to build reliable and correct controllers – we find that controllers face myriad reliability issues that lead to severe consequences like data loss, security vulnerabilities, and resource leaks. We present Sieve, the first automatic reliability-testing tool for cluster-management controllers. Sieve drives controllers to their potentially buggy corners by systematically and extensively perturbing the controller’s view of the current cluster state in ways it is expected to tolerate. It then compares the cluster state’s evolution with and without perturbations to detect safety and liveness issues. Sieve’s design is powered by a fundamental opportunity in state-reconciliation systems – these systems are based on state-centric interfaces between the controllers and the cluster state; such interfaces are highly transparent and thereby enable fully-automated reliability testing. To date, Sieve has efficiently found 46 serious safety and liveness bugs (35 confirmed and 22 fixed) in ten popular controllers with a low false-positive rate of 3.5%.
more » « less
Full Text Available
DepFast: Orchestrating Code of Quorum Systems

Luo, Xuhao; Shen, Weihai; Mu, Shuai; Xu, Tianyin (July 2022, Proceedings of the 2022 USENIX Annual Technical Conference)

Quorum systems (e.g., replicated state machines) are critical distributed systems. Building correct, high-performance quorum systems is known to be hard. A major reason is that the protocols in quorum systems lead to non-deterministic state changes and complex branching conditions based on different events (e.g., timeouts). Traditionally, these systems are built with an asynchronous coding style with event-driven callbacks, but often lead to “callback hell” that makes code hard to follow and maintain. Converting to synchronous coding styles (e.g., using coroutines) is challenging because of the complex branching conditions. In this paper, we present Dependably Fast (DepFast), an effective, expressive framework for developing quorum systems. DepFast provides a unique QuorumEvent abstraction to enable building quorum systems in a synchronous style. It also supports composition of multiple events, e.g., timeouts, different quorums. To evaluate DepFast, we use it to implement two quorum systems, Raft and Copilot. We show that complex quorum systems implemented by DepFast are easy to write and have high performance. Specifically, it takes 25%–35% fewer lines of code to implement Raft and Copilot using DepFast, and the DepFast-based implementations have comparable performance with the state-of-the-art systems.
more » « less
Full Text Available

Search for: All records