skip to main content


Title: Automatic Reliability Testing for Cluster Management Controllers
Modern cluster managers like Borg, Omega and Kubernetes rely on the state-reconciliation principle to be highly resilient and extensible. In these systems, all cluster-management logic is embedded in a loosely coupled collection of microservices called controllers. Each controller independently observes the current cluster state and issues corrective actions to converge the cluster to a desired state. However, the complex distributed nature of the overall system makes it hard to build reliable and correct controllers – we find that controllers face myriad reliability issues that lead to severe consequences like data loss, security vulnerabilities, and resource leaks. We present Sieve, the first automatic reliability-testing tool for cluster-management controllers. Sieve drives controllers to their potentially buggy corners by systematically and extensively perturbing the controller’s view of the current cluster state in ways it is expected to tolerate. It then compares the cluster state’s evolution with and without perturbations to detect safety and liveness issues. Sieve’s design is powered by a fundamental opportunity in state-reconciliation systems – these systems are based on state-centric interfaces between the controllers and the cluster state; such interfaces are highly transparent and thereby enable fully-automated reliability testing. To date, Sieve has efficiently found 46 serious safety and liveness bugs (35 confirmed and 22 fixed) in ten popular controllers with a low false-positive rate of 3.5%.  more » « less
Award ID(s):
1816615 2130560 2145295
NSF-PAR ID:
10357315
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Cloud systems are increasingly being managed by operation programs termed operators, which automate tedious, human-based operations. Operators of modern management platforms like Kubernetes, Twine, and ECS implement declarative interfaces based on the state-reconciliation principle. An operation declares a desired system state and the operator automatically reconciles the system to that declared state. Operator correctness is critical, given the impacts on system operations—bugs in operator code put systems in undesired or error states, with severe consequences. However, validating operator correctness is challenging due to the enormous system-state space and complex operation interface. A correct operator must not only satisfy correctness properties of its own code, but it must also maintain managed systems in desired states. Unfortunately, end-to-end testing of operators significantly falls short. We present Acto, the first automatic end-to-end testing technique for cloud system operators. Acto uses a statecentric approach to test an operator together with a managed system. Acto continuously instructs an operator to reconcile a system to different states and checks if the system successfully reaches those desired states. Acto models operations as state transitions and systematically realizes state-transition sequences to exercise supported operations in different scenarios. Acto’s oracles automatically check whether a system’s state is as desired. To date, Acto has helped find 56 serious new bugs (42 were confirmed and 30 have been fixed) in eleven Kubernetes operators with few false alarms. 
    more » « less
  2. Abstract

    Modern distributed data management systems face a new challenge: how can autonomous, mutually distrusting parties cooperate safely and effectively? Addressing this challenge brings up familiar questions from classical distributed systems: how to combine multiple steps into a single atomic action, how to recover from failures, and how to synchronize concurrent access to data. Nevertheless, each of these issues requires rethinking when participants are autonomous and potentially adversarial. We propose the notion of across-chain deal, a new way to structure complex distributed computations that manage assets in an adversarial setting. Deals are inspired by classical atomic transactions, but are necessarily different, in important ways, to accommodate the decentralized and untrusting nature of the exchange. We describe novel safety and liveness properties, along with two alternative protocols for implementing cross-chain deals in a system of independent blockchain ledgers. One protocol, based on synchronous communication, is fully decentralized, while the other, based on semi-synchronous communication, requires a globally shared ledger. We also prove that some degree of centralization is required in the semi-synchronous communication model.

     
    more » « less
  3. Self-driving cars and trucks, autonomous vehicles (AVs), should not be accepted by regulatory bodies and the public until they have much higher confidence in their safety and reliability --- which can most practically and convincingly be achieved by testing. But existing testing methods are inadequate for checking the end-to-end behaviors of AV controllers against complex, real-world corner cases involving interactions with multiple independent agents such as pedestrians and human-driven vehicles. While test-driving AVs on streets and highways fails to capture many rare events, existing simulation-based testing methods mainly focus on simple scenarios and do not scale well for complex driving situations that require sophisticated awareness of the surroundings. To address these limitations, we propose a new fuzz testing technique, called AutoFuzz, which can leverage widely-used AV simulators' API grammars to generate semantically and temporally valid complex driving scenarios (sequences of scenes). To efficiently search for traffic violations-inducing scenarios in a large search space, we propose a constrained neural network (NN) evolutionary search method to optimize AutoFuzz. Evaluation of our prototype on one state-of-the-art learning-based controller, two rule-based controllers, and one industrial-grade controller in five scenarios shows that AutoFuzz efficiently finds hundreds of traffic violations in high-fidelity simulation environments. For each scenario, AutoFuzz can find on average 10-39% more unique traffic violations than the best-performing baseline method. Further, fine-tuning the learning-based controller with the traffic violations found by AutoFuzz successfully reduced the traffic violations found in the new version of the AV controller software. 
    more » « less
  4. Programmable Logic Controllers are an established platform used throughout industrial automation, but rather poorly understood among researchers in the control systems community. This paper gives an overview of the state of the practice in industrial control systems while presenting a critical analysis of the dominant programming styles used in today's automation systems. We describe the patterns standardized loosely in IEC 61131-3 and, where there are ambiguities in the standard, realized in concrete vendor implementations. Ultimately, we suggest directions for further research towards enabling increasingly complex industrial control applications subject to the novel requirements of Industry 4.0 settings without compromising the safety and reliability guaranteed by the current industrial automation stack. 
    more » « less
  5. Solid-state-batteries (SSBs) present a promising technology for next-generation batteries due to their superior properties including increased energy density, wider electrochemical window and safer electrolyte design. Commercialization of SSBs, however, will depend on the resolution of a number of critical chemical and mechanical stability issues. The resolution of these issues will in turn depend heavily on our ability to accurately model these systems such that appropriate material selection, microstructure design, and operational parameters may be determined. In this article we review the current state-of-the art modeling tools with a focus on chemo-mechanics. Some of the key chemo-mechanical problems in SSBs involve dendrite growth through the solid-state electrolyte (SSE), interphase formation at the anode/SSE interface, and damage/decohesion of the various phases in the solid-state composite cathode. These mechanical processes in turn lead to capacity fade, impedance increase, and short-circuit of the battery, ultimately compromising safety and reliability. The article is divided into the three natural components of an all-solid-state architecture. First, modeling efforts pertaining to Li-metal anodes and dendrite initiation and growth mechanisms are reviewed, making the transition from traditional liquid electrolyte anodes to next generation all-solid-state anodes. Second, chemo-mechanics modeling of the SSE is reviewed with a particular focus on the formation of a thermodynamically unstable interphase layer at the anode/SSE interface. Finally, we conclude with a review of chemo-mechanics modeling efforts for solid-state composite cathodes. For each of these critical areas in a SSB we conclude by highlighting the key open areas for future research as it pertains to modeling the chemo-mechanical behavior of these systems. 
    more » « less