skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Automatic Reliability Testing for Cluster Management Controllers
Modern cluster managers like Borg, Omega and Kubernetes rely on the state-reconciliation principle to be highly resilient and extensible. In these systems, all cluster-management logic is embedded in a loosely coupled collection of microservices called controllers. Each controller independently observes the current cluster state and issues corrective actions to converge the cluster to a desired state. However, the complex distributed nature of the overall system makes it hard to build reliable and correct controllers – we find that controllers face myriad reliability issues that lead to severe consequences like data loss, security vulnerabilities, and resource leaks. We present Sieve, the first automatic reliability-testing tool for cluster-management controllers. Sieve drives controllers to their potentially buggy corners by systematically and extensively perturbing the controller’s view of the current cluster state in ways it is expected to tolerate. It then compares the cluster state’s evolution with and without perturbations to detect safety and liveness issues. Sieve’s design is powered by a fundamental opportunity in state-reconciliation systems – these systems are based on state-centric interfaces between the controllers and the cluster state; such interfaces are highly transparent and thereby enable fully-automated reliability testing. To date, Sieve has efficiently found 46 serious safety and liveness bugs (35 confirmed and 22 fixed) in ten popular controllers with a low false-positive rate of 3.5%.  more » « less
Award ID(s):
1816615 2130560 2145295
PAR ID:
10357315
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Traffic management systems play a vital role in ensuring safe and efficient transportation on roads. However, the use of advanced technologies in traffic management systems has introduced new safety challenges. Therefore, it is important to ensure the safety of these systems to prevent accidents and minimize their impact on road users. In this survey, we provide a comprehensive review of the literature on safety in traffic management systems. Specifically, we discuss the different safety issues that arise in traffic management systems, the current state of research on safety in these systems, and the techniques and methods proposed to ensure the safety of these systems. We also identify the limitations of the existing research and suggest future research directions. 
    more » « less
  2. Cloud systems are increasingly being managed by operation programs termed operators, which automate tedious, human-based operations. Operators of modern management platforms like Kubernetes, Twine, and ECS implement declarative interfaces based on the state-reconciliation principle. An operation declares a desired system state and the operator automatically reconciles the system to that declared state. Operator correctness is critical, given the impacts on system operations—bugs in operator code put systems in undesired or error states, with severe consequences. However, validating operator correctness is challenging due to the enormous system-state space and complex operation interface. A correct operator must not only satisfy correctness properties of its own code, but it must also maintain managed systems in desired states. Unfortunately, end-to-end testing of operators significantly falls short. We present Acto, the first automatic end-to-end testing technique for cloud system operators. Acto uses a statecentric approach to test an operator together with a managed system. Acto continuously instructs an operator to reconcile a system to different states and checks if the system successfully reaches those desired states. Acto models operations as state transitions and systematically realizes state-transition sequences to exercise supported operations in different scenarios. Acto’s oracles automatically check whether a system’s state is as desired. To date, Acto has helped find 56 serious new bugs (42 were confirmed and 30 have been fixed) in eleven Kubernetes operators with few false alarms. 
    more » « less
  3. Byzantine fault-tolerant state machine replication (SMR) protocols, such as PBFT, HotStuff, and Jolteon, are essential for modern blockchain technologies. However, they are challenging to implement correctly because they have to deal with any unexpected message from Byzantine peers and ensure safety and liveness at all times. Many formal frameworks have been developed to verify the safety of SMR implementations, but there is still a gap in the verification of their liveness. Existing liveness proofs are either limited to the network level or do not cover popular partially synchronous protocols. We introduce LiDO, a consensus model that enables the verification of both safety and liveness of implementations through refinement. We observe that current consensus models cannot handle liveness because they do not include a pacemaker state. We show that by adding a pacemaker state to the LiDO model, we can express the liveness properties of SMR protocols as a few safety properties that can be easily verified by refinement proofs. Based on our LiDO model, we provide mechanized safety and liveness proofs for both unpipelined and pipelined Jolteon in Coq. This is the first mechanized liveness proof for a byzantine consensus protocol with non-trivial optimizations such as pipelining. 
    more » « less
  4. Distributed data management systems use state Machine Replication (SMR) to provide fault tolerance. The SMR algorithm enables Byzantine Fault-Tolerant (BFT) protocols to guarantee safety and liveness despite the malicious failure of nodes. However, SMR does not prevent the adversarial manipulation of the order of transactions, where the order assigned by a malicious leader differs from the order in that transactions are received from clients. Whileorder-fairnesshas been recently studied in a few protocols, such protocols rely on synchronized clocks, suffer from liveness issues, or incur significant performance overhead. This paper presentsRashnu, a high-performance fair ordering protocol. Rashnu is motivated by the fact that fair ordering among two transactions is needed only when both transactions access a shared resource. Based on this observation, we define the notion ofdata-dependent order fairnesswhere replicas capture only the order of data-dependent transactions and the leader uses these orders to propose a dependency graph that represents fair ordering among transactions. Replicas then execute transactions using the dependency graph, resulting in the parallel execution of independent transactions. We implemented a prototype of Rashnu where our experimental evaluation reveals the low overhead of providing order-fairness in Rashnu. 
    more » « less
  5. Programmable Logic Controllers are an established platform used throughout industrial automation, but rather poorly understood among researchers in the control systems community. This paper gives an overview of the state of the practice in industrial control systems while presenting a critical analysis of the dominant programming styles used in today's automation systems. We describe the patterns standardized loosely in IEC 61131-3 and, where there are ambiguities in the standard, realized in concrete vendor implementations. Ultimately, we suggest directions for further research towards enabling increasingly complex industrial control applications subject to the novel requirements of Industry 4.0 settings without compromising the safety and reliability guaranteed by the current industrial automation stack. 
    more » « less