Mutiny! How Does Kubernetes Fail, and What Can We Do About It?

Barletta, Marco; Cinque, Marcello; Di_Martino, Catello; Kalbarczyk, Zbigniew T; Iyer, Ravishankar K

doi:10.1109/DSN58291.2024.00016

Citation Details

Mutiny! How Does Kubernetes Fail, and What Can We Do About It?

In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/over provisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies. more »

Award ID(s):: 2029049

PAR ID:: 10546553

Author(s) / Creator(s):: Barletta, Marco; Cinque, Marcello; Di_Martino, Catello; Kalbarczyk, Zbigniew T; Iyer, Ravishankar K

Publisher / Repository:: arXiv

Date Published:: 2024-04-17

ISBN:: 979-8-3503-4105-8

Page Range / eLocation ID:: 1 to 14

Subject(s) / Keyword(s):: container orchestration, failure, resiliency, mission-critical, fault injection, cloud

Format(s):: Medium: X

Location:: Brisbane, Australia

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/DSN58291.2024.00016

More Like this