skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: ConfExp: Root-Cause Analysis of Service Misconfigurations in Enterprise Systems
Abstract Misconfiguration is a known and increasingly serious problem in enterprise systems due to frequent code updates and retuning of the configuration parameters. Diagnosing complex, residual misconfiguration problems that lead to inaccessible services or failed transactions often starts with either a user complaint or observation by administrators, followed by a largely manual process of deciding what tests to run and how to proceed with further testing based on the test results. The goal of this paper is to automate this process and thereby make root-cause analysis of accessibility related misconfigurations much speedier and much more effective. We explore a domain-knowledge-driven methodology, called ConfExp using a network emulator that runs real enterprise networking protocols. Thus, by using commonly used tests, we show that the root-cause can be determined in all cases where discriminative tests exist. The methodology also highlights areas where more discriminative tests are needed to pinpoint the precise configuration variables at fault.  more » « less
Award ID(s):
2011252
PAR ID:
10568892
Author(s) / Creator(s):
; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Journal of Network and Systems Management
Volume:
33
Issue:
2
ISSN:
1064-7570
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Large-scale cloud services deploy hundreds of configuration changes to production systems daily. At such velocity, con- figuration changes have inevitably become prevalent causes of production failures. Existing misconfiguration detection and configuration validation techniques only check configu- ration values. These techniques cannot detect common types of failure-inducing configuration changes, such as those that cause code to fail or those that violate hidden constraints. We present ctests, a new type of tests for detecting failure- inducing configuration changes to prevent production failures. The idea behind ctests is simple—connecting production sys- tem configurations to software tests so that configuration changes can be tested in the context of code affected by the changes. So, ctests can detect configuration changes that ex- pose dormant software bugs and diverse misconfigurations. We show how to generate ctests by transforming the many existing tests in mature systems. The key challenge that we address is the automated identification of test logic and oracles that can be reused in ctests. We generated thousands of ctests from the existing tests in five cloud systems. Our results show that ctests are effective in detecting failure-inducing configuration changes before deployment. We evaluate ctests on real-world failure-inducing configura- tion changes, injected misconfigurations, and deployed con- figuration files from public Docker images. Ctests effectively detect real-world failure-inducing configuration changes and misconfigurations in the deployed files. 
    more » « less
  2. Continuous Integration (CI) allows developers to check whether their code can build successfully and pass tests across various system environments with every commit. To use a CI platform, a developer must provide configuration files within a code repository to specify build conditions. Incorrect configuration settings lead to CI build failures, which can take hours to run, wasting valuable developer time and delaying product release dates. Debugging CI configurations is a slow and error-prone process. The only way to check the correctness of CI configurations is to push a commit and wait for the build result. We present VeriCI, the first system for localizing CI configuration errors at the code level. VeriCI runs as a static analysis tool, before the developer sends the build request to the CI server. Our key insight is that the commit history and the corresponding build histories available in CI environments can be used both for build error prediction and build error localization. We leverage the build history as a labeled dataset to automatically derive customized rules describing correct CI configurations, using supervised machine learning techniques. To more accurately identify root causes, we train a neural network that filters out constraints that are less likely to be connected to the root cause of build failure. We evaluate VeriCI on real world data from GitHub and achieve 91% accuracy of predicting a build failure and correctly identify the root cause in 75% of cases. We also conducted a between-subjects user study with 20 software developers, showing that VeriCI significantly helps users in identifying and fixing errors in CI. 
    more » « less
  3. null (Ed.)
    Misconfiguration is a major cause of system failures. Prior solutions focus on detecting invalid settings that are introduced by user mistakes. But another type of misconfiguration that continues to haunt production services is specious configuration—settings that are valid but lead to unexpectedly poor performance in production. Such misconfigurations are subtle, so even careful administrators may fail to foresee them. We propose a tool called Violet to detect specious configuration. We realize the crux of specious configuration is that it causes some slow code path to be executed, but the bad performance effect cannot always be triggered. Violet thus takes a novel approach that uses selective symbolic execution to systematically reason about the performance effect of configuration parameters, their combination effect, and the relationship with input. Violet outputs a performance impact model for the automatic detection of poor configuration settings. We applied Violet on four large systems. To evaluate the effectiveness of Violet, we collect 17 real-world specious configuration cases. Violet detects 15 of them. Violet also identifies 11 unknown specious configurations 
    more » « less
  4. null (Ed.)
    Misconfiguration is a major cause of system failures. Prior solutions focus on detecting invalid settings that are introduced by user mistakes. But another type of misconfiguration that continues to haunt production services is specious configuration---settings that are valid but lead to unexpectedly poor performance in production. Such misconfigurations are subtle, so even careful administrators may fail to foresee them. We propose a tool called Violet to detect specious configuration. We realize the crux of specious configuration is that it causes some slow code path to be executed, but the bad performance effect cannot always be triggered. Violet thus takes a novel approach that uses selective symbolic execution to systematically reason about the performance effect of configuration parameters, their combination effect, and the relationship with input. Violet outputs a performance impact model for the automatic detection of poor configuration settings. We applied Violet on four large systems. To evaluate the effectiveness of Violet, we collect 17 real-world specious configuration cases. Violet detects 15 of them. Violet also identifies 11 unknown specious configurations. 
    more » « less
  5. Abstract Bio-loggers are widely used for studying the movement and behavior of animals. However, some sensors provide more data than is practical to store given experiment or bio-logger design constraints. One approach for overcoming this limitation is to utilize data collection strategies, such as non-continuous recording or data summarization that may record data more efficiently, but need to be validated for correctness. In this paper we address two fundamental questions—how can researchers determine suitable parameters and behaviors for bio-logger sensors, and how do they validate their choices? We present a methodology that uses software-based simulation of bio-loggers to validate various data collection strategies using recorded data and synchronized, annotated video. The use of simulation allows for fast and repeatable tests, which facilitates the validation of data collection methods as well as the configuration of bio-loggers in preparation for experiments. We demonstrate this methodology using accelerometer loggers for recording the activity of the small songbirdJunco hyemalis hyemalis. 
    more » « less