skip to main content


Title: Integrating Failure Detection and Isolation into a Reference Governor-Based Reconfiguration Strategy for Stuck Actuators
A set-theoretic Failure Model and Effect Management (FMEM) strategy for stuck/jammed actuators in systems with redundant actuators is considered. This strategy uses a reference governor for command tracking while satisfying state and control constraints and, once the failure mode is known, generates a recovery command sequence during mode transitions triggered by actuator failures. In the paper, this FMEM strategy is enhanced with a scheme to detect and isolate failures within a finite time, and to handle unmeasured set-bounded disturbance inputs. A numerical example is reported to illustrate the offline design process and the online operation with the proposed approach.  more » « less
Award ID(s):
1931738
NSF-PAR ID:
10433562
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of 2022 American Control Conference, Atlanta, Georgia, USA, June 8-10, 2022.
Page Range / eLocation ID:
4311 to 4316
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper proposes a Failure Mode and Effect Management (FMEM) strategy for constrained systems with redundant actuators based on the combined use of constraint admissible and recoverable sets. Several approaches to ensure reconfiguration of the system without constraint violation in the event of actuator failures are presented. Numerical simulation results are reported. 
    more » « less
  2. This artifact contains the source code for FlakeRake, a tool for automatically reproducing timing-dependent flaky-test failures. It also includes raw and processed results produced in the evaluation of FlakeRake

     

    Contents:

     

    Timing-related APIs that FlakeRake considers adding sleeps at: timing-related-apis

    Anonymized code for FlakeRake (not runnable in its anonymized state, but included for reference; we will publicly release the non-anonymized code under an open source license pending double-blind review): flakerake.tgz

    Failure messages extracted from the FlakeFlagger dataset: 10k_reruns_failures_by_test.csv.gz 

    Output from running isolated reruns on each flaky test in the FlakeFlager dataset: 10k_isolated_reruns_all_results.csv.gz (all test results summarized into a CSV), 10k_isolated_reruns_failures_by_test.csv.gz (CSV including just test failures, including failure messages), 10k_isolated_reruns_raw_results.tgz (includes all raw results from reruns, including the XML files output by maven)

    Output from running the FlakeFlagger replication study (non-isolated 10k reruns):flakeFlaggerReplResults.csv.gz (all test results summarized into a CSV), 10k_reruns_failures_by_test.csv.gz (CSV including just failures, including failure messages), flakeFlaggerRepl_raw_results.tgz (includes all raw results from reruns, including the XML files output by maven - this file is markedly larger than the 10k isolated reruns results because we ran *all* tests in this experiment, whereas the 10k isolated rerun experiment only re-ran the tests that were known to be flaky from the FlakeFlagger dataset).

    Output from running FlakeRake on each flaky test in the FlakeFlagger dataset:

    For bisection mode: results-bis.tgz

    For one-by-one mode: results-obo.tgz

    Scripts used to execute FlakeRake using an HPC cluster: execution-scripts.tgz
    Scripts used to execute rerun experiments using an HPC cluster: flakeFlaggerReplScripts.tgz
    Scripts used to parse the "raw" maven test result XML files in this artifact into the CSV files contained in this artifact: parseSurefireXMLs.tgz 

    Output from running FlakeRake in “reproduction” mode, attempting to reproduce each of the failures that matched the FlakeFlagger dataset (collected for bisection mode only): results-repro-bis.tgz

    Analysis of timing-dependent API calls in the failure inducing configurations that matched FlakeFlagger failures: bis-sleepyline.cause-to-matched-fail-configs-found.csv

     
    more » « less
  3. null (Ed.)
    A large number of software reliability growth models have been proposed in the literature. Many of these models have also been the subject of optimization problems, including the optimal release problem in which a decision-maker seeks to minimize cost by balancing the cost of testing with field failures. However, the majority of these optimal release formulations are either unused or untested. In many cases, researchers derive expressions and apply them to the complete set of failure data in order to identify the time at which cost was minimized, but this is clearly unusable, since it is not possible to go back in time to make a release decision. The only other implicit strategy implied by these optimal release formulations is to refit a model every time a failure occurs and to assess if the optimal release time has past or if additional testing should be performed. 
    more » « less
  4. null (Ed.)
    Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead. 
    more » « less
  5. Continuous Integration (CI) practices encourage developers to frequently integrate code into a shared repository. Each integration is validated by automatic build and testing such that errors are revealed as early as possible. When CI failures or integration errors are reported, existing techniques are insufficient to automatically locate the root causes for two reasons. First, a CI failure may be triggered by faults in source code and/or build scripts, while current approaches consider only source code. Second, a tentative integration can fail because of build failures and/or test failures, while existing tools focus on test failures only. This paper presents UniLoc, the first unified technique to localize faults in both source code and build scripts given a CI failure log, without assuming the failure’s location (source code or build scripts) and nature (a test failure or not). Adopting the information retrieval (IR) strategy, UniLoc locates buggy files by treating source code and build scripts as documents to search and by considering build logs as search queries. However, instead of naïvely applying an off-the-shelf IR technique to these software artifacts, for more accurate fault localization, UniLoc applies various domain-specific heuristics to optimize the search queries, search space, and ranking formulas. To evaluate UniLoc, we gathered 700 CI failure fixes in 72 open-source projects that are built with Gradle. UniLoc could effectively locate bugs with the average MRR (Mean Reciprocal Rank) value as 0.49, MAP (Mean Average Precision) value as 0.36, and NDCG (Normalized Discounted Cumulative Gain) value as 0.54. UniLoc outperformed the state-of-the-art IR-based tool BLUiR and Locus. UniLoc has the potential to help developers diagnose root causes for CI failures more accurately and efficiently. 
    more » « less