skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Integrating Failure Detection and Isolation into a Reference Governor-Based Reconfiguration Strategy for Stuck Actuators
A set-theoretic Failure Model and Effect Management (FMEM) strategy for stuck/jammed actuators in systems with redundant actuators is considered. This strategy uses a reference governor for command tracking while satisfying state and control constraints and, once the failure mode is known, generates a recovery command sequence during mode transitions triggered by actuator failures. In the paper, this FMEM strategy is enhanced with a scheme to detect and isolate failures within a finite time, and to handle unmeasured set-bounded disturbance inputs. A numerical example is reported to illustrate the offline design process and the online operation with the proposed approach.  more » « less
Award ID(s):
1931738
PAR ID:
10433562
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of 2022 American Control Conference, Atlanta, Georgia, USA, June 8-10, 2022.
Page Range / eLocation ID:
4311 to 4316
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper proposes a Failure Mode and Effect Management (FMEM) strategy for constrained systems with redundant actuators based on the combined use of constraint admissible and recoverable sets. Several approaches to ensure reconfiguration of the system without constraint violation in the event of actuator failures are presented. Numerical simulation results are reported. 
    more » « less
  2. null (Ed.)
    Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead. 
    more » « less
  3. This artifact contains the source code for FlakeRake, a tool for automatically reproducing timing-dependent flaky-test failures. It also includes raw and processed results produced in the evaluation of FlakeRake   Contents:   Timing-related APIs that FlakeRake considers adding sleeps at: timing-related-apis Anonymized code for FlakeRake (not runnable in its anonymized state, but included for reference; we will publicly release the non-anonymized code under an open source license pending double-blind review): flakerake.tgz Failure messages extracted from the FlakeFlagger dataset: 10k_reruns_failures_by_test.csv.gz  Output from running isolated reruns on each flaky test in the FlakeFlager dataset: 10k_isolated_reruns_all_results.csv.gz (all test results summarized into a CSV), 10k_isolated_reruns_failures_by_test.csv.gz (CSV including just test failures, including failure messages), 10k_isolated_reruns_raw_results.tgz (includes all raw results from reruns, including the XML files output by maven) Output from running the FlakeFlagger replication study (non-isolated 10k reruns):flakeFlaggerReplResults.csv.gz (all test results summarized into a CSV), 10k_reruns_failures_by_test.csv.gz (CSV including just failures, including failure messages), flakeFlaggerRepl_raw_results.tgz (includes all raw results from reruns, including the XML files output by maven - this file is markedly larger than the 10k isolated reruns results because we ran *all* tests in this experiment, whereas the 10k isolated rerun experiment only re-ran the tests that were known to be flaky from the FlakeFlagger dataset). Output from running FlakeRake on each flaky test in the FlakeFlagger dataset: For bisection mode: results-bis.tgz For one-by-one mode: results-obo.tgz Scripts used to execute FlakeRake using an HPC cluster: execution-scripts.tgz Scripts used to execute rerun experiments using an HPC cluster: flakeFlaggerReplScripts.tgz Scripts used to parse the "raw" maven test result XML files in this artifact into the CSV files contained in this artifact: parseSurefireXMLs.tgz  Output from running FlakeRake in “reproduction” mode, attempting to reproduce each of the failures that matched the FlakeFlagger dataset (collected for bisection mode only): results-repro-bis.tgz Analysis of timing-dependent API calls in the failure inducing configurations that matched FlakeFlagger failures: bis-sleepyline.cause-to-matched-fail-configs-found.csv 
    more » « less
  4. When a failure occurs in production systems, the highest priority is to quickly mitigate it. Despite its importance, failure mitigation is done in a reactive and ad-hoc way: taking some fixed actions only after a severe symptom is observed. For cloud systems, such a strategy is inadequate. In this paper, we propose a preventive and adaptive failure mitigation service, Narya, that is integrated in a production cloud, Microsoft Azure's compute platform. Narya predicts imminent host failures based on multi-layer system signals and then decides smart mitigation actions. The goal is to avert VM failures. Narya's decision engine takes a novel online experimentation approach to continually explore the best mitigation action. Narya further enhances the adaptive decision capability through reinforcement learning. Narya has been running in production for 15 months. It on average reduces VM interruptions by 26% compared to the previous static strategy. 
    more » « less
  5. This article presents a study of seismically-induced failure of massive steep rock slopes. A dynamic implementation of the bonded particle model (BPM) for rock is used to simulate the dynamic response and initiation of fracture in the slopes. Observation of forces that develop within the model in response to wave transmission and dynamic excitation provides insight into the fundamental mechanisms at work in seismically induced rock slope failure. Five distinct mechanisms of failure initiation are identified using non-destructive simulations and confirmed with destructive simulations. Three distinct modes of rock mass movement enabled by the failure mechanisms are identified. The predominant co-seismic failure mode was a shallow, highly-disrupted cliff collapse. Cliff collapse is initiated by relatively low levels of shaking. Shallow failures are also triggered at higher levels of shaking prior to the initiation of deeper, more coherent failures in the same seismic event. The results of the numerical study agree with qualitative historical surveys of seismically-induced rock slope failure trends and provide insight into the mechanisms behind observed co-seismic rock slope behavior. The frequently observed shallow failures are triggered by high compression stresses near the cliff toe combined with shallow subhorizontal ruptures behind the cliff face. These mechanisms are not well-captured by simplified analysis methods which may lead to underprediction of shallow co-seismic events. Deeper failure surfaces from stronger shaking create a base-isolation effect, slowing further disruption in the failure mass. Slope dynamic response and damage accumulation were shown to be interdependent and complex, emphasizing the importance of further research into the interaction between rock mass strength, slope geometry, structure, and ground motion characteristics. 
    more » « less