skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Architecture-based Software Reliability Incorporating Fault Tolerant Machine Learning
With the increased interest to incorporate machine learning into software and systems, methods to characterize the impact of the reliability of machine learning are needed to ensure the reliability of the software and systems in which these algorithms reside. Towards this end, we build upon the architecture-based approach to software reliability modeling, which represents application reliability in terms of the component reliabilities and the probabilistic transitions between the components. Traditional architecture-based software reliability models consider all components to be deterministic software. We therefore extend this modeling approach to the case, where some components represent learning enabled components. Here, the reliability of a machine learning component is interpreted as the accuracy of its decisions, which is a common measure of classification algorithms. Moreover, we allow these machine learning components to be fault-tolerant in the sense that multiple diverse classifier algorithms are trained to guide decisions and the majority decision taken. We demonstrate the utility of the approach to assess the impact of machine learning on software reliability as well as illustrate the concept of reliability growth in machine learning. Finally, we validate past analytical results for a fault tolerant system composed of correlated components with real machine learning algorithms and data, demonstrating the analytical expression’s ability to accurately estimate the reliability of the fault tolerant machine learning component and subsequently the architecture-based software within which it resides.  more » « less
Award ID(s):
1749635
PAR ID:
10221174
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Annual Reliability and Maintainability Symposium (RAMS 2020)
Page Range / eLocation ID:
1 to 6
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Vision Transformers (ViTs) have evolved in the field of computer vision by transitioning traditional Convolutional Neural Networks (CNNs) into attention-based architectures. This architecture processes input images as sequences of patches. ViTs achieve enhanced performance in many tasks such as image classification and object detection due to their ability to capture global dependencies within input data. While their software implementations are widely adopted, deploying ViTs on hardware introduces several challenges. These include fault tolerance in the presence of hardware failures, real-time reliability, and high computational requirements. Permanent faults that are in processing elements, interconnections, or memory subsystems lead to incorrect computations and degrading system performance. This paper proposes a fault-tolerant hardware implementation of ViTs to overcome these challenges. This hardware implementation integrates real-time fault detection and recovery mechanisms. The architecture includes four primary units: patch embedding, encoder, decoder, and Multi Layer Perceptron (MLP) which are supported by fault-tolerant components such as lightweight recompute units, a centralized Built-In Self-Test (BIST), and a learning-based decision-making system using machine learning model 'decision tree'. These units are interconnected through a centralized global buffer for efficient data transfer, ensuring seamless operation even under fault conditions. 
    more » « less
  2. null (Ed.)
    Optical network technology is one of the leading candidates for meeting the required backhaul transport layer latency and capacity requirements of 5G services. In addition, its physical layer programmability supports the execution of advanced methods that can improve 5G service reliability and SLA compliance in the face of equipment failure. While a number of such methods is addressed in the literature, including Virtual Network Function (VNF) fault-tolerant methods, a full proof of concept is yet to be reported.The study in this paper describes a testbed — along with its Software Defined Networking (SDN) and Network Function Virtualization (NFV) capabilities — which is used to experimentally showcase the key functionalities that are required by VNF fault-tolerant methods. The testbed makes use of OpenROADM compliant Dense Wavelength Division Multiplexing (DWDM) equipment to implement the programmable backhaul of a Next Generation Radio Access Network (NG-RAN) Non-standalone (NSA) architecture running 4G Evolved Packet Core (EPC) with the 5G next-generation NodeB (gNB). Specifically, the testbed is used to showcase the live migration of virtualized EPC components that is required to restore pre-failure VNF. 
    more » « less
  3. Abstract Advances in deep learning have revolutionized cyber‐physical applications, including the development of autonomous vehicles. However, real‐world collisions involving autonomous control of vehicles have raised significant safety concerns regarding the use of deep neural networks (DNNs) in safety‐critical tasks, particularly perception. The inherent unverifiability of DNNs poses a key challenge in ensuring their safe and reliable operation. In this work, we propose perception simplex ( ), a fault‐tolerant application architecture designed for obstacle detection and collision avoidance. We analyse an existing LiDAR‐based classical obstacle detection algorithm to establish strict bounds on its capabilities and limitations. Such analysis and verification have not been possible for deep learning‐based perception systems yet. By employing verifiable obstacle detection algorithms, identifies obstacle existence detection faults in the output of unverifiable DNN‐based object detectors. When faults with potential collision risks are detected, appropriate corrective actions are initiated. Through extensive analysis and software‐in‐the‐loop simulations, we demonstrate that provides deterministic fault tolerance against obstacle existence detection faults, establishing a robust safety guarantee. 
    more » « less
  4. Summary Energy‐efficient scientific applications require insight into how high performance computing system features impact the applications' power and performance. This insight can result from the development of performance and power models. In this article, we use the modeling and prediction tool MuMMI (Multiple Metrics Modeling Infrastructure) and 10 machine learning methods to model and predict performance and power consumption and compare their prediction error rates. We use an algorithm‐based fault‐tolerant linear algebra code and a multilevel checkpointing fault‐tolerant heat distribution code to conduct our modeling and prediction study on the Cray XC40 Theta and IBM BG/Q Mira at Argonne National Laboratory and the Intel Haswell cluster Shepard at Sandia National Laboratories. Our experimental results show that the prediction error rates in performance and power using MuMMI are less than 10% for most cases. By utilizing the models for runtime, node power, CPU power, and memory power, we identify the most significant performance counters for potential application optimizations, and we predict theoretical outcomes of the optimizations. Based on two collected datasets, we analyze and compare the prediction accuracy in performance and power consumption using MuMMI and 10 machine learning methods. 
    more » « less
  5. null (Ed.)
    The need for fail-slow fault tolerance in modern distributed systems is highlighted by the increasingly reported fail-slow hardware/software components that lead to poor performance system-wide. We argue that fail-slow fault tolerance not only needs new distributed protocol designs, but also desires programming support for implementing and verifying fail-slow fault-tolerant code. Our observation is that the inability of tolerating fail-slow faults in existing distributed systems is often rooted in the implementations and is difficult to understand and debug. We designed the Dependably Fast Library (DepFast) for implementing fail-slow tolerant distributed systems. DepFast provides expressive interfaces for taking control of possible fail-slow points in the program to prevent unexpected slowness propagation once and for all. We use DepFast to implement a distributed replicated state machine (RSM) and show that it can tolerate various types of fail-slow faults that affect existing RSM implementations. 
    more » « less