skip to main content

Title: Concurrent Error Detection in Embedded Digital Control of Nonlinear Autonomous Systems Using Adaptive State Space Checks
The advent of pervasive autonomous systems such as self-driving cars and drones has raised questions about their safety and trustworthiness. This is particularly relevant in the event of on-board subsystem errors or failures. In this research, we show how encoded Extended Kalman Filter can be used to detect anomalous behaviors of critical components of nonlinear autonomous systems: sensors, actuators, state estimation algorithms and control software. As opposed to prior work that is limited to linear systems or requires the use of cumbersome machine learned checks with fixed detection thresholds, the proposed approach necessitates the use of time-varying checks with dynamically adaptive thresholds. The method is lightweight in comparison to existing methods (does not rely on machine learning paradigms) and achieves high coverage as well as low detection latency of errors. A quadcopter and an automotive steer-by-wire system are used as test vehicles for the research and simulation and hardware results indicate the overhead, coverage and error detection latency benefits of the proposed approach.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
International Test Conference
Page Range / eLocation ID:
1 to 10
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this research, a low cost error detection and correction approach is developed for multilayer perceptron networks, where checker neurons are used to encode hidden layer functions using independent training experiments. Error detection and correction is predicated on validating consistency properties of the encoded checks and shows that high coverage of injected errors can be achieved with extremely low computational overhead. 
    more » « less
  2. The last decade has seen tremendous advances in the application of artificial neural networks to solving problems that mimic human intelligence. Many of these systems are implemented using traditional digital compute engines where errors can occur during memory accesses or during numerical computation. While such networks are inherently error resilient, specific errors can result in incorrect decisions. This work develops a low overhead error detection and correction approach for multilayer artificial neural networks, here the hidden layer functions are approximated using checker neurons. Experimental results show that a high coverage of injected errors can be achieved with extremely low computational overhead using consistency properties of the encoded checks. A key side benefit is that the checks can flag errors when the network is presented outlier data that do not correspond to data with which the network is trained to operate. 
    more » « less
  3. The reliability of emerging neuromorphic compute fabrics is of great concern due to their widespread use in critical data-intensive applications. Ensuring such reliability is difficult due to the intensity of underlying computations (billions of parameters), errors induced by low power operation and the complex relationship between errors in computations and their effect on network performance accuracy. We study the problem of designing error-resilient neuromorphic systems where errors can stem from: (a) soft errors in computation of matrix-vector multiplications and neuron activations, (b) malicious trojan and adversarial security attacks and (c) effects of manufacturing process variations on analog crossbar arrays that can affect DNN accuracy. The core principle of error detection relies on embedded predictive neuron checks using invariants derived from the statistics of nominal neuron activation patterns of hidden layers of a neural network. Algorithmic encodings of hidden neuron function are also used to derive invariants for checking. A key contribution is designing checks that are robust to the inherent nonlinearity of neuron computations with minimal impact on error detection coverage. Once errors are detected, they are corrected using probabilistic methods due to the difficulties involved in exact error diagnosis in such complex systems. The technique is scalable across soft errors as well as a range of security attacks. The effects of manufacturing process variations are handled through the use of compact tests from which DNN performance can be assessed using learning techniques. Experimental results on a variety of neuromorphic test systems: DNNs, spiking networks and hyperdimensional computing are presented. 
    more » « less
  4. The last decade has seen tremendous advances in the transformation of ubiquitous control, computing and communication platforms that are anytime, anywhere. These platforms allow humans to interact with machines through sensing, control and actuation functions in ways not imaginable a few decades ago. While robust control techniques aim to maintain autonomous system performance in the presence of bounded modeling errors, they are not designed to manage large multiparameter variations and internal component failures that are inevitable during lengthy periods of field deployment. To address the trustworthiness of autonomous systems in the field, we propose a cross-layer error resilience approach in which errors are detected and corrected at appropriate levels of the design (hardware-through software) with the objective of minimizing the latency of error recovery while maintaining high failure coverage. At the control processor level, soft errors in the digital control processor are considered. At the system level, sensor and actuator failures are analyzed. These impairments define the health of the system. A methodology for adapting the control procedure of the autonomous system to compensate for degraded system health is proposed. It is shown how this methodology can be applied to simple linear and nonlinear control systems to maintain system performance in the presence of internal component failures. Experimental results demonstrate the feasibility of the proposed methodology. 
    more » « less
  5. Online reinforcement learning (RL) based systems are being increasingly deployed in a variety of safety-critical applications ranging from drone control to medical robotics. These systems typically use RL onboard rather than relying on remote operation from high-performance datacenters. Due to the dynamic nature of the environments they work in, onboard RL hardware is vulnerable to soft errors from radiation, thermal effects and electrical noise that corrupt the results of computations. Existing approaches to on-line error resilience in machine learning systems have relied on availability of the large training datasets to configure resilience parameters, which is not necessarily feasible for online RL systems. Similarly, other approaches involving specialized hardware or modifications to training algorithms are difficult to implement for onboard RL applications. In contrast, we present a novel error resilience approach for online RL that makes use of running statistics collected across the (real-time) RL training process to configure error detection thresholds without the need to access a reference training dataset. In this methodology, statistical concentration bounds leveraging running statistics are used to diagnose neuron outputs as erroneous. These erroneous neurons are then set to zero (suppressed). Our approach is compared against the state of the art and validated on several RL algorithms involving the use of multiple concentration bounds on CPU as well as GPU hardware. 
    more » « less