This content will become publicly available on March 23, 2024
- Award ID(s):
- 2128419
- NSF-PAR ID:
- 10453092
- Date Published:
- Journal Name:
- Latin American Test Symposium
- Page Range / eLocation ID:
- 1-2
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Online reinforcement learning (RL) based systems are being increasingly deployed in a variety of safety-critical applications ranging from drone control to medical robotics. These systems typically use RL onboard rather than relying on remote operation from high-performance datacenters. Due to the dynamic nature of the environments they work in, onboard RL hardware is vulnerable to soft errors from radiation, thermal effects and electrical noise that corrupt the results of computations. Existing approaches to on-line error resilience in machine learning systems have relied on availability of the large training datasets to configure resilience parameters, which is not necessarily feasible for online RL systems. Similarly, other approaches involving specialized hardware or modifications to training algorithms are difficult to implement for onboard RL applications. In contrast, we present a novel error resilience approach for online RL that makes use of running statistics collected across the (real-time) RL training process to configure error detection thresholds without the need to access a reference training dataset. In this methodology, statistical concentration bounds leveraging running statistics are used to diagnose neuron outputs as erroneous. These erroneous neurons are then set to zero (suppressed). Our approach is compared against the state of the art and validated on several RL algorithms involving the use of multiple concentration bounds on CPU as well as GPU hardware.more » « less
-
Transformer networks have achieved remarkable success in Natural Language Processing (NLP) and Computer Vision applications. However, the underlying large volumes of Transformer computations demand high reliability and resilience to soft errors in processor hardware. The objective of this research is to develop efficient techniques for design of error resilient Transformer architectures. To enable this, we first perform a soft error vulnerability analysis of every fully connected layers in Transformer computations. Based on this study, error detection and suppression modules are selectively introduced into datapaths to restore Transformer performance under anticipated error rate conditions. Memory access errors and neuron output errors are detected using checksums of linear Transformer computations. Correction consists of determining output neurons with out-of-range values and suppressing the same to zero. For a Transformer with nominal BLEU score of 52.7, such vulnerability guided selective error suppression can recover language translation performance from a BLEU score of 0 to 50.774 with as much as 0.001 probability of activation error, incurring negligible memory and computation overheads.more » « less
-
Abstract Early attack detection is essential to ensure the security of complex networks, especially those in critical infrastructures. This is particularly crucial in networks with multi-stage attacks, where multiple nodes are connected to external sources, through which attacks could enter and quickly spread to other network elements. Bayesian attack graphs (BAGs) are powerful models for security risk assessment and mitigation in complex networks, which provide the probabilistic model of attackers’ behavior and attack progression in the network. Most attack detection techniques developed for BAGs rely on the assumption that network compromises will be detected through routine monitoring, which is unrealistic given the ever-growing complexity of threats. This paper derives the optimal minimum mean square error (MMSE) attack detection and monitoring policy for the most general form of BAGs. By exploiting the structure of BAGs and their partial and imperfect monitoring capacity, the proposed detection policy achieves the MMSE optimality possible only for linear-Gaussian state space models using Kalman filtering. An adaptive resource monitoring policy is also introduced for monitoring nodes if the expected predictive error exceeds a user-defined value. Exact and efficient matrix-form computations of the proposed policies are provided, and their high performance is demonstrated in terms of the accuracy of attack detection and the most efficient use of available resources using synthetic Bayesian attack graphs with different topologies.
-
Neuromorphic computing systems execute machine learning tasks designed with spiking neural networks. These systems are embracing non-volatile memory to implement high-density and low-energy synaptic storage. Elevated voltages and currents needed to operate non-volatile memories cause aging of CMOS-based transistors in each neuron and synapse circuit in the hardware, drifting the transistor’s parameters from their nominal values. If these circuits are used continuously for too long, the parameter drifts cannot be reversed, resulting in permanent degradation of circuit performance over time, eventually leading to hardware faults. Aggressive device scaling increases power density and temperature, which further accelerates the aging, challenging the reliable operation of neuromorphic systems. Existing reliability-oriented techniques periodically de-stress all neuron and synapse circuits in the hardware at fixed intervals, assuming worst-case operating conditions, without actually tracking their aging at run-time. To de-stress these circuits, normal operation must be interrupted, which introduces latency in spike generation and propagation, impacting the inter-spike interval and hence, performance (e.g., accuracy). We observe that in contrast to long-term aging, which permanently damages the hardware, short-term aging in scaled CMOS transistors is mostly due to bias temperature instability. The latter is heavily workload-dependent and, more importantly, partially reversible. We propose a new architectural technique to mitigate the aging-related reliability problems in neuromorphic systems by designing an intelligent run-time manager (NCRTM), which dynamically de-stresses neuron and synapse circuits in response to the short-term aging in their CMOS transistors during the execution of machine learning workloads, with the objective of meeting a reliability target. NCRTM de-stresses these circuits only when it is absolutely necessary to do so, otherwise reducing the performance impact by scheduling de-stress operations off the critical path. We evaluate NCRTM with state-of-the-art machine learning workloads on a neuromorphic hardware. Our results demonstrate that NCRTM significantly improves the reliability of neuromorphic hardware, with marginal impact on performance.more » « less
-
The last decade has seen tremendous advances in the application of artificial neural networks to solving problems that mimic human intelligence. Many of these systems are implemented using traditional digital compute engines where errors can occur during memory accesses or during numerical computation. While such networks are inherently error resilient, specific errors can result in incorrect decisions. This work develops a low overhead error detection and correction approach for multilayer artificial neural networks, here the hidden layer functions are approximated using checker neurons. Experimental results show that a high coverage of injected errors can be achieved with extremely low computational overhead using consistency properties of the encoded checks. A key side benefit is that the checks can flag errors when the network is presented outlier data that do not correspond to data with which the network is trained to operate.more » « less