skip to main content


Search for: All records

Award ID contains: 2128419

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Online reinforcement learning (RL) based systems are being increasingly deployed in a variety of safety-critical applications ranging from drone control to medical robotics. These systems typically use RL onboard rather than relying on remote operation from high-performance datacenters. Due to the dynamic nature of the environments they work in, onboard RL hardware is vulnerable to soft errors from radiation, thermal effects and electrical noise that corrupt the results of computations. Existing approaches to on-line error resilience in machine learning systems have relied on availability of the large training datasets to configure resilience parameters, which is not necessarily feasible for online RL systems. Similarly, other approaches involving specialized hardware or modifications to training algorithms are difficult to implement for onboard RL applications. In contrast, we present a novel error resilience approach for online RL that makes use of running statistics collected across the (real-time) RL training process to configure error detection thresholds without the need to access a reference training dataset. In this methodology, statistical concentration bounds leveraging running statistics are used to diagnose neuron outputs as erroneous. These erroneous neurons are then set to zero (suppressed). Our approach is compared against the state of the art and validated on several RL algorithms involving the use of multiple concentration bounds on CPU as well as GPU hardware. 
    more » « less
    Free, publicly-accessible full text available July 3, 2024
  2. Spiking Neural Networks (SNNs) can be implemented with power-efficient digital as well as analog circuitry. However, in Resistive RAM (RRAM) based SNN accelerators, synapse weights programmed into the crossbar can differ from their ideal values due to defects and programming errors, degrading inference accuracy. In addition, circuit nonidealities within analog spiking neurons that alter the neuron spiking rate (modeled by variations in neuron firing threshold) can degrade SNN inference accuracy when the value of inference time steps (ITSteps) of SNN is set to a critical minimum that maximizes network throughput. We first develop a recursive linearized check to detect synapse weight errors with high sensitivity. This triggers a correction methodology which sets out-of-range synapse values to zero. For correcting the effects of firing threshold variations, we develop a test methodology that calibrates the extent of such variations. This is then used to proportionally increase inference time steps during inference for chips with higher variation. Experiments on a variety of SNNs prove the viability of the proposed resilience methods. 
    more » « less
    Free, publicly-accessible full text available May 29, 2024
  3. Transformer networks have achieved remarkable success in Natural Language Processing (NLP) and Computer Vision applications. However, the underlying large volumes of Transformer computations demand high reliability and resilience to soft errors in processor hardware. The objective of this research is to develop efficient techniques for design of error resilient Transformer architectures. To enable this, we first perform a soft error vulnerability analysis of every fully connected layers in Transformer computations. Based on this study, error detection and suppression modules are selectively introduced into datapaths to restore Transformer performance under anticipated error rate conditions. Memory access errors and neuron output errors are detected using checksums of linear Transformer computations. Correction consists of determining output neurons with out-of-range values and suppressing the same to zero. For a Transformer with nominal BLEU score of 52.7, such vulnerability guided selective error suppression can recover language translation performance from a BLEU score of 0 to 50.774 with as much as 0.001 probability of activation error, incurring negligible memory and computation overheads. 
    more » « less
    Free, publicly-accessible full text available May 22, 2024
  4. The reliability of emerging neuromorphic compute fabrics is of great concern due to their widespread use in critical data-intensive applications. Ensuring such reliability is difficult due to the intensity of underlying computations (billions of parameters), errors induced by low power operation and the complex relationship between errors in computations and their effect on network performance accuracy. We study the problem of designing error-resilient neuromorphic systems where errors can stem from: (a) soft errors in computation of matrix-vector multiplications and neuron activations, (b) malicious trojan and adversarial security attacks and (c) effects of manufacturing process variations on analog crossbar arrays that can affect DNN accuracy. The core principle of error detection relies on embedded predictive neuron checks using invariants derived from the statistics of nominal neuron activation patterns of hidden layers of a neural network. Algorithmic encodings of hidden neuron function are also used to derive invariants for checking. A key contribution is designing checks that are robust to the inherent nonlinearity of neuron computations with minimal impact on error detection coverage. Once errors are detected, they are corrected using probabilistic methods due to the difficulties involved in exact error diagnosis in such complex systems. The technique is scalable across soft errors as well as a range of security attacks. The effects of manufacturing process variations are handled through the use of compact tests from which DNN performance can be assessed using learning techniques. Experimental results on a variety of neuromorphic test systems: DNNs, spiking networks and hyperdimensional computing are presented. 
    more » « less
    Free, publicly-accessible full text available March 23, 2024
  5. Analog crossbar arrays have recently attracted significant attention due to their usefulness for deep neural net (DNN) computations with ultra-low power consumption. However, recent studies have shown that DNNs implemented with such crossbar arrays suffer from as high as 30% degradation in performance due to the effects of manufacturing process variability effects resulting in degradation of their functional safety. One way to test these DNNs is to apply an exhaustive set of test images to each device to ascertain its performance. This is expensive and time-consuming. We propose an alternative test scheme in which a small subset of test images is applied to each DNN and the classification accuracy of the DNN is predicted directly from observation of the final layer outputs of the network. This saves test cost while allowing binning of DNNs for performance. Experimental results for a variety of test cases are presented and show test efficiency improvements of 3X over testing with the exhaustive test image set. 
    more » « less
  6. Deep learning techniques have been widely adopted in daily life with applications ranging from face recognition to recommender systems. The substantial overhead of conventional error tolerance techniques precludes their widespread use, while approaches involving median filtering and invariant generation rely on alterations to DNN training that may be difficult to achieve for larger networks on larger datasets. To address this issue, this paper presents a novel approach taking advantage of the statistics of neuron output gradients to identify and suppress erroneous neuron values. By using the statistics of neurons’ gradients with respect to their neighbors, tighter statistical thresholds are obtained compared to the use of neuron output values alone. This approach is modular and is combined with accurate, low-overhead error detection methods to ensure it is used only when needed, further reducing its cost. Deep learning models can be trained using standard methods and our error correction module is fit to a trained DNN, achieving comparable or superior performance compared to baseline error correction methods while incurring comparable hardware overhead without needing to modify DNN training or utilize specialized hardware architectures. 
    more » « less
  7. Deep learning techniques have been widely adopted in daily life with applications ranging from face recognition to recommender systems. The substantial overhead of conventional error tolerance techniques precludes their widespread use, while approaches involving median filtering and invariant generation rely on alterations to DNN training that may be difficult to achieve for larger networks on larger datasets. To address this issue, this paper presents a novel approach taking advantage of the statistics of neuron output gradients to identify and suppress erroneous neuron values. By using the statistics of neurons’ gradients with respect to their neighbors, tighter statistical thresholds are obtained compared to the use of neuron output values alone. This approach is modular and is combined with accurate, low-overhead error detection methods to ensure it is used only when needed, further reducing its cost. Deep learning models can be trained using standard methods and our error correction module is fit to a trained DNN, achieving comparable or superior performance compared to baseline error correction methods while incurring comparable hardware overhead without needing to modify DNN training or utilize specialized hardware architectures. 
    more » « less