skip to main content


Title: Soft Error Resilient Deep Learning Systems Using Neuron Gradient Statistics
Deep learning techniques have been widely adopted in daily life with applications ranging from face recognition to recommender systems. The substantial overhead of conventional error tolerance techniques precludes their widespread use, while approaches involving median filtering and invariant generation rely on alterations to DNN training that may be difficult to achieve for larger networks on larger datasets. To address this issue, this paper presents a novel approach taking advantage of the statistics of neuron output gradients to identify and suppress erroneous neuron values. By using the statistics of neurons’ gradients with respect to their neighbors, tighter statistical thresholds are obtained compared to the use of neuron output values alone. This approach is modular and is combined with accurate, low-overhead error detection methods to ensure it is used only when needed, further reducing its cost. Deep learning models can be trained using standard methods and our error correction module is fit to a trained DNN, achieving comparable or superior performance compared to baseline error correction methods while incurring comparable hardware overhead without needing to modify DNN training or utilize specialized hardware architectures.  more » « less
Award ID(s):
2128419
NSF-PAR ID:
10358565
Author(s) / Creator(s):
Date Published:
Journal Name:
International On-Line Testing Symposium
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Deep learning techniques have been widely adopted in daily life with applications ranging from face recognition to recommender systems. The substantial overhead of conventional error tolerance techniques precludes their widespread use, while approaches involving median filtering and invariant generation rely on alterations to DNN training that may be difficult to achieve for larger networks on larger datasets. To address this issue, this paper presents a novel approach taking advantage of the statistics of neuron output gradients to identify and suppress erroneous neuron values. By using the statistics of neurons’ gradients with respect to their neighbors, tighter statistical thresholds are obtained compared to the use of neuron output values alone. This approach is modular and is combined with accurate, low-overhead error detection methods to ensure it is used only when needed, further reducing its cost. Deep learning models can be trained using standard methods and our error correction module is fit to a trained DNN, achieving comparable or superior performance compared to baseline error correction methods while incurring comparable hardware overhead without needing to modify DNN training or utilize specialized hardware architectures. 
    more » « less
  2. Abstract

    To improve predictive machine learning-based models limited by sparse data, supplemental physics-related features are introduced into a deep neural network (DNN). While some approaches inject physics through differential equations or numerical simulation, improvements are possible using simplified relationships from engineering references. To evaluate this hypothesis, thin rectangular plates were simulated to generate training datasets. With plate dimensions and material properties as input features and fundamental natural frequency as the output, predictive performance of a data-driven DNN-based model is compared with models using supplemental inputs, such as modulus of rigidity. To evaluate model accuracy improvements, these additional features are injected into various DNN layers, and the network is trained with four different dataset sizes. When evaluated against independent data of similar features to the training sets, supplementation provides no statistically-significant prediction error reduction. However, notable accuracy gains occur when independent test data is of material and dimensions different from the original training set. Furthermore, when physics-enhanced data is injected into multiple DNN layers, reductions in mean error from 33.2% to 19.6%, 34.9% to 19.9%, 35.8% to 22.4%, and 43.0% to 28.4% are achieved for dataset sizes of 261, 117, 60, and 30, respectively, demonstrating potential for generalizability using a data supplementation approach. Additionally, when compared with other methods—such as linear regression and support vector machine (SVM) approaches—the physics-enhanced DNN demonstrates an order of magnitude reduction in percentage error for dataset sizes of 261, 117, and 60 and a 30% reduction for a size of 30 when compared with a cubic SVM model independently tested with data divergent from the training and validation set.

     
    more » « less
  3. Across basic research studies, cell counting requires significant human time and expertise. Trained experts use thin focal plane scanning to count (click) cells in stained biological tissue. This computer-assisted process (optical disector) requires a well-trained human to select a unique best z-plane of focus for counting cells of interest. Though accurate, this approach typically requires an hour per case and is prone to inter-and intra-rater errors. Our group has previously proposed deep learning (DL)-based methods to automate these counts using cell segmentation at high magnification. Here we propose a novel You Only Look Once (YOLO) model that performs cell detection on multi-channel z-plane images (disector stack). This automated Multiple Input Multiple Output (MIMO) version of the optical disector method uses an entire z-stack of microscopy images as its input, and outputs cell detections (counts) with a bounding box of each cell and class corresponding to the z-plane where the cell appears in best focus. Compared to the previous segmentation methods, the proposed method does not require time-and labor-intensive ground truth segmentation masks for training, while producing comparable accuracy to current segmentation-based automatic counts. The MIMO-YOLO method was evaluated on systematic-random samples of NeuN-stained tissue sections through the neocortex of mouse brains (n=7). Using a cross validation scheme, this method showed the ability to correctly count total neuron numbers with accuracy close to human experts and with 100% repeatability (Test-Retest). 
    more » « less
  4. null (Ed.)
    Deep Neural Networks (DNNs) are becoming an integral part of most software systems. Previous work has shown that DNNs have bugs. Unfortunately, existing debugging techniques don't support localizing DNN bugs because of the lack of understanding of model behaviors. The entire DNN model appears as a black box. To address these problems, we propose an approach and a tool that automatically determines whether the model is buggy or not, and identifies the root causes for DNN errors. Our key insight is that historic trends in values propagated between layers can be analyzed to identify faults, and also localize faults. To that end, we first enable dynamic analysis of deep learning applications: by converting it into an imperative representation and alternatively using a callback mechanism. Both mechanisms allows us to insert probes that enable dynamic analysis over the traces produced by the DNN while it is being trained on the training data. We then conduct dynamic analysis over the traces to identify the faulty layer or hyperparameter that causes the error. We propose an algorithm for identifying root causes by capturing any numerical error and monitoring the model during training and finding the relevance of every layer/parameter on the DNN outcome. We have collected a benchmark containing 40 buggy models and patches that contain real errors in deep learning applications from Stack Overflow and GitHub. Our benchmark can be used to evaluate automated debugging tools and repair techniques. We have evaluated our approach using this DNN bug-and-patch benchmark, and the results showed that our approach is much more effective than the existing debugging approach used in the state-of-the-practice Keras library. For 34/40 cases, our approach was able to detect faults whereas the best debugging approach provided by Keras detected 32/40 faults. Our approach was able to localize 21/40 bugs whereas Keras did not localize any faults. 
    more » « less
  5. In this work, we investigate various non-ideal effects (Stuck-At-Fault (SAF), IR-drop, thermal noise, shot noise, and random telegraph noise)of ReRAM crossbar when employing it as a dot-product engine for deep neural network (DNN) acceleration. In order to examine the impacts of those non-ideal effects, we first develop a comprehensive framework called PytorX based on main-stream DNN pytorch framework. PytorX could perform end-to-end training, mapping, and evaluation for crossbar-based neural network accelerator, considering all above discussed non-ideal effects of ReRAM crossbar together. Experiments based on PytorX show that directly mapping the trained large scale DNN into crossbar without considering these non-ideal effects could lead to a complete system malfunction (i.e., equal to random guess) when the neural network goes deeper and wider. In particular, to address SAF side effects, we propose a digital SAF error correction algorithm to compensate for crossbar output errors, which only needs one-time profiling to achieve almost no system accuracy degradation. Then, to overcome IR drop effects, we propose a Noise Injection Adaption (NIA) methodology by incorporating statistics of current shift caused by IR drop in each crossbar as stochastic noise to DNN training algorithm, which could efficiently regularize DNN model to make it intrinsically adaptive to non-ideal ReRAM crossbar. It is a one-time training method without the request of retraining for every specific crossbar. Optimizing system operating frequency could easily take care of rest non-ideal effects. Various experiments on different DNNs using image recognition application are conducted to show the efficacy of our proposed methodology. 
    more » « less