skip to main content


Title: Efficient On-Device Training via Gradient Filtering
Despite its importance for federated learning, continuous learning and many other applications, on-device training remains an open problem for EdgeAI. The problem stems from the large number of operations (e.g., floating point multiplications and additions) and memory consumption required during training by the back-propagation algorithm. Consequently, in this paper, we propose a new gradient filtering approach which enables on-device CNN model training. More precisely, our approach creates a special structure with fewer unique elements in the gradient map, thus significantly reducing the computational complexity and memory consumption of back propagation during training. Extensive experiments on image classification and semantic segmentation with multiple CNN models (e.g., MobileNet, DeepLabV3, UPerNet) and devices (e.g., Raspberry Pi and Jetson Nano) demonstrate the effectiveness and wide applicability of our approach. For example, compared to SOTA, we achieve up to 19× speedup and 77.1% memory savings on ImageNet classification with only 0.1% accuracy loss. Finally, our method is easy to implement and deploy; over 20× speedup and 90% energy savings have been observed compared to highly optimized baselines in MKLDNN and CUDNN on NVIDIA Jetson Nano. Consequently, our approach opens up a new direction of research with a huge potential for on-device training.  more » « less
Award ID(s):
2007284
PAR ID:
10468126
Author(s) / Creator(s):
; ;
Publisher / Repository:
IEEE
Date Published:
ISBN:
979-8-3503-0129-8
Page Range / eLocation ID:
3811 to 3820
Format(s):
Medium: X
Location:
Vancouver, BC, Canada
Sponsoring Org:
National Science Foundation
More Like this
  1. Nowadays, one practical limitation of deep neural network (DNN) is its high degree of specialization to a single task or domain (e.g., one visual domain). It motivates researchers to develop algorithms that can adapt DNN model to multiple domains sequentially, while still performing well on the past domains, which is known as multi-domain learning. Almost all conventional methods only focus on improving accuracy with minimal parameter update, while ignoring high computing and memory cost during training, which makes it difficult to deploy multi-domain learning into more and more widely used resource-limited edge devices, like mobile phone, IoT, embedded system, etc. During our study in multi-domain training process, we observe that large memory used for activation storage is the bottleneck that largely limits the training time and cost on edge devices. To reduce training memory usage, while keeping the domain adaption accuracy performance, we propose Dynamic Additive Attention Adaption (DA3), a novel memory-efficient on-device multi-domain learning method. DA3 learns a novel additive attention adaptor module, while freezing the weights of the pre-trained backbone model for each domain. Differentiating from prior works, such module not only mitigates activation memory buffering for reducing memory usage during training, but also serves as a dynamic gating mechanism to reduce the computation cost for fast inference. We validate DA3 on multiple datasets against state-of-the-art methods, which shows great improvement in both accuracy and training time. Moreover, we deployed DA3 into the popular NIVDIA Jetson Nano edge GPU, where the measured experimental results show our proposed \mldam reduces the on-device training memory consumption by 19x-37x, and training time by 2x, in comparison to the baseline methods (e.g., standard fine-tuning, Parallel and Series Res. adaptor, and Piggyback). 
    more » « less
  2. This paper addresses the design of accelerators using systolic architectures to train convolutional neural networks using a novel gradient interleaving approach. Training the neural network involves computation and backpropagation of gradients of error with respect to the activation functions and weights. It is shown that the gradient with respect to the activation function can be computed using a weight-stationary systolic array, while the gradient with respect to the weights can be computed using an output-stationary systolic array. The novelty of the proposed approach lies in interleaving the computations of these two gradients on the same configurable systolic array. This results in the reuse of the variables from one computation to the other and eliminates unnecessary memory accesses and energy consumption associated with these memory accesses. The proposed approach leads to 1.4−2.2× savings in terms of the number of cycles and 1.9× savings in terms of memory accesses in the fully-connected layer. Furthermore, the proposed method uses up to 25% fewer cycles and memory accesses, and 16% less energy than baseline implementations for state-of-the-art CNNs. Under iso-area comparisons, for Inception-v4, compared to weight-stationary (WS), Intergrad achieves 12% savings in energy, 17% savings in memory, and 4% savings in cycles. Savings for Densenet-264 are 18% , 26% , and 27% with respect to energy, memory, and cycles, respectively. Thus, the proposed novel accelerator architecture reduces the latency and energy consumption for training deep neural networks. 
    more » « less
  3. In recent years, Convolutional Neural Networks (CNNs) have shown superior capability in visual learning tasks. While accuracy-wise CNNs provide unprecedented performance, they are also known to be computationally intensive and energy demanding for modern computer systems. In this paper, we propose Virtual Pooling (ViP), a model-level approach to improve speed and energy consumption of CNN-based image classification and object detection tasks, with a provable error bound. We show the efficacy of ViP through experiments on four CNN models, three representative datasets, both desktop and mobile platforms, and two visual learning tasks, i.e., image classification and object detection. For example, ViP delivers 2.1x speedup with less than 1.5% accuracy degradation in ImageNet classification on VGG16, and 1.8x speedup with 0.025 mAP degradation in PASCAL VOC object detection with Faster-RCNN. ViP also reduces mobile GPU and CPU energy consumption by up to 55% and 70%, respectively. As a complementary method to existing acceleration approaches, ViP achieves 1.9x speedup on ThiNet leading to a combined speedup of 5.23x on VGG16. Furthermore, ViP provides a knob for machine learning practitioners to generate a set of CNN models with varying trade-offs between system speed/energy consumption and accuracy to better accommodate the requirements of their tasks. Code is available at https://github.com/cmu-enyac/VirtualPooling. 
    more » « less
  4. Deep neural networks (DNNs) are becoming increasingly deeper, wider, and non-linear due to the growing demands on prediction accuracy and analysis quality. Training wide and deep neural networks require large amounts of storage resources such as memory because the intermediate activation data must be saved in the memory during forward propagation and then restored for backward propagation. However, state-of-the-art accelerators such as GPUs are only equipped with very limited memory capacities due to hardware design constraints, which significantly limits the maximum batch size and hence performance speedup when training large-scale DNNs. Traditional memory saving techniques either suffer from performance overhead or are constrained by limited interconnect bandwidth or specific interconnect technology. In this paper, we propose a novel memory-efficient CNN training framework (called COMET) that leverages error-bounded lossy compression to significantly reduce the memory requirement for training in order to allow training larger models or to accelerate training. Our framework purposely adopts error-bounded lossy compression with a strict error-controlling mechanism. Specifically, we perform a theoretical analysis on the compression error propagation from the altered activation data to the gradients, and empirically investigate the impact of altered gradients over the training process. Based on these analyses, we optimize the error-bounded lossy compression and propose an adaptive error-bound control scheme for activation data compression. Experiments demonstrate that our proposed framework can significantly reduce the training memory consumption by up to 13.5X over the baseline training and 1.8X over another state-of-the-art compression-based framework, respectively, with little or no accuracy loss. 
    more » « less
  5. Convolutional neural networks (CNNs) have been increasingly deployed to Internet of Things (IoT) devices. Hence, many efforts have been made towards efficient CNN inference in resource-constrained platforms. This paper attempts to explore an orthogonal direction: how to conduct more energy-efficient training of CNNs, so as to enable on-device training? We strive to reduce the energy cost during training, by dropping unnecessary computations, from three complementary levels: stochastic mini-batch dropping on the data level; selective layer update on the model level; and sign prediction for low-cost, low-precision back-propagation, on the algorithm level. Extensive simulations and ablation studies, with real energy measurements from an FPGA board, confirm the superiority of our proposed strategies and demonstrate remarkable energy savings for training. Specifically, when training ResNet-74 on CIFAR-10, we achieve aggressive energy savings of >90% and >60%, while incurring an accuracy loss of only about 2% and 1.2%, respectively. When training ResNet-110 on CIFAR-100, an over 84% training energy saving comes at the small accuracy costs of 2% (top-1) and 0.1% (top-5). 
    more » « less