skip to main content


Title: Dynamic Neural Network to Enable Run-Time Trade-off between Accuracy and Latency
To deploy powerful deep neural network (DNN) into smart, but resource limited IoT devices, many prior works have been proposed to compress DNN to reduce the network size and computation complexity with negligible accuracy degradation, such as weight quantization, network pruning, convolution decomposition, etc. However, by utilizing conventional DNN compression methods, a smaller, but fixed, network is generated from a relative large background model to achieve resource limited hardware acceleration. However, such optimization lacks the ability to adjust its structure in real-time to adapt for a dynamic computing hardware resource allocation and workloads. In this paper, we mainly review our two prior works [13], [15] to tackle this challenge, discussing how to construct a dynamic DNN by means of either uniform or non-uniform sub-nets generation methods. Moreover, to generate multiple non-uniform sub-nets, [15] needs to fully retrain the background model for each sub-net individually, named as multi-path method. To reduce the training cost, in this work, we further propose a single-path sub-nets generation method that can sample multiple sub-nets in different epochs within one training round. The constructed dynamic DNN, consisting of multiple sub-nets, provides the ability to run-time trade-off the inference accuracy and latency according to hardware resources and environment requirements. In the end, we study the the dynamic DNNs with different sub-nets generation methods on both CIFAR-10 and ImageNet dataset. We also present the run-time tuning of accuracy and latency on both GPU and CPU.  more » « less
Award ID(s):
2005209 1931871 2019548
NSF-PAR ID:
10295406
Author(s) / Creator(s):
;
Date Published:
Journal Name:
2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC)
Page Range / eLocation ID:
587 to 592
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    With the widely deployment of powerful deep neural network (DNN) into smart, but resource limited IoT devices, many prior works have been proposed to compress DNN in a hardware-aware manner to reduce the computing complexity, while maintaining accuracy, such as weight quantization, pruning, convolution decomposition, etc. However, in typical DNN compression methods, a smaller, but fixed, network structure is generated from a relative large background model for resource limited hardware accelerator deployment. However, such optimization lacks the ability to tune its structure on-the-fly to best fit for a dynamic computing hardware resource allocation and workloads. In this paper, we mainly review two of our prior works [1], [2] to address this issue, discussing how to construct a dynamic DNN structure through either uniform or non-uniform channel selection based sub-network sampling. The constructed dynamic DNN could tune its computing path to involve different number of channels, thus providing the ability to trade-off between speed, power and accuracy on-the-fly after model deployment. Correspondingly, an emerging Spin-Orbit Torque Magnetic Random-Access-Memory (SOT-MRAM) based Processing-In-Memory (PIM) accelerator will also be discussed for such dynamic neural network structure. 
    more » « less
  2. null (Ed.)
    With the success of Deep Neural Networks (DNN), many recent works have been focusing on developing hardware accelerator for power and resource-limited system via model compression techniques, such as quantization, pruning, low-rank approximation and etc. However, almost all existing compressed DNNs are fixed after deployment, which lacks run-time adaptive structure to adapt to its dynamic hardware resource allocation, power budget, throughput requirement, as well as dynamic workload. As the countermeasure, to construct a novel run-time dynamic DNN structure, we propose a novel DNN sub-network sampling method via non-uniform channel selection for subnets generation. Thus, user can trade off between power, speed, computing load and accuracy on-the-fly after the deployment, depending on the dynamic requirements or specifications of the given system. We verify the proposed model on both CIFAR-10 and ImageNet dataset using ResNets, which outperforms the same sub-nets trained individually and other related works. It shows that, our method can achieve latency trade-off among 13.4, 24.6, 41.3, 62.1(ms) and 30.5, 38.7, 51, 65.4(ms) for GPU with 128 batch-size and CPU respectively on ImageNet using ResNet18. 
    more » « less
  3. Recently, a new trend of exploring training sparsity has emerged, which remove parameters during training, leading to both training and inference efficiency improvement. This line of works primarily aims to obtain a single sparse model under a pre-defined large sparsity ratio. It leads to a static/fixed sparse inference model that is not capable of adjusting or re-configuring its computation complexity (i.e., inference structure, latency) after training for real-world varying and dynamic hardware resource availability. To enable such run-time or post-training network morphing, the concept of `dynamic inference' or `training-once-for-all' has been proposed to train a single network consisting of multiple sub-nets once, but each sub-net could perform the same inference function with different computing complexity. However, the traditional dynamic inference training method requires a joint training scheme with multi-objective optimization, which suffers from very large training overhead. In this work, for the first time, we propose a novel alternating sparse training (AST) scheme to train multiple sparse sub-nets for dynamic inference without extra training cost compared to the case of training a single sparse model from scratch. Furthermore, to mitigate the interference of weight update among sub-nets, we propose gradient correction within the inner-group iterations to reduce their weight update interference. We validate the proposed AST on multiple datasets against state-of-the-art sparse training method, which shows that AST achieves similar or better accuracy, but only needs to train once to get multiple sparse sub-nets with different sparsity ratios. More importantly, compared with the traditional joint training based dynamic inference training methodology, the large training overhead is completely eliminated without affecting the accuracy of each sub-net. 
    more » « less
  4. Nowadays, one practical limitation of deep neural network (DNN) is its high degree of specialization to a single task or domain (e.g., one visual domain). It motivates researchers to develop algorithms that can adapt DNN model to multiple domains sequentially, while still performing well on the past domains, which is known as multi-domain learning. Almost all conventional methods only focus on improving accuracy with minimal parameter update, while ignoring high computing and memory cost during training, which makes it difficult to deploy multi-domain learning into more and more widely used resource-limited edge devices, like mobile phone, IoT, embedded system, etc. During our study in multi-domain training process, we observe that large memory used for activation storage is the bottleneck that largely limits the training time and cost on edge devices. To reduce training memory usage, while keeping the domain adaption accuracy performance, we propose Dynamic Additive Attention Adaption (DA3), a novel memory-efficient on-device multi-domain learning method. DA3 learns a novel additive attention adaptor module, while freezing the weights of the pre-trained backbone model for each domain. Differentiating from prior works, such module not only mitigates activation memory buffering for reducing memory usage during training, but also serves as a dynamic gating mechanism to reduce the computation cost for fast inference. We validate DA3 on multiple datasets against state-of-the-art methods, which shows great improvement in both accuracy and training time. Moreover, we deployed DA3 into the popular NIVDIA Jetson Nano edge GPU, where the measured experimental results show our proposed \mldam reduces the on-device training memory consumption by 19x-37x, and training time by 2x, in comparison to the baseline methods (e.g., standard fine-tuning, Parallel and Series Res. adaptor, and Piggyback). 
    more » « less
  5. With the success of deep neural networks (DNN), many recent works have been focusing on developing hardware accelerator for power and resource-limited embedded system via model compression techniques, such as quantization, pruning, low-rank approximation, etc. However, almost all existing DNN structure is fixed after deployment, which lacks runtime adaptive DNN structure to adapt to its dynamic hardware resource, power budget, throughput requirement, as well as dynamic workload. Correspondingly, there is no runtime adaptive hardware platform to support dynamic DNN structure. To address this problem, we first propose a dynamic channel-adaptive deep neural network (CA-DNN) which can adjust the involved convolution channel (i.e. model size, computing load) at run-time (i.e. at inference stage without retraining) to dynamically trade off between power, speed, computing load and accuracy. Further, we utilize knowledge distillation method to optimize the model and quantize the model to 8-bits and 16-bits, respectively, for hardware friendly mapping. We test the proposed model on CIFAR-10 and ImageNet dataset by using ResNet. Comparing with the same model size of individual model, our CA-DNN achieves better accuracy. Moreover, as far as we know, we are the first to propose a Processing-in-Memory accelerator for such adaptive neural networks structure based on Spin Orbit Torque Magnetic Random Access Memory(SOT-MRAM) computational adaptive sub-arrays. Then, we comprehensively analyze the trade-off of the model with different channel-width between the accuracy and the hardware parameters, eg., energy, memory, and area overhead. 
    more » « less