skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: DA3: Dynamic Additive Attention Adaption for Memory-Efficient On-Device Multi-Domain Learning
Nowadays, one practical limitation of deep neural network (DNN) is its high degree of specialization to a single task or domain (e.g., one visual domain). It motivates researchers to develop algorithms that can adapt DNN model to multiple domains sequentially, while still performing well on the past domains, which is known as multi-domain learning. Almost all conventional methods only focus on improving accuracy with minimal parameter update, while ignoring high computing and memory cost during training, which makes it difficult to deploy multi-domain learning into more and more widely used resource-limited edge devices, like mobile phone, IoT, embedded system, etc. During our study in multi-domain training process, we observe that large memory used for activation storage is the bottleneck that largely limits the training time and cost on edge devices. To reduce training memory usage, while keeping the domain adaption accuracy performance, we propose Dynamic Additive Attention Adaption (DA3), a novel memory-efficient on-device multi-domain learning method. DA3 learns a novel additive attention adaptor module, while freezing the weights of the pre-trained backbone model for each domain. Differentiating from prior works, such module not only mitigates activation memory buffering for reducing memory usage during training, but also serves as a dynamic gating mechanism to reduce the computation cost for fast inference. We validate DA3 on multiple datasets against state-of-the-art methods, which shows great improvement in both accuracy and training time. Moreover, we deployed DA3 into the popular NIVDIA Jetson Nano edge GPU, where the measured experimental results show our proposed \mldam reduces the on-device training memory consumption by 19x-37x, and training time by 2x, in comparison to the baseline methods (e.g., standard fine-tuning, Parallel and Series Res. adaptor, and Piggyback).  more » « less
Award ID(s):
1931871 2144751
PAR ID:
10348284
Author(s) / Creator(s):
;
Date Published:
Journal Name:
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
Page Range / eLocation ID:
2619-2627
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Transfer learning on edge is challenging due to on-device limited resources. Existing work addresses this issue by training a subset of parameters or adding model patches. Developed with inference in mind, Inverted Residual Blocks (IRBs) split a convolutional layer into depthwise and pointwise convolutions, leading to more stacking layers, e.g., convolution, normalization, and activation layers. Though they are efficient for inference, IRBs require that additional activation maps are stored in memory for training weights for convolution layers and scales for normalization layers. As a result, their high memory cost prohibits training IRBs on resource-limited edge devices, and making them unsuitable in the context of transfer learning. To address this issue, we present MobileTL, a memory and computationally efficient on-device transfer learning method for models built with IRBs. MobileTL trains the shifts for internal normalization layers to avoid storing activation maps for the backward pass. Also, MobileTL approximates the backward computation of the activation layer (e.g., Hard-Swish and ReLU6) as a signed function which enables storing a binary mask instead of activation maps for the backward pass. MobileTL fine-tunes a few top blocks (close to output) rather than propagating the gradient through the whole network to reduce the computation cost. Our method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs, respectively. For MobileNetV3, we observe a 36% reduction in floating-point operations (FLOPs) when fine-tuning 5 blocks, while only incurring a 0.6% accuracy reduction on CIFAR10. Extensive experiments on multiple datasets demonstrate that our method is Pareto-optimal (best accuracy under given hardware constraints) compared to prior work in transfer learning for edge devices. 
    more » « less
  2. Conventionally, DNN models are trained once in the cloud and deployed in edge devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time inference. However, there are many cases that require the models to adapt to new environments, domains, or users. In order to realize such domain adaption or personalization, the models on devices need to be continuously trained on the device. In this work, we design EF-Train, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs. It is challenging to implement on-device training on resource-limited FPGAs due to the low efficiency caused by different memory access patterns among forward and backward propagation and weight update. Therefore, we developed a data reshaping approach with intra-tile continuous memory allocation and weight reuse. An analytical model is established to automatically schedule computation and memory resources to achieve high energy efficiency on edge FPGAs. The experimental results show that our design achieves 46.99 GFLOPS and 6.09 GFLOPS/W in terms of throughput and energy efficiency, respectively. 
    more » « less
  3. The non-volatile Resistive RAM (ReRAM) crossbar has shown great potential in accelerating inference in various machine learning models However, it suffers from high reprogramming energy, hindering its usage for on-device adaption to new tasks. Recently, parameter-efficient fine-tuning methods, such as Low-Rank Adaption (LoRA), have been proposed to train few parameters while matching full fine-tuning performance. However, in ReRAM crossbar, the reprogramming cost of LoRA is non-trivial and will increase significantly when adapting to multi-tasks on the device. To address this issue, we are the first to propose LoRAFusion, a parameter-efficient multi-task on-device learning framework for ReRAM crossbar via fusion of pre-trained LoRA modules. LoRAFusion is a group of LoRA modules that are one-time learned based on diverse domain-specific tasks and deployed to the crossbar, acting as the pool of background knowledge. Then given a new unseen task, those LoRA modules are frozen (i.e., no energy-hungry ReRAM cells reprograming), only the proposed learnable layer-wise LoRA fusion coefficient and magnitude vector parameters are trained on-device to weighted-combine pre-trained LoRA modules, which significantly reduces the training parameter size. Our comprehensive experiments show LoRAFusion only uses 3% of the number of trainable parameters in LoRA (148K vs. 4700K), with 0.19% accuracy drop. Codes are available at https://github.com/ASU-ESIC-FAN-Lab/LoRAFusion 
    more » « less
  4. The ever increasing size of deep neural network (DNN) models once implied that they were only limited to cloud data centers for runtime inference. Nonetheless, the recent plethora of DNN model compression techniques have successfully overcome this limit, turning into a reality that DNN-based inference can be run on numerous resource-constrained edge devices including mobile phones, drones, robots, medical devices, wearables, Internet of Things devices, among many others. Naturally, edge devices are highly heterogeneous in terms of hardware specification and usage scenarios. On the other hand, compressed DNN models are so diverse that they exhibit different tradeoffs in a multi-dimension space, and not a single model can achieve optimality in terms of all important metrics such as accuracy, latency and energy consumption. Consequently, how to automatically select a compressed DNN model for an edge device to run inference with optimal quality of experience (QoE) arises as a new challenge. The state-of-the-art approaches either choose a common model for all/most devices, which is optimal for a small fraction of edge devices at best, or apply device-specific DNN model compression, which is not scalable. In this paper, by leveraging the predictive power of machine learning and keeping end users in the loop, we envision an automated device-level DNN model selection engine for QoE-optimal edge inference. To concretize our vision, we formulate the DNN model selection problem into a contextual multi-armed bandit framework, where features of edge devices and DNN models are contexts and pre-trained DNN models are arms selected online based on the history of actions and users' QoE feedback. We develop an efficient online learning algorithm to balance exploration and exploitation. Our preliminary simulation results validate our algorithm and highlight the potential of machine learning for automating DNN model selection to achieve QoE-optimal edge inference. 
    more » « less
  5. Modern deep neural networks (DNNs) often require high memory consumption and large computational loads. In order to deploy DNN algorithms efficiently on edge or mobile devices, a series of DNN compression algorithms have been explored, including factorization methods. Factorization methods approximate the weight matrix of a DNN layer with the multiplication of two or multiple low-rank matrices. However, it is hard to measure the ranks of DNN layers during the training process. Previous works mainly induce low-rank through implicit approximations or via costly singular value decomposition (SVD) process on every training step. The former approach usually induces a high accuracy loss while the latter has a low efficiency. In this work, we propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step. SVD training first decomposes each layer into the form of its full-rank SVD, then performs training directly on the decomposed weights. We add orthogonality regularization to the singular vectors, which ensure the valid form of SVD and avoid gradient vanishing/exploding. Low-rank is encouraged by applying sparsity-inducing regularizers on the singular values of each layer. Singular value pruning is applied at the end to explicitly reach a low-rank model. We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy, comparing to not only previous factorization methods but also state-of-the-art filter pruning methods. 
    more » « less