Due to the separate memory and computation units in traditional Von-Neumann architecture, massive data transfer dominates the overall computing system’s power and latency, known as the ‘Memory-Wall’ issue. Especially with ever-increasing deep learning-based AI model size and computing complexity, it becomes the bottleneck for state-of-the-art AI computing systems. To address this challenge, In-Memory Computing (IMC) based Neural Network accelerators have been widely investigated to support AI computing within memory. However, most of those works focus only on inference. The on-device training and continual learning have not been well explored yet. In this work, for the first time, we introduce on-device continual learning with STT-assisted-SOT (SAS) Magnetic Random Access Memory (MRAM) based IMC system. On the hardware side, we have fabricated a SAS-MRAM device prototype with 4 Magnetic Tunnel Junctions (MTJ, each at 100nm × 50nm) sharing a common heavy metal layer, achieving significantly improved memory writing and area efficiency compared to traditional SOT-MRAM. Next, we designed fully digital IMC circuits with our SAS-MRAM to support both neural network inference and on-device learning. To enable efficient on-device continual learning for new task data, we present an 8-bit integer (INT8) based continual learning algorithm that utilizes our SAS-MRAM IMC-supported bit-serial digital in-memory convolution operations to train a small parallel reprogramming Network (Rep-Net) while freezing the major backbone model. Extensive studies have been presented based on our fabricated SAS-MRAM device prototype, cross-layer device-circuit benchmarking and simulation, as well as the on-device continual learning system evaluation. 
                        more » 
                        « less   
                    
                            
                            Rep-Net: Efficient On-Device Learning via Feature Reprogramming
                        
                    
    
            Transfer learning, where the goal is to transfer the well-trained deep learning models from a primary source task to a new task, is a crucial learning scheme for on-device machine learning, due to the fact that IoT/edge devices collect and then process massive data in our daily life. However, due to the tiny memory constraint in IoT/edge devices, such on-device learning requires ultra-small training memory footprint, bringing new challenges for memory-efficient learning. Many existing works solve this problem by reducing the number of trainable parameters. However, this doesn't directly translate to memory-saving since the major bottleneck is the activations, not parameters. To develop memory-efficient on-device transfer learning, in this work, we are the first to approach the concept of transfer learning from a new perspective of intermediate feature reprogramming of a pre-trained model (i.e., backbone). To perform this lightweight and memory-efficient reprogramming, we propose to train a tiny Reprogramming Network (Rep-Net) directly from the new task input data, while freezing the backbone model. The proposed Rep-Net model interchanges the features with the backbone model using an activation connector at regular intervals to mutually benefit both the backbone model and Rep-Net model features. Through extensive experiments, we validate each design specs of the proposed Rep-Net model in achieving highly memory-efficient on-device reprogramming. Our experiments establish the superior performance (i.e., low training memory and high accuracy) of Rep-Net compared to SOTA on-device transfer learning schemes across multiple benchmarks. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10348280
- Date Published:
- Journal Name:
- IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- Page Range / eLocation ID:
- 12277-12286
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Nowadays, one practical limitation of deep neural network (DNN) is its high degree of specialization to a single task or domain (e.g., one visual domain). It motivates researchers to develop algorithms that can adapt DNN model to multiple domains sequentially, while still performing well on the past domains, which is known as multi-domain learning. Almost all conventional methods only focus on improving accuracy with minimal parameter update, while ignoring high computing and memory cost during training, which makes it difficult to deploy multi-domain learning into more and more widely used resource-limited edge devices, like mobile phone, IoT, embedded system, etc. During our study in multi-domain training process, we observe that large memory used for activation storage is the bottleneck that largely limits the training time and cost on edge devices. To reduce training memory usage, while keeping the domain adaption accuracy performance, we propose Dynamic Additive Attention Adaption (DA3), a novel memory-efficient on-device multi-domain learning method. DA3 learns a novel additive attention adaptor module, while freezing the weights of the pre-trained backbone model for each domain. Differentiating from prior works, such module not only mitigates activation memory buffering for reducing memory usage during training, but also serves as a dynamic gating mechanism to reduce the computation cost for fast inference. We validate DA3 on multiple datasets against state-of-the-art methods, which shows great improvement in both accuracy and training time. Moreover, we deployed DA3 into the popular NIVDIA Jetson Nano edge GPU, where the measured experimental results show our proposed \mldam reduces the on-device training memory consumption by 19x-37x, and training time by 2x, in comparison to the baseline methods (e.g., standard fine-tuning, Parallel and Series Res. adaptor, and Piggyback).more » « less
- 
            The non-volatile Resistive RAM (ReRAM) crossbar has shown great potential in accelerating inference in various machine learning models However, it suffers from high reprogramming energy, hindering its usage for on-device adaption to new tasks. Recently, parameter-efficient fine-tuning methods, such as Low-Rank Adaption (LoRA), have been proposed to train few parameters while matching full fine-tuning performance. However, in ReRAM crossbar, the reprogramming cost of LoRA is non-trivial and will increase significantly when adapting to multi-tasks on the device. To address this issue, we are the first to propose LoRAFusion, a parameter-efficient multi-task on-device learning framework for ReRAM crossbar via fusion of pre-trained LoRA modules. LoRAFusion is a group of LoRA modules that are one-time learned based on diverse domain-specific tasks and deployed to the crossbar, acting as the pool of background knowledge. Then given a new unseen task, those LoRA modules are frozen (i.e., no energy-hungry ReRAM cells reprograming), only the proposed learnable layer-wise LoRA fusion coefficient and magnitude vector parameters are trained on-device to weighted-combine pre-trained LoRA modules, which significantly reduces the training parameter size. Our comprehensive experiments show LoRAFusion only uses 3% of the number of trainable parameters in LoRA (148K vs. 4700K), with 0.19% accuracy drop. Codes are available at https://github.com/ASU-ESIC-FAN-Lab/LoRAFusionmore » « less
- 
            With the rise of tiny IoT devices powered by machine learning (ML), many researchers have directed their focus toward compressing models to fit on tiny edge devices. Recent works have achieved remarkable success in compressing ML models for object detection and image classification on microcontrollers with small memory, e.g., 512kB SRAM. However, there remain many challenges prohibiting the deployment of ML systems that require high-resolution images. Due to fundamental limits in memory capacity for tiny IoT devices, it may be physically impossible to store large images without external hardware. To this end, we propose a high-resolution image scaling system for edge ML, called HiRISE, which is equipped with selective region-of-interest (ROI) capability leveraging analog in-sensor image scaling. Our methodology not only significantly reduces the peak memory requirements, but also achieves up to 17.7× reduction in data transfer and energy consumption.more » « less
- 
            Transfer learning on edge is challenging due to on-device limited resources. Existing work addresses this issue by training a subset of parameters or adding model patches. Developed with inference in mind, Inverted Residual Blocks (IRBs) split a convolutional layer into depthwise and pointwise convolutions, leading to more stacking layers, e.g., convolution, normalization, and activation layers. Though they are efficient for inference, IRBs require that additional activation maps are stored in memory for training weights for convolution layers and scales for normalization layers. As a result, their high memory cost prohibits training IRBs on resource-limited edge devices, and making them unsuitable in the context of transfer learning. To address this issue, we present MobileTL, a memory and computationally efficient on-device transfer learning method for models built with IRBs. MobileTL trains the shifts for internal normalization layers to avoid storing activation maps for the backward pass. Also, MobileTL approximates the backward computation of the activation layer (e.g., Hard-Swish and ReLU6) as a signed function which enables storing a binary mask instead of activation maps for the backward pass. MobileTL fine-tunes a few top blocks (close to output) rather than propagating the gradient through the whole network to reduce the computation cost. Our method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs, respectively. For MobileNetV3, we observe a 36% reduction in floating-point operations (FLOPs) when fine-tuning 5 blocks, while only incurring a 0.6% accuracy reduction on CIFAR10. Extensive experiments on multiple datasets demonstrate that our method is Pareto-optimal (best accuracy under given hardware constraints) compared to prior work in transfer learning for edge devices.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    