NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

LoRAFusion: A Crossbar-aware Multi-task Adaption Framework via Efficient Fusion of Pretrained LoRA Modules

https://doi.org/10.1145/3716368.3735213

Guo, Jingkai; Ali, Asmer; Yang, Li; Fan, Deliang (June 2025, ACM)

The non-volatile Resistive RAM (ReRAM) crossbar has shown great potential in accelerating inference in various machine learning models However, it suffers from high reprogramming energy, hindering its usage for on-device adaption to new tasks. Recently, parameter-efficient fine-tuning methods, such as Low-Rank Adaption (LoRA), have been proposed to train few parameters while matching full fine-tuning performance. However, in ReRAM crossbar, the reprogramming cost of LoRA is non-trivial and will increase significantly when adapting to multi-tasks on the device. To address this issue, we are the first to propose LoRAFusion, a parameter-efficient multi-task on-device learning framework for ReRAM crossbar via fusion of pre-trained LoRA modules. LoRAFusion is a group of LoRA modules that are one-time learned based on diverse domain-specific tasks and deployed to the crossbar, acting as the pool of background knowledge. Then given a new unseen task, those LoRA modules are frozen (i.e., no energy-hungry ReRAM cells reprograming), only the proposed learnable layer-wise LoRA fusion coefficient and magnitude vector parameters are trained on-device to weighted-combine pre-trained LoRA modules, which significantly reduces the training parameter size. Our comprehensive experiments show LoRAFusion only uses 3% of the number of trainable parameters in LoRA (148K vs. 4700K), with 0.19% accuracy drop. Codes are available at https://github.com/ASU-ESIC-FAN-Lab/LoRAFusion
more » « less
Full Text Available
FP-SMR: A Fully Digital Floating-Point Processing-in-SAS-MRAM for Session-based Recommender System

https://doi.org/10.1145/3716368.3735206

Ali, Asmer Hamid; Sridharan, Amitesh; Guo, Cheng; Hwang, William; Tsai, Wilman; Zhang, Jeff; Chen, Yiran; X_Wang, Shan; Fan, Deliang (June 2025, ACM)

With the rapid advancement of DNNs, numerous Process-in-Memory (PIM) architectures based on various memory technologies (Non-Volatile (NVM)/Volatile Memory) have been developed to accelerate AI workloads. Magnetic Random Access Memory (MRAM) is highly promising among NVMs due to its zero standby leakage, fast write/read speeds, CMOS compatibility, and high memory density. However, existing MRAM technologies such as spin-transfer torque MRAM (STT-MRAM) and spin-orbit torque MRAM (SOT-MRAM), have inherent limitations. STT-MRAM faces high write current requirements, while SOT-MRAM introduces significant area overhead due to additional access transistors. The new STT-assisted-SOT (SAS) MRAM provides an area-efficient alternative by sharing one write access transistor for multiple magnetic tunnel junctions (MTJs). This work presents the first fully digital processing-in-SAS-MRAM system to enable 8-bit floating-point (FP8) neural network inference with an application in on-device session-based recommender system. A SAS-MRAM device prototype is fabricated with 4 MTJs sharing the same SOT metal line. The proposed SAS-MRAM-based PIM macro is designed in TSMC 28nm technology. It achieves 15.31 TOPS/W energy efficiency and 269 GOPS performance for FP8 operations at 700 MHz. Compared to state-of-the-art recommender systems for the same popular YooChoose dataset, it demonstrates a 86 ×, 1.8 ×, and 1.12 × higher energy efficiency than that of GPU, SRAM-PIM, and ReRAM-PIM, respectively.
more » « less
Full Text Available
Phantom: Privacy-Preserving Deep Neural Network Model Obfuscation in Heterogeneous TEE and GPU System

Bai, Juyang; Chowdhuryy, Md Hafizul_Islam; Li, Jingtao; Yao, Fan; Chakrabarti, Chaitali; Fan, Deliang (May 2025, 34th USENIX Security Symposium)

In this work, we present Phantom, a novel privacy-preserving framework for obfuscating deep neural network (DNN) model deployed in heterogeneous TEE/GPU systems. Phantom employs reinforcement learning to add lightweight obfuscation layers, degrading model performance for adversaries while maintaining functionality for authorized user. To reduce the off-chip data communication between TEE and GPU, we propose a Top-K layer-wise obfuscation sensitivity analysis method. Extensive experiments demonstrate Phantom's superiority over state-of-the-art (SoTA) defense methods against model stealing and fine-tuning attacks across various architectures and datasets. It reduces unauthorized accuracy to near-random guessing (e.g., 10% for CIFAR-10 tasks, 1% for CIFAR-100 tasks) and achieves a 6.99% average attack success rate for model stealing, significantly outperforming SoTA competing methods. System implementation on Intel SGX2 and NVIDIA GPU heterogeneous system achieves 35% end-to-end latency reduction compared with most recent SoTA work.
more » « less
Full Text Available
Learning to Prune and Low-Rank Adaptation for Compact Language Model Deployment

https://doi.org/10.1145/3658617.3697648

Ali, Asmer Hamid; Zhang, Fan; Yang, Li; Fan, Deliang (January 2025, ACM)

Nowadays, parameter-efficient fine-tuning (PEFT) large pre-trained models (LPMs) for downstream task have gained significant popularity, since it could significantly minimize the training computational overhead. The representative work, LoRA [1], learns a low-rank adaptor for a new downstream task, rather than fine-tuning the whole backbone model. However, for inference, the large size of the learned model remains unchanged, leading to in-efficient inference computation. To mitigate this, in this work, we are the first to propose a learning-to-prune methodology specially designed for fine-tuning downstream tasks based on LPMs with low-rank adaptation. Unlike prior low-rank adaptation approaches that only learn the low-rank adaptors for downstream tasks, our method further leverages the Gumbel-Sigmoid tricks to learn a set of trainable binary channel-wise masks that automatically prune the backbone LPMs. Therefore, our method could leverage the benefits of low-rank adaptation to reduce the training parameters size and smaller pruned backbone LPM size for efficient inference computation. Extensive experiments show that the Pruned-RoBbase model with our method achieves an average channel-wise structured pruning ratio of 24.5% across the popular GLUE Benchmark, coupled with an average of 18% inference time speed-up in real NVIDIA A5000 GPU. The Pruned-DistilBERT shows an average of 13% inference time improvement with 17% sparsity. The Pruned-LLaMA-7B model achieves up to 18.2% inference time improvement with 24.5% sparsity, demonstrating the effectiveness of our learnable pruning approach across different models and tasks.
more » « less
Full Text Available
RA-BNN: Constructing a Robust & Accurate Binary Neural Network Using a Novel Network Growth Mechanism to Defend Against BFA

https://doi.org/10.1109/CCWC62904.2025.10903977

Rakin, Adnan Siraj; Yang, Li; Li, Jingtao; Yao, Fan; Chakrabarti, Chaitali; Cao, Yu; Seo, Jae-sun; Fan, Deliang (January 2025, IEEE)

Adversarial bit-flip attack (BFA), a type of powerful adversarial weight attack demonstrated in real computer systems has shown enormous success in compromising Deep Neural Network (DNN) performance with a minimal amount of model parameter perturbation through rowhammer-based computer main memory bit-flip. For the first time in this work, we demonstrate to defeat adversarial bit-flip attacks by developing a Robust and Accurate Binary Neural Network (RA-BNN) that adopts a complete BNN (i.e., weights and activations are both in binary). Prior works have demonstrated that binary or clustered weights could intrinsically improve a network's robustness against BFA, while in this work, we further reveal that binary activation could improve such robustness even better. However, with both aggressive binary weight and activation representations, the complete BNN suffers from poor clean (i.e., no attack) inference accuracy. To counter this, we propose an efficient two-stage complete BNN growing method for constructing simultaneously robust and accurate BNN, named RA-Growth. It selectively grows the channel size of each BNN layer based on trainable channel-wise binary mask learning with a Gumbel-Sigmoid function. The wider binary network (i.e., RA-BNN) has dual benefits: it can recover clean inference accuracy and significantly higher resistance against BFA. Our evaluation of the CIFAR-10 dataset shows that the proposed RA-BNN can improve the resistance to BFA by up to 100 x. On ImageNet, with a sufficiently large (e.g., 5,000) number of bit-flips, the baseline BNN accuracy drops to 4.3 % from 51.9 %, while our RA-BNN accuracy only drops to 37.1 % from 60.9 %, making it the best defense performance.
more » « less
Full Text Available

Search for: All records