skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on January 20, 2026

Title: Learning to Prune and Low-Rank Adaptation for Compact Language Model Deployment
Nowadays, parameter-efficient fine-tuning (PEFT) large pre-trained models (LPMs) for downstream task have gained significant popularity, since it could significantly minimize the training computational overhead. The representative work, LoRA [1], learns a low-rank adaptor for a new downstream task, rather than fine-tuning the whole backbone model. However, for inference, the large size of the learned model remains unchanged, leading to in-efficient inference computation. To mitigate this, in this work, we are the first to propose a learning-to-prune methodology specially designed for fine-tuning downstream tasks based on LPMs with low-rank adaptation. Unlike prior low-rank adaptation approaches that only learn the low-rank adaptors for downstream tasks, our method further leverages the Gumbel-Sigmoid tricks to learn a set of trainable binary channel-wise masks that automatically prune the backbone LPMs. Therefore, our method could leverage the benefits of low-rank adaptation to reduce the training parameters size and smaller pruned backbone LPM size for efficient inference computation. Extensive experiments show that the Pruned-RoBbase model with our method achieves an average channel-wise structured pruning ratio of 24.5% across the popular GLUE Benchmark, coupled with an average of 18% inference time speed-up in real NVIDIA A5000 GPU. The Pruned-DistilBERT shows an average of 13% inference time improvement with 17% sparsity. The Pruned-LLaMA-7B model achieves up to 18.2% inference time improvement with 24.5% sparsity, demonstrating the effectiveness of our learnable pruning approach across different models and tasks.  more » « less
Award ID(s):
2314591 2505326 2503906 2505209
PAR ID:
10616373
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ACM
Date Published:
ISBN:
9798400706356
Page Range / eLocation ID:
36 to 42
Format(s):
Medium: X
Location:
Tokyo Japan
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The invention of Transformer model structure boosts the performance of Neural Machine Translation (NMT) tasks to an unprecedented level. Many previous works have been done to make the Transformer model more execution-friendly on resource-constrained platforms. These researches can be categorized into three key fields: Model Pruning, Transfer Learning, and Efficient Transformer Variants. The family of model pruning methods are popular for their simplicity in practice and promising compression rate and have achieved great success in the field of convolution neural networks (CNNs) for many vision tasks. Nonetheless, previous Transformer pruning works did not perform a thorough model analysis and evaluation on each Transformer component on off-the-shelf mobile devices. In this work, we analyze and prune transformer models at the line-wise granularity and also implement our pruning method on real mobile platforms. We explore the properties of all Transformer components as well as their sparsity features, which are leveraged to guide Transformer model pruning. We name our whole Transformer analysis and pruning pipeline as TPrune. In TPrune, we first propose Block-wise Structured Sparsity Learning (BSSL) to analyze Transformer model property. Then, based on the characters derived from BSSL, we apply Structured Hoyer Square (SHS) to derive the final pruned models. Comparing with the state-of-the-art Transformer pruning methods, TPrune is able to achieve a higher model compression rate with less performance degradation. Experimental results show that our pruned models achieve 1.16×–1.92× speedup on mobile devices with 0%–8% BLEU score degradation compared with the original Transformer model. 
    more » « less
  2. There have been many recent attempts to extend the successes of convolutional neural networks (CNNs) from 2-dimensional (2D) image classification to 3-dimensional (3D) video recognition by exploring 3D CNNs. Considering the emerging growth of mobile or Internet of Things (IoT) market, it is essential to investigate the deployment of 3D CNNs on edge devices. Previous works have implemented standard 3D CNNs (C3D) on hardware platforms, however, they have not exploited model compression for acceleration of inference. This work proposes a hardware-aware pruning approach that can fully adapt to the loop tiling technique of FPGA design and is applied onto a novel 3D network called R(2+1)D. Leveraging the powerful ADMM, the proposed pruning method achieves simultaneous high accuracy and significant acceleration of computation on FPGA. With layer-wise pruning rates up to 10× and negligible accuracy loss, the pruned model is implemented on a Xilinx ZCU102 FPGA board, where the pruned model achieves 2.6× speedup compared with the unpruned version, and 2.3× speedup and 2.3× power efficiency improvement compared with state-of-the-art FPGA implementation of C3D. 
    more » « less
  3. We propose SLOPE, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLOPE improves the accuracy of sparsely pretrained models by adding low-rank adapters in the final 1% iterations of pretraining without adding significant overheads to the model pretraining and inference. In addition, SLOPE uses a double-pruned backward pass formulation that prunes the transposed weight matrix using N:M sparsity structures to enable an accelerated sparse backward pass. SLOPE accelerates the training and inference of models with billions of parameters up to 1.25→ and 1.54→ respectively (OPT-33B and OPT-66B) while reducing their memory usage by up to 0.63→ and 0.61→ for training and inference respectively. 
    more » « less
  4. Neural network pruning is an essential technique for reducing the size and complexity of deep neural networks, enabling large-scale models on devices with limited resources. However, existing pruning approaches heavily rely on training data for guiding the pruning strategies, making them ineffective for federated learning over distributed and confidential datasets. Additionally, the memory- and computation-intensive pruning process becomes infeasible for recourse-constrained devices in federated learning. To address these challenges, we propose FedTiny, a distributed pruning framework for federated learning that generates specialized tiny models for memory-and computing-constrained devices. We introduce two key modules in FedTiny to adaptively search coarse- and finer-pruned specialized models to fit deployment scenarios with sparse and cheap local computation. First, an adaptive batch normalization selection module is designed to mitigate biases in pruning caused by the heterogeneity of local data. Second, a lightweight progressive pruning module aims to finer prune the models under strict memory and computational budgets, allowing the pruning policy for each layer to be gradually determined rather than evaluating the overall model structure. The experimental results demonstrate the effectiveness of FedTiny, which outperforms state-of-the-art approaches, particularly when compressing deep models to extremely sparse tiny models. FedTiny achieves an accuracy improvement of 2.61% while significantly reducing the computational cost by 95.91% and the memory footprint by 94.01% compared to state-of-the-art methods. 
    more » « less
  5. Neural architecture search (NAS) and network pruning are widely studied efficient AI techniques, but not yet perfect.NAS performs exhaustive candidate architecture search, incurring tremendous search cost.Though (structured) pruning can simply shrink model dimension, it remains unclear how to decide the per-layer sparsity automatically and optimally.In this work, we revisit the problem of layer-width optimization and propose Pruning-as-Search (PaS), an end-to-end channel pruning method to search out desired sub-network automatically and efficiently.Specifically, we add a depth-wise binary convolution to learn pruning policies directly through gradient descent.By combining the structural reparameterization and PaS, we successfully searched out a new family of VGG-like and lightweight networks, which enable the flexibility of arbitrary width with respect to each layer instead of each stage.Experimental results show that our proposed architecture outperforms prior arts by around 1.0% top-1 accuracy under similar inference speed on ImageNet-1000 classification task.Furthermore, we demonstrate the effectiveness of our width search on complex tasks including instance segmentation and image translation.Code and models are released. 
    more » « less