The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, ie, full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training data scales, while reducing per-task storage cost.
more »
« less
Parameter-Efficient Tuning with Special Token Adaptation
Parameter-efficient tuning aims at updating only a small subset of parameters when adapting a pretrained model to downstream tasks. In this work, we introduce PASTA, in which we only modify the special token representations (e.g., [SEP] and [CLS] in BERT) before the self-attention module at each layer in Transformer-based models. PASTA achieves comparable performance to fine-tuning in natural language understanding tasks including text classification and NER with up to only 0.029% of total parameters trained. Our work not only provides a simple yet effective way of parameter-efficient tuning, which has a wide range of practical applications when deploying finetuned models for multiple tasks, but also demonstrates the pivotal role of special tokens in pretrained language models.
more »
« less
- Award ID(s):
- 2105329
- PAR ID:
- 10418841
- Editor(s):
- Vlachos, Andreas; Augenstein, Isabelle
- Date Published:
- Journal Name:
- Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL)
- Page Range / eLocation ID:
- 865–872
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The non-volatile Resistive RAM (ReRAM) crossbar has shown great potential in accelerating inference in various machine learning models However, it suffers from high reprogramming energy, hindering its usage for on-device adaption to new tasks. Recently, parameter-efficient fine-tuning methods, such as Low-Rank Adaption (LoRA), have been proposed to train few parameters while matching full fine-tuning performance. However, in ReRAM crossbar, the reprogramming cost of LoRA is non-trivial and will increase significantly when adapting to multi-tasks on the device. To address this issue, we are the first to propose LoRAFusion, a parameter-efficient multi-task on-device learning framework for ReRAM crossbar via fusion of pre-trained LoRA modules. LoRAFusion is a group of LoRA modules that are one-time learned based on diverse domain-specific tasks and deployed to the crossbar, acting as the pool of background knowledge. Then given a new unseen task, those LoRA modules are frozen (i.e., no energy-hungry ReRAM cells reprograming), only the proposed learnable layer-wise LoRA fusion coefficient and magnitude vector parameters are trained on-device to weighted-combine pre-trained LoRA modules, which significantly reduces the training parameter size. Our comprehensive experiments show LoRAFusion only uses 3% of the number of trainable parameters in LoRA (148K vs. 4700K), with 0.19% accuracy drop. Codes are available at https://github.com/ASU-ESIC-FAN-Lab/LoRAFusionmore » « less
-
In-Context Learning (ICL) ability has been found efficient across a wide range of applications, where the Large Language Models (LLM) learn to complete the tasks from the examples in the prompt without tuning the parameters. In this work, we conduct a comprehensive study to understand ICL from a statistical perspective. First, we show that the perfectly pretrained LLMs perform Bayesian Model Averaging (BMA) for ICL under a dynamic model of examples in the prompt. The average error analysis for ICL is then built for the perfectly pretrained LLMs with the analysis of BMA. Second, we demonstrate how the attention structure boosts the BMA implementation. With sufficient examples in the prompt, attention is proven to perform BMA under the Gaussian linear ICL model, which also motivates the explicit construction of the hidden concepts from the attention heads' values. Finally, we analyze the pretraining behavior of LLMs. The pretraining error is decomposed as the generalization error and the approximation error. The generalization error is upper bounded via the PAC-Bayes framework. Then the ICL average error of the pretrained LLMs is shown to be the sum of O(T^{-1}) and the pretraining error. In addition, we analyze the ICL performance of the pretrained LLMs with misspecified examples.more » « less
-
null (Ed.)NLP is currently dominated by language models like RoBERTa which are pretrained on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? To explore this question, we adopt five styles of evaluation: classifier probing, information-theoretic probing, unsupervised relative acceptability judgments, unsupervised language model knowledge probing, and fine-tuning on NLU tasks. We then draw learning curves that track the growth of these different measures of model ability with respect to pretraining data volume using the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B words. We find that these LMs require only about 10M to 100M words to learn to reliably encode most syntactic and semantic features we test. They need a much larger quantity of data in order to acquire enough commonsense knowledge and other skills required to master typical downstream NLU tasks. The results suggest that, while the ability to encode linguistic features is almost certainly necessary for language understanding, it is likely that other, unidentified, forms of knowledge are the major drivers of recent improvements in language understanding among large pretrained models.more » « less
-
It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK)—which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization—describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.more » « less