Title: Statistical Minimax Lower Bounds for Transfer Learning in Linear Binary Classification
Modern machine learning models require a large amount of labeled data for training to perform well. A recently emerging paradigm for reducing the reliance of large model training on massive labeled data is to take advantage of abundantly available labeled data from a related source task to boost the performance of the model in a desired target task where there may not be a lot of data available. This approach, which is called transfer learning, has been applied successfully in many application domains. However, despite the fact that many transfer learning algorithms have been developed, the fundamental understanding of "when" and "to what extent" transfer learning can reduce sample complexity is still limited. In this work, we take a step towards foundational understanding of transfer learning by focusing on binary classification with linear models and Gaussian features and develop statistical minimax lower bounds in terms of the number of source and target samples and an appropriate notion of similarity between source and target tasks. To derive this bound, we reduce the transfer learning problem to hypothesis testing via constructing a packing set of source and target parameters by exploiting Gilbert-Varshamov bound, which in turn leads to a lower bound on sample complexity. We also evaluate our theoretical results by experiments on real data sets. more »« less
M. M. Kalan, Z. Fabian
(, Neural Information Processing Systems (NeuRIPS 2020))
null
(Ed.)
Transfer learning has emerged as a powerful technique for improving the performance of machine learning models on new domains where labeled training data may be scarce. In this approach a model trained for a source task, where plenty of labeled training data is available, is used as a starting point for training a model on a related target task with only few labeled training data. Despite recent empirical success of transfer learning approaches, the benefits and fundamental limits of transfer learning are poorly understood. In this paper we develop a statistical minimax framework to characterize the fundamental limits of transfer learning in the context of regression with linear and one-hidden layer neural network models. Specifically, we derive a lower-bound for the target generalization error achievable by any algorithm as a function of the number of labeled source and target data as well as appropriate notions of similarity between the source and target tasks. Our lower bound provides new insights into the benefits and limitations of transfer learning. We further corroborate our theoretical finding with various experiments.
In this work, we consider a setting where the goal is to achieve adversarial robustness on a target task, given only unlabeled training data from the task distribution, by leveraging a labeled training data from a different yet related source task distribution. The absence of the labels on training data for the target task poses a unique challenge as conventional adversarial robustness defenses cannot be directly applied. To address this challenge, we first bound the adversarial population 0-1 robust loss on the target task in terms of (i) empirical 0-1 loss on the source task, (ii) joint loss on source and target tasks of an ideal classifier, and (iii) a measure of worst-case domain divergence. Motivated by this bound, we develop a novel unified defense framework called Divergence-Aware adveRsarial Training (DART), which can be used in conjunction with a variety of standard UDA methods; e.g., DANN. DART is applicable to general threat models, including the popular \ell_p-norm model, and does not require heuristic regularizers or architectural changes. We also release DomainRobust, a testbed for evaluating robustness of UDA models to adversarial attacks. DomainRobust consists of 4 multidomain benchmark datasets (with 46 source-target pairs) and 7 meta-algorithms with a total of 11 variants. Our large-scale experiments demonstrate that, on average, DART significantly enhances model robustness on all benchmarks compared to the state of the art, while maintaining competitive standard accuracy. The relative improvement in robustness from DART reaches up to 29.2% on the source-target domain pairs considered.
Watkins, Austin; Ullah, Enayat; Nguyen-Tang, Thanh; Arora, Raman
(, 38th Conference on Neural Information Processing Systems (NeurIPS 2024))
We study adversarially robust transfer learning, wherein, given labeled data on multiple (source) tasks, the goal is to train a model with small robust error on a previously unseen (target) task. In particular, we consider a multi-task representation learning (MTRL) setting, i.e., we assume that the source and target tasks admit a simple (linear) predictor on top of a shared representation (e.g., the final hidden layer of a deep neural network). In this general setting, we provide rates on the excess adversarial (transfer) risk for Lipschitz losses and smooth nonnegative losses. These rates show that learning a representation using adversarial training on diverse tasks helps protect against inference-time attacks in data-scarce environments. Additionally, we provide novel rates for the single-task setting.
Zhang, Xuhui; Blanchet, Jose H.; Ghosh, Soumyadip; Squillante, Mark S.
(, Proceedings of the International Workshop on Artificial Intelligence and Statistics)
We study the problem of transfer learning, observing that previous efforts to understand its information-theoretic limits do not fully exploit the geometric structure of the source and target domains. In contrast, our study first illustrates the benefits of incorporating a natural geometric structure within a linear regression model, which corresponds to the generalized eigenvalue problem formed by the Gram matrices of both domains. We next establish a finite-sample minimax lower bound, propose a refined model interpolation estimator that enjoys a matching upper bound, and then extend our framework to multiple source domains and generalized linear models. Surprisingly, as long as information is available on the distance between the source and target parameters, negative-transfer does not occur. Simulation studies show that our proposed interpolation estimator outperforms state-of-the-art transfer learning methods in both moderate- and high-dimensional settings.
Aakur, Sathyanarayanan N.; Sarkar, Sudeep
(, IEEE Transactions on Pattern Analysis and Machine Intelligence)
The commonsense natural language inference (CNLI) tasks aim to select the most likely follow-up statement to a contextual description of ordinary, everyday events and facts. Current approaches to transfer learning of CNLI models across tasks require many labeled data from the new task. This paper presents a way to reduce this need for additional annotated training data from the new task by leveraging symbolic knowledge bases, such as ConceptNet. We formulate a teacher-student framework for mixed symbolic-neural reasoning, with the large-scale symbolic knowledge base serving as the teacher and a trained CNLI model as the student. This hybrid distillation process involves two steps. The first step is a symbolic reasoning process. Given a collection of unlabeled data, we use an abductive reasoning framework based on Grenander's pattern theory to create weakly labeled data. Pattern theory is an energy-based graphical probabilistic framework for reasoning among random variables with varying dependency structures. In the second step, the weakly labeled data, along with a fraction of the labeled data, is used to transfer-learn the CNLI model into the new task. The goal is to reduce the fraction of labeled data required. We demonstrate the efficacy of our approach by using three publicly available datasets (OpenBookQA, SWAG, and HellaSWAG) and evaluating three CNLI models (BERT, LSTM, and ESIM) that represent different tasks. We show that, on average, we achieve 63% of the top performance of a fully supervised BERT model with no labeled data. With only 1000 labeled samples, we can improve this performance to 72%. Interestingly, without training, the teacher mechanism itself has significant inference power. The pattern theory framework achieves 32.7% accuracy on OpenBookQA, outperforming transformer-based models such as GPT (26.6%), GPT-2 (30.2%), and BERT (27.1%) by a significant margin. We demonstrate that the framework can be generalized to successfully train neural CNLI models using knowledge distillation under unsupervised and semi-supervised learning settings. Our results show that it outperforms all unsupervised and weakly supervised baselines and some early supervised approaches, while offering competitive performance with fully supervised baselines. Additionally, we show that the abductive learning framework can be adapted for other downstream tasks, such as unsupervised semantic textual similarity, unsupervised sentiment classification, and zero-shot text classification, without significant modification to the framework. Finally, user studies show that the generated interpretations enhance its explainability by providing key insights into its reasoning mechanism.
Mousavi Kalan, Seyed Mohammadreza, Soltanolkotabi, Mahdi, and Avestimehr, A. Salman. Statistical Minimax Lower Bounds for Transfer Learning in Linear Binary Classification. Retrieved from https://par.nsf.gov/biblio/10483603. Web. doi:10.1109/ISIT50566.2022.9834760.
Mousavi Kalan, Seyed Mohammadreza, Soltanolkotabi, Mahdi, & Avestimehr, A. Salman. Statistical Minimax Lower Bounds for Transfer Learning in Linear Binary Classification. Retrieved from https://par.nsf.gov/biblio/10483603. https://doi.org/10.1109/ISIT50566.2022.9834760
@article{osti_10483603,
place = {Country unknown/Code not available},
title = {Statistical Minimax Lower Bounds for Transfer Learning in Linear Binary Classification},
url = {https://par.nsf.gov/biblio/10483603},
DOI = {10.1109/ISIT50566.2022.9834760},
abstractNote = {Modern machine learning models require a large amount of labeled data for training to perform well. A recently emerging paradigm for reducing the reliance of large model training on massive labeled data is to take advantage of abundantly available labeled data from a related source task to boost the performance of the model in a desired target task where there may not be a lot of data available. This approach, which is called transfer learning, has been applied successfully in many application domains. However, despite the fact that many transfer learning algorithms have been developed, the fundamental understanding of "when" and "to what extent" transfer learning can reduce sample complexity is still limited. In this work, we take a step towards foundational understanding of transfer learning by focusing on binary classification with linear models and Gaussian features and develop statistical minimax lower bounds in terms of the number of source and target samples and an appropriate notion of similarity between source and target tasks. To derive this bound, we reduce the transfer learning problem to hypothesis testing via constructing a packing set of source and target parameters by exploiting Gilbert-Varshamov bound, which in turn leads to a lower bound on sample complexity. We also evaluate our theoretical results by experiments on real data sets.},
journal = {},
publisher = {IEEE},
author = {Mousavi Kalan, Seyed Mohammadreza and Soltanolkotabi, Mahdi and Avestimehr, A. Salman},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.