skip to main content


Title: Transfer Learning with Kernel Methods
Abstract

Transfer learning refers to the process of adapting a model trained on a source task to a target task. While kernel methods are conceptually and computationally simple models that are competitive on a variety of tasks, it has been unclear how to develop scalable kernel-based transfer learning methods across general source and target tasks with possibly differing label dimensions. In this work, we propose a transfer learning framework for kernel methods by projecting and translating the source model to the target task. We demonstrate the effectiveness of our framework in applications to image classification and virtual drug screening. For both applications, we identify simple scaling laws that characterize the performance of transfer-learned kernels as a function of the number of target examples. We explain this phenomenon in a simplified linear setting, where we are able to derive the exact scaling laws.

 
more » « less
Award ID(s):
1651995
NSF-PAR ID:
10460192
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Nature Communications
Volume:
14
Issue:
1
ISSN:
2041-1723
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Dynamic transfer learning refers to the knowledge transfer from a static source task with adequate label information to a dynamic target task with little or no label information. However, most existing theoretical studies and practical algorithms of dynamic transfer learning assume that the target task is continuously evolving over time. This strong assumption is often violated in real world applications, e.g., the target distribution is suddenly changing at some time stamp. To solve this problem, in this paper, we propose a novel meta-learning framework L2S based on a progressive meta-task scheduler for dynamic transfer learning. The crucial idea of L2S is to incrementally learn to schedule the meta-pairs of tasks and then learn the optimal model initialization from those meta-pairs of tasks for fast adaptation to the newest target task. The effectiveness of our L2S framework is verified both theoretically and empirically. 
    more » « less
  2. We address the challenge of getting efficient yet accurate recognition systems with limited labels. While recognition models improve with model size and amount of data, many specialized applications of computer vision have severe resource constraints both during training and inference. Transfer learning is an effective solution for training with few labels, however often at the expense of a compu- tationally costly fine-tuning of large base models. We propose to mitigate this unpleasant trade-off between compute and accuracy via semi-supervised cross- domain distillation from a set of diverse source models. Initially, we show how to use task similarity metrics to select a single suitable source model to distill from, and that a good selection process is imperative for good downstream performance of a target model. We dub this approach DISTILLNEAREST. Though effective, DISTILLNEAREST assumes a single source model matches the target task, which is not always the case. To alleviate this, we propose a weighted multi-source distilla- tion method to distill multiple source models trained on different domains weighted by their relevance for the target task into a single efficient model (named DISTILL- WEIGHTED). Our methods need no access to source data, and merely need features and pseudo-labels of the source models. When the goal is accurate recognition under computational constraints, both DISTILLNEAREST and DISTILLWEIGHTED approaches outperform both transfer learning from strong ImageNet initializations as well as state-of-the-art semi-supervised techniques such as FixMatch. Averaged over 8 diverse target tasks our multi-source method outperforms the baselines by 5.6%-points and 4.5%-points, respectively. 
    more » « less
  3. null (Ed.)
    Arguably one of the top success stories of deep learning is transfer learning. The finding that pre-training a network on a rich source set (e.g., ImageNet) can help boost performance once fine-tuned on a usually much smaller target set, has been instrumental to many applications in language and vision. Yet, very little is known about its usefulness in 3D point cloud understanding. We see this as an opportunity considering the effort required for annotating data in 3D. In this work, we aim at facilitating research on 3D representation learning. Different from previous works, we focus on high-level scene understanding tasks. To this end, we select a suit of diverse datasets and tasks to measure the effect of unsupervised pre-training on a large source set of 3D scenes. Our findings are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets – demonstrating that the learned representation can generalize across domains. Furthermore, the improvement was similar to supervised pre-training, suggesting that future efforts should favor scaling data collection over more detailed annotation. We hope these findings will encourage more research on unsupervised pretext task design for 3D deep learning. 
    more » « less
  4. null (Ed.)
    The advent of deep learning algorithms for mobile devices and sensors has led to a dramatic expansion in the availability and number of systems trained on a wide range of machine learning tasks, creating a host of opportunities and challenges in the realm of transfer learning. Currently, most transfer learning methods require some kind of control over the systems learned, either by enforcing constraints dur- ing the source training, or through the use of a joint optimization objective between tasks that requires all data be co-located for training. However, for practical, pri- vacy, or other reasons, in a variety of applications we may have no control over the individual source task training, nor access to source training samples. Instead we only have access to features pre-trained on such data as the output of “black-boxes.” For such scenarios, we consider the multi-source learning problem of training a classifier using an ensemble of pre-trained neural networks for a set of classes that have not been observed by any of the source networks, and for which we have very few training samples. We show that by using these distributed networks as feature extractors, we can train an effective classifier in a computationally-efficient manner using tools from (nonlinear) maximal correlation analysis. In particular, we develop a method we refer to as maximal correlation weighting (MCW) to build the required target classifier from an appropriate weighting of the feature functions from the source networks. We illustrate the effectiveness of the resulting classi- fier on datasets derived from the CIFAR-100, Stanford Dogs, and Tiny ImageNet datasets, and, in addition, use the methodology to characterize the relative value of different source tasks in learning a target task. 
    more » « less
  5. Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form $f(x)=g(Ux)$ where $U: \R^d \to \R^r$ with $d≫r$. When the degree of f⋆ is p, it is known that n≍dp samples are necessary to learn f⋆ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f. This results in an improved sample complexity of n≍d2r+drp. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation U but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of d. 
    more » « less