skip to main content

Title: A Class of Geometric Structures in Transfer Learning: Minimax Bounds and Optimality
We study the problem of transfer learning, observing that previous efforts to understand its information-theoretic limits do not fully exploit the geometric structure of the source and target domains. In contrast, our study first illustrates the benefits of incorporating a natural geometric structure within a linear regression model, which corresponds to the generalized eigenvalue problem formed by the Gram matrices of both domains. We next establish a finite-sample minimax lower bound, propose a refined model interpolation estimator that enjoys a matching upper bound, and then extend our framework to multiple source domains and generalized linear models. Surprisingly, as long as information is available on the distance between the source and target parameters, negative-transfer does not occur. Simulation studies show that our proposed interpolation estimator outperforms state-of-the-art transfer learning methods in both moderate- and high-dimensional settings.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the International Workshop on Artificial Intelligence and Statistics
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    We present new models and methods for the posterior drift problem where the regression function in the target domain is modelled as a linear adjustment, on an appropriate scale, of that in the source domain, and study the theoretical properties of our proposed estimators in the binary classification problem. The core idea of our model inherits the simplicity and the usefulness of generalized linear models and accelerated failure time models from the classical statistics literature. Our approach is shown to be flexible and applicable in a variety of statistical settings, and can be adopted for transfer learning problems in various domains including epidemiology, genetics and biomedicine. As concrete applications, we illustrate the power of our approach (i) through mortality prediction for British Asians by borrowing strength from similar data from the larger pool of British Caucasians, using the UK Biobank data, and (ii) in overcoming a spurious correlation present in the source domain of the Waterbirds dataset.

    more » « less
  2. null (Ed.)
    Transfer learning has emerged as a powerful technique for improving the performance of machine learning models on new domains where labeled training data may be scarce. In this approach a model trained for a source task, where plenty of labeled training data is available, is used as a starting point for training a model on a related target task with only few labeled training data. Despite recent empirical success of transfer learning approaches, the benefits and fundamental limits of transfer learning are poorly understood. In this paper we develop a statistical minimax framework to characterize the fundamental limits of transfer learning in the context of regression with linear and one-hidden layer neural network models. Specifically, we derive a lower-bound for the target generalization error achievable by any algorithm as a function of the number of labeled source and target data as well as appropriate notions of similarity between the source and target tasks. Our lower bound provides new insights into the benefits and limitations of transfer learning. We further corroborate our theoretical finding with various experiments. 
    more » « less
  3. Due to the ability of deep neural nets to learn rich representations, recent advances in unsupervised domain adaptation have focused on learning domain-invariant features that achieve a small error on the source domain. The hope is that the learnt representation, together with the hypothesis learnt from the source domain, can generalize to the target domain. In this paper, we first construct a simple counterexample showing that, contrary to common belief, the above conditions are not sufficient to guarantee successful domain adaptation. In particular, the counterexample exhibits conditional shift: the class-conditional distributions of input features change between source and target domains. To give a sufficient condition for domain adaptation, we propose a natural and interpretable generalization upper bound that explicitly takes into account the aforementioned shift.Moreover, we shed new light on the problem by proving an information-theoretic lower bound on the joint error of any domain adaptation method that attempts to learn invariant representations.Our result characterizes a fundamental tradeoff between learning invariant representations and achieving small joint error on both domains when the marginal label distributions differ from source to target. Finally, we conduct experiments on real-world datasets that corroborate our theoretical findings. We believe these insights are helpful in guiding the future design of domain adaptation and representation learning algorithms. 
    more » « less
  4. Abstract Background

    Advances in imagery at atomic and near-atomic resolution, such as cryogenic electron microscopy (cryo-EM), have led to an influx of high resolution images of proteins and other macromolecular structures to data banks worldwide. Producing a protein structure from the discrete voxel grid data of cryo-EM maps involves interpolation into the continuous spatial domain. We present a novel data format called the neural cryo-EM map, which is formed from a set of neural networks that accurately parameterize cryo-EM maps and provide native, spatially continuous data for density and gradient. As a case study of this data format, we create graph-based interpretations of high resolution experimental cryo-EM maps.


    Normalized cryo-EM map values interpolated using the non-linear neural cryo-EM format are more accurate, consistently scoring less than 0.01 mean absolute error, than a conventional tri-linear interpolation, which scores up to 0.12 mean absolute error. Our graph-based interpretations of 115 experimental cryo-EM maps from 1.15 to 4.0 Å resolution provide high coverage of the underlying amino acid residue locations, while accuracy of nodes is correlated with resolution. The nodes of graphs created from atomic resolution maps (higher than 1.6 Å) provide greater than 99% residue coverage as well as 85% full atomic coverage with a mean of 0.19 Å root mean squared deviation. Other graphs have a mean 84% residue coverage with less specificity of the nodes due to experimental noise and differences of density context at lower resolutions.


    The fully continuous and differentiable nature of the neural cryo-EM map enables the adaptation of the voxel data to alternative data formats, such as a graph that characterizes the atomic locations of the underlying protein or macromolecular structure. Graphs created from atomic resolution maps are superior in finding atom locations and may serve as input to predictive residue classification and structure segmentation methods. This work may be generalized to transform any 3D grid-based data format into non-linear, continuous, and differentiable format for downstream geometric deep learning applications.

    more » « less
  5. Massoulie, Laurent (Ed.)
    Spreading processes on graphs arise in a host of application domains, from the study of online social networks to viral marketing to epidemiology. Various discrete-time probabilistic models for spreading processes have been proposed. These are used for downstream statistical estimation and prediction problems, often involving messages or other information that is transmitted along with infections caused by the process. These models generally model cascade behavior at a small time scale but are insufficiently flexible to model cascades that exhibit intermittent behavior governed by multiple scales. We argue that the presence of such time scales that are unaccounted for by a cascade model can result in degradation of performance of models on downstream statistical and time-sensitive optimization tasks. To address these issues, we formulate a model that incorporates multiple temporal scales of cascade behavior. This model is parameterized by a \emph{clock}, which encodes the times at which sessions of cascade activity start. These sessions are themselves governed by a small-scale cascade model, such as the discretized independent cascade (IC) model. Estimation of the multiscale cascade model parameters leads to the problem of \emph{clock estimation} in terms of a natural distortion measure that we formulate. Our framework is inspired by the optimization problem posed by DiTursi et al, 2017, which can be seen as providing one possible estimator (a maximum-proxy-likelihood estimator) for the parameters of our generative model. We give a clock estimation algorithm, which we call FastClock, that runs in linear time in the size of its input and is provably statistically accurate for a broad range of model parameters when cascades are generated from any spreading process model with well-concentrated session infection set sizes and when the underlying graph is at least in the semi-sparse regime. We exemplify our algorithm for the case where the small-scale model is the discretized independent cascade process and extend substantially to processes whose infection set sizes satisfy a general martingale difference property. We further evaluate the performance of FastClock empirically in comparison to the state of the art estimator from DiTursi et al, 2017. We find that in a broad parameter range on synthetic networks and on a real network, our algorithm substantially outperforms that algorithm in terms of both running time and accuracy. In all cases, our algorithm's running time is asymptotically lower than that of the baseline. 
    more » « less