skip to main content

Title: Meta-learning for Mixed Linear Regression
n modern supervised learning, there are a large number of tasks, but many of them are associated with only a small amount of labelled data. These include data from medical image processing and robotic interaction. Even though each individual task cannot be meaningfully trained in isolation, one seeks to meta-learn across the tasks from past experiences by exploiting some similarities. We study a fundamental question of interest: When can abundant tasks with small data compensate for lack of tasks with big data? We focus on a canonical scenario where each task is drawn from a mixture of k linear regressions, and identify sufficient conditions for such a graceful exchange to hold; there is little loss in sample complexity even when we only have access to small data tasks. To this end, we introduce a novel spectral approach and show that we can efficiently utilize small data tasks with the help of Omega(k^3/2) medium data tasks each with Omega(k^1/2) examples.
Authors:
; ; ; ;
Award ID(s):
1637360 1740551 1703574
Publication Date:
NSF-PAR ID:
10203962
Journal Name:
International Conference on Machine Learning
Sponsoring Org:
National Science Foundation
More Like this
  1. A common challenge faced in practical supervised learning, such as medical image processing and robotic interactions, is that there are plenty of tasks but each task cannot afford to collect enough labeled examples to be learned in isolation. However, by exploiting the similarities across those tasks, one can hope to overcome such data scarcity. Under a canonical scenario where each task is drawn from a mixture of k linear regressions, we study a fundamental question: can abundant small-data tasks compensate for the lack of big-data tasks? Existing second moment based approaches show that such a trade-off is efficiently achievable, with the help of medium-sized tasks with k^1/2 examples each. However, this algorithm is brittle in two important scenarios. The predictions can be arbitrarily bad even with only a few outliers in the dataset; or even if the medium-sized tasks are slightly smaller with. We introduce a spectral approach that is simultaneously robust under both scenarios. To this end, we first design a novel outlier-robust principal component analysis algorithm that achieves an optimal accuracy. This is followed by a sum-of-squares algorithm to exploit the information from higher order moments. Together, this approach is robust against outliers and achieves a graceful statisticalmore »trade-off.« less
  2. In modern supervised learning, there are a large number of tasks, but many of them are associated with only a small amount of labelled data. These include data from medical image processing and robotic interaction. Even though each individual task cannot be meaningfully trained in isolation, one seeks to meta-learn across the tasks from past experiences by exploiting some similarities. We study a fundamental question of interest: When can abundant tasks with small data compensate for lack of tasks with big data? We focus on a canonical scenario where each task is drawn from a mixture of 𝑘 linear regressions, and identify sufficient conditions for such a graceful exchange to hold; there is little loss in sample complexity even when we only have access to small data tasks. To this end, we introduce a novel spectral approach and show that we can efficiently utilize small data tasks with the help of ΊĖƒ (𝑘3/2) medium data tasks each with ΊĖƒ (𝑘1/2) examples.
  3. Feature representations from pre-trained deep neural networks have been known to exhibit excellent generalization and utility across a variety of related tasks. Fine-tuning is by far the simplest and most widely used approach that seeks to exploit and adapt these feature representations to novel tasks with limited data. Despite the effectiveness of fine-tuning, itis often sub-optimal and requires very careful optimization to prevent severe over-fitting to small datasets. The problem of sub-optimality and over-fitting, is due in part to the large number of parameters used in a typical deep convolutional neural network. To address these problems, we propose a simple yet effective regularization method for fine-tuning pre-trained deep networks for the task of k-shot learning. To prevent overfitting, our key strategy is to cluster the model parameters while ensuring intra-cluster similarity and inter-cluster diversity of the parameters, effectively regularizing the dimensionality of the parameter search space. In particular, we identify groups of neurons within each layer of a deep network that shares similar activation patterns. When the network is to be fine-tuned for a classification task using only k examples, we propagate a single gradient to all of the neuron parameters that belong to the same group. The grouping ofmore »neurons is non-trivial as neuron activations depend on the distribution of the input data. To efficiently search for optimal groupings conditioned on the input data, we propose a reinforcement learning search strategy using recurrent networks to learn the optimal group assignments for each network layer. Experimental results show that our method can be easily applied to several popular convolutional neural networks and improve upon other state-of-the-art fine-tuning based k-shot learning strategies by more than10%« less
  4. Braverman, Mark (Ed.)
    We present a framework for speeding up the time it takes to sample from discrete distributions $\mu$ defined over subsets of size $k$ of a ground set of $n$ elements, in the regime where $k$ is much smaller than $n$. We show that if one has access to estimates of marginals $\mathbb{P}_{S\sim \mu}[i\in S]$, then the task of sampling from $\mu$ can be reduced to sampling from related distributions $\nu$ supported on size $k$ subsets of a ground set of only $n^{1-\alpha}\cdot \operatorname{poly}(k)$ elements. Here, $1/\alpha\in [1, k]$ is the parameter of entropic independence for $\mu$. Further, our algorithm only requires sparsified distributions $\nu$ that are obtained by applying a sparse (mostly $0$) external field to $\mu$, an operation that for many distributions $\mu$ of interest, retains algorithmic tractability of sampling from $\nu$. This phenomenon, which we dub domain sparsification, allows us to pay a one-time cost of estimating the marginals of $\mu$, and in return reduce the amortized cost needed to produce many samples from the distribution $\mu$, as is often needed in upstream tasks such as counting and inference. For a wide range of distributions where $\alpha=\Omega(1)$, our result reduces the domain size, and as a corollary, themore »cost-per-sample, by a $\operatorname{poly}(n)$ factor. Examples include monomers in a monomer-dimer system, non-symmetric determinantal point processes, and partition-constrained Strongly Rayleigh measures. Our work significantly extends the reach of prior work of Anari and Derezi\'nski who obtained domain sparsification for distributions with a log-concave generating polynomial (corresponding to $\alpha=1$). As a corollary of our new analysis techniques, we also obtain a less stringent requirement on the accuracy of marginal estimates even for the case of log-concave polynomials; roughly speaking, we show that constant-factor approximation is enough for domain sparsification, improving over $O(1/k)$ relative error established in prior work.« less
  5. Abstract Background

    Natural language processing (NLP) tasks in the health domain often deal with limited amount of labeled data due to high annotation costs and naturally rare observations. To compensate for the lack of training data, health NLP researchers often have to leverage knowledge and resources external to a task at hand. Recently, pretrained large-scale language models such as the Bidirectional Encoder Representations from Transformers (BERT) have been proven to be a powerful way of learning rich linguistic knowledge from massive unlabeled text and transferring that knowledge to downstream tasks. However, previous downstream tasks often used training data at such a large scale that is unlikely to obtain in the health domain. In this work, we aim to study whether BERT can still benefit downstream tasks when training data are relatively small in the context of health NLP.

    Method

    We conducted a learning curve analysis to study the behavior of BERT and baseline models as training data size increases. We observed the classification performance of these models on two disease diagnosis data sets, where some diseases are naturally rare and have very limited observations (fewer than 2 out of 10,000). The baselines included commonly used text classification models such as sparse andmore »dense bag-of-words models, long short-term memory networks, and their variants that leveraged external knowledge. To obtain learning curves, we incremented the amount of training examples per disease from small to large, and measured the classification performance in macro-averaged$$F_{1}$$F1score.

    Results

    On the task of classifying all diseases, the learning curves of BERT were consistently above all baselines, significantly outperforming them across the spectrum of training data sizes. But under extreme situations where only one or two training documents per disease were available, BERT was outperformed by linear classifiers with carefully engineered bag-of-words features.

    Conclusion

    As long as the amount of training documents is not extremely few, fine-tuning a pretrained BERT model is a highly effective approach to health NLP tasks like disease classification. However, in extreme cases where each class has only one or two training documents and no more will be available, simple linear models using bag-of-words features shall be considered.

    « less