skip to main content


This content will become publicly available on November 14, 2024

Title: k-Mixup Regularization for Deep Learning via Optimal Transport
Mixup is a popular regularization technique for training deep neural networks that improves generalization and increases robustness to certain distribution shifts. It perturbs input training data in the direction of other randomly-chosen instances in the training set. To better leverage the structure of the data, we extend mixup in a simple, broadly applicable way to k-mixup, which perturbs k-batches of training points in the direction of other k-batches. The perturbation is done with displacement interpolation, i.e. interpolation under the Wasserstein metric. We demonstrate theoretically and in simulations that k-mixup preserves cluster and manifold structures, and we extend theory studying the efficacy of standard mixup to the k-mixup case. Our empirical results show that training with k-mixup further improves generalization and robustness across several network architectures and benchmark datasets of differing modalities. For the wide variety of real datasets considered, the performance gains of k-mixup over standard mixup are similar to or larger than the gains of mixup itself over standard ERM after hyperparameter optimization. In several instances, in fact, k-mixup achieves gains in settings where standard mixup has negligible to zero improvement over ERM.  more » « less
Award ID(s):
1838071
NSF-PAR ID:
10483956
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
OpenReview
Date Published:
Journal Name:
Transactions on Machine Learning Research
ISSN:
2835-8856
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In the Mixup training paradigm, a model is trained using convex combinations of data points and their associated labels. Despite seeing very few true data points during training, models trained using Mixup seem to still minimize the original empirical risk and exhibit better generalization and robustness on various tasks when compared to standard training. In this paper, we investigate how these benefits of Mixup training rely on properties of the data in the context of classification. For minimizing the original empirical risk, we compute a closed form for the Mixup-optimal classification, which allows us to construct a simple dataset on which minimizing the Mixup loss can provably lead to learning a classifier that does not minimize the empirical loss on the data. On the other hand, we also give sufficient conditions for Mixup training to also minimize the original empirical risk. For generalization, we characterize the margin of a Mixup classifier, and use this to understand why the decision boundary of a Mixup classifier can adapt better to the full structure of the training data when compared to standard training. In contrast, we also show that, for a large class of linear models and linearly separable datasets, Mixup training leads to learning the same classifier as standard training. 
    more » « less
  2. Mixup is a data augmentation technique that relies on training using random convex combinations of data points and their labels. In recent years, Mixup has become a standard primitive used in the training of state-of-the-art image classification models due to its demonstrated benefits over empirical risk minimization with regards to generalization and robustness. In this work, we try to explain some of this success from a feature learning perspective. We focus our attention on classification problems in which each class may have multiple associated features (or views) that can be used to predict the class correctly. Our main theoretical results demonstrate that, for a non-trivial class of data distributions with two features per class, training a 2-layer convolutional network using empirical risk minimization can lead to learning only one feature for almost all classes while training with a specific instantiation of Mixup succeeds in learning both features for every class. We also show empirically that these theoretical insights extend to the practical settings of image benchmarks modified to have additional synthetic features. 
    more » « less
  3. Mixup is a popular data augmentation technique based on taking convex combinations of pairs of examples and their labels. This simple technique has been shown to substantially improve both the robustness and the generalization of the trained model. However, it is not well-understood why such improvement occurs. In this paper, we provide theoretical analysis to demonstrate how using Mixup in training helps model robustness and generalization. For robustness, we show that minimizing the Mixup loss corresponds to approximately minimizing an upper bound of the adversarial loss. This explains why models obtained by Mixup training exhibits robustness to several kinds of adversarial attacks such as Fast Gradient Sign Method (FGSM). For generalization, we prove that Mixup augmentation corresponds to a specific type of data-adaptive regularization which reduces overfitting. Our analysis provides new insights and a framework to understand Mixup. 
    more » « less
  4. Data augmentation techniques, such as simple image transformations and combinations, are highly effective at improving the generalization of computer vision models, especially when training data is limited. However, such techniques are fundamentally incompatible with differentially private learning approaches, due to the latter's built-in assumption that each training image's contribution to the learned model is bounded. In this paper, we investigate why naive applications of multi-sample data augmentation techniques, such as mixup, fail to achieve good performance and propose two novel data augmentation techniques specifically designed for the constraints of differentially private learning. Our first technique, DP-Mix_Self, achieves SoTA classification performance across a range of datasets and settings by performing mixup on self-augmented data. Our second technique, DP-Mix_Diff, further improves performance by incorporating synthetic data from a pre-trained diffusion model into the mixup process. 
    more » « less
  5. Purpose

    To develop a scan‐specific model that estimates and corrects k‐space errors made when reconstructing accelerated MRI data.

    Methods

    Scan‐specific artifact reduction in k‐space (SPARK) trains a convolutional‐neural‐network to estimate and correct k‐space errors made by an input reconstruction technique by back‐propagating from the mean‐squared‐error loss between an auto‐calibration signal (ACS) and the input technique’s reconstructed ACS. First, SPARK is applied to generalized autocalibrating partially parallel acquisitions (GRAPPA) and demonstrates improved robustness over other scan‐specific models, such as robust artificial‐neural‐networks for k‐space interpolation (RAKI) and residual‐RAKI. Subsequent experiments demonstrate that SPARK synergizes with residual‐RAKI to improve reconstruction performance. SPARK also improves reconstruction quality when applied to advanced acquisition and reconstruction techniques like 2D virtual coil (VC‐) GRAPPA, 2D LORAKS, 3D GRAPPA without an integrated ACS region, and 2D/3D wave‐encoded imaging.

    Results

    SPARK yields SSIM improvement and 1.5 – 2× root mean squared error (RMSE) reduction when applied to GRAPPA and improves robustness to ACS size for various acceleration rates in comparison to other scan‐specific techniques. When applied to advanced reconstruction techniques such as residual‐RAKI, 2D VC‐GRAPPA and LORAKS, SPARK achieves up to 20% RMSE improvement. SPARK with 3D GRAPPA also improves RMSE performance by ~2×, SSIM performance, and perceived image quality without a fully sampled ACS region. Finally, SPARK synergizes with non‐Cartesian, 2D and 3D wave‐encoding imaging by reducing RMSE between 20% and 25% and providing qualitative improvements.

    Conclusion

    SPARK synergizes with physics‐based acquisition and reconstruction techniques to improve accelerated MRI by training scan‐specific models to estimate and correct reconstruction errors in k‐space.

     
    more » « less