Structured prediction of treeshaped objects is heavily studied under the name of
syntactic dependency parsing. Current practice based on maximum likelihood or
margin is either agnostic to or inconsistent with the evaluation loss. Risk minimization alleviates the discrepancy between training and test objectives but typically
induces a nonconvex problem. These approaches adopt explicit regularization
to combat overfitting without probabilistic interpretation. We propose a momentbased distributionally robust optimization approach for tree structured prediction,
where the worstcase expected loss over a set of distributions within bounded
moment divergence from the empirical distribution is minimized. We develop
efficient algorithms for arborescences and other variants of trees. We derive Fisher
consistency, convergence rates and generalization bounds for our proposed method.
We evaluate its empirical effectiveness on dependency parsing benchmarks.
more »
« less
Measuring Generalization with Optimal Transport
Understanding the generalization of deep neural networks is one of the most important tasks in deep learning. Although much progress has been made, theoretical error bounds still often behave disparately from empirical observations. In this work, we develop marginbased generalization bounds, where the margins are normalized with optimal transport costs between independent random subsets sampled from the training distribution. In particular, the optimal transport cost can be interpreted as a generalization of variance which captures the structural properties of the learned feature space. Our bounds robustly predict the generalization error, given training data and network parameters, on large scale datasets. Theoretically, we demonstrate that the concentration and separation of features play crucial roles in generalization, supporting empirical results in the literature. The code is available at https://github.com/chingyaoc/kVMargin.
more »
« less
 Award ID(s):
 1741341
 NSFPAR ID:
 10380225
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 ISSN:
 10495258
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


In the Mixup training paradigm, a model is trained using convex combinations of data points and their associated labels. Despite seeing very few true data points during training, models trained using Mixup seem to still minimize the original empirical risk and exhibit better generalization and robustness on various tasks when compared to standard training. In this paper, we investigate how these benefits of Mixup training rely on properties of the data in the context of classification. For minimizing the original empirical risk, we compute a closed form for the Mixupoptimal classification, which allows us to construct a simple dataset on which minimizing the Mixup loss can provably lead to learning a classifier that does not minimize the empirical loss on the data. On the other hand, we also give sufficient conditions for Mixup training to also minimize the original empirical risk. For generalization, we characterize the margin of a Mixup classifier, and use this to understand why the decision boundary of a Mixup classifier can adapt better to the full structure of the training data when compared to standard training. In contrast, we also show that, for a large class of linear models and linearly separable datasets, Mixup training leads to learning the same classifier as standard training.more » « less

Various approaches have been developed to upper bound the generalization error of a supervised learning algorithm. However, existing bounds are often loose and lack of guarantees. As a result, they may fail to characterize the exact generalization ability of a learning algorithm.Our main contribution is an exact characterization of the expected generalization error of the wellknown Gibbs algorithm (a.k.a. Gibbs posterior) using symmetrized KL information between the input training samples and the output hypothesis. Our result can be applied to tighten existing expected generalization error and PACBayesian bounds. Our approach is versatile, as it also characterizes the generalization error of the Gibbs algorithm with datadependent regularizer and that of the Gibbs algorithm in the asymptotic regime, where it converges to the empirical risk minimization algorithm. Of particular relevance, our results highlight the role the symmetrized KL information plays in controlling the generalization error of the Gibbs algorithm.more » « less

We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss ℓ can control the test error under all Moreau envelopes of the loss ℓ . We use our finitesample bound to directly recover the “optimistic rate” of Zhou et al. (2021) for linear regression with the square loss, which is known to be tight for minimal ℓ2norm interpolation, but we also handle more general settings where the label is generated by a potentially misspecified multiindex model. The same argument can analyze noisy interpolation of maxmargin classifiers through the squared hinge loss, and establishes consistency results in spikedcovariance settings. More generally, when the loss is only assumed to be Lipschitz, our bound effectively improves Talagrand’s wellknown contraction lemma by a factor of two, and we prove uniform convergence of interpolators (Koehler et al. 2021) for all smooth, nonnegative losses. Finally, we show that application of our generalization bound using localized Gaussian width will generally be sharp for empirical risk minimizers, establishing a nonasymptotic Moreau envelope theorymore » « less

Data augmentation (DA) is commonly used during model training, as it significantly improves test error and model robustness. DA artificially expands the training set by applying random noise, rotations, crops, or even adversarial perturbations to the input data. Although DA is widely used, its capacity to provably improve robustness is not fully understood. In this work, we analyze the robustness that DA begets by quantifying the margin that DA enforces on empirical risk minimizers. We first focus on linear separators, and then a class of nonlinear models whose labeling is constant within small convex hulls of data points. We present lower bounds on the number of augmented data points required for nonzero margin, and show that commonly used DA techniques may only introduce significant margin after adding exponentially many points to the data set.more » « less