Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

VapnikChervonenkis (VC) theory has so far been unable to explain the small generalization error of overparametrized neural networks. Indeed, existing applications of VC theory to large networks obtain upper bounds on VC dimension that are proportional to the number of weights, and for a large class of networks, these upper bound are known to be tight. In this work, we focus on a subclass of partially quantized networks that we refer to as hyperplane arrangement neural networks (HANNs). Using a sample compression analysis, we show that HANNs can have VC dimension significantly smaller than the number of weights, while being highly expressive. In particular, empirical risk minimization over HANNs in the overparametrized regime achieves the minimax rate for classification with Lipschitz posterior class probability. We further demonstrate the expressivity of HANNs empirically. On a panel of 121 UCI datasets, overparametrized HANNs match the performance of stateoftheart fullprecision models.

Recent empirical evidence suggests that the WestonWatkins support vector machine is among the best performing multiclass extensions of the binary SVM. Current stateoftheart solvers repeatedly solve a particular subproblem approximately using an iterative strategy. In this work, we propose an algorithm that solves the subproblem exactly using a novel reparametrization of the WestonWatkins dual problem. For linear WWSVMs, our solver shows significant speedup over the stateoftheart solver when the number of classes is large. Our exact subproblem solver also allows us to prove linear convergence of the overall solver.

Recent empirical evidence suggests that the WestonWatkins support vector machine is among the best performing multiclass extensions of the binary SVM. Current stateoftheart solvers repeatedly solve a particular subproblem approximately using an iterative strategy. In this work, we propose an algorithm that solves the subproblem exactly using a novel reparametrization of the WestonWatkins dual problem. For linear WWSVMs, our solver shows significant speedup over the stateoftheart solver when the number of classes is large. Our exact subproblem solver also allows us to prove linear convergence of the overall solver.

In the problem of domain generalization (DG), there are labeled training data sets from several related prediction problems, and the goal is to make accurate predictions on future unlabeled data sets that are not known to the learner. This problem arises in several applications where data distributions fluctuate because of environmental, technical, or other sources of variation. We introduce a formal framework for DG, and argue that it can be viewed as a kind of supervised learning problem by augmenting the original feature space with the marginal distribution of feature vectors. While our framework has several connections to conventional analysis of supervised learning algorithms, several unique aspects of DG require new methods of analysis. This work lays the learning theoretic foundations of domain generalization, building on our earlier conference paper where the problem of DG was introduced (Blanchard et al., 2011). We present two formal models of data generation, corresponding notions of risk, and distributionfree generalization error analysis. By focusing our attention on kernel methods, we also provide more quantitative results and a universally consistent algorithm. An efficient implementation is provided for this algorithm, which is experimentally compared to a pooling strategy on one synthetic and three realworld data sets.more »

In the problem of domain generalization (DG), there are labeled training data sets from several related prediction problems, and the goal is to make accurate predictions on future unlabeled data sets that are not known to the learner. This problem arises in several applications where data distributions fluctuate because of environmental, technical, or other sources of variation. We introduce a formal framework for DG, and argue that it can be viewed as a kind of supervised learning problem by augmenting the original feature space with the marginal distribution of feature vectors. While our framework has several connections to conventional analysis of supervised learning algorithms, several unique aspects of DG require new methods of analysis. This work lays the learning theoretic foundations of domain generalization, building on our earlier conference paper where the problem of DG was introduced (Blanchard et al., 2011). We present two formal models of data generation, corresponding notions of risk, and distributionfree generalization error analysis. By focusing our attention on kernel methods, we also provide more quantitative results and a universally consistent algorithm. An efficient implementation is provided for this algorithm, which is experimentally compared to a pooling strategy on one synthetic and three realworld data sets.

Learning from label proportions (LLP) is a weakly supervised setting for classification in which unlabeled training instances are grouped into bags, and each bag is annotated with the proportion of each class occurring in that bag. Prior work on LLP has yet to establish a consistent learning procedure, nor does there exist a theoretically justified, general purpose training criterion. In this work we address these two issues by posing LLP in terms of mutual contamination models (MCMs), which have recently been applied successfully to study various other weak supervision settings. In the process, we establish several novel technical results for MCMs, including unbiased losses and generalization error bounds under noniid sampling plans. We also point out the limitations of a common experimental setting for LLP, and propose a new one based on our MCM framework.

Learning from label proportions (LLP) is a weakly supervised setting for classification in whichunlabeled training instances are grouped into bags, and each bag is annotated with the proportion ofeach class occurring in that bag. Prior work on LLP has yet to establish a consistent learning procedure,nor does there exist a theoretically justified, general purpose training criterion. In this work we addressthese two issues by posing LLP in terms of mutual contamination models (MCMs), which have recentlybeen applied successfully to study various other weak supervision settings. In the process, we establishseveral novel technical results for MCMs, including unbiased losses and generalization error bounds undernoniid sampling plans. We also point out the limitations ofa common experimental setting for LLP,and propose a new one based on our MCM framework.

Multiclass extensions of the support vector machine (SVM) have been formulated in a variety of ways.A recent empirical comparison of nine such formulations [1]recommends the variant proposed by Westonand Watkins (WW), despite the fact that the WWhinge loss is not calibrated with respect to the 01 loss.In this work we introduce a novel discrete loss function for multiclass classification, theordered partitionloss, and prove that the WWhinge lossiscalibrated with respect to this loss. We also argue that theordered partition loss is maximally informative among discrete losses satisfying this property. Finally,we apply our theory to justify the empirical observation made by Doˇgan et al. [1] that the WWSVMcan work well even under massive label noise, a challenging setting for multiclass SVMs.

Multiclass extensions of the support vector machine (SVM) have been formulated in a variety of ways. A recent empirical comparison of nine such formulations [1] recommends the variant proposed by Weston and Watkins (WW), despite the fact that the WWhinge loss is not calibrated with respect to the 01 loss. In this work we introduce a novel discrete loss function for multiclass classification, the ordered partition loss, and prove that the WWhinge loss is calibrated with respect to this loss. We also argue that the ordered partition loss is minimally emblematic among discrete losses satisfying this property. Finally, we apply our theory to justify the empirical observation made by Doˇgan et al. [1] that the WWSVM can work well even under massive label noise, a challenging setting for multiclass SVMs.

Adversarially robust classification seeks a classifier that is insensitive to adversarial perturbations of test patterns. This problem is often formulated via a minimax objective, where the target loss is the worstcase value of the 01 loss subject to a bound on the size of perturbation. Recent work has proposed convex surrogates for the adversarial 01 loss, in an effort to make optimization more tractable. In this work, we consider the question of which surrogate losses are calibrated with respect to the adversarial 01 loss, meaning that minimization of the former implies minimization of the latter. We show that no convex surrogate loss is calibrated with respect to the adversarial 01 loss when restricted to the class of linear models. We further introduce a class of nonconvex losses and offer necessary and sufficient conditions for losses in this class to be calibrated. Keywords: surrogate loss, classification calibration, adversarial robustness