skip to main content


Title: Calibrated Surrogate Losses for Adversarially Robust Classification
Adversarially robust classification seeks a classifier that is insensitive to adversarial perturbations of test patterns. This problem is often formulated via a minimax objective, where the target loss is the worst-case value of the 0-1 loss subject to a bound on the size of perturbation. Recent work has proposed convex surrogates for the adversarial 0-1 loss, in an effort to make optimization more tractable. In this work, we consider the question of which surrogate losses are calibrated with respect to the adversarial 0-1 loss, meaning that minimization of the former implies minimization of the latter. We show that no convex surrogate loss is calibrated with respect to the adversarial 0-1 loss when restricted to the class of linear models. We further introduce a class of nonconvex losses and offer necessary and sufficient conditions for losses in this class to be calibrated. Keywords: surrogate loss, classification calibration, adversarial robustness  more » « less
Award ID(s):
1838179
NSF-PAR ID:
10206120
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Conference on Learning Theory, 2020
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Multiclass extensions of the support vector machine (SVM) have been formulated in a variety of ways. A recent empirical comparison of nine such formulations [1] recommends the variant proposed by Weston and Watkins (WW), despite the fact that the WW-hinge loss is not calibrated with respect to the 0-1 loss. In this work we introduce a novel discrete loss function for multiclass classification, the ordered partition loss, and prove that the WW-hinge loss is calibrated with respect to this loss. We also argue that the ordered partition loss is minimally emblematic among discrete losses satisfying this property. Finally, we apply our theory to justify the empirical observation made by Doˇgan et al. [1] that the WW-SVM can work well even under massive label noise, a challenging setting for multiclass SVMs. 
    more » « less
  2. null (Ed.)
    Multiclass extensions of the support vector machine (SVM) have been formulated in a variety of ways.A recent empirical comparison of nine such formulations [1]recommends the variant proposed by Westonand Watkins (WW), despite the fact that the WW-hinge loss is not calibrated with respect to the 0-1 loss.In this work we introduce a novel discrete loss function for multiclass classification, theordered partitionloss, and prove that the WW-hinge lossiscalibrated with respect to this loss. We also argue that theordered partition loss is maximally informative among discrete losses satisfying this property. Finally,we apply our theory to justify the empirical observation made by Doˇgan et al. [1] that the WW-SVMcan work well even under massive label noise, a challenging setting for multiclass SVMs. 
    more » « less
  3. Abstract

    This paper studies several solution paths of sparse quadratic minimization problems as a function of the weighing parameter of the bi-objective of estimation loss versus solution sparsity. Three such paths are considered: the “$$\ell _0$$0-path” where the discontinuous$$\ell _0$$0-function provides the exact sparsity count; the “$$\ell _1$$1-path” where the$$\ell _1$$1-function provides a convex surrogate of sparsity count; and the “capped$$\ell _1$$1-path” where the nonconvex nondifferentiable capped$$\ell _1$$1-function aims to enhance the$$\ell _1$$1-approximation. Serving different purposes, each of these three formulations is different from each other, both analytically and computationally. Our results deepen the understanding of (old and new) properties of the associated paths, highlight the pros, cons, and tradeoffs of these sparse optimization models, and provide numerical evidence to support the practical superiority of the capped$$\ell _1$$1-path. Our study of the capped$$\ell _1$$1-path is interesting in its own right as the path pertains to computable directionally stationary (= strongly locally minimizing in this context, as opposed to globally optimal) solutions of a parametric nonconvex nondifferentiable optimization problem. Motivated by classical parametric quadratic programming theory and reinforced by modern statistical learning studies, both casting an exponential perspective in fully describing such solution paths, we also aim to address the question of whether some of them can be fully traced in strongly polynomial time in the problem dimensions. A major conclusion of this paper is that a path of directional stationary solutions of the capped$$\ell _1$$1-regularized problem offers interesting theoretical properties and practical compromise between the$$\ell _0$$0-path and the$$\ell _1$$1-path. Indeed, while the$$\ell _0$$0-path is computationally prohibitive and greatly handicapped by the repeated solution of mixed-integer nonlinear programs, the quality of$$\ell _1$$1-path, in terms of the two criteria—loss and sparsity—in the estimation objective, is inferior to the capped$$\ell _1$$1-path; the latter can be obtained efficiently by a combination of a parametric pivoting-like scheme supplemented by an algorithm that takes advantage of the Z-matrix structure of the loss function.

     
    more » « less
  4. We consider the fundamental problem of ReLU regression, where the goal is to output the best fitting ReLU with respect to square loss given access to draws from some unknown distribution. We give the first efficient, constant-factor approximation algorithm for this problem assuming the underlying distribution satisfies some weak concentration and anti-concentration conditions (and includes, for example, all log-concave distributions). This solves the main open problem of Goel et al., who proved hardness results for any exact algorithm for ReLU regression (up to an additive ϵ). Using more sophisticated techniques, we can improve our results and obtain a polynomial-time approximation scheme for any subgaussian distribution. Given the aforementioned hardness results, these guarantees can not be substantially improved. Our main insight is a new characterization of surrogate losses for nonconvex activations. While prior work had established the existence of convex surrogates for monotone activations, we show that properties of the underlying distribution actually induce strong convexity for the loss, allowing us to relate the global minimum to the activation’s Chow parameters. 
    more » « less
  5. Many modern machine learning tasks require models with high tail performance, i.e. high performance over the worst-off samples in the dataset. This problem has been widely studied in fields such as algorithmic fairness, class imbalance, and risk-sensitive decision making. A popular approach to maximize the model’s tail performance is to minimize the CVaR (Conditional Value at Risk) loss, which computes the average risk over the tails of the loss. However, for classification tasks where models are evaluated by the 0/1 loss, we show that if the classifiers are deterministic, then the minimizer of the average 0/1 loss also minimizes the CVaR 0/1 loss, suggesting that CVaR loss minimization is not helpful without additional assumptions. We circumvent this negative result by minimizing the CVaR loss over randomized classifiers, for which the minimizers of the average 0/1 loss and the CVaR 0/1 loss are no longer the same, so minimizing the latter can lead to better tail performance. To learn such randomized classifiers, we propose the Boosted CVaR Classification framework which is motivated by a direct relationship between CVaR and a classical boosting algorithm called LPBoost. Based on this framework, we design an algorithm called alpha-AdaLPBoost. We empirically evaluate our proposed algorithm on four benchmark datasets and show that it achieves higher tail performance than deterministic model training methods. 
    more » « less