skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Revisiting Outer Optimization in Adversarial Training
Despite the fundamental distinction between adversarial and natural training (AT and NT), AT methods generally adopt momentum SGD (MSGD) for the outer optimization. This paper aims to analyze this choice by investigating the overlooked role of outer optimization in AT. Our exploratory evaluations reveal that AT induces higher gradient norm and variance compared to NT. This phenomenon hinders the outer optimization in AT since the convergence rate of MSGD is highly dependent on the variance of the gradients. To this end, we propose an optimization method called ENGM which regularizes the contribution of each input example to the average mini-batch gradients. We prove that the convergence rate of ENGM is independent of the variance of the gradients, and thus, it is suitable for AT. We introduce a trick to reduce the computational cost of ENGM using empirical observations on the correlation between the norm of gradients w.r.t. the network parameters and input examples. Our extensive evaluations and ablation studies on CIFAR-10, CIFAR-100, and TinyImageNet demonstrate that ENGM and its variants consistently improve the performance of a wide range of AT methods. Furthermore, ENGM alleviates major shortcomings of AT including robust overfitting and high sensitivity to hyperparameter settings.  more » « less
Award ID(s):
1650474
PAR ID:
10401298
Author(s) / Creator(s):
Date Published:
Journal Name:
16th European Conference on Computer Vision (ECCV 2022)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) for stochastic convex optimization under favorable geometry, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal in terms of finding a near-stationary point with respect to the l2-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the l1-norm of the gradient as the stationarity measure, as opposed to the standard l2-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the l1-norm stationarity measure, we establish an upper bound on the convergence rate of AdaGrad and a corresponding lower bound for SGD. In particular, we identify non-convex settings in which the iteration complexity of AdaGrad is favorable over SGD and show that, for certain configurations of problem parameters, it outperforms SGD by a factor of d, where d is the problem dimension. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions. 
    more » « less
  2. We develop new adaptive algorithms for variational inequalities with monotone operators, which capture many problems of interest, notably convex optimization and convex-concave saddle point problems. Our algorithms automatically adapt to unknown problem parameters such as the smoothness and the norm of the operator, and the variance of the stochastic evaluation oracle. We show that our algorithms are universal and simultaneously achieve the optimal convergence rates in the non-smooth, smooth, and stochastic settings. The convergence guarantees of our algorithms improve over existing adaptive methods and match the optimal non-adaptive algorithms. Additionally, prior works require that the optimization domain is bounded. In this work, we remove this restriction and give algorithms for unbounded domains that are adaptive and universal. Our general proof techniques can be used for many variants of the algorithm using one or two operator evaluations per iteration. The classical methods based on the ExtraGradient/MirrorProx algorithm require two operator evaluations per iteration, which is the dominant factor in the running time in many settings. 
    more » « less
  3. In this work, we propose a method to construct a uniform error bound for the SK predictor. In investigating the asymptotic properties of the proposed uniform error bound, we examine the convergence rate of SK’s predictive variance under the supremum norm in both fixed and random design settings. Our analyses reveal that the large-sample properties of SK prediction depend on the design-point sampling scheme and the budget allocation scheme adopted. Appropriately controlling the order of noise variances through budget allocation is crucial for achieving a desirable convergence rate of SK’s approximation error, as quantified by the uniform error bound, and for maintaining SK’s numerical stability. Moreover, we investigate the impact of noise variance estimation on the uniform error bound’s performance theoretically and numerically. We demonstrate the superiority of the proposed uniform bound to the Bonferroni correction-based simultaneous confidence interval under various experimental settings through numerical evaluations. 
    more » « less
  4. In this paper, we address the challenges of asynchronous gradient descent in distributed learning environments, particularly focusing on addressing the challenges of stale gradients and the need for extensive communication resources. We develop a novel communication efficient framework that incorporates a gradient evaluation algorithm to assess and utilize delayed gradients based on their quality, ensuring efficient and effective model updates while significantly reducing communication overhead. Our proposed algorithm requires agents to only send the norm of the gradients rather than the computed gradient. The server then decides whether to accept the gradient if the ratio between the norm of the gradient and the distance between the global model parameter and the local model parameter exceeds a certain threshold. With the proper choice of the threshold, we show that the convergence rate achieves the same order as the synchronous stochastic gradient without depending on the staleness value unlike most of the existing works. Given the computational complexity of the initial algorithm, we introduce a simplified variant that prioritizes the practical applicability without compromising on the convergence rates. Our simulations demonstrate that our proposed algorithms outperform existing state-of-the-art methods, offering improved convergence rates, stability, accuracy, and resource consumption. 
    more » « less
  5. In this paper, we address the challenges of asynchronous gradient descent in distributed learning environments, particularly focusing on addressing the challenges of stale gradients and the need for extensive communication resources. We develop a novel communication efficient framework that incorporates a gradient evaluation algorithm to assess and utilize delayed gradients based on their quality, ensuring efficient and effective model updates while significantly reducing communication overhead. Our proposed algorithm requires agents to only send the norm of the gradients rather than the computed gradient. The server then decides whether to accept the gradient if the ratio between the norm of the gradient and the distance between the global model parameter and the local model parameter exceeds a certain threshold. With the proper choice of the threshold, we show that the convergence rate achieves the same order as the synchronous stochastic gradient without depending on the staleness value unlike most of the existing works. Given the computational complexity of the initial algorithm, we introduce a simplified variant that prioritizes the practical applicability without compromising on the convergence rates. Our simulations demonstrate that our proposed algorithms outperform existing state-of-the-art methods, offering improved convergence rates, stability, accuracy, and resource consumption. 
    more » « less