skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Distributionally Robust Optimization with Bias and Variance Reduction
We consider the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and f-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-k loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparameter, and prove that it enjoys linear convergence for smooth regularized losses. This contrasts with previous algorithms that either require tuning multiple hyperparameters or potentially fail to converge due to biased gradient estimates or inadequate regularization. Empirically, we show that Prospect can converge 2-3× faster than baselines such as stochastic gradient and stochastic saddle-point methods on distribution shift and fairness benchmarks spanning tabular, vision, and language domains.  more » « less
Award ID(s):
2023166
PAR ID:
10534217
Author(s) / Creator(s):
; ; ;
Corporate Creator(s):
Editor(s):
OpenReview
Publisher / Repository:
OpenReview
Date Published:
Edition / Version:
NA
Volume:
NA
Issue:
NA
Page Range / eLocation ID:
NA
Subject(s) / Keyword(s):
NA
Format(s):
Medium: X Size: NA Other: NA
Size(s):
NA
Sponsoring Org:
National Science Foundation
More Like this
  1. Stochastic gradient descent (SGD) is the optimization algorithm of choice in many machine learning applications such as regularized empirical risk minimization and training deep neural networks. The classical analysis of convergence of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, it is always violated for cases where the objective function is strongly convex. In (Bottou et al.,2016) a new analysis of convergence of SGD is performed under the assumption that stochastic gradients are bounded with respect to the true gradient norm. Here we show that for stochastic problems arising in machine learning such bound always holds. Moreover, we propose an alternative convergence analysis of SGD with diminishing learning rate regime, which is results in more relaxed conditions that those in (Bottou et al.,2016). We then move on the asynchronous parallel setting, and prove convergence of the Hogwild! algorithm in the same regime, obtaining the first convergence results for this method in the case of diminished learning rate. 
    more » « less
  2. In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an overparametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an \epsilon-local-minimizer, matches the corresponding deterministic rate of ˜O(1/\epsilon^2). We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an \epsilon-local-minimizer under interpolation-like conditions, is ˜O(1/\epsilon^2.5). While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of ˜O(1/\epsilon^1.5) corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings. 
    more » « less
  3. We focus on developing a theoretical understanding of meta-learning. Given multiple tasks drawn i.i.d. from some (unknown) task distribution, the goal is to find a good pre-trained model that can be adapted to a new, previously unseen, task with little computational and statistical overhead. We introduce a novel notion of stability for meta-learning algorithms, namely uniform meta-stability. We instantiate two uniformly meta-stable learning algorithms based on regularized empirical risk minimization and gradient descent and give explicit generalization bounds for convex learning problems with smooth losses and for weakly convex learning problems with non-smooth losses. Finally, we extend our results to stochastic and adversarially robust variants of our meta-learning algorithm. 
    more » « less
  4. Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; Garnett, R. (Ed.)
    Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses $$F$$, STORM finds a point $$\boldsymbol{x}$$ with $$E[\|\nabla F(\boldsymbol{x})\|]\le O(1/\sqrt{T}+\sigma^{1/3}/T^{1/3})$$ in $$T$$ iterations with $$\sigma^2$$ variance in the gradients, matching the optimal rate and without requiring knowledge of $$\sigma$$. 
    more » « less
  5. This paper unifies commonly used accelerated stochastic gradient methods (Polyak’s Heavy Ball, Nesterov’s Accelerated Gradient and Adaptive Moment Estimation (Adam)) as specific cases of a general lowpass regularized learning framework, the Automatic Stochastic Gradient Method (AutoSGM). For AutoSGM, we derive an optimal iteration-dependent learning rate function and realize an approximation. Adam is also an approximation of this optimal approach that replaces the iteration-dependent learning-rate with a constant. Empirical results on deep neural networks comparing the learning behavior of AutoSGM equipped with this iteration-dependent learning-rate algorithm demonstrate fast learning behavior, robustness to the initial choice of the learning rate, and can tune an initial constant learning-rate in applications where a good constant learning rate approximation is unknown. 
    more » « less