skip to main content


Title: The importance of better models in stochastic optimization
Standard stochastic optimization methods are brittle, sensitive to stepsize choice and other algorithmic parameters, and they exhibit instability outside of well-behaved families of objectives. To address these challenges, we investigate models for stochastic optimization and learning problems that exhibit better robustness to problem families and algorithmic parameters. With appropriately accurate models—which we call the aprox family—stochastic methods can be made stable, provably convergent, and asymptotically optimal; even modeling that the objective is nonnegative is sufficient for this stability. We extend these results beyond convexity to weakly convex objectives, which include compositions of convex losses with smooth functions common in modern machine learning. We highlight the importance of robustness and accurate modeling with experimental evaluation of convergence time and algorithm sensitivity.  more » « less
Award ID(s):
1553086
NSF-PAR ID:
10383072
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Volume:
116
Issue:
46
ISSN:
0027-8424
Page Range / eLocation ID:
22924 to 22930
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Stochastic gradient descent with momentum (SGDm) is one of the most popular optimization algorithms in deep learning. While there is a rich theory of SGDm for convex problems, the theory is considerably less developed in the context of deep learning where the problem is non-convex and the gradient noise might exhibit a heavy-tailed behavior, as empirically observed in recent studies. In this study, we consider a \emph{continuous-time} variant of SGDm, known as the underdamped Langevin dynamics (ULD), and investigate its asymptotic properties under heavy-tailed perturbations. Supported by recent studies from statistical physics, we argue both theoretically and empirically that the heavy-tails of such perturbations can result in a bias even when the step-size is small, in the sense that \emph{the optima of stationary distribution} of the dynamics might not match \emph{the optima of the cost function to be optimized}. As a remedy, we develop a novel framework, which we coin as \emph{fractional} ULD (FULD), and prove that FULD targets the so-called Gibbs distribution, whose optima exactly match the optima of the original cost. We observe that the Euler discretization of FULD has noteworthy algorithmic similarities with \emph{natural gradient} methods and \emph{gradient clipping}, bringing a new perspective on understanding their role in deep learning. We support our theory with experiments conducted on a synthetic model and neural networks. 
    more » « less
  2. Stochastic Gradient Descent (SGD) based methods have been widely used for training large-scale machine learning models that also generalize well in practice. Several explanations have been offered for this generalization performance, a prominent one being algorithmic stability [18]. However, there are no known examples of smooth loss functions for which the analysis can be shown to be tight. Furthermore, apart from the properties of the loss function, data distribution has also been shown to be an important factor in generalization performance. This raises the question: is the stability analysis of [18] tight for smooth functions, and if not, for what kind of loss functions and data distributions can the stability analysis be improved? In this paper we first settle open questions regarding tightness of bounds in the data-independent setting: we show that for general datasets, the existing analysis for convex and strongly-convex loss functions is tight, but it can be improved for non-convex loss functions. Next, we give a novel and improved data-dependent bounds: we show stability upper bounds for a large class of convex regularized loss functions, with negligible regularization parameters, and improve existing data-dependent bounds in the non-convex setting. We hope that our results will initiate further efforts to better understand the data-dependent setting under non-convex loss functions, leading to an improved understanding of the generalization abilities of deep networks. 
    more » « less
  3. Stochastic Gradient Descent (SGD) based methods have been widely used for training large-scale machine learning models that also generalize well in practice. Several explanations have been offered for this generalization performance, a prominent one being algorithmic stability Hardt et al [2016]. However, there are no known examples of smooth loss functions for which the analysis can be shown to be tight. Furthermore, apart from properties of the loss function, data distribution has also been shown to be an important factor in generalization performance. This raises the question: is the stability analysis of Hardt et al [2016] tight for smooth functions, and if not, for what kind of loss functions and data distributions can the stability analysis be improved? In this paper we first settle open questions regarding tightness of bounds in the data-independent setting: we show that for general datasets, the existing analysis for convex and strongly-convex loss functions is tight, but it can be improved for non-convex loss functions. Next, we give novel and improved data-dependent bounds: we show stability upper bounds for a large class of convex regularized loss functions, with negligible regularization parameters, and improve existing data-dependent bounds in the non-convex setting. We hope that our results will initiate further efforts to better understand the data-dependent setting under non-convex loss functions, leading to an improved understanding of the generalization abilities of deep networks. 
    more » « less
  4. Linear discriminant analysis (LDA) is widely used for dimensionality reduction under supervised learning settings. Traditional LDA objective aims to minimize the ratio of squared Euclidean distances that may not perform optimally on noisy data sets. Multiple robust LDA objectives have been proposed to address this problem, but their implementations have two major limitations. One is that their mean calculations use the squared l2-norm distance to center the data, which is not valid when the objective does not use the Euclidean distance. The second problem is that there is no generalized optimization algorithm to solve different robust LDA objectives. In addition, most existing algorithms can only guarantee the solution to be locally optimal, rather than globally optimal. In this paper, we review multiple robust loss functions and propose a new and generalized robust objective for LDA. Besides, to better remove the mean value within data, our objective uses an optimal way to center the data through learning. As one important algorithmic contribution, we derive an efficient iterative algorithm to optimize the resulting non-smooth and non-convex objective function. We theoretically prove that our solution algorithm guarantees that both the objective and the solution sequences converge to globally optimal solutions at a sub-linear convergence rate. The experimental results demonstrate the effectiveness of our new method, achieving significant improvements compared to the other competing methods. 
    more » « less
  5. In many areas of constrained optimization, representing all possible constraints that give rise to an accurate feasible region can be difficult and computationally prohibitive for online use. Satisfying feasibility constraints becomes more challenging in high-dimensional, non-convex regimes which are common in engineering applications. A prominent example that is explored in the manuscript is the security-constrained optimal power flow (SCOPF) problem, which minimizes power generation costs, while enforcing system feasibility under contingency failures in the transmission network. In its full form, this problem has been modeled as a nonlinear two-stage stochastic programming problem. In this work, we propose a hybrid structure that incorporates and takes advantage of both a high-fidelity physical model and fast machine learning surrogates. Neural network (NN) models have been shown to classify highly non-linear functions and can be trained offline but require large training sets. In this work, we present how model-guided sampling can efficiently create datasets that are highly informative to a NN classifier for non-convex functions. We show how the resultant NN surrogates can be integrated into a non-linear program as smooth, continuous functions to simultaneously optimize the objective function and enforce feasibility using existing non-linear solvers. Overall, this allows us to optimize instances of the SCOPF problem with an order of magnitude CPU improvement over existing methods.

     
    more » « less