 Award ID(s):
 1818977
 NSFPAR ID:
 10230560
 Date Published:
 Journal Name:
 Journal of machine learning research
 Volume:
 22
 Issue:
 17
 ISSN:
 15337928
 Page Range / eLocation ID:
 140
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

null (Ed.)We present and analyze a momentumbased gradient method for training linear classifiers with an exponentiallytailed loss (eg, the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of O (1/t^ 2). This contrasts with a rate of O (1/log (t)) for standard gradient descent, and O (1/t) for normalized gradient descent. The momentumbased method is derived via the convex dual of the maximummargin problem, and specifically by applying Nesterov acceleration to this dual, which manages to result in a simple and intuitive method in the primal. This dual view can also be used to derive a stochastic variant, which performs adaptive nonuniform sampling via the dual variables.more » « less

Iterative firstorder methods such as gradient descent and its variants are widely used for solving optimization and machine learning problems. There has been recent interest in analytic or numerically efficient methods for computing worstcase performance bounds for such algorithms, for example over the class of strongly convex loss functions. A popular approach is to assume the algorithm has a fixed size (fixed dimension, or memory) and that its structure is parameterized by one or two hyperparameters, for example a learning rate and a momentum parameter. Then, a Lyapunov function is sought to certify robust stability and subsequent optimization can be performed to find optimal hyperparameter tunings. In the present work, we instead fix the constraints that characterize the loss function and apply techniques from robust control synthesis to directly search over algorithms. This approach yields stronger results than those previously available, since the bounds produced hold over algorithms with an arbitrary, but finite, amount of memory rather than just holding for algorithms with a prescribed structure.more » « less

Bach, Francis ; Blei, David ; Scholkopf, Bernhard (Ed.)This paper investigates the asymptotic behaviors of gradient descent algorithms (particularly accelerated gradient descent and stochastic gradient descent) in the context of stochastic optimization arising in statistics and machine learning, where objective functions are estimated from available data. We show that these algorithms can be computationally modeled by continuoustime ordinary or stochastic differential equations. We establish gradient flow central limit theorems to describe the limiting dynamic behaviors of these computational algorithms and the largesample performances of the related statistical procedures, as the number of algorithm iterations and data size both go to infinity, where the gradient flow central limit theorems are governed by some linear ordinary or stochastic differential equations, like timedependent OrnsteinUhlenbeck processes. We illustrate that our study can provide a novel unified framework for a joint computational and statistical asymptotic analysis, where the computational asymptotic analysis studies the dynamic behaviors of these algorithms with time (or the number of iterations in the algorithms), the statistical asymptotic analysis investigates the largesample behaviors of the statistical procedures (like estimators and classifiers) that are computed by applying the algorithms; in fact, the statistical procedures are equal to the limits of the random sequences generated from these iterative algorithms, as the number of iterations goes to infinity. The joint analysis results based on the obtained gradient flow central limit theorems lead to the identification of four factors—learning rate, batch size, gradient covariance, and Hessian—to derive new theories regarding the local minima found by stochastic gradient descent for solving nonconvex optimization problems.more » « less

Bach, Francis ; Blei, David ; Scholkopf, Bernhard (Ed.)This paper investigates the asymptotic behaviors of gradient descent algorithms (particularly accelerated gradient descent and stochastic gradient descent) in the context of stochastic optimization arising in statistics and machine learning, where objective functions are estimated from available data. We show that these algorithms can be computationally modeled by continuoustime ordinary or stochastic differential equations. We establish gradient flow central limit theorems to describe the limiting dynamic behaviors of these computational algorithms and the largesample performances of the related statistical procedures, as the number of algorithm iterations and data size both go to infinity, where the gradient flow central limit theorems are governed by some linear ordinary or stochastic differential equations, like timedependent OrnsteinUhlenbeck processes. We illustrate that our study can provide a novel unified framework for a joint computational and statistical asymptotic analysis, where the computational asymptotic analysis studies the dynamic behaviors of these algorithms with time (or the number of iterations in the algorithms), the statistical asymptotic analysis investigates the largesample behaviors of the statistical procedures (like estimators and classifiers) that are computed by applying the algorithms; in fact, the statistical procedures are equal to the limits of the random sequences generated from these iterative algorithms, as the number of iterations goes to infinity. The joint analysis results based on the obtained The joint analysis results based on the obtained gradient flow central limit theorems lead to the identification of four factorslearning rate, batch size, gradient covariance, and Hessianto derive new theories regarding the local minima found by stochastic gradient descent for solving nonconvex optimization problems.more » « less

Deep learning architectures are usually proposed with millions of parameters, resulting in a memory issue when training deep neural networks with stochastic gradient descent type methods using large batch sizes. However, training with small batch sizes tends to produce low quality solution due to the large variance of stochastic gradients. In this paper, we tackle this problem by proposing a new framework for training deep neural network with small batches/noisy gradient. During optimization, our method iteratively applies a proximal type regularizer to make loss function strongly convex. Such regularizer stablizes the gradient, leading to better training performance. We prove that our algorithm achieves comparable convergence rate as vanilla SGD even with small batch size. Our framework is simple to implement and can be potentially combined with many existing optimization algorithms. Empirical results show that our method outperforms SGD and Adam when batch size is small. Our implementation is available at https://github.com/huiqu18/TRAlgorithm.