skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization
In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data. However, the local estimates on each machine are typically biased, relative to the full solution on all of the data, and this can limit the effectiveness of averaging. Here, we introduce a new technique for debiasing the local estimates, which leads to both theoretical and empirical improvements in the convergence rate of distributed second order methods. Our technique has two novel components: (1) modifying standard sketching techniques to obtain what we call a surrogate sketch; and (2) carefully scaling the global regularization parameter for local computations. Our surrogate sketches are based on determinantal point processes, a family of distributions for which the bias of an estimate of the inverse Hessian can be computed exactly. Based on this computation, we show that when the objective being minimized is l2-regularized with parameter ! and individual machines are each given a sketch of size m, then to eliminate the bias, local estimates should be computed using a shrunk regularization parameter given by (See PDF), where d(See PDF) is the (See PDF)-effective dimension of the Hessian (or, for quadratic problems, the data matrix).  more » « less
Award ID(s):
1838179
PAR ID:
10206897
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Conference on Neural Information Processing Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this work, we address the problem of Hessian inversion bias in distributed second-order optimization algorithms. We introduce a novel shrinkage-based estimator for the resolvent of gram matrices that is asymptotically unbiased, and characterize its non-asymptotic convergence rate in the isotropic case. We apply this estimator to bias correction of Newton steps in distributed second-order optimization algorithms, as well as randomized sketching based methods. We examine the bias present in the naive averaging-based distributed Newton’s method using analytical expressions and contrast it with our proposed biasfree approach. Our approach leads to significant improvements in convergence rate compared to standard baselines and recent proposals, as shown through experiments on both real and synthetic datasets. 
    more » « less
  2. We propose a stochastic variance-reduced cubic regularized Newton method (SVRC) for non-convex optimization. At the core of our algorithm is a novel semi-stochastic gradient along with a semi-stochastic Hessian, which are specifically designed for cubic regularization method. We show that our algorithm is guaranteed to converge to an $$(\epsilon,\sqrt{\epsilon})$$-approximate local minimum within $$\tilde{O}(n^{4/5}/\epsilon^{3/2})$$ second-order oracle calls, which outperforms the state-of-the-art cubic regularization algorithms including subsampled cubic regularization. Our work also sheds light on the application of variance reduction technique to high-order non-convex optimization methods. Thorough experiments on various non-convex optimization problems support our theory. 
    more » « less
  3. null (Ed.)
    In this article, Momentum Iterative Hessian Sketch (M-IHS) techniques, a group of solvers for large scale linear Least Squares (LS) problems, are proposed and analyzed in detail. The proposed techniques are obtained by incorporating the Heavy Ball Acceleration into the Iterative Hessian Sketch algorithm and they provide significant improvements over the randomized preconditioning techniques. Through the error analyses of the M-IHS variants, lower bounds on the sketch size for various randomized distributions to converge at a pre-determined rate with a constant probability are established. The bounds present the best results in the current literature for obtaining a solution approximation and they suggest that the sketch size can be chosen proportional to the statistical dimension of the regularized problem regardless of the size of the coefficient matrix. The statistical dimension is always smaller than the rank and it gets smaller as the regularization parameter increases. By using approximate solvers along with the iterations, the M-IHS variants are capable of avoiding all matrix decompositions and inversions, which is one of the main advantages over the alternative solvers such as the Blendenpik and the LSRN. Similar to the Chebyshev Semi-iterations, the M-IHS variants do not use any inner products and eliminate the corresponding synchronizations steps in hierarchical or distributed memory systems, yet the M-IHS converges faster than the Chebyshev Semi-iteration based solvers 
    more » « less
  4. For any given neural network architecture a permutation of weights and biases results in the same functional network. This implies that optimization algorithms used to 'train' or 'learn' the network are faced with a very large number (in the millions even for small networks) of equivalent optimal solutions in the parameter space. To the best of our knowledge, this observation is absent in the literature. In order to narrow down the parameter search space, a novel technique is introduced in order to fix the bias vector configurations to be monotonically increasing. This is achieved by augmenting a typical learning problem with inequality constraints on the bias vectors in each layer. A Moreau-Yosida regularization based algorithm is proposed to handle these inequality constraints and a theoretical convergence of this algorithm is established. Applications of the proposed approach to standard trigonometric functions and more challenging stiff ordinary differential equations arising in chemically reacting flows clearly illustrate the benefits of the proposed approach. Further application of the approach on the MNIST dataset within TensorFlow, illustrate that the presented approach can be incorporated in any of the existing machine learning libraries. 
    more » « less
  5. We propose using a different smoothness energy, the Hessian energy, whose natural boundary conditions avoid this bias.In geometry processing, smoothness energies are commonly used to model scattered data interpolation, dense data denoising, and regularization during shape optimization. The squared Laplacian energy is a popular choice of energy and has a corresponding standard implementation: squaring the discrete Laplacian matrix. For compact domains, when values along the boundary are not known in advance, this construction bakes in low-order boundary conditions. This causes the geometric shape of the boundary to strongly bias the solution. For many applications, this is undesirable.Instead, we propose using the squared Frobenius norm of the Hessian as a smoothness energy. Unlike the squared Laplacian energy, this energy’s natural boundary conditions(those that best minimize the energy) correspond to meaningful high-order boundary conditions. These boundary conditions model free boundaries where the shape of the boundary should not bias the solution locally. Our analysis begins in the smooth setting and concludes with discretizations using finite-differences on 2D grids or mixed fnite elements for triangle meshes. We demonstrate the core behavior of the squared Hessian as a smoothness energy for various tasks. 
    more » « less