skip to main content

Title: Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization
In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data. However, the local estimates on each machine are typically biased, relative to the full solution on all of the data, and this can limit the effectiveness of averaging. Here, we introduce a new technique for debiasing the local estimates, which leads to both theoretical and empirical improvements in the convergence rate of distributed second order methods. Our technique has two novel components: (1) modifying standard sketching techniques to obtain what we call a surrogate sketch; and (2) carefully scaling the global regularization parameter for local computations. Our surrogate sketches are based on determinantal point processes, a family of distributions for which the bias of an estimate of the inverse Hessian can be computed exactly. Based on this computation, we show that when the objective being minimized is l2-regularized with parameter ! and individual machines are each given a sketch of size m, then to eliminate the bias, local estimates should be computed using a shrunk regularization parameter given by (See PDF), where d(See PDF) is the (See PDF)-effective dimension of the Hessian (or, for quadratic problems, the data matrix).  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Conference on Neural Information Processing Systems
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. For any given neural network architecture a permutation of weights and biases results in the same functional network. This implies that optimization algorithms used to 'train' or 'learn' the network are faced with a very large number (in the millions even for small networks) of equivalent optimal solutions in the parameter space. To the best of our knowledge, this observation is absent in the literature. In order to narrow down the parameter search space, a novel technique is introduced in order to fix the bias vector configurations to be monotonically increasing. This is achieved by augmenting a typical learning problem with inequality constraints on the bias vectors in each layer. A Moreau-Yosida regularization based algorithm is proposed to handle these inequality constraints and a theoretical convergence of this algorithm is established. Applications of the proposed approach to standard trigonometric functions and more challenging stiff ordinary differential equations arising in chemically reacting flows clearly illustrate the benefits of the proposed approach. Further application of the approach on the MNIST dataset within TensorFlow, illustrate that the presented approach can be incorporated in any of the existing machine learning libraries.

    more » « less
  2. Understanding the learning dynamics and inductive bias of neural networks (NNs) is hindered by the opacity of the relationship between NN parameters and the function represented. Partially, this is due to symmetries inherent within the NN parameterization, allowing multiple different parameter settings to result in an identical output function, resulting in both an unclear relationship and redundant degrees of freedom. The NN parameterization is invariant under two symmetries: permutation of the neurons and a continuous family of transformations of the scale of weight and bias parameters. We propose taking a quotient with respect to the second symmetry group and reparametrizing ReLU NNs as continuous piecewise linear splines. Using this spline lens, we study learning dynamics in shallow univariate ReLU NNs, finding unexpected insights and explanations for several perplexing phenomena. We develop a surprisingly simple and transparent view of the structure of the loss surface, including its critical and fixed points, Hessian, and Hessian spectrum. We also show that standard weight initializations yield very flat initial functions, and that this flatness, together with overparametrization and the initial weight scale, is responsible for the strength and type of implicit regularization, consistent with previous work. Our implicit regularization results are complementary to recent work, showing that initialization scale critically controls implicit regularization via a kernel-based argument. Overall, removing the weight scale symmetry enables us to prove these results more simply and enables us to prove new results and gain new insights while offering a far more transparent and intuitive picture. Looking forward, our quotiented spline-based approach will extend naturally to the multivariate and deep settings, and alongside the kernel-based view, we believe it will play a foundational role in efforts to understand neural networks. Videos of learning dynamics using a spline-based visualization are available at . 
    more » « less
  3. Abstract

    Recent advances in magnetic microscopy have enabled studies of geological samples whose weak and spatially nonuniform magnetizations were previously inaccessible to standard magnetometry techniques. A quantity of central importance is the net magnetic moment, which reflects the mean direction and the intensity of the magnetization states of numerous ferromagnetic crystals within a certain volume. The planar arrangement of typical magnetic microscopy measurements, which originates from measuring the field immediately above the polished surface of a sample to maximize sensitivity and spatial resolution, makes estimating net moments considerably more challenging than with spherically distributed data. In particular, spatially extended and nonuniform magnetization distributions often cannot be adequately approximated by a single magnetic dipole. To address this limitation, we developed a multipole fitting technique that can accurately estimate net moment using spherical harmonic multipole expansions computed from planar data. Given that the optimal location for the origin of such expansions is unknown beforehand and generally unconstrained, regularization of this inverse problem is critical for obtaining accurate moment estimates from noisy experimental magnetic data. We characterized the performance of the technique using synthetic sources under different conditions (noiseless data, data corrupted with simulated white noise, and data corrupted with measured instrument noise). We then validated and demonstrated the technique using superconducting quantum interference device microscopy measurements of impact melt spherules from Lonar crater, India and dusty olivine chondrules from the CO chondrite meteorite Dominion Range 08006.

    more » « less
  4. null (Ed.)
    In this article, Momentum Iterative Hessian Sketch (M-IHS) techniques, a group of solvers for large scale linear Least Squares (LS) problems, are proposed and analyzed in detail. The proposed techniques are obtained by incorporating the Heavy Ball Acceleration into the Iterative Hessian Sketch algorithm and they provide significant improvements over the randomized preconditioning techniques. Through the error analyses of the M-IHS variants, lower bounds on the sketch size for various randomized distributions to converge at a pre-determined rate with a constant probability are established. The bounds present the best results in the current literature for obtaining a solution approximation and they suggest that the sketch size can be chosen proportional to the statistical dimension of the regularized problem regardless of the size of the coefficient matrix. The statistical dimension is always smaller than the rank and it gets smaller as the regularization parameter increases. By using approximate solvers along with the iterations, the M-IHS variants are capable of avoiding all matrix decompositions and inversions, which is one of the main advantages over the alternative solvers such as the Blendenpik and the LSRN. Similar to the Chebyshev Semi-iterations, the M-IHS variants do not use any inner products and eliminate the corresponding synchronizations steps in hierarchical or distributed memory systems, yet the M-IHS converges faster than the Chebyshev Semi-iteration based solvers 
    more » « less
  5. The data provided here accompany the publication "Drought Characterization with GPS: Insights into Groundwater and Reservoir Storage in California" [Young et al., (2024)] which is currently under review with Water Resources Research. (as of 28 May 2024)

    Please refer to the manuscript and its supplemental materials for full details. (A link will be appended following publication)

    File formatting information is listed below, followed by a sub-section of the text describing the Geodetic Drought Index Calculation.

    The longitude, latitude, and label for grid points are provided in the file "loading_grid_lon_lat".

    Time series for each Geodetic Drought Index (GDI) time scale are provided within "".

    The included time scales are for 00- (daily), 1-, 3-, 6-, 12- 18- 24-, 36-, and 48-month GDI solutions.

    Files are formatted following...

    Title: "grid point label L****"_"time scale"_month

    File Format: ["decimal date" "GDI value"]

    Gridded, epoch-by-epoch, solutions for each time scale are provided within "".

    Files are formatted following...

    Title: GDI_"decimal date"_"time scale"_month

    File Format: ["longitude" "latitude" "GDI value" "grid point label L****"]


    We develop the GDI following Vicente-Serrano et al. (2010) and Tang et al. (2023), such that the GDI mimics the derivation of the SPEI, and utilize the log-logistic distribution (further details below). While we apply hydrologic load estimates derived from GPS displacements as the input for this GDI (Figure 1a-d), we note that alternate geodetic drought indices could be derived using other types of geodetic observations, such as InSAR, gravity, strain, or a combination thereof. Therefore, the GDI is a generalizable drought index framework.

    A key benefit of the SPEI is that it is a multi-scale index, allowing the identification of droughts which occur across different time scales. For example, flash droughts (Otkin et al., 2018), which may develop over the period of a few weeks, and persistent droughts (>18 months), may not be observed or fully quantified in a uni-scale drought index framework. However, by adopting a multi-scale approach these signals can be better identified (Vicente-Serrano et al., 2010). Similarly, in the case of this GPS-based GDI, hydrologic drought signals are expected to develop at time scales that are both characteristic to the drought, as well as the source of the load variation (i.e., groundwater versus surface water and their respective drainage basin/aquifer characteristics). Thus, to test a range of time scales, the TWS time series are summarized with a retrospective rolling average window of D (daily with no averaging), 1, 3, 6, 12, 18, 24, 36, and 48-months width (where one month equals 30.44 days).

    From these time-scale averaged time series, representative compilation window load distributions are identified for each epoch. The compilation window distributions include all dates that range ±15 days from the epoch in question per year. This allows a characterization of the estimated loads for each day relative to all past/future loads near that day, in order to bolster the sample size and provide more robust parametric estimates [similar to Ford et al., (2016)]; this is a key difference between our GDI derivation and that presented by Tang et al. (2023). Figure 1d illustrates the representative distribution for 01 December of each year at the grid cell co-located with GPS station P349 for the daily TWS solution. Here all epochs between between 16 November and 16 December of each year (red dots), are compiled to form the distribution presented in Figure 1e.

    This approach allows inter-annual variability in the phase and amplitude of the signal to be retained (which is largely driven by variation in the hydrologic cycle), while removing the primary annual and semi-annual signals. Solutions converge for compilation windows >±5 days, and show a minor increase in scatter of the GDI time series for windows of ±3-4 days (below which instability becomes more prevalent). To ensure robust characterization of drought characteristics, we opt for an extended ±15-day compilation window. While Tang et al. (2023) found the log-logistic distribution to be unstable and opted for a normal distribution, we find that, by using the extended compiled distribution, the solutions are stable with negligible differences compared to the use of a normal distribution. Thus, to remain aligned with the SPEI solution, we retain the three-parameter log-logistic distribution to characterize the anomalies. Probability weighted moments for the log-logistic distribution are calculated following Singh et al., (1993) and Vicente-Serrano et al., (2010). The individual moments are calculated following Equation 3.

    These are then used to calculate the L-moments for shape (), scale (), and location () of the three-parameter log-logistic distribution (Equations 4 – 6).

    The probability density function (PDF) and the cumulative distribution function (CDF) are then calculated following Equations 7 and 8, respectively.

    The inverse Gaussian function is used to transform the CDF from estimates of the parametric sample quantiles to standard normal index values that represent the magnitude of the standardized anomaly. Here, positive/negative values represent greater/lower than normal hydrologic storage. Thus, an index value of -1 indicates that the estimated load is approximately one standard deviation dryer than the expected average load on that epoch.

    *Equations can be found in the main text.

    more » « less