skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM to 12:00 AM ET on Tuesday, March 25 due to maintenance. We apologize for the inconvenience.


Title: Is local SGD better than minibatch SGD?
Award ID(s):
1718970
PAR ID:
10218521
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
International Conference on Machine Learning
Page Range / eLocation ID:
10334-10343
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper we focus on the problem of finding the optimal weights of the shallowest of neural networks consisting of a single Rectified Linear Unit (ReLU). These functions are of the form x → max(0,⟨w,x⟩) with w ∈ Rd denoting the weight vector. We focus on a planted model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to a planted weight vector. We first show that mini-batch stochastic gradient descent when suitably initialized, converges at a geometric rate to the planted model with a number of samples that is optimal up to numerical constants. Next we focus on a parallel implementation where in each iteration the mini-batch gradient is calculated in a distributed manner across multiple processors and then broadcast to a master or all other processors. To reduce the communication cost in this setting we utilize a Quanitzed Stochastic Gradient Scheme (QSGD) where the partial gradients are quantized. Perhaps unexpectedly, we show that QSGD maintains the fast convergence of SGD to a globally optimal model while significantly reducing the communication cost. We further corroborate our numerical findings via various experiments including distributed implementations over Amazon EC2. 
    more » « less
  2. null (Ed.)
    In this paper, we study communication-efficient decentralized training of large-scale machine learning models over a network. We propose and analyze SQuARM-SGD, a decentralized training algorithm, employing momentum and compressed communication between nodes regulated by a locally computable triggering rule. In SQuARM-SGD, each node performs a fixed number of local SGD (stochastic gradient descent) steps using Nesterov's momentum and then sends sparisified and quantized updates to its neighbors only when there is a significant change in its model parameters since the last time communication occurred. We provide convergence guarantees of our algorithm for strongly-convex and non-convex smooth objectives. We believe that ours is the first theoretical analysis for compressed decentralized SGD with momentum updates. We show that SQuARM-SGD converges at rate O(1/nT) for strongly-convex objectives, while for non-convex objectives it converges at rate O(1/√nT), thus matching the convergence rate of \emphvanilla distributed SGD in both these settings. We corroborate our theoretical understanding with experiments and compare the performance of our algorithm with the state-of-the-art, showing that without sacrificing much on the accuracy, SQuARM-SGD converges at a similar rate while saving significantly in total communicated bits. 
    more » « less