Deep learning architectures are usually proposed with millions of parameters, resulting in a memory issue when training deep neural networks with stochastic gradient descent type methods using large batch sizes. However, training with small batch sizes tends to produce low quality solution due to the large variance of stochastic gradients. In this paper, we tackle this problem by proposing a new framework for training deep neural network with small batches/noisy gradient. During optimization, our method iteratively applies a proximal type regularizer to make loss function strongly convex. Such regularizer stablizes the gradient, leading to better training performance. We prove that our algorithm achieves comparable convergence rate as vanilla SGD even with small batch size. Our framework is simple to implement and can be potentially combined with many existing optimization algorithms. Empirical results show that our method outperforms SGD and Adam when batch size is small. Our implementation is available at https://github.com/huiqu18/TRAlgorithm.
more » « less- Award ID(s):
- 1855759
- PAR ID:
- 10108626
- Date Published:
- Journal Name:
- The Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI)
- Page Range / eLocation ID:
- 4348 to 4354
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
We analyze deep ReLU neural networks trained with mini-batch stochastic gradient decent and weight decay. We prove that the source of the SGD noise is an implicit low rank constraint across all of the weight matrices within the network. Furthermore, we show, both theoretically and empirically, that when training a neural network using Stochastic Gradient Descent (SGD) with a small batch size, the resulting weight matrices are expected to be of small rank. Our analysis relies on a minimal set of assumptions and the neural networks may include convolutional layers, residual connections, as well as batch normalization layers.more » « less
-
In this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers.more » « less
-
Machine learning models such as deep neural networks have been shown to be successful in solving a wide range of problems. Training such a model typically requires stochastic gradient descent, and the process is time-consuming and expensive in terms of computing resources. In this paper, we propose a distributed framework that supports the prioritized execution of the gradient computation. Our proposed distributed framework identifies important data points through computing or estimating the priority for each data point. We evaluate the proposed distributed framework with several machine learning models including multi-layer perceptron (MLP) and convolutional neural networks (CNN). Our experimental results show that prioritized SGD accelerates the training of machine learning models by as much as 1.6X over that of the mini-batch SGD. Further, the distributed framework scales linearly with the number of workers.more » « less
-
This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work presents a sharp analysis of: (1) mini- batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD in order to decrease the variance in SGD’s final iterate. This work presents sharp finite sample generalization error bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problem-dependent extent to which mini-batching can be used to yield provable near-linear parallelization speedups over SGD with batch size one. This characterization is used to understand the relationship between learning rate versus batch size when considering the excess risk of the final iterate of an SGD procedure. Next, this mini-batching characterization is utilized in providing a highly parallelizable SGD method that achieves the min- imax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGD-style methods. Following this, a non-asymptotic excess risk bound for model averaging (which is a communication efficient parallelization scheme) is provided. Finally, this work sheds light on fundamental differences in SGD’s behavior when dealing with mis-specified models in the non-realizable least squares problem. This paper shows that maximal stepsizes ensuring minimax risk for the mis-specified case must depend on the noise properties. The analysis tools used by this paper generalize the operator view of averaged SGD (De ́fossez and Bach, 2015) followed by developing a novel analysis in bounding these operators to char- acterize the generalization error. These techniques are of broader interest in analyzing various computational aspects of stochastic approximation.more » « less
-
Stochastic gradient descent (SGD) is the optimization algorithm of choice in many machine learning applications such as regularized empirical risk minimization and training deep neural networks. The classical analysis of convergence of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, it is always violated for cases where the objective function is strongly convex. In (Bottou et al.,2016) a new analysis of convergence of SGD is performed under the assumption that stochastic gradients are bounded with respect to the true gradient norm. Here we show that for stochastic problems arising in machine learning such bound always holds. Moreover, we propose an alternative convergence analysis of SGD with diminishing learning rate regime, which is results in more relaxed conditions that those in (Bottou et al.,2016). We then move on the asynchronous parallel setting, and prove convergence of the Hogwild! algorithm in the same regime, obtaining the first convergence results for this method in the case of diminished learning rate.more » « less