- Award ID(s):
- 1929955
- PAR ID:
- 10105883
- Date Published:
- Journal Name:
- Proceedings of Machine Learning Research
- Volume:
- 89
- ISSN:
- 2640-3498
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Deep neural networks have been shown to be effective adaptive beamformers for ultrasound imaging. However, when training with traditional L p norm loss functions, model selection is difficult because lower loss values are not always associated with higher image quality. This ultimately limits the maximum achievable image quality with this approach and raises concerns about the optimization objective. In an effort to align the optimization objective with the image quality metrics of interest, we implemented a novel ultrasound-specific loss function based on the spatial lag-one coherence and signal-to-noise ratio of the delayed channel data in the short-time Fourier domain. We employed the R-Adam optimizer with look ahead and cyclical learning rate to make the training more robust to initialization and local minima, leading to better model performance and more reliable convergence. With our custom loss function and optimization scheme, we achieved higher contrast-to-noise-ratio, higher speckle signal-to-noise-ratio, and more accurate contrast ratio reconstruction than with previous deep learning and delay-and-sum beamforming approaches.more » « less
-
Beam search optimization (Wiseman and Rush, 2016) resolves many issues in neural machine translation. However, this method lacks principled stopping criteria and does not learn how to stop during training, and the model naturally prefers longer hypotheses during the testing time in practice since they use the raw score instead of the probability-based score. We propose a novel ranking method which enables an optimal beam search stop- ping criteria. We further introduce a structured prediction loss function which penalizes suboptimal finished candidates produced by beam search during training. Experiments of neural machine translation on both synthetic data and real languages (German→English and Chinese→English) demonstrate our pro- posed methods lead to better length and BLEU score.more » « less
-
Existing gradient-based optimization methods update parameters locally, in a direction that minimizes the loss function. We study a different approach, symmetry teleportation, that allows parameters to travel a large distance on the loss level set, in order to improve the convergence speed in subsequent steps. Teleportation exploits symmetries in the loss landscape of optimization problems. We derive loss-invariant group actions for test functions in optimization and multi-layer neural networks, and prove a necessary condition for teleportation to improve convergence rate. We also show that our algorithm is closely related to second order methods. Experimentally, we show that teleportation improves the convergence speed of gradient descent and AdaGrad for several optimization problems including test functions, multi-layer regressions, and MNIST classification.more » « less
-
In this paper we prove that Local (S)GD (or FedAvg) can optimize deep neural networks with Rectified Linear Unit (ReLU) activation function in polynomial time. Despite the established convergence theory of Local SGD on optimizing general smooth functions in communication-efficient distributed optimization, its convergence on non-smooth ReLU networks still eludes full theoretical understanding. The key property used in many Local SGD analysis on smooth function is gradient Lipschitzness, so that the gradient on local models will not drift far away from that on averaged model. However, this decent property does not hold in networks with non-smooth ReLU activation function. We show that, even though ReLU network does not admit gradient Lipschitzness property, the difference between gradients on local models and average model will not change too much, under the dynamics of Local SGD. We validate our theoretical results via extensive experiments. This work is the first to show the convergence of Local SGD on non-smooth functions, and will shed lights on the optimization theory of federated training of deep neural networks.more » « less
-
Deep learning architectures are usually proposed with millions of parameters, resulting in a memory issue when training deep neural networks with stochastic gradient descent type methods using large batch sizes. However, training with small batch sizes tends to produce low quality solution due to the large variance of stochastic gradients. In this paper, we tackle this problem by proposing a new framework for training deep neural network with small batches/noisy gradient. During optimization, our method iteratively applies a proximal type regularizer to make loss function strongly convex. Such regularizer stablizes the gradient, leading to better training performance. We prove that our algorithm achieves comparable convergence rate as vanilla SGD even with small batch size. Our framework is simple to implement and can be potentially combined with many existing optimization algorithms. Empirical results show that our method outperforms SGD and Adam when batch size is small. Our implementation is available at https://github.com/huiqu18/TRAlgorithm.