Search for: All records

Creators/Authors contains: "Osher, S"

« Prev Next »

Total Resources

7

Resource Type
Conference Paper

2

Conference Proceeding

0

Dataset

0

Journal Article

5

Workshop Report

0

Availability
Full Text / Resource Available

7

Citation Only

0

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A deterministic gradient-based approach to avoid saddle points

https://doi.org/10.1017/S0956792522000316

Kreusser, L. M. ; Osher, S. J. ; Wang, B. ( December 2022 , European Journal of Applied Mathematics)

Abstract Loss functions with a large number of saddle points are one of the major obstacles for training modern machine learning (ML) models efficiently. First-order methods such as gradient descent (GD) are usually the methods of choice for training ML models. However, these methods converge to saddle points for certain choices of initial guesses. In this paper, we propose a modification of the recently proposed Laplacian smoothing gradient descent (LSGD) [Osher et al., arXiv:1806.06317 ], called modified LSGD (mLSGD), and demonstrate its potential to avoid saddle points without sacrificing the convergence rate. Our analysis is based on the attraction region, formed by all starting points for which the considered numerical scheme converges to a saddle point. We investigate the attraction region’s dimension both analytically and numerically. For a canonical class of quadratic functions, we show that the dimension of the attraction region for mLSGD is $\lfloor (n-1)/2\rfloor$ , and hence it is significantly smaller than that of GD whose dimension is $n-1$ .
more » « less
Full Text Available
A Neural Network Approach Applied to Multi-Agent Optimal Control

Onken, D ; Nurbekyan, L ; Li, Xingjian ; Wu Fung, S ; Osher, S ; Ruthotto, L ( January 2021 , European Control Conference)
null (Ed.)
We propose a neural network approach for solving high-dimensional optimal control problems. In particular, we focus on multi-agent control problems with obstacle and collision avoidance. These problems immediately become high-dimensional, even for moderate phase-space dimensions per agent. Our approach fuses the Pontryagin Maximum Principle and Hamilton-Jacobi-Bellman (HJB) approaches and parameterizes the value function with a neural network. Our approach yields controls in a feedback form for quick calculation and robustness to moderate disturbances to the system. We train our model using the objective function and optimality conditions of the control problem. Therefore, our training algorithm neither involves a data generation phase nor solutions from another algorithm. Our model uses empirically effective HJB penalizers for efficient training. By training on a distribution of initial states, we ensure the controls' optimality is achieved on a large portion of the state-space. Our approach is grid-free and scales efficiently to dimensions where grids become impractical or infeasible. We demonstrate our approach's effectiveness on a 150-dimensional multi-agent problem with obstacles.
more » « less
Full Text Available
Understanding straight-through estimator in training activation quantized neural nets

Yin, P ; Lyu, J ; Zhang, S ; Osher, S ; Qi, Y-Y ; Xin, J ( April 2019 , International Conference on Learning Representations)

Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass only, so that the “gradient” through the modified chain rule becomes non-trivial. Since this unusual “gradient” is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual “gradient” given by the STE-modifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments.
more » « less
Full Text Available
Blended Coarse Gradient Descent for Full Quantization of Deep Neural Networks

https://doi.org/DOI:10.1007/s40687-018-0177-6

Yin, P ; Zhang, S ; Lyu, L ; Osher, S ; Qi, Y-Y ; Xin, J ( January 2019 , Research in the mathematical sciences)

Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular full-precision counterparts. To maintain the same performance level especially at low bit-widths, QDNNs must be retrained. Their training involves piece-wise constant activation functions and discrete weights; hence, mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the full-precision weights and their quantization (the so-called blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bit-width such as binarization. In full quantization of ResNet-18 for ImageNet classification task, BCGD gives 64.36% top-1 accuracy with binary weights across all layers and 4-bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent for a two-linear-layer neural network model with Gaussian input data and prove that the expected coarse gradient correlates positively with the underlying true gradient.
more » « less
Full Text Available
Modeling Environmental Crime in Protected Areas Using the Level Set Method

https://doi.org/10.1137/18M1205339

Arnold, D. J. ; Fernandez, D. ; Jia, R. ; Parkinson, C. ; Tonne, D. ; Yaniv, Y. ; Bertozzi, A. L. ; Osher, S. J. ( January 2019 , SIAM Journal on Applied Mathematics)

Full Text Available
A parallel method for earth mover’s distance

Gangbo, W. ; Li, W. ; Osher, S. ; Ryu, E. ; Yin, W. ( January 2018 , Journal of scientific computing)

We propose a new algorithm to approximate the Earth Mover's distance (EMD). Our main idea is motivated by the theory of optimal transport, in which EMD can be reformulated as a familiar L1 type minimization. We use a regularization which gives us a unique solution for this L1 type problem. The new regularized minimization is very similar to problems which have been solved in the fields of compressed sensing and image processing, where several fast methods are available. In this paper, we adopt a primal-dual algorithm designed there, which uses very simple updates at each iteration and is shown to converge very rapidly. Several numerical examples are provided.
more » « less
Full Text Available
BinaryRelax: A Relaxation Approach for Training Deep Neural Networks with Quantized Weights

https://doi.org/DOI:10.1137/18M1166134

Yin, P ; Zhang, S ; Lyu, J ; Osher, S ; Qi, Y-Y ; Xin, J ( January 2018 , SIAM journal on imaging sciences)

We propose BinaryRelax, a simple two-phase algorithm, for training deep neural networks with quantized weights. The set constraint that characterizes the quantization of weights is not imposed until the late stage of training, and a sequence of pseudo quantized weights is maintained. Specifically, we relax the hard constraint into a continuous regularizer via Moreau envelope, which turns out to be the squared Euclidean distance to the set of quantized weights. The pseudo quantized weights are obtained by linearly interpolating between the float weights and their quantizations. A continuation strategy is adopted to push the weights towards the quantized state by gradually increasing the regularization parameter. In the second phase, exact quantization scheme with a small learning rate is invoked to guarantee fully quantized weights. We test BinaryRelax on the benchmark CIFAR and ImageNet color image datasets to demonstrate the superiority of the relaxed quantization approach and the improved accuracy over the state-of-the-art training methods. Finally, we prove the convergence of BinaryRelax under an approximate orthogonality condition.
more » « less
Full Text Available