In this work, we study the convergence in high probability
of clipped gradient methods when the noise distribution has heavy
tails, i.e., with bounded $p$th moments, for some $1< p \leq 2$. Prior
works in this setting follow the same recipe of using concentration
inequalities and an inductive argument with union bound to bound the
iterates across all iterations. This method results in an increase
in the failure probability by a factor of $T$, where $T$ is the
number of iterations. We instead propose a new analysis approach based
on bounding the moment generating function of a well chosen supermartingale
sequence. We improve the dependency on $T$ in the convergence guarantee
for a wide range of algorithms with clipped gradients, including stochastic
(accelerated) mirror descent for convex objectives and stochastic
gradient descent for nonconvex objectives. Our high probability bounds
achieve the optimal convergence rates and match the best currently
known in-expectation bounds. Our approach naturally allows the algorithms
to use time-varying step sizes and clipping parameters when the time
horizon is unknown, which appears difficult or even impossible using
existing techniques from prior works. Furthermore, we show that in
the case of clipped stochastic mirror descent, several problem constants,
including the initial distance to the optimum, are not required when
setting step sizes and clipping parameters.
more »
« less
Plug-and-Play (PnP) methods are efficient iterative algorithms for solving ill-posed image inverse problems. PnP methods are obtained by using deep Gaussian denoisers instead of the proximal operator or the gradient-descent step within proximal algorithms. Current PnP schemes rely on data-fidelity terms that have either Lipschitz gradients or closed-form proximal operators, which is not applicable to Poisson inverse problems. Based on the observation that the Gaussian noise is not the adequate noise model in this setting, we propose to generalize PnP using the Bregman Proximal Gradient (BPG) method. BPG replaces the Euclidean distance with a Bregman divergence that can better capture the smoothness properties of the problem. We introduce the Bregman Score Denoiser specifically parametrized and trained for the new Bregman geometry and prove that it corresponds to the proximal operator of a nonconvex potential. We propose two PnP algorithms based on the Bregman Score Denoiser for solving Poisson inverse problems. Extending the convergence results of BPG in the nonconvex settings, we show that the proposed methods converge, targeting stationary points of an explicit global functional. Experimental evaluations conducted on various Poisson inverse problems validate the convergence results and showcase effective restoration performance.
more »
« less
null
(Ed.)
We present a stochastic descent algorithm for unconstrained optimization that is
particularly efficient when the objective function is slow to evaluate and gradients
are not easily obtained, as in some PDE-constrained optimization and machine
learning problems. The algorithm maps the gradient onto a low-dimensional ran-
dom subspace of dimension at each iteration, similar to coordinate descent but
without restricting directional derivatives to be along the axes. Without requiring
a full gradient, this mapping can be performed by computing directional deriva-
tives (e.g., via forward-mode automatic differentiation). We give proofs for conver-
gence in expectation under various convexity assumptions as well as probabilistic
convergence results under strong-convexity. Our method provides a novel extension
to the well-known Gaussian smoothing technique to descent in subspaces of dimen-
sion greater than one, opening the doors to new analysis of Gaussian smoothing
when more than one directional derivative is used at each iteration. We also provide
a finite-dimensional variant of a special case of the Johnson–Lindenstrauss lemma.
Experimentally, we show that our method compares favorably to coordinate descent,
Gaussian smoothing, gradient descent and BFGS (when gradients are calculated via
forward-mode automatic differentiation) on problems from the machine learning
and shape optimization literature.
more »
« less
Stochastic gradient descent (SGD) is the optimization algorithm of choice in many machine learning applications such as regularized empirical risk minimization and training deep neural networks. The classical analysis of convergence of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, it is always violated for cases where the objective function is strongly convex. In (Bottou et al.,2016) a new analysis of convergence of SGD is performed under the assumption that stochastic gradients are bounded with respect to the true gradient norm. Here we show that for stochastic problems arising in machine learning such bound always holds. Moreover, we propose an alternative convergence analysis of SGD with diminishing learning rate regime, which is results in more relaxed conditions that those in (Bottou et al.,2016). We then move on the asynchronous parallel setting, and prove convergence of the Hogwild! algorithm in the same regime, obtaining the first convergence results for this method in the case of diminished learning rate.
more »
« less