Bilevel optimization is one of the fundamental problems in machine learning and optimization. Recent theoretical developments in bilevel optimization focus on finding the first-order stationary points for nonconvex-strongly-convex cases. In this paper, we analyze algorithms that can escape saddle points in nonconvex-strongly-convex bilevel optimization. Specifically, we show that the perturbed approximate implicit differentiation (AID) with a warm start strategy finds an ϵ-approximate local minimum of bilevel optimization in $$\tilde O(\epsilon^{-2})$$ iterations with high probability. Moreover, we propose an inexact NEgative-curvature-Originated-from-Noise Algorithm (iNEON), an algorithm that can escape saddle point and find local minimum of stochastic bilevel optimization. As a by-product, we provide the first nonasymptotic analysis of perturbed multi-step gradient descent ascent (GDmax) algorithm that converges to local minimax point for minimax problems. 
                        more » 
                        « less   
                    This content will become publicly available on January 1, 2026
                            
                            Efficiently Escaping Saddle Points in Bilevel Optimization
                        
                    
    
            Bilevel optimization is one of the fundamental problems in machine learning and optimization. Recent theoretical developments in bilevel optimization focus on finding the firstorder stationary points for nonconvex-strongly-convex cases. In this paper, we analyze algorithms that can escape saddle points in nonconvex-strongly-convex bilevel optimization. Specifically, we show that the perturbed approximate implicit differentiation (AID) with a warm start strategy finds an -approximate local minimum of bilevel optimization in ̃O(−2) iterations with high probability. Moreover, we propose an inexact NEgativecurvature-Originated-from-Noise Algorithm (iNEON), an algorithm that can escape saddle point and find local minimum of stochastic bilevel optimization. As a by-product, we provide the first nonasymptotic analysis of perturbed multi-step gradient descent ascent (GDmax) algorithm that converges to local minimax point for minimax problems. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10592193
- Publisher / Repository:
- JMLR
- Date Published:
- Journal Name:
- Journal of machine learning research
- Volume:
- 26
- Issue:
- 1
- ISSN:
- 1533-7928
- Page Range / eLocation ID:
- 1-61
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Momentum stochastic gradient descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning (e.g., training deep neural networks, variational Bayesian inference, etc.). Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.more » « less
- 
            In this paper, we revisit the bilevel optimization problem, in which the upper-level objective function is generally nonconvex and the lower-level objective function is strongly convex. Although this type of problem has been studied extensively, it still remains an open question how to achieve an $$\mathcal{O}(\epsilon^{-1.5})$$ sample complexity in Hessian/Jacobian-free stochastic bilevel optimization without any second-order derivative computation. To fill this gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named FdeHBO, which features a simple fully single-loop structure, a projection-aided finite-difference Hessian/Jacobian-vector approximation, and momentum-based updates. Theoretically, we show that FdeHBO requires $$\mathcal{O}(\epsilon^{-1.5})$$ iterations (each using $$\mathcal{O}(1)$$ samples and only first-order gradient information) to find an $$\epsilon$$-accurate stationary point. As far as we know, this is the first Hessian/Jacobian-free method with an $$\mathcal{O}(\epsilon^{-1.5})$$ sample complexity for nonconvex-strongly-convex stochastic bilevel optimization.more » « less
- 
            Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called geometric step decay and proceeds by halving the step size after every few epochs. In recent work, geometric step decay was shown to improve exponentially upon classical sublinear rates for the class of sharp convex functions. In this work, we ask whether geometric step decay similarly improves stochastic algorithms for the class of sharp weakly convex problems. Such losses feature in modern statistical recovery problems and lead to a new challenge not present in the convex setting: the region of convergence is local, so one must bound the probability of escape. Our main result shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows well-known algorithms with a local linear (or nearly linear) rate of convergence to global minimizers. This guarantee applies to the stochastic projected subgradient, proximal point, and prox-linear algorithms. As an application of our main result, we analyze two statistical recovery tasks—phase retrieval and blind deconvolution—and match the best known guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions.more » « less
- 
            Abstract We explore why many recently proposed robust estimation problems are efficiently solvable, even though the underlying optimization problems are non-convex. We study the loss landscape of these robust estimation problems, and identify the existence of ’generalized quasi-gradients’. Whenever these quasi-gradients exist, a large family of no-regret algorithms are guaranteed to approximate the global minimum; this includes the commonly used filtering algorithm. For robust mean estimation of distributions under bounded covariance, we show that any first-order stationary point of the associated optimization problem is an approximate global minimum if and only if the corruption level $$\epsilon < 1/3$$. Consequently, any optimization algorithm that approaches a stationary point yields an efficient robust estimator with breakdown point $1/3$. With carefully designed initialization and step size, we improve this to $1/2$, which is optimal. For other tasks, including linear regression and joint mean and covariance estimation, the loss landscape is more rugged: there are stationary points arbitrarily far from the global minimum. Nevertheless, we show that generalized quasi-gradients exist and construct efficient algorithms. These algorithms are simpler than previous ones in the literature, and for linear regression we improve the estimation error from $$O(\sqrt{\epsilon })$$ to the optimal rate of $$O(\epsilon )$$ for small $$\epsilon $$ assuming certified hypercontractivity. For mean estimation with near-identity covariance, we show that a simple gradient descent algorithm achieves breakdown point $1/3$ and iteration complexity $$\tilde{O}(d/\epsilon ^2)$$.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
