skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Achieving O(\epsilon^{-1.5}) Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization
In this paper, we revisit the bilevel optimization problem, in which the upper-level objective function is generally nonconvex and the lower-level objective function is strongly convex. Although this type of problem has been studied extensively, it still remains an open question how to achieve an $$\mathcal{O}(\epsilon^{-1.5})$$ sample complexity in Hessian/Jacobian-free stochastic bilevel optimization without any second-order derivative computation. To fill this gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named FdeHBO, which features a simple fully single-loop structure, a projection-aided finite-difference Hessian/Jacobian-vector approximation, and momentum-based updates. Theoretically, we show that FdeHBO requires $$\mathcal{O}(\epsilon^{-1.5})$$ iterations (each using $$\mathcal{O}(1)$$ samples and only first-order gradient information) to find an $$\epsilon$$-accurate stationary point. As far as we know, this is the first Hessian/Jacobian-free method with an $$\mathcal{O}(\epsilon^{-1.5})$$ sample complexity for nonconvex-strongly-convex stochastic bilevel optimization.  more » « less
Award ID(s):
2311274
PAR ID:
10505562
Author(s) / Creator(s):
; ;
Publisher / Repository:
Advances in Neural Information Processing Systems
Date Published:
Format(s):
Medium: X
Location:
Advances in Neural Information Processing Systems
Sponsoring Org:
National Science Foundation
More Like this
  1. This work proposes a new algorithm – the Single-timescale Double-momentum Stochastic Approximation (SUSTAIN) –for tackling stochastic unconstrained bilevel optimization problems. We focus on bilevel problems where the lower level subproblem is strongly-convex and the upper level objective function is smooth. Unlike prior works which rely on two-timescale or double loop techniques, we design a stochastic momentum-assisted gradient estimator for both the upper and lower level updates. The latter allows us to control the error in the stochastic gradient updates due to inaccurate solution to both subproblems. If the upper objective function is smooth but possibly non-convex, we show that SUSTAIN requires $${O}(\epsilon^{-3/2})$$ iterations (each using $O(1)$ samples) to find an $$\epsilon$$-stationary solution. The $$\epsilon$$-stationary solution is defined as the point whose squared norm of the gradient of the outer function is less than or equal to $$\epsilon$$. The total number of stochastic gradient samples required for the upper and lower level objective functions match the best-known complexity for single-level stochastic gradient algorithms. We also analyze the case when the upper level objective function is strongly-convex. 
    more » « less
  2. Bilevel optimization is one of the fundamental problems in machine learning and optimization. Recent theoretical developments in bilevel optimization focus on finding the first-order stationary points for nonconvex-strongly-convex cases. In this paper, we analyze algorithms that can escape saddle points in nonconvex-strongly-convex bilevel optimization. Specifically, we show that the perturbed approximate implicit differentiation (AID) with a warm start strategy finds an Ο΅-approximate local minimum of bilevel optimization in $$\tilde O(\epsilon^{-2})$$ iterations with high probability. Moreover, we propose an inexact NEgative-curvature-Originated-from-Noise Algorithm (iNEON), an algorithm that can escape saddle point and find local minimum of stochastic bilevel optimization. As a by-product, we provide the first nonasymptotic analysis of perturbed multi-step gradient descent ascent (GDmax) algorithm that converges to local minimax point for minimax problems. 
    more » « less
  3. We study the problem of finding an πœ€-first-order stationary point (FOSP) of a smooth function, given access only to gradient information. The best-known gradient query complexity for this task, assuming both the gradient and Hessian of the objective function are Lipschitz continuous, is O(πœ€βˆ’7/4). In this work, we propose a method with a gradient complexity of O(𝑑1/4πœ€βˆ’13/8), where 𝑑 is the problem dimension, leading to an improved complexity when 𝑑 = O(πœ€βˆ’1/2). To achieve this result, we design an optimization algorithm that, underneath, involves solving two online learning problems. Specifically, we first reformulate the task of finding a stationary point for a nonconvex problem as minimizing the regret in an online convex optimization problem, where the loss is determined by the gradient of the objective function. Then, we introduce a novel optimistic quasi-Newton method to solve this online learning problem, with the Hessian approximation update itself framed as an online learning problem in the space of matrices. Beyond improving the complexity bound for achieving an πœ€-FOSP using a gradient oracle, our result provides the first guarantee suggesting that quasi-Newton methods can potentially outperform gradient descent-type methods in nonconvex settings. 
    more » « less
  4. We consider the problem of finding stationary points in Bilevel optimization when the lower-level problem is unconstrained and strongly convex. The problem has been extensively studied in recent years; the main technical challenge is to keep track of lower-level solutions $y^*(x)$ in response to the changes in the upper-level variables $$x$$. Subsequently, all existing approaches tie their analyses to a genie algorithm that knows lower-level solutions and, therefore, need not query any points far from them. We consider a dual question to such approaches: suppose we have an oracle, which we call $y^*$-aware, that returns an $$O(\epsilon)$$-estimate of the lower-level solution, in addition to first-order gradient estimators {\it locally unbiased} within the $$\Theta(\epsilon)$$-ball around $y^*(x)$. We study the complexity of finding stationary points with such an $y^*$-aware oracle: we propose a simple first-order method that converges to an $$\epsilon$$ stationary point using $$O(\epsilon^{-6}), O(\epsilon^{-4})$$ access to first-order $y^*$-aware oracles. Our upper bounds also apply to standard unbiased first-order oracles, improving the best-known complexity of first-order methods by $$O(\epsilon)$$ with minimal assumptions. We then provide the matching $$\Omega(\epsilon^{-6})$$, $$\Omega(\epsilon^{-4})$$ lower bounds without and with an additional smoothness assumption on $y^*$-aware oracles, respectively. Our results imply that any approach that simulates an algorithm with an $y^*$-aware oracle must suffer the same lower bounds. 
    more » « less
  5. We consider the problem of finding stationary points in Bilevel optimization when the lower-level problem is unconstrained and strongly convex. The problem has been extensively studied in recent years; the main technical challenge is to keep track of lower-level solutions $y^*(x)$ in response to the changes in the upper-level variables $$x$$. Subsequently, all existing approaches tie their analyses to a genie algorithm that knows lower-level solutions and, therefore, need not query any points far from them. We consider a dual question to such approaches: suppose we have an oracle, which we call $y^*$-aware, that returns an $$O(\epsilon)$$-estimate of the lower-level solution, in addition to first-order gradient estimators locally unbiased within the $$\Theta(\epsilon)$$-ball around $y^*(x)$. We study the complexity of finding stationary points with such an $y^*$-aware oracle: we propose a simple first-order method that converges to an $$\epsilon$$ stationary point using $$O(\epsilon^{-6}), O(\epsilon^{-4})$$ access to first-order $y^*$-aware oracles. Our upper bounds also apply to standard unbiased first-order oracles, improving the best-known complexity of first-order methods by $$O(\epsilon)$$ with minimal assumptions. We then provide the matching $$\Omega(\epsilon^{-6})$$, $$\Omega(\epsilon^{-4})$$ lower bounds without and with an additional smoothness assumption, respectively. Our results imply that any approach that simulates an algorithm with an $y^*$-aware oracle must suffer the same lower bounds. 
    more » « less