We consider the problem of finding stationary points in Bilevel optimization when the lowerlevel problem is unconstrained and strongly convex. The problem has been extensively studied in recent years; the main technical challenge is to keep track of lowerlevel solutions $y^*(x)$ in response to the changes in the upperlevel variables $x$. Subsequently, all existing approaches tie their analyses to a genie algorithm that knows lowerlevel solutions and, therefore, need not query any points far from them. We consider a dual question to such approaches: suppose we have an oracle, which we call $y^*$aware, that returns an $O(\epsilon)$estimate of the lowerlevel solution, in addition to firstorder gradient estimators {\it locally unbiased} within the $\Theta(\epsilon)$ball around $y^*(x)$. We study the complexity of finding stationary points with such an $y^*$aware oracle: we propose a simple firstorder method that converges to an $\epsilon$ stationary point using $O(\epsilon^{6}), O(\epsilon^{4})$ access to firstorder $y^*$aware oracles. Our upper bounds also apply to standard unbiased firstorder oracles, improving the bestknown complexity of firstorder methods by $O(\epsilon)$ with minimal assumptions. We then provide the matching $\Omega(\epsilon^{6})$, $\Omega(\epsilon^{4})$ lower bounds without and with an additional smoothness assumption on $y^*$aware oracles, respectively. Our results imply that any approach that simulates an algorithm with an $y^*$aware oracle must suffer the same lower bounds.
more »
« less
This content will become publicly available on January 16, 2025
On Penalty Methods for Nonconvex Bilevel Optimization and FirstOrder Stochastic Approximation
In this work, we study firstorder algorithms for solving Bilevel Optimization (BO) where the objective functions are smooth but possibly nonconvex in both levels and the variables are restricted to closed convex sets. As a first step, we study the landscape of BO through the lens of penalty methods, in which the upper and lowerlevel objectives are combined in a weighted sum with penalty parameter
. In particular, we establish a strong connection between the penalty function and the hyperobjective by explicitly characterizing the conditions under which the values and derivatives of the two must be
close. A byproduct of our analysis is the explicit formula for the gradient of hyperobjective when the lowerlevel problem has multiple solutions under minimal conditions, which could be of independent interest. Next, viewing the penalty formulation as
approximation of the original BO, we propose firstorder algorithms that find an
stationary solution by optimizing the penalty formulation with
. When the perturbed lowerlevel problem uniformly satisfies the {\it smallerror} proximal errorbound (EB) condition, we propose a firstorder algorithm that converges to an
stationary point of the penalty function using in total
accesses to firstorder stochastic gradient oracles. Under an additional assumption on stochastic oracles, we show that the algorithm can be implemented in a fully {\it singleloop} manner, {\it i.e.,} with
samples per iteration, and achieves the improved oraclecomplexity of
.
more »
« less
 Award ID(s):
 2023239
 NSFPAR ID:
 10533142
 Publisher / Repository:
 International Conference on Learning Representations
 Date Published:
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We consider the problem of finding stationary points in Bilevel optimization when the lowerlevel problem is unconstrained and strongly convex. The problem has been extensively studied in recent years; the main technical challenge is to keep track of lowerlevel solutions $y^*(x)$ in response to the changes in the upperlevel variables $x$. Subsequently, all existing approaches tie their analyses to a genie algorithm that knows lowerlevel solutions and, therefore, need not query any points far from them. We consider a dual question to such approaches: suppose we have an oracle, which we call $y^*$aware, that returns an $O(\epsilon)$estimate of the lowerlevel solution, in addition to firstorder gradient estimators locally unbiased within the $\Theta(\epsilon)$ball around $y^*(x)$. We study the complexity of finding stationary points with such an $y^*$aware oracle: we propose a simple firstorder method that converges to an $\epsilon$ stationary point using $O(\epsilon^{6}), O(\epsilon^{4})$ access to firstorder $y^*$aware oracles. Our upper bounds also apply to standard unbiased firstorder oracles, improving the bestknown complexity of firstorder methods by $O(\epsilon)$ with minimal assumptions. We then provide the matching $\Omega(\epsilon^{6})$, $\Omega(\epsilon^{4})$ lower bounds without and with an additional smoothness assumption, respectively. Our results imply that any approach that simulates an algorithm with an $y^*$aware oracle must suffer the same lower bounds.more » « less

This work proposes a new algorithm – the Singletimescale Doublemomentum Stochastic Approximation (SUSTAIN) –for tackling stochastic unconstrained bilevel optimization problems. We focus on bilevel problems where the lower level subproblem is stronglyconvex and the upper level objective function is smooth. Unlike prior works which rely on twotimescale or double loop techniques, we design a stochastic momentumassisted gradient estimator for both the upper and lower level updates. The latter allows us to control the error in the stochastic gradient updates due to inaccurate solution to both subproblems. If the upper objective function is smooth but possibly nonconvex, we show that SUSTAIN requires ${O}(\epsilon^{3/2})$ iterations (each using $O(1)$ samples) to find an $\epsilon$stationary solution. The $\epsilon$stationary solution is defined as the point whose squared norm of the gradient of the outer function is less than or equal to $\epsilon$. The total number of stochastic gradient samples required for the upper and lower level objective functions match the bestknown complexity for singlelevel stochastic gradient algorithms. We also analyze the case when the upper level objective function is stronglyconvex.more » « less

In this paper, we revisit the bilevel optimization problem, in which the upperlevel objective function is generally nonconvex and the lowerlevel objective function is strongly convex. Although this type of problem has been studied extensively, it still remains an open question how to achieve an $\mathcal{O}(\epsilon^{1.5})$ sample complexity in Hessian/Jacobianfree stochastic bilevel optimization without any secondorder derivative computation. To fill this gap, we propose a novel Hessian/Jacobianfree bilevel optimizer named FdeHBO, which features a simple fully singleloop structure, a projectionaided finitedifference Hessian/Jacobianvector approximation, and momentumbased updates. Theoretically, we show that FdeHBO requires $\mathcal{O}(\epsilon^{1.5})$ iterations (each using $\mathcal{O}(1)$ samples and only firstorder gradient information) to find an $\epsilon$accurate stationary point. As far as we know, this is the first Hessian/Jacobianfree method with an $\mathcal{O}(\epsilon^{1.5})$ sample complexity for nonconvexstronglyconvex stochastic bilevel optimization.more » « less

We consider the problem of subspace clustering with data that is potentially corrupted by both dense noise and sparse gross errors. In particular, we study a recently proposed low rank subspace clustering approach based on a nonconvex modeling formulation. This formulation includes a nonconvex spectral function in the objective function that makes the optimization task challenging, e.g., it is unknown whether the alternating direction method of multipliers (ADMM) framework proposed to solve the nonconvex model formulation is provably convergent. In this paper, we establish that the spectral function is differentiable and give a formula for computing the derivative. Moreover, we show that the derivative of the spectral function is Lipschitz continuous and provide an explicit value for the Lipschitz constant. These facts are then used to provide a lower bound for how the penalty parameter in the ADMM method should be chosen. As long as the penalty parameter is chosen according to this bound, we show that the ADMM algorithm computes iterates that have a limit point satisfying firstorder optimality conditions. We also present a second strategy for solving the nonconvex problem that is based on proximal gradient calculations. The convergence and performance of the algorithms is verified through experiments on real data from face and digit clustering and motion segmentation.more » « less