skip to main content


Title: Linearly Constrained Bilevel Optimization: A Smoothed Implicit Gradient Approach
This work develops analysis and algorithms for solving a class of bilevel optimization problems where the lower-level (LL) problems have linear constraints. Most of the existing approaches for constrained bilevel problems rely on value function-based approximate reformulations, which suffer from issues such as non-convex and non-differentiable constraints. In contrast, in this work, we develop an implicit gradient-based approach, which is easy to implement, and is suitable for machine learning applications. We first provide an in-depth understanding of the problem, by showing that the implicit objective for such problems is in general non-differentiable. However, if we add some small (linear) perturbation to the LL objective, the resulting implicit objective becomes differentiable almost surely. This key observation opens the door for developing (deterministic and stochastic) gradient-based algorithms similar to the state-of-the-art ones for unconstrained bi-level problems. We show that when the implicit function is assumed to be stronglyconvex, convex, and weakly-convex, the resulting algorithms converge with guaranteed rate. Finally, we experimentally corroborate the theoretical findings and evaluate the performance of the proposed framework on numerical and adversarial learning problems.  more » « less
Award ID(s):
1910385
NSF-PAR ID:
10466725
Author(s) / Creator(s):
Publisher / Repository:
International Conference on Machine Learning
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This work proposes a new algorithm – the Single-timescale Double-momentum Stochastic Approximation (SUSTAIN) –for tackling stochastic unconstrained bilevel optimization problems. We focus on bilevel problems where the lower level subproblem is strongly-convex and the upper level objective function is smooth. Unlike prior works which rely on two-timescale or double loop techniques, we design a stochastic momentum-assisted gradient estimator for both the upper and lower level updates. The latter allows us to control the error in the stochastic gradient updates due to inaccurate solution to both subproblems. If the upper objective function is smooth but possibly non-convex, we show that SUSTAIN requires ${O}(\epsilon^{-3/2})$ iterations (each using $O(1)$ samples) to find an $\epsilon$-stationary solution. The $\epsilon$-stationary solution is defined as the point whose squared norm of the gradient of the outer function is less than or equal to $\epsilon$. The total number of stochastic gradient samples required for the upper and lower level objective functions match the best-known complexity for single-level stochastic gradient algorithms. We also analyze the case when the upper level objective function is strongly-convex. 
    more » « less
  2. Bilevel optimization (BO) is useful for solving a variety of important machine learning problems including but not limited to hyperparameter optimization, meta- learning, continual learning, and reinforcement learning. Conventional BO methods need to differentiate through the low-level optimization process with implicit dif- ferentiation, which requires expensive calculations related to the Hessian matrix. There has been a recent quest for first-order methods for BO, but the methods pro- posed to date tend to be complicated and impractical for large-scale deep learning applications. In this work, we propose a simple first-order BO algorithm that de- pends only on first-order gradient information, requires no implicit differentiation, and is practical and efficient for large-scale non-convex functions in deep learning. We provide a non-asymptotic convergence analysis of the proposed method to stationary points for non-convex objectives and present empirical results that show its superior practical performance. 
    more » « less
  3. Ruiz, Francisco ; Dy, Jennifer ; van de Meent, Jan-Willem (Ed.)
    In this paper, we study a class of bilevel optimization problems, also known as simple bilevel optimization, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are either asymptotic for the upper-level objective, or the convergence rates are slow and sub-optimal. To address this issue, in this paper, we introduce a novel bilevel optimization method that locally approximates the solution set of the lower-level problem via a cutting plane and then runs a conditional gradient update to decrease the upper-level objective. When the upper-level objective is convex, we show that our method requires ${O}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective. Moreover, when the upper-level objective is non-convex, our method requires ${O}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$-optimal solution. We also prove stronger convergence guarantees under the Holderian error bound assumption on the lower-level problem. To the best of our knowledge, our method achieves the best-known iteration complexity for the considered class of bilevel problems. 
    more » « less
  4. null (Ed.)
    Abstract In this paper we consider large-scale smooth optimization problems with multiple linear coupled constraints. Due to the non-separability of the constraints, arbitrary random sketching would not be guaranteed to work. Thus, we first investigate necessary and sufficient conditions for the sketch sampling to have well-defined algorithms. Based on these sampling conditions we develop new sketch descent methods for solving general smooth linearly constrained problems, in particular, random sketch descent (RSD) and accelerated random sketch descent (A-RSD) methods. To our knowledge, this is the first convergence analysis of RSD algorithms for optimization problems with multiple non-separable linear constraints. For the general case, when the objective function is smooth and non-convex, we prove for the non-accelerated variant sublinear rate in expectation for an appropriate optimality measure. In the smooth convex case, we derive for both algorithms, non-accelerated and A-RSD, sublinear convergence rates in the expected values of the objective function. Additionally, if the objective function satisfies a strong convexity type condition, both algorithms converge linearly in expectation. In special cases, where complexity bounds are known for some particular sketching algorithms, such as coordinate descent methods for optimization problems with a single linear coupled constraint, our theory recovers the best known bounds. Finally, we present several numerical examples to illustrate the performances of our new algorithms. 
    more » « less
  5. This work concerns the local convergence theory of Newton and quasi-Newton methods for convex-composite optimization: where one minimizes an objective that can be written as the composition of a convex function with one that is continuiously differentiable. We focus on the case in which the convex function is a potentially infinite-valued piecewise linear-quadratic function. Such problems include nonlinear programming, mini-max optimization, and estimation of nonlinear dynamics with non-Gaussian noise as well as many modern approaches to large-scale data analysis and machine learning. Our approach embeds the optimality conditions for convex-composite optimization problems into a generalized equation. We establish conditions for strong metric subregularity and strong metric regularity of the corresponding set-valued mappings. This allows us to extend classical convergence of Newton and quasi-Newton methods to the broader class of nonfinite valued piecewise linear-quadratic convex-composite optimization problems. In particular, we establish local quadratic convergence of the Newton method under conditions that parallel those in nonlinear programming. 
    more » « less