 Award ID(s):
 1910385
 NSFPAR ID:
 10341964
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 ISSN:
 10495258
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

In this paper, we revisit the bilevel optimization problem, in which the upperlevel objective function is generally nonconvex and the lowerlevel objective function is strongly convex. Although this type of problem has been studied extensively, it still remains an open question how to achieve an $\mathcal{O}(\epsilon^{1.5})$ sample complexity in Hessian/Jacobianfree stochastic bilevel optimization without any secondorder derivative computation. To fill this gap, we propose a novel Hessian/Jacobianfree bilevel optimizer named FdeHBO, which features a simple fully singleloop structure, a projectionaided finitedifference Hessian/Jacobianvector approximation, and momentumbased updates. Theoretically, we show that FdeHBO requires $\mathcal{O}(\epsilon^{1.5})$ iterations (each using $\mathcal{O}(1)$ samples and only firstorder gradient information) to find an $\epsilon$accurate stationary point. As far as we know, this is the first Hessian/Jacobianfree method with an $\mathcal{O}(\epsilon^{1.5})$ sample complexity for nonconvexstronglyconvex stochastic bilevel optimization.more » « less

Ruiz, Francisco ; Dy, Jennifer ; van de Meent, JanWillem (Ed.)In this paper, we study a class of bilevel optimization problems, also known as simple bilevel optimization, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are either asymptotic for the upperlevel objective, or the convergence rates are slow and suboptimal. To address this issue, in this paper, we introduce a novel bilevel optimization method that locally approximates the solution set of the lowerlevel problem via a cutting plane and then runs a conditional gradient update to decrease the upperlevel objective. When the upperlevel objective is convex, we show that our method requires ${O}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$optimal for the upperlevel objective and $\epsilon_g$optimal for the lowerlevel objective. Moreover, when the upperlevel objective is nonconvex, our method requires ${O}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$optimal solution. We also prove stronger convergence guarantees under the Holderian error bound assumption on the lowerlevel problem. To the best of our knowledge, our method achieves the bestknown iteration complexity for the considered class of bilevel problems.more » « less

Abstract We present a new feasible proximal gradient method for constrained optimization where both the objective and constraint functions are given by summation of a smooth, possibly nonconvex function and a convex simple function. The algorithm converts the original problem into a sequence of convex subproblems. Formulating those subproblems requires the evaluation of at most one gradientvalue of the original objective and constraint functions. Either exact or approximate subproblems solutions can be computed efficiently in many cases. An important feature of the algorithm is the constraint level parameter. By carefully increasing this level for each subproblem, we provide a simple solution to overcome the challenge of bounding the Lagrangian multipliers and show that the algorithm follows a strictly feasible solution path till convergence to the stationary point. We develop a simple, proximal gradient descent type analysis, showing that the complexity bound of this new algorithm is comparable to gradient descent for the unconstrained setting which is new in the literature. Exploiting this new design and analysis technique, we extend our algorithms to some more challenging constrained optimization problems where (1) the objective is a stochastic or finitesum function, and (2) structured nonsmooth functions replace smooth components of both objective and constraint functions. Complexity results for these problems also seem to be new in the literature. Finally, our method can also be applied to convex function constrained problems where we show complexities similar to the proximal gradient method.

This work develops analysis and algorithms for solving a class of bilevel optimization problems where the lowerlevel (LL) problems have linear constraints. Most of the existing approaches for constrained bilevel problems rely on value functionbased approximate reformulations, which suffer from issues such as nonconvex and nondifferentiable constraints. In contrast, in this work, we develop an implicit gradientbased approach, which is easy to implement, and is suitable for machine learning applications. We first provide an indepth understanding of the problem, by showing that the implicit objective for such problems is in general nondifferentiable. However, if we add some small (linear) perturbation to the LL objective, the resulting implicit objective becomes differentiable almost surely. This key observation opens the door for developing (deterministic and stochastic) gradientbased algorithms similar to the stateoftheart ones for unconstrained bilevel problems. We show that when the implicit function is assumed to be stronglyconvex, convex, and weaklyconvex, the resulting algorithms converge with guaranteed rate. Finally, we experimentally corroborate the theoretical findings and evaluate the performance of the proposed framework on numerical and adversarial learning problems.more » « less

null (Ed.)In this paper, we study communicationefficient decentralized training of largescale machine learning models over a network. We propose and analyze SQuARMSGD, a decentralized training algorithm, employing momentum and compressed communication between nodes regulated by a locally computable triggering rule. In SQuARMSGD, each node performs a fixed number of local SGD (stochastic gradient descent) steps using Nesterov's momentum and then sends sparisified and quantized updates to its neighbors only when there is a significant change in its model parameters since the last time communication occurred. We provide convergence guarantees of our algorithm for stronglyconvex and nonconvex smooth objectives. We believe that ours is the first theoretical analysis for compressed decentralized SGD with momentum updates. We show that SQuARMSGD converges at rate O(1/nT) for stronglyconvex objectives, while for nonconvex objectives it converges at rate O(1/√nT), thus matching the convergence rate of \emphvanilla distributed SGD in both these settings. We corroborate our theoretical understanding with experiments and compare the performance of our algorithm with the stateoftheart, showing that without sacrificing much on the accuracy, SQuARMSGD converges at a similar rate while saving significantly in total communicated bits.more » « less