<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Level constrained first order methods for function constrained optimization</title></titleStmt>
			<publicationStmt>
				<publisher>Springer</publisher>
				<date>03/06/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10573816</idno>
					<idno type="doi">10.1007/s10107-024-02057-4</idno>
					<title level='j'>Mathematical Programming</title>
<idno>0025-5610</idno>
<biblScope unit="volume">209</biblScope>
<biblScope unit="issue">1-2</biblScope>					

					<author>Digvijay Boob</author><author>Qi Deng</author><author>Guanghui Lan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<title>Abstract</title> <p>We present a new feasible proximal gradient method for constrained optimization where both the objective and constraint functions are given by summation of a smooth, possibly nonconvex function and a convex simple function. The algorithm converts the original problem into a sequence of convex subproblems. Formulating those subproblems requires the evaluation of at most one gradient-value of the original objective and constraint functions. Either exact or approximate subproblems solutions can be computed efficiently in many cases. An important feature of the algorithm is the constraint level parameter. By carefully increasing this level for each subproblem, we provide a simple solution to overcome the challenge of bounding the Lagrangian multipliers and show that the algorithm follows a strictly feasible solution path till convergence to the stationary point. We develop a simple, proximal gradient descent type analysis, showing that the complexity bound of this new algorithm is comparable to gradient descent for the unconstrained setting which is new in the literature. Exploiting this new design and analysis technique, we extend our algorithms to some more challenging constrained optimization problems where (1) the objective is a stochastic or finite-sum function, and (2) structured nonsmooth functions replace smooth components of both objective and constraint functions. Complexity results for these problems also seem to be new in the literature. Finally, our method can also be applied to convex function constrained problems where we show complexities similar to the proximal gradient method.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In this paper, we study the following constrained optimization problem min</p><p>where &#968; i (x) is a composite function that sums up functions f i (x) and &#967; i (x). Here, f i , i = 0, 1, . . . , m, are smooth functions, &#967; 0 (x) is a proper, convex, lowersemicontinuous (lsc) function and &#967; i (x), i = 1, . . . , m, are convex continuous functions over the domain of &#967; 0 (i.e. dom &#967; 0 ). We consider that &#967; i , i = 0, . . . , m are 'simple' functions, namely, a feasible optimization problem of the form below min x&#8712;R d</p><p>xa 0 2 + &#967; 0 (x) : xa i 2 + &#967; i (x) &#8804; b i , i = 1, . . . , m.</p><p>(1.2) can be solved to efficiently obtain either an exact solution or an inexact solution of desired accuracy. Note that if &#967; i = 0, i = 1, . . . , m, then (1.2) becomes a proximal operator for function &#967; 0 on the intersection of balls. If we further assume &#967; 0 = 0, then (1.2) is a special type of quadratically constrained quadratic programming (QCQP) that can be solved efficiently because all the Hessians are identity matrices. In addition, we consider the case where constraints f i , i = 1, . . . , m, are structured nonsmooth functions which can be approximated by smooth functions (also called smoothable functions). Note that problem (1.1) covers a variety of convex and nonconvex function constrained optimization depending on the assumptions of f i , i = 0, . . . , m. Nonlinear optimization with function constraints is a classical topic in continuous optimization. While the earlier study focused on the asymptotic performance, recent work has put more emphasis on the complexity analysis of algorithms, mainly driven by the growing interest in large-scale optimization and machine learning. For most of our discussion on the complexity analysis, we generally require convergence to an -approximate KKT point (c.f. Definition 3). Penalty methods <ref type="bibr">[9,</ref><ref type="bibr">25,</ref><ref type="bibr">33]</ref>, including augmented Lagrangian methods <ref type="bibr">[22]</ref><ref type="bibr">[23]</ref><ref type="bibr">[24]</ref><ref type="bibr">34]</ref>, is one popular approach for constrained optimization. In <ref type="bibr">[8]</ref>, Cartis et al. presented an exact penalty method by minimizing a sequence of convex composition functions. When the penalty weight is bounded, this method solves O(1/ ) trust region subproblems. If the penalty weight is unbounded, the complexity is of O(1/ 2.5 ) to reach an -KKT point. In a subsequent work <ref type="bibr">[9]</ref>, the authors provided a target following method that achieves the complexity of O(1/ ), regardless of the growth of the penalty parameter. In <ref type="bibr">[33]</ref>, Wang et al. extended the penalty method for solving constrained problems where the objective takes the expectation form. Sequential quadratic programming (SQP) is another important approach for constrained optimization. Typically, SQP involves linearization of the constraints, quadratic approximation of the objective, and possibly some trust region constraint for the convergence guarantee <ref type="bibr">[6,</ref><ref type="bibr">7]</ref>. The recent work <ref type="bibr">[12]</ref> established a unified convergence analysis of SQP (GSQP) in more general settings where the feasibility and constraint qualification may or may not hold. Different from the standard SQP, the Moving Balls Approximation (MBA) method <ref type="bibr">[1]</ref> follows a feasible solution path and transforms the initial problem into a diagonal QCQP. A subsequent work <ref type="bibr">[3]</ref> presented a unified analysis of MBA and other variants of SQP methods. Under the assumption of Kurdyka-&#321;ojasiewicz (KL) property, they establish the global convergence rates which depend on the &#321;ojasiewicz exponent.</p><p>Despite much progress in prior works, there are some significant remaining issues. Specifically, most of the analysis is carried out for only smooth optimization and requires that the exact optimal solution of the convex subproblem is readily available. Unfortunately, both assumptions can be unrealistic in many large-scale applications. To overcome these issues, <ref type="bibr">[4,</ref><ref type="bibr">25,</ref><ref type="bibr">26]</ref> presented some new proximal point algorithms that iteratively solve strongly convex proximal subproblems inexactly using first-order methods. A significant computational advantage is that first-order methods only need to compute a relatively easy proximal gradient mapping in each iteration. In particular, <ref type="bibr">[4]</ref> proposed to solve the proximal point subproblem by a new first-order primal-dual method called ConEx. Under some strict feasibility assumption, they derived the total complexities of the overall algorithm for which the objective and constraints can be either stochastic or deterministic, and either nonsmooth or smooth. Notably, for nonconvex and smooth constrained problems, inexact proximal point <ref type="bibr">[4]</ref> requires O(1/ 1.5 ) function/gradient evaluations. A similar complexity bound is obtained by the proximal point penalty method <ref type="bibr">[25]</ref> when a feasible point is available. Nevertheless, at this point, it may be difficult to directly compare the efficiency of the proximal point with the earlier approach, given that very different oracles are employed in each method. The inexact proximal point method appears to be less efficient in terms of gradient and function value computations since first-order penalty method <ref type="bibr">[9]</ref> and a variant of SQP <ref type="bibr">[12]</ref> (where the surrogate is formed by first-order approximation) has an O(1/ ) complexity bound. Nevertheless, it might be more efficient if the corresponding proximal mapping is much easier to solve than the subproblems in penalty or SQP methods.</p><p>In this paper, we attempt to alleviate some of the aforementioned issues in solving nonconvex constrained optimization. Our main contribution is the development of a novel Level Constrained Proximal Gradient (LCPG) method for constrained optimization, based on the following key ideas.</p><p>First, we convert the original problem (1.1) into a sequence of simple convex subproblems of the form (1.2) for which an exact or an approximate solution can be computed efficiently. In particular, solving the subproblem requires at most one gradient and function value computation for f i , i = 0, . . . , m. This phenomenon is quite similar to simple single-loop methods even though LCPG method can be multi-loop since we allow for an inexact solution of (1.2) using some kind of iterative scheme.</p><p>Second, starting from a strictly feasible initial point and carefully controlling the feasibility levels of the subproblem constraints, we ensure that LCPG follows a strictly feasible solution path. This also allows us to deal with nonsmooth constraints where &#967; i is not necessarily 0 and further extends LCPG to the inexact case where the subproblem admits an approximate solution. Even though subtle, the level-control design is crucial in bounding the Lagrange multipliers under the well-known Mangasarian-Fromovitz constraint qualification (MFCQ) <ref type="bibr">[4,</ref><ref type="bibr">27]</ref>. Subsequently, we also show asymptotic convergence of the LCPG method.</p><p>Third, we offer a new insight into the complexity analysis of LCPG as a gradient descent type method, which could be of independent interest. When the objective and constraints are nonconvex composite, we aim to find a first-order -KKT point (c.f. Definition 3) under the aforementioned MFCQ assumption. We can show that LCPG method converges in O(1/ ) iterations. Furthermore, each subproblem requires at most one function-value and gradient computation. The net outcome is that gradient complexity of our method is of O(1/ ). Notice that the number of iterations required by the proximal point method under MFCQ is also O(1/ ) (see <ref type="bibr">[4,</ref><ref type="bibr">Theorem 5]</ref>). However, each iteration of this method requires O(1/ 0.5 ) gradient computation, and hence its total gradient complexity can be bounded by O(1/ 1.5 ). This is much worse than LCPG method. We compare with some significant lines of work in Table <ref type="table">1</ref>.</p><p>Exploiting the intrinsic connection between LCPG and proximal gradient (without function constraints), we extend LCPG to a variety of cases. <ref type="bibr">(1)</ref> We can show a similar O(1/ ) gradient complexity for an inexact LCPG method for which the subproblem is solved to a pre-specified accuracy. If we assume &#967; i = 0, then the corresponding subproblem (1.2) (i.e. diagonal QCQP) can be efficiently solved by a customized interior point method in logarithmic time. In the more general setting when &#967; i = 0, we propose to solve (1.2) by the first-order method ConEx, which has very cheap iterations. <ref type="bibr">(2)</ref> We also extend LCPG method to stochastic (LCSPG) and variance-reduced (LCSVRG) variants when f 0 is either stochastic or finite-sum function, respectively. LCSPG and LCSVRG require O(1/ 2 ) (similar to SGD <ref type="bibr">[15]</ref>) and O( &#8730; n/ ) (similar to SVRG <ref type="bibr">[18]</ref>) stochastic gradients, respectively, where n is the number of components in the finite-sum objective. The complexity of variants of LCPG method for stochastic cases can also be seen in Table <ref type="table">2</ref>. <ref type="bibr">(3)</ref> We consider the case when function f i , i = 0, 1, . . . , m are nondifferentiable but contain a smooth saddle structure (referred to as structured nonsmooth). We extend LCPG method for such nonsmooth nonconvex function constrained problem using Nesterov's smoothing scheme <ref type="bibr">[29]</ref>. In this case, LCPG method requires O(1/ 2 ) gradients.</p><p>We show that the GD-type analysis of the LCPG method can be extended to the convex case. In particular, when the objective and constraint functions are convex, we show that LCPG method requires O(1/ ) gradient computations for smooth and composite constrained problems, and this complexity improves to O(log (1/ )) when the objective is smooth and strongly-convex. Furthermore, we develop the complexity of inexact variants of LCPG method by leveraging the analysis of gradient descent with inexact projection oracles <ref type="bibr">[31]</ref>. Inexact LCPG method maintains the gradient complexity of O(1/ ) and O(log(1/ )) for convex and strongly convex problems, respectively.  <ref type="bibr">[12,</ref><ref type="bibr">25]</ref> requires an O( &#8730; ) error on the complementary slackness and feasibility. Our measure requires 0feasibility error and O( )-complementary slackness error. * IQRC does not explicitly discuss the composite case. Their subproblem oracle can be upgraded to handle proximal cases relatively easily &#8224; Different methods have different costs for solving the subproblem. Some methods require explicit gradient computations for solving the subproblems and hence, are expected to be quite computationally costly. Some methods (including ours) have simple subproblems. See Remark 11 for a detailed discussion. Hence, comparing total computational complexity is not possible. We instead compare gradient complexities to provide a realistic estimate of the computational effort of each of these methods</p><p>Throughout our analysis, we require that the Lagrange multipliers for the convex subproblems of type (1.2) are bounded. This problem is addressed in different ways in arguably all works in the literature. In this paper, we show that under the assumption of MFCQ, Lagrange multipliers associated with the sequence of subproblems remain bounded by a quantity specified as B. Even then, the value of B cannot be estimated a priori. Fortunately, this bound is not needed in the implementation of our methods. However, it plays a role in complexity analysis. Hence, our comparison with the existing complexity literature (e.g., proximal point method of <ref type="bibr">[4]</ref>) is valid when bound B on the sequence of Lagrange multipliers largely depends on the problem itself and not on the sequence of subproblems. One can easily see that such uniform bounds on Lagrange multipliers hold under the strong feasibility constraint qualification <ref type="bibr">[4]</ref>, a similar uniform Slater's condition <ref type="bibr">[26]</ref> or for nonsmooth nonconvex relaxation in the application of sparsity constrained optimization <ref type="bibr">[5]</ref>. The problem of comparing bounds B on Lagrange multipliers requires getting into specific applications, which is not the purpose of this paper. Hence, throughout our comparison with existing literature, we assume that bound B for different methods is of a similar order.</p><p>Comparison with MBA method Auslender et al. <ref type="bibr">[1]</ref> provided a Moving Balls Approximation (MBA) method for smooth constrained problems, i.e. &#967; i (x), i = 0, . . . , m, are not present. They use Lipschitz continuity of constraint gradients along with MFCQ to</p><p>Table 2 Total number of stochastic gradient evaluations to obtain O &#949;, &#949; randomized KKT points in the finite sum problem</p><p>ensure that the subproblems satisfy Slater's conditions (see [1, Proposition 2.1(iii)]).</p><p>A similar result is also used in <ref type="bibr">[35]</ref> where they provide a line-search version of MBA for functions satisfying certain KL properties. The MBA method was studied for semialgebraic functions in <ref type="bibr">[3]</ref> where they used the KL-property of semi-algebraic functions.</p><p>The work <ref type="bibr">[1]</ref> also provides the complexity guarantee for constrained programs with a smooth and strongly convex objective. Our results differ from the past studies in the following several aspects. First, we do not assume any KL property on the class of functions, hence making the method applicable to a wider class of problems. Second, we show complexity analysis for a variety of cases, e.g., stochastic, finite-sum, or structured nonsmooth cases. Note that complexity results are not known for the MBA type method even for the purely smooth problem. Third, we show complexity results for both convex and strongly convex cases which strictly subsumes the results in <ref type="bibr">[1]</ref>. Fourth, it should be noted that <ref type="bibr">[1]</ref> also considered the efficiency of solving subproblems. They proposed an accelerated gradient method that obtains O(1/ &#8730; ) complexity for solving the dual of the QCQP subproblem. However, it is unclear what accuracy is enough for ensuring asymptotic convergence of the whole algorithm.</p><p>Comparison with generalized SQP The work <ref type="bibr">[12]</ref> developed the first complexity analysis of the generalized SQP (GSQP) method by using a novel ghost penalty approach. Different from our feasible method, they consider a general setting where the feasibility and constraint qualification may or may not hold. They show that SQP-type methods have an O(1/ 2 ) complexity for reaching some -approximate generalized stationary point. Under an extended MFCQ condition, they established an improved complexity O(1/ ) for reaching the scaled-KKT point, which matches our complexity result under a similar MFCQ assumption. Notably, both their analysis and ours rely on MFCQ to show that the global upper bound (constant B) on the multipliers of the subproblems exists. However, to obtain the best O(1/ ) complexity, GSQP explicitly relies on the value of such unknown upper bound to determine the stepsize, which appears to be rather challenging in practical use. In contrast, our algorithm does not involve the constant B in the algorithm implementation; we only require the Lipschitz constants of the gradients, which is standard for gradient descent methods.</p><p>Outline This paper is organized as follows: Sect. 2 describes notations and assumptions.</p><p>It also provides various definitions used throughout the paper. Section 3 presents the LCPG method which uses exact solutions of subproblems. It also establishes asymptotic convergence and convergence rate results. Section 4.1 and Sect. 4.2 provides the LCSPG and LCSVRG method for stochastic and finite-sum problems, respectively. Section 5 shows the extension of LCPG for nonsmooth nonconvex function constraints. Section 6 introduces the inexact LCPG method and provides its complexity analysis when the subproblems are inexactly solved by an interior point method or first-order method. Finally, Sect. 7 extends LCPG method for convex optimization problems and establishes its complexity for both strongly convex and convex problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Notations and assumptions</head><p>Notations. R n + stands for the non-negative orthant in R n . We use &#8226; to express the Euclidean norm. For a set X , we define X -= dist(0, X ) = inf x , x &#8712; X . If X is a convex set then we denote its normal cone at x as N X (x). Furthermore, we denote the dual cone of the normal cone at x as N * X (x). Let e be a vector full of elements one. For simplicity, we denote</p><p>Assumption 1 (General) We assume that the optimal value of problem (1.1) is finite:</p><p>Furthermore, the objective and constraint functions have the following properties.</p><p>1: &#967; 0 is a proper, convex, and lower semi-continuous (lsc) function. Moreover, we assume that for all i = 1, . . . , m, the function</p><p>The Lagrangian function of problem (1.1) is denoted by</p><p>For functions &#968; i , we denote its subdifferential as</p><p>where the sum is in the Minkowski sense. Note that this definition of the subdifferential for nonconvex functions was first proposed in <ref type="bibr">[4]</ref>. Moreover, &#8706;&#968; i = {&#8711; f i } when &#968; i is a "purely" smooth nonconvex function and &#8706;&#968; i = &#8706;&#967; i when &#968; i is a nonsmooth convex function. Hence, it is a valid definition for the subdifferential of a nonconvex function. Below, we define the KKT condition using the above subdifferential.</p><p>x is feasible and there exists a vector</p><p>The values {&#955; i } are called Lagrange multipliers.</p><p>It is known that the KKT condition is necessary for optimality under the assumption of certain constraint qualifications (c.f. <ref type="bibr">[2]</ref>). Our result will be based on a variant of the Mangasarian-Fromovitz constraint qualification, which is formally given below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 2 (MFCQ)</head><p>We say that a point x satisfies the Mangasarian-Fromovitz constraint qualification for (1.1) if there exists a vector z &#8712; -N *</p><p>where</p><p>Proposition 1 (Necessary condition) Let x be a local optimal solution of Problem <ref type="bibr">(1.1)</ref>. If it satisfies MFCQ (2.3), then there is a vector &#955; &#8712; R m + such that the KKT condition (1) holds.</p><p>Next, we introduce some optimality measures before formally presenting any algorithms. It is natural to characterize algorithm performance by measuring the error of satisfying the KKT condition. Towards this goal, we have the following definition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 3</head><p>We say that x is an type-I (approximate) KKT point if it is feasible (i.e. &#968;(x) &#8804; &#951;), and there exists a vector &#955; &#8712; R m + satisfying the following conditions:</p><p>Moreover, x is a randomized type-I KKT point if both x and &#955; are feasible randomized primal-dual solutions that satisfy</p><p>where the expectation is taken over the randomness of x and &#955;.</p><p>Besides the above definition, we invoke a second optimality measure which will help analyze the performance of a proximal algorithm (see, for example, <ref type="bibr">[4]</ref>). Therein, it is arguably more convenient to approach the proximity of some nearly stationary points.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 4</head><p>We say that x is a ( , &#957;) type-II KKT point if there exists an type-I KKT point x and x -x 2 &#8804; &#957;. Similarly, x is a randomized ( , &#957;) type-II KKT point if x is a random vector and E[ x -x 2 ] &#8804; &#957;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">A proximal gradient method</head><p>We present the level constrained proximal gradient (LCPG) method in Algorithm 1. The main idea of this algorithm is to turn the original nonconvex problem into a sequence of relatively easier subproblems that involve some convex surrogate functions &#968; k i (x) (0 &#8804; i &#8804; m) and variable constraint levels &#951; k : min</p><p>Above, we take the surrogate function &#968; k i (x) (0 &#8804; i &#8804; m) by partially linearizing &#968; i (x) at x k and adding the proximal term L i 2 xx k 2 :</p><p>It should be noted that our algorithm may not be well-defined if it were to be initialized by an infeasible solution x 0 . Furthermore, we require the initial point to be strictly feasible with respect to the nonlinear constraints &#968;(x) &#8804; &#951;. Therefore, we explicitly state this assumption below and assume it holds throughout the paper.</p><p>Assumption 2 (Strict feasibility) There exist a point x 0 &#8712; dom &#967; 0 and a vector &#951; 0 &#8712; R m satisfying</p><p>With a strictly feasible solution, we assume that the constraint levels {&#951; k } are incrementally updated and converge to the constraint levels for the original problem:</p><p>Algorithm 1 Level constrained proximal gradient method (LCPG)</p><p>1: Input: x 0 , &#951; 0 ; 2: for k=0,1,2,...,K do 3: For i = 0, 1, . . . , m, set &#968; k i (x) by (3.2); 4: Update x k+1 by solving (3.1); 5:</p><p>Choose &#948; k &gt; 0 and update &#951; k+1 by</p><p>2, . . . , m. (3.3) 6: end for 7: Output: x k+1 where k is a random index sampled from {0, 1, 2, ..., K }.</p><p>The following Lemma will be used many times throughout the rest of the paper.</p><p>123</p><p>Lemma 1 [Three-point inequality] Let g : R d &#8594; (-&#8734;, &#8734;] be a proper lsc convex function and</p><p>then for any x &#8712; R d , we have</p><p>Next, we present some important properties of the generated solutions in the following theorem.</p><p>Proposition 2 Suppose that Assumption 2 holds, then Algorithm 1 has the following properties.</p><p>1. The sequence x k is well-defined and is feasible for problem <ref type="bibr">(1.1)</ref>. {x k } satisfies the sufficient descent property:</p><p>Moreover, the sequence of objective values &#968; 0 (x k ) is monotonically decreasing and lim k&#8594;&#8734; &#968; 0 (x k ) exists. 2. There exists a vector &#955; k+1 &#8712; R m + such that the KKT condition holds:</p><p>Proof Part 1). First, we show that {x k } is a well-defined sequence, namely, X k &#8745;dom &#967; 0 is a nonempty set where X k = x : &#968; k i (x) &#8804; &#951; k i . This result clearly holds for k = 0 by Assumption 2. We show the general case (k &gt; 0) by induction. Suppose that x k is well-defined, i.e., X k-1 &#8745; dom &#967; 0 is nonempty. Then, by the definition of &#968; k i , &#968; k-1 i and the definition x k , we have x k &#8712; dom &#967; 0 and</p><p>Here, the first inequality follows due to the smoothness of f i (x) which ensures for all x,</p><p>fact that &#951; k i &lt; &#951; i , we have x k &#8712; dom &#967; 0 &#8745;{x : &#968; i (x) &#8804; &#951; i , i = 1, . . . , m}. Hence, the whole sequence {x k } remains feasible for the original problem. Now, let us apply Lemma 1 with g(x) = &#8711; f 0 (x k ), x + &#967; 0 (x) + 1 X k (x), y = x k and &#947; = L 0 . Then, for any x &#8712; dom &#967; 0 &#8745;X k , we have</p><p>Placing x = x k in the above relation, we have</p><p>(3.8)</p><p>Moreover, since f 0 (&#8226;) is Lipschitz smooth, we have that</p><p>Summing up the above two inequalities and using the definition &#968; 0 = f 0 + &#967; 0 , we conclude (3.5). Hence, the sequence &#968; 0 (x k ) is monotonically decreasing. The convergence of &#968; 0 (x k ) follows from the lower boundedness assumption. Part 2). Note that (3.7) ensures the strict feasibility of x k w.r.t. the constraint set X k for the kth subproblem. Therefore, Slater's condition for (3.1) and the optimality of x k+1 imply that there must exist a vector &#955; k+1 &#8712; R m + satisfying KKT condition (3.6). Hence, we complete the proof.</p><p>In order to show convergence to the KKT solutions, we need the following constraint qualifications.</p><p>Assumption 3 [Uniform MFCQ] All the feasible points of problem (1.1) satisfy MFCQ.</p><p>We state the main asymptotic convergence property of LCPG in the following theorem.</p><p>Theorem 1 Suppose that Assumption 3 holds, then we have the following conclusions.</p><p>1. The dual solutions {&#955; k+1 } are bounded from above. Namely, there exists a constant B &gt; 0 such that sup 0&#8804;k&#8804;&#8734; &#955; k+1 &lt; B.</p><p>(3.9)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Every limit point of Algorithm 1 is a KKT point.</head><p>Proof <ref type="bibr">Part 1)</ref>. First, we can immediately see that {x k } is a bounded sequence and hence the limit point exists. This result is from Assumption 1.3: and the feasibility of x k for problem (1.1) (c.f. Proposition 2, Part 1). Without loss of generality, we can assume lim k&#8594;&#8734; x k = x. For the sake of contradiction, suppose that &#955; k+1 is unbounded, then passing to a subsequence if necessary, we can assume &#955; k+1 &#8594; &#8734; 123 for simplicity. Note that we also have lim k&#8594;0 x k+1x k 2 = 0 due to the sufficient descent property <ref type="bibr">(3.5)</ref>. From the KKT condition (3.6), we have</p><p>(3.10) Let X := i&#8712;[m] {x : &#968; i (x) &#8804; &#951; i } &#8745; dom &#967; 0 be the feasible domain for problem <ref type="bibr">(1.1)</ref>. Due to the fact that x k &#8712; X (Proposition 2), boundedness of X (Assumption 1.3:) and strong convexity of &#968; k 0 , there exists l 0 &#8712; R such that X &#8834; {x : &#968; k 0 (x) &lt; l 0 } for all k. Then, using (3.10) for all x &#8712; dom &#967; 0 &#8745;{&#968; k 0 (x) &#8804; l 0 } and dividing both sides by &#955; k+1 , we have</p><p>Let us take k &#8594; &#8734; on both sides of <ref type="bibr">(3.11)</ref>. Note that for all x &#8712; dom &#967; 0 &#8745;{&#968; k 0 (x) &#8804; l 0 }, we have</p><p>where (3.12) is due to boundedness of &#968; k 0 (x) on dom &#967; 0 &#8745;{&#968; k 0 (x) &#8804; l 0 } and boundedness of &#968; k 0 (x k+1 ) since x k+1 &#8712; X which is a bounded set. Moreover, (3.13) and (3.14) use the continuity of f i (x) and &#967; i (x), i &#8712; [m]. Next, we consider the sequence {u k = &#955; k+1 / &#955; k+1 }. Since u k is a bounded sequence, it has a convergent subsequence. Let &#363; be a limit point of {u k } and the subsequence { j k } &#8838; {1, 2, ..., } such that lim k&#8594;&#8734; u j k = &#363;. Since the subsequence of a convergent sequence is also convergent, we pass to the subsequence j k in <ref type="bibr">(3.11)</ref> and apply (3.12), (3.13) and <ref type="bibr">(3.14)</ref>, yielding</p><p>x) &lt; l 0 } and using the stationarity condition for optimality of x, we have <ref type="bibr">(3.16)</ref> where we dropped the constraint &#968; k 0 (x) &#8804; l 0 due to complementary slackness and the fact that &#968; k 0 ( x) &lt; l 0 .</p><p>Let B = {i : &#363;i &gt; 0}, then we must have lim k&#8594;&#8734; &#955;</p><p>Based on complementary slackness, we have &#968;</p><p>i for any i &#8712; B for large enough k. Due to (3.14), we have the limit: &#968; i ( x) = &#951; i . Therefore, the ith constraint is active at x, i.e. i &#8712; A( x). In view of (3.16), there exists subgradients</p><p>(3.17)</p><p>However, equation (3.17) contradicts the MFCQ assumption. This is because MFCQ guarantees the existence of z&#8712; -N *</p><p>where the first inequality follows since z &#8712; -N dom &#967; 0 ( x) and v 0 &#8712; N dom &#967; 0 ( x) implying that z, v 0 &#8804; 0, the second inequality follow since &#363;i &#8805; 0 and v i &#8712; &#8706;&#968; i ( x), and the last strict inequality follows due to uniform MFCQ (c.f. Assumption 3 and relation (2.3)) and &#363;i &gt; 0 for at least one i &#8712; B. This clearly leads to a contradiction. Hence, we conclude that {&#955; k+1 } is a bounded sequence and conclude the proof.</p><p>Part 2). Without loss of generality, we assume that x is the only limit point. Since {&#955; k+1 } is a bounded sequence, there exists a limit point &#955;. Passing to a subsequence if necessary, we have &#955; k+1 &#8594; &#955;.</p><p>From the optimality condition 0 &#8712; &#8706; x L k (x k+1 , &#955; k+1 ), for any x, we have</p><p>Let us take k &#8594; &#8734; on both sides of (3.18). We note that lim k&#8594;&#8734; x k -</p><p>), and &#967; 0 ( x) &#8804; lim inf k&#8594;&#8734; &#967; 0 (x k ) due to the lower semi-continuity of &#967; 0 (&#8226;). It then follows that</p><p>In other words, x is the minimizer of convex optimization problem min</p><p>In addition, using the complementary slackness, we have</p><p>due to the convergence lim k&#8594;&#8734; &#955; k+1 = &#955;, lim k&#8594;&#8734; &#951; k = &#951;, lim k&#8594;&#8734; &#967;(x k+1 ) = &#967;( x) and lim k&#8594;&#8734; x k+1x k = 0. Putting (3.20) and (3.21) together, we conclude that ( x, &#955;) satisfies the KKT condition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Remark 1</head><p>To show the existence of a limit point x, we use Assumption 1.3: to ensure that the sequence x k remains in a bounded domain. For the sake of conciseness, we henceforth assume the existence of a limit point x and do not delve into the technical assumption used to ensure this condition. Moreover, it should be noted that the boundedness property can be obtained under other assumptions, e.g., assuming the compactness of sublevel set {x : &#968; 0 (x) &#8804; &#968; 0 (x 0 )} and using the sufficient descent condition (3.5), we can immediately show the existence of x. However, it appears to be more challenging to show convergence via this approach when sufficient descent condition fails (e.g., in the forthcoming stochastic optimization).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Dependence of B on the constraint qualification.</head><p>In Theorem 1, we proved existence of a bound B on the dual multiplier. However, the size of that bound still remains unknown. Through Example 1 below, we observe that the limiting behaviour of the sequence &#955; k (which largely governs the size of B) is closely tied to the magnitude of the number c( x), where</p><p>Here, the inner max follows from the relation (2.3) and outer min tries to find the best possible z that ensures MFCQ. It is observed that if c( x) is a large positive number, then MFCQ is strongly satisfied and B is a reasonable bound. In contrast, if c( x) is close to 0, then B can get quite large.</p><p>Example 1 Consider a two dimensional optimization problem with SCAD constraint:</p><p>This function fits our framework with the smoothness parameter 1 &#952;-1 . Lets consider &#946; = 1, &#952; = 5, the level &#951; 1 = 3 and limit point x = (5, 0). Clearly, the constraint is active at x and h is 1  4 -smooth function. Then, &#8706;&#968;( x) = {</p><p>} as per the definition. Then, one can see that uniform MFCQ is violated at the limit point x. Indeed,</p><p>implying c( x) = 0. Furthermore, no &#955; can be found satisfying the KKT condition</p><p>. Hence, as we get close to this limit point, bound on &#955; k will get arbitrarily large. Easy way to see this fact is to construct a subproblem at the limit point itself. After observing the feasible region for the subproblem at (5, 0), it is clear that it has only one feasible solution (5, 0) which gives rise to degeneracy. See Fig. <ref type="figure">1</ref> for more details. Figure <ref type="figure">1a</ref> shows the well-behaved subproblem at an interior point while Fig. <ref type="figure">1b</ref> show the degeneracy at the limit. However, as we change level &#951; 1 to any value either above or below 3, we do not get any violation of MFCQ. It also gives nondegenerate feasible sets at limit point and &#955; k remains bounded for all k. See Fig. <ref type="figure">2</ref> below for more details. In particular, if &#951; 1 = 2.5 &lt; 3, then x = (3, 0) is the limit point. At this point, we have &#8706;&#968;</p><p>Choosing the unit vector z = (z 1 , z 2 ) = (-1, 0), we obtain that point x satisfies MFCQ with c( x) = 0.5. Hence, even when the search point reaches the limit point x (i.e., &#8594; 0), the &#955; k still exists. (See, in particular, Fig. <ref type="figure">2b</ref> whose subproblem at x has a nonempty interior).</p><p>The view of the above example, we see that the limiting behavior of &#955; k (and by implication the order of B) is closely related to the "strength" of the constraint qualification MFCQ at the limit point. In order to get an apriori bound B, we use a somewhat stronger yet verifiable constraint qualification called as strong feasibility which is a slight modification of the CQ proposed in [4, <ref type="bibr">Assumption 3]</ref>. Fig. 2 The nonconvex constraint &#968; 1 (x) &#8804; &#951; 1 where &#951; 1 = 2.5. The dotted blue curves are subproblem constraint for two different points. Since the MFCQ is satisfied, the limiting subproblem constraint at (3, 0) is still a full dimensional set with nonempty interior Assumption 4 [Strong feasibility CQ] There exists x &#8712; X := i&#8712;[m] {x :</p><p>where</p><p>In view of Assumption 1.3:, we note that X is a bounded set. Hence, D X and (consequently) Assumption 4 are well-defined. See <ref type="bibr">[4]</ref> for a connection between Assumption 4 and Assumption 3. Below, we show that strong feasibility CQ leads to a fixed apriori bound on &#955; k .</p><p>Lemma 2 Suppose Assumption 4 is satisfied. Then, &#955; k 1 &#8804; B :=</p><p>where first inequality uses</p><p>x -x k 2 (follows due to L i -smoothness of f i ), and second inequality follows by Assumption 4 along with the fact that x k &#8712; X (see <ref type="bibr">Proposition 2)</ref>.</p><p>In view of (3.24), we have strict feasibility of subproblem (3.1) for all k implying that &#955; k+1 exists. Hence, we have</p><p>Then, for all x &#8712; dom &#967; 0 , we have</p><p>where equality follows from the complementary slackness of the KKT condition, and inequality is due to optimality of x k+1 . Using x = x in the above relation and combining with (3.24), we obtain</p><p>Finally, note that</p><p>where first inequality follows by <ref type="bibr">(3.24)</ref> for i = 0 and &#968; k 0 (x) &#8805; &#968; 0 (x), and second inequality follows by the definition of &#968; * 0 and D X . Combining the above relation with (3.25), we get the result. Hence, we conclude the proof.</p><p>The discussion above implies that the value of B is intricately related to the constraint qualification. While uniform MFCQ is unverifiable and does not allow for a priori bounds on B, it is widely used in nonlinear programming to ensure the existence of such a bound <ref type="bibr">[2]</ref>. As observed in Fig. <ref type="figure">1b</ref> and Fig. <ref type="figure">2b</ref>, the actual value of B depends on the closeness of the MFCQ violation at the limit point. This situation is rare, but the current assumptions do not eliminate that possibility. Problems of this nature are ill-conditioned, and to our knowledge, no algorithm can ensure bounds on the dual in such a situation. The existing literature deals with this issue in two ways: One track assumes existence of B (similar to Theorem 1) and performs the complexity or convergence analysis; A second track assumes a stronger constraint qualification that removes the ill-conditioned problems and shows more explicit bound on the dual (similar to Lemma 2. We perform our analysis for both cases. To conclude, we henceforth assume that the bound B is a constant and do not delve into the discussion on 123 related constraint qualification. To substantiate that the bound B is small, we perform detailed numerical experiments in Sect. 8.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Convergence rate analysis of LCPG method</head><p>Our main goal now is to develop some non-asymptotic convergence rates for Algorithm 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 3</head><p>In Algorithm 1, for k = 1, 2, ..., we have</p><p>Proof Let L k be the Lagrangian function of subproblem (3.1):</p><p>Using (2.1) and (3.27), we have</p><p>Using the smoothness of f i (x), the optimality condition 0 &#8712; &#8706; x L k (x k+1 , &#955; k+1 ) and the triangle inequality, we obtain</p><p>Hence we conclude the proof.</p><p>In view of Lemma 3, we derive the complexity of LCPG to attain approximate KKT solutions in the following theorem.</p><p>Theorem 2 Let &#945; k &gt; 0 (k = 0, 1, .., K ) be a non-decreasing sequence and suppose that Assumption 3 holds, then there exists a constant B &gt; 0 such that</p><p>where D =</p><p>Proof From the sufficient descent property (3.5), we have</p><p>where the second inequality uses the monotonicity of sequence &#968; 0 (x k ). In view of Theorem 1 and Cauchy-Schwarz inequality, we have &#955; k+1 , L &#8804; &#955; k+1 L &#8804; B L . This relation and (3.26) implies</p><p>Combining the above inequality with (3.33) immediately yields <ref type="bibr">(3.30)</ref>.</p><p>Next, we bound the error of complementary slackness. We have</p><p>where the first inequality uses the triangle inequality, the second inequality uses complementary slackness and the Lipschitz smoothness of f i (&#8226;), and the last inequality follows from Cauchy-Schwartz inequality and boundedness of &#955; k+1 . Summing up (3.34) weighted by &#945; k for k = 0, ..., K , we have</p><p>Combining the above result with (3.33) gives <ref type="bibr">(3.31)</ref>. Finally, the fact that x k+1 is a randomized K type-I KKT point for K defined in (3.32) is immediately follows from (3.30), (3.31) and Definition 3.</p><p>The following corollary shows that the output of Algorithm 1 is a randomized O(1/K ) KKT point under more specific parameter selection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Corollary 1</head><p>In Algorithm 1, suppose that all the assumptions of Theorem 2 hold. Set</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2</head><p>. Moreover, for i &#8712; [m] and k &#8805; 0, we have</p><p>Plugging these values into (3.32) gives us the desired conclusion.</p><p>Remark 2 Corollary 1 shows that the gradient complexity of LCPG for smooth composite constrained problems is on a par with that of gradient descent for unconstrained optimization problems. To the best of our knowledge, this is the first complexity result for a constrained problem where the constraint functions can be nonsmooth and nonconvex. Note that the convergence rate involves the unknown bound B on the Lagrangian multipliers. The presence of such a constant is not new in nonlinear programming literature <ref type="bibr">[1,</ref><ref type="bibr">8,</ref><ref type="bibr">9,</ref><ref type="bibr">12,</ref><ref type="bibr">13]</ref>. Fortunately, we can safely implement LCPG method since the step-size scheme does not rely on B. On the other hand, the bound B is often a problem-dependent quantity. E.g., in <ref type="bibr">[4]</ref> authors show a class of problems for which an a priori bound B can be established, or <ref type="bibr">[5]</ref> shows the exact value of B for a class of nonconvex relaxations of sparse optimization problems. In such cases, our comparisons are arguably fair. Hence, throughout the paper, we make comparative statements under the assumption that B largely depends on the problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Stochastic optimization</head><p>The goal of this section is to extend our proposed framework to stochastic constrained optimization where the objective f 0 is an expectation function:</p><p>Here, F(x, &#958;) is differentiable and &#958; denotes a random variable following a certain distribution &#926; . Directly evaluating either the objective f 0 or its gradient can be computationally challenging due to the stochastic nature of the problem. To address this, we introduce the following additional assumptions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Assumption 5</head><p>The information of f 0 is available via a stochastic first-order oracle (SFO). Given any input x and a random sample &#958; , SFO outputs a stochastic gradient</p><p>for some &#963; &#8712; (0, &#8734;).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Level constrained stochastic proximal gradient</head><p>In Algorithm 2, we present a stochastic variant of LCPG for solving problem 1.1 with f 0 defined by (4.1). As observed in (4.2) and (4.3), the LCSPG method uses a minibatch of random samples to estimate the true gradient in each iteration. It should be noted that the value f 0 (x k ) is presented in (4.3) only for the ease of description, it is not required when solving (3.1).</p><p>Algorithm 2 Level constrained stochastic proximal gradient (LCSPG)</p><p>by (3.2); 6: Update x k+1 by (3.1) and update &#951; k+1 by (3.3); 7: end forOutput: x k+1 where k is a random index sampled from {0, 1, 2, ..., K }.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>123</head><p>Note that the proximal point method of <ref type="bibr">[4]</ref> does not need to account for a stochastic nonconvex problem separately since they solve corresponding stochastic convex subproblems using a ConEx method developed in their work. On the contrary, LCSPG directly applies to stochastic nonconvex function constrained problems and convex subproblems are deterministic in nature. Hence, we need to develop asymptotic convergence and convergence rates for the LCSPG method separately.</p><p>Let</p><p>) denote the error of gradient estimation. We have</p><p>The following proposition summarizes some important properties of the generated solutions of LCSPG.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proposition 3 In Algorithm 2, for any</head><p>Moreover, there exists a vector &#955; k+1 &#8712; R m + such that the KKT condition (3.6) (with &#968; k 0 defined in (4.3)) holds. Proof By the KKT condition, x k+1 is the minimizer of L k (&#8226;, &#955; k+1 ). Therefore, we have</p><p>(4.5)</p><p>Placing x = x k in (4.5) and using (3.27), we have</p><p>where the second inequality is due to the complementary slackness &#955; k+1 i</p><p>) and Lipschitz smoothness of f 0 , we have</p><p>Above, the last inequality uses the fact -a 2 x 2 + bx &#8804; b 2 2a for any x, b &#8712; R, a &gt; 0. Showing the existence of the KKT condition follows a similar argument of proving part 2, Proposition 2.</p><p>We prove a technical result in the following lemma which plays a crucial role in proving dual boundedness.</p><p>Proof We prove this result by contradiction. If the result does not hold then there exists &gt; 0 and c &gt; 0 such that</p><p>However, due to Chebyshev's inequality, we have P(</p><p>The above relation contradicts <ref type="bibr">(4.8)</ref>. Hence, we have lim k&#8594;&#8734; X k = 0 a.s.</p><p>In the following theorem, we present the main asymptotic property of LCSPG.</p><p>Theorem 3 Suppose that &#8734; k=0 b -1 k &lt; &#8734;, then we have that lim k&#8594;&#8734; x k+1x k = 0, a.s. Moreover, suppose that Assumption 3 holds, &#947; k &lt; &#8734;, &#946; k is lower bounded and 2&#947; k -&#946; k -L 0 &gt; 0, then we have that (1) sup k &#955; k &lt; &#8734; a.s., and (2) all the limit points of Algorithm 2 satisfy the KKT condition, a.s.</p><p>Proof First, we fix notations. Let (&#937;, F, P) be the probability space defined over the sampling minibatches B 0 , B 1 , . . . ,. Let E k [&#8226;] be the expectation conditioned on the sub &#963; -algebra generating B 0 , B 1 , . . . , B k-1 . Applying it to (4.4) gives</p><p>In view of the super-martingale convergence theorem <ref type="bibr">[30]</ref>, we have that lim k&#8594;&#8734; &#968; 0 (x k ) exists and is finite a.s., when</p><p>x s+1x s 2 for k &#8805; 0 and C 0 = 0. We have</p><p>Applying the super-martingale convergence theorem <ref type="bibr">[30]</ref> again we can show that the limit of &#968; 0 (x k ) + C k exists a.s. Together with (4.9) and lower-boundedness of &#946; k and 2&#947; k -&#946; k -L 0 , we have that lim k&#8594;&#8734; x k+1x k 2 = 0, a.s. Next, we prove the boundedness of &#955; k . Let us consider the events</p><p>We just argued P(B) = 1. It is easy to see that if both conditions (i)</p><p>Hence, using Lemma 4, we have that lim k&#8594;&#8734; &#950; k = 0 a.s. Due to the boundedness of &#8711; f (x k ), we have that</p><p>We prove condition (ii) by contradiction. Suppose that our claim fails. We take an element &#969; &#8712; U &#8745; (A &#8745; B) and then pass to a subsequence { j k } such that lim k&#8594;&#8734; &#955; j k (&#969;) = &#8734;. In the rest of the proof, we skip &#969; for brevity. Passing to another subsequence if necessary, let x be a limit point of {x j k }. By our presumption, x satisfies MFCQ. Moreover, the KKT condition implies that</p><p>Dividing both sides by &#955; j k +1 gives</p><p>where we denote</p><p>converges to 0. Therefore, taking k &#8594; &#8734; on both sides of (4.11), we have</p><p>Analogous to the proof of Theorem 1, it is easy to show that x violates MFCQ, which however, contradicts Assumption 3. As a consequence of this argument, we have U &#8838; A c &#8746; B c . Hence, we claim that the event sup k &#955; k &lt; &#8734; will happen a.s. and complete our proof of the boundedness condition. Next, we prove asymptotic convergence to KKT solutions. For any random element &#969;, let x(&#969;) be any limit point of {x k }. Passing to some subsequence if necessary, we assume that lim k&#8594;&#8734; x k = x and lim k&#8594;&#8734; &#955; k+1 = &#955;.</p><p>Moreover, we have</p><p>Combining the above two results, we have</p><p>Taking k &#8594; &#8734; in the above relation and noting that almost surely we have lim k&#8594;&#8734; &#950; k = 0 and lim k&#8594;&#8734; x kx k+1 = 0, then</p><p>Using an argument similar to the one in Theorem 1, we can show that x is almost surely a KKT point.</p><p>Our next goal is to develop the iteration complexity of Algorithm 2. To achieve this goal, we need to assume that the dual is uniformly bounded, namely, condition (3.9) holds for all the random events. While this condition is stronger than the almost sure boundedness of &#955; k+1 shown by Theorem 3, it is indeed satisfied in many scenarios, e.g., when strong feasibility (Assumption 4) holds or other scenarios described in <ref type="bibr">[4,</ref><ref type="bibr">5]</ref>. We present the main complexity result in the following theorem.</p><p>Theorem 4 Suppose that condition (3.9) holds. Then, the sequence {(x k+1 , &#955; k+1 )} satisfies that</p><p>Proof First, appealing to (4.3), (2.1) and (3.27), we have</p><p>It follows that</p><p>In view of the above result and basic inequality (a + b) 2 &#8804; 2a 2 + 2b 2 , we have</p><p>Let us denote an auxiliary sequence</p><p>Putting this relation and (4.15) together, we have</p><p>Summing up (4.17) over k = 0, 1, . . . , K weighted by &#945; k leads to</p><p>Moreover, since {C k } is monotonically decreasing, we have</p><p>Combining the above two relations leads to our first result (4.13).</p><p>For the second part, note that (3.34) remains valid in the stochastic setting. Putting (3.34) and (4.16) together, we obtain</p><p>Multiplying both ends by &#945; k and then summing up the resulting terms over k = 0, . . . , K gives (4.14).</p><p>We next obtain more specific convergence rate by choosing the parameters properly.</p><p>. Then x k+1 is a randomized type-I KKT point with</p><p>Proof Plugging in the value of &#947; k , &#945; k , &#946; k in the relation (4.13) and taking expectation over all the randomness, we have</p><p>Moreover, due to the random sampling of k, we have</p><p>Combining the above two results, we have</p><p>Second, plugging in the values of &#947; k , &#946; k and &#948; k in (4.14), we have</p><p>It then follows from (4.18) and the definition of k that</p><p>This completes our proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Remark 3</head><p>In order to obtain some -error in satisfying the type I KKT condition, LCSPG requires a number of O(&#949; -2 ) calls to the SFO, which matches the complexity bound of stochastic gradient descent for unconstrained nonconvex optimization <ref type="bibr">[15]</ref>. Moreover, due to minibatching, LCSPG obtains an even better O(&#949; -1 ) complexity in the number of evaluations of f i (x) and</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Level constrained stochastic variance reduced gradient descent</head><p>We consider the finite sum problem:</p><p>where each F(x, &#958; i ) is Lipschitz smooth with the parameter L 0 , i = 1, 2, . . . , n.</p><p>To further improve the convergence performance in this setting, we present a new variant of the stochastic gradient method by extending the stochastic variance reduced gradient descent to the constrained setting. We present the level constrained stochastic variance-reduced gradient descent (LCSVRG) in Algorithm 3, which extends the nonconvex variance reduced mirror descent (see <ref type="bibr">[20]</ref>) to handle nonlinear constraint. Algorithm 3 can be viewed as a double-loop algorithm in which the outer loop computes the full gradient &#8711; f (x k ) Algorithm 3 Level constrained stochastic variance-reduced gradient descent (LCSVRG) 1: Input: x 0 , x -1 , &#951; 0 &lt; &#951;, T ; 2: for k = 0, 1, . . . , K do 3:</p><p>if k%T == 0 then 4:</p><p>Sample a mini-batch B k of size b uniformly at random and compute</p><p>(4.20)</p><p>7: end if 8: Set &#968; k 0 (x) by (4.3) and set &#968; k i (x) by (3.2) for i &#8712; [m]; 9: Update x k+1 by (3.1) and update &#951; k+1 by (3.3); 10: end for once every T iterations and the nested loop performs stochastic proximal gradient updates based on an unbiased estimator of the true gradient. In this view, we let k indicate the tth iteration at the r th epoch, for some values t and r . Then we use k and (r , t) interchangeably throughout the rest of this section. We keep the notation &#950; k (or &#950; (r , j) ) for expressing G k -&#8711; f (x k ) and note that &#950; (r ,0) = 0.</p><p>Our next goal is to develop some iteration complexity results of LCSVRG. We skip the asymptotic analysis since it is similar to that of LCSPG. The following Lemma (see <ref type="bibr">[20,</ref><ref type="bibr">Lemma 6.10]</ref>) presents a key insight of Algorithm 3 that the variance is controlled by the point distances. We provide proof for completeness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 5</head><p>In Algorithm 3, G k is an unbiased estimator of &#8711; f 0 (x k ). Moreover, Let (r , t) correspond to k. If t &gt; 0, then we have</p><p>Proof We prove the first part by induction. When k = 0, we have</p><p>Next, we estimate the variance of the stochastic gradient. Appealing to (4.20), we have</p><p>where the third equality uses the independence of B k and &#950; k-1 , the first inequality uses the bound Var(x) &#8804; E x 2 , and the second inequality uses the Lipschitz smoothness of F(&#8226;, &#958;). Taking expectation over all the randomness generating B (r ,1) , B (r ,2) , . . . , B (r ,t) , we have</p><p>The next Lemma shows that the generated solutions satisfy a property of sufficient descent on expectation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 6</head><p>Assume that &#947; k = &#947; and (r ,0) ) -E &#968; 0 (x (r ,t+1) ) , 0 &#8804; t &lt; T .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(4.21)</head><p>Proof In view of (3), at the jth iteration of the r th epoch, we have</p><p>. Summing up the above result over j = 0, 1, 2, ..., t (t &lt; T ) and using Lemma 5, we have</p><p>Here we use -1 j=0 &#8226; = 0.</p><p>We present the main convergence property of Algorithm 3 in the next theorem.</p><p>Theorem 5 Suppose that condition (3.9) and assumptions of Lemma 6 hold, b &#8805; 2T and K = r 0 T + j 0 for some r 0 , j 0 &#8805; 0. Let {&#945; k } be a non-decreasing sequence and {&#945; (r , j) } be its equivalent form in (r , j) notations. Suppose that &#945; (r , j) = &#945; (r ,0) for j = 1, 2, ..., T -1. Then we have</p><p>Moreover, if we take T = &#8730; n , b = 8T , &#947; = L 0 , &#946; = L 0 /2, and &#945; k = T k/T +1, and set &#948; k = &#951;-&#951; 0 (k+1)(k+2) . Then x k+1 is a randomized Type-I -KKT point with</p><p>Proof First, using Lemma 5 and the assumption that b &#8805; 2T , for any t &#8804; T -1 we have</p><p>Note that (4.15) still holds. Therefore, combining (4.15) and (4.25) leads to</p><p>It then follows from Lemma 6 that (r ,t+1) ) . (4.26)</p><p>Let K = r 0 T + j 0 . Summing up the above inequality weighted by &#945; k and exchanging the notation &#945; k &#8596; &#945; (r , j) , then we have</p><p>Above, the first inequality applies (4.26) and uses x (r ,T ) = x (r +1,0) while the second inequality uses the monotonicity of {&#968; 0 (x k )} and an argument similar to <ref type="bibr">(3.33)</ref>.</p><p>The second part is similar to the argument of Theorem 4. Particularly, combining (3.34) and (4.21) gives</p><p>Consequently, using the above relation and an argument similar to show (3.33), we deduce</p><p>Therefore, we complete the proof of (4.22) and (4.23).</p><p>Using the provided parameter setting, we have</p><p>Remark 4 It is interesting to compare the performance of LCSVRG with the other two level constrained first-order methods in the finite sum setting <ref type="bibr">(4.19)</ref>. Similar to LCPG, LCSVRG runs for O(&#949; -1 ) iterations to compute Type-I -KKT point.</p><p>LCSVRG has an appealing feature that the number of stochastic gradient &#8711; F(x, &#958;) computed can be significantly reduced for a large value of n. Specifically, Algorithm 3 requires a full gradient &#8711; f 0 (x) every T iterations, which contributes </p><p>&#8730; nK . This is better than the O(nK ) stochastic gradients needed by LCPG. Moreover, it is better than the bound O(K 2 ) of LCSPG when K is at an order larger than &#937;( &#8730; n), which corresponds to a higher accuracy regime of 1 &#8730; n . The complexities of all the proposed algorithms for getting some &#949;-KKT solutions are listed in Table <ref type="table">2</ref>.</p><p>Remark 5 While we mainly discuss the finite-sum objective (4. <ref type="bibr">19)</ref>, it is possible to extend the variance reduction technique to handle the expectation-based objective (4.1) and improve the O(&#949; -2 ) bound of LCSPG to O(&#949; -3/2 ). To achieve this goal, we impose an additional assumption that F(x, &#958;) is L 0 -Lipschitz smooth for each &#958; in the support set. We choose to omit a detailed discussion on this particular extension, as the technical development for this can be readily derived from the arguments in Sec. 6.5.2. <ref type="bibr">[21]</ref> and our previous analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Smooth optimization of nonsmooth constrained problems</head><p>In this section, we consider the constrained problem (1.1) with nonsmooth objective and nonsmooth constraint functions. We assume that f i (i &#8712; {0, 1, ..., m}) exhibits a difference-of-convex (DC) structure f i (x) := g i (x)-h i (x): 1) h i is an L h i -Lipschitzsmooth convex function and 2) g i is a structured nonsmooth convex function:</p><p>where A i &#8712; R a i &#215;n is a linear mapping, Y i &#8834; R a i is a convex compact set and p i : Y i &#8594; R is a convex continuous function. In view of such a nonsmooth structure, we can not simply apply the LCPG method, as the crucial quadratic upper-bound on f i (x) does not hold in the nonsmooth cases. However, as pointed out by Nesterov <ref type="bibr">[29]</ref>, the nonsmooth convex function g i can be closely approximated by a smooth convex function. Let us denote y i := argmin y i &#8712;Y i y i , D Y i := max y i &#8712;Y i y iy i and define the approximation function</p><p>where &#946; i &gt; 0.</p><p>Given some properly chosen smoothing parameter &#946; i , we propose to apply LCPG to solve the following smooth approximation problem:</p><p>Prior to the analysis of our algorithm, we need to develop some properties of the smooth function f</p><p>We first present a key Lemma which builds some important connection between the quadratic approximation of smooth function and Lipschitz smoothness. The proof is left in Appendix A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 7 Suppose p(&#8226;) is continuously differentiable function satisfying</head><p>for all x, y. Then, p(&#8226;) satisfies</p><p>In smooth approximation, it is shown in <ref type="bibr">[29]</ref> that g</p><p>i is a Lipschitz smooth function and it approximates the function value of g i with some O(&#946; i )-error:</p><p>&#8711;g</p><p>Similar properties of f</p><p>i are developed in the following proposition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proposition 4</head><p>We have the following properties about the approximation function f</p><p>1. Let &#946;i &#8712; [0, &#946; i ], then we have f</p><p>(5.6)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">f</head><p>)</p><p>(5.8)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">f</head><p>Namely, for any x, y, we have &#8711; f</p><p>(5.9)</p><p>On the other hand, using the boundedness of Y i , we have</p><p>Combining the above two results gives the desired inequality. Part 2. Since g</p><p>i and h i are both convex and smooth functions, we have</p><p>Summing up the above two inequalities and noting the definition of f</p><p>Similarly, using convexity of g &#946; i i and smoothness of h i , we obtain that f &#946; i i has a negative lower curvature -L h i . Part 3. The Lipschitz continuity (5.9) is an immediate consequence of part 2) and Lemma 7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Remark 6</head><p>When &#946;i = 0, Relation (5.6) reads f</p><p>Together with Assumption 2, it can be seen that x 0 is also strictly feasible for problem <ref type="bibr">(5.1)</ref>. This justifies that LCPG is well-defined for problem (5.1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Remark 7</head><p>The Lipschitz constant of &#8711; f</p><p>be derived in a different way. Since &#8711;g &#946; i i and &#8711;h i are L &#946; i g i and L h i Lipschitz continuous, respectively, we can show by triangle inequality that &#8711; f</p><p>In contrast, by exploiting the asymmetry between lower and upper curvature, Proposition 4 derived a slightly sharper bound on the gradient Lipschitz constant.</p><p>Throughout this section, we choose specific &#946; i to ensure</p><p>. Hence, we can define the additive approximation factor above as</p><p>(5.10)</p><p>Note that (5.4) provides an approximation error for function values, or the so-called zeroth-order oracle of function g i . However, convergence results for nonconvex optimization are generally given in terms of first-order stationarity measure, implying that we need approximation for first-order oracle for the function f i and consequently function g i . Below we discuss a widely used approximate subdifferential for convex functions and generalize it for nonsmooth nonconvex functions.</p><p>Definition 5 [&#957;-subdifferential] We say that a vector v &#8712; R n is a &#957;-subgradient of the convex function p(&#8226;) at x if for any z, we have</p><p>The set of all &#957;-subgradients of p at x is called the &#957;-subdifferential, denoted by &#8706; &#957; p(x). Moreover, we define &#957; subdifferential of nonconvex function f i as &#8706; &#957; f i (x) := &#8706; &#957; g i (x) + {-&#8711;h i (x)} where the addition of sets is in Minkowski sense.</p><p>Finally, we define a generalization of type-I KKT convergence criterion for structured nonsmooth nonconvex function constrained optimization problem: Definition 6 We say that a point x is an ( , &#957;) type-III KKT point of (1.1) if there exists &#955; &#8805; 0 satisfying the following conditions:</p><p>(5.13)</p><p>Moreover, we say that x is a randomized ( , &#957;) type-III KKT point of (1.1) if (5.11), (5.12) and (5.13) are satisfied in expectation.</p><p>The -subdifferential and the type-III KKT point are essential for associating smooth approximation with the original nonsmooth problem. We build some important properties in the following proposition.</p><p>Proposition 5 Let &#946; i and &#957; satisfy (5.10). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">For any x &#8712; R d , we have &#8711; f</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Suppose that x is a (randomized) Type-I -KKT point of problem</head><p>where the first inequality follows from the first relation in <ref type="bibr">(5.4)</ref>, and the third inequality follows from second relation in <ref type="bibr">(5.4)</ref>. Noting the definition of &#957;-subgradient, we conclude the proof.</p><p>Part 2. It suffices to show the conversion of randomized Type-I KKT points to randomized Type-III KKT points. Suppose that x is a randomized Type-I -KKT point and we have &#955;</p><p>Using Proposition 4, we have</p><p>This implies</p><p>Summing up this inequality over i = 1, 2, 3, ..., m and taking expectation with all the randomness, we have the</p><p>Moreover, we have</p><p>Now, we are ready to discuss the convergence rate of LCPG for nonsmooth nonconvex function constrained optimization. Theorem 6 Assume that &#946; i , &#957; satisfy (5.10) and set &#948; k = &#951;-&#951; 0 (k+1)(k+2) when running LCPG to solve problem <ref type="bibr">(5.1)</ref>.</p><p>Proof Our analysis resembles the proof of Theorem 2. Using a similar argument in (3.33), we have</p><p>(5.14)</p><p>Combining this result with Lemma 3 we obtain <ref type="bibr">15)</ref>, we see that x k+1 is a Type-I -KKT point for</p><p>), we have</p><p>Using the definition of k and Proposition 5 we obtain the desired result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Inexact LCPG</head><p>LCPG requires the exact optimal solution of subproblem (3.1), which, however, poses a great challenge when the subproblem is difficult to solve. To alleviate such an issue, we consider an inexact variant of LCPG method for which the update of x k+1 only solves problem <ref type="bibr">(3.1)</ref> approximately. This section is organized as follows. First, we present a general convergence property of inexact LCPG when the subproblem solutions satisfy certain approximation criterion. Next, we analyze the efficiency of inexact LCPG when the subproblems are handled by different external solvers. When the subproblem is a quadratically constrained quadratic program (QCQP), we propose an efficient interior point algorithm by exploiting the diagonal structure. When the subproblem has general proximal components, we propose to solve it by first-order methods. Particularly, we consider solving the subproblem by the constraint extrapolation (ConEx) method and develop the total iteration complexity of ConEx-based LCPG.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Convergence analysis under an inexactness criterion</head><p>Throughout the rest of this section, we will denote the exact primal-dual solution of (3.1) as ( x k+1 , &#955; k+1 ). We use the following criterion for measuring the accuracy of subproblem solutions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 7</head><p>We say that a point x is an -solution of (3.1) if</p><p>The following theorem shows asymptotic convergence to stationarity for inexact LCPG method under mild assumptions. Since the proof is similar to the previous argument, we present the details in Appendix B for the sake of completeness. Note that the theorem applies to a general nonconvex problem and hence applies to convex problems as well.</p><p>Theorem 7 Suppose that Assumption 3 holds and let x k+1 be an k -solution of (3.1)</p><p>Then all the conclusions of Theorem 1 still hold. Then the dual sequence { &#955;k } is uniformly bounded by a constant B &gt; 0. Moreover, every limit point of inexact LCPG is a KKT point.</p><p>Under the inexactness condition in Definition 7, we establish the complexity of inexact LCPG in the following theorem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem 8 Under the assumptions of Theorem 7, we have</head><p>)</p><p>(6.4)</p><p>In particular, using</p><p>Proof Using (3.8) with x k+1 replaced by x k+1 (the optimal solution of problem (3.1)) and adding f (x k ) + L 0 2 x kx k+1 2 on both sides, we have</p><p>where the second inequality follows from &#968; k 0 (x k ) = &#968; 0 (x k ) as well as x k+1 being an k -solution (see Definition 7) of subproblem (3.1), and the third inequality follows from the fact that &#968; k 0 (x) &#8805; &#968; 0 (x) for all x &#8712; dom &#967; 0 . Using Lemma 3 (again x k+1 is replaced by x k+1 ), noting that k satisfies the requirements of Theorem 7 implying that &#955; k &#8804; B and using (6.6), we have</p><p>Similar to the argument of (3.34), we have</p><p>Multiplying (6.6), (6.7) and (6.8) by &#945; k and summing over k = 0, 1, . . . , K give (6.1), (6.2) and (6.3). We derive a convergence rate based on the specified parameters. First, from relation (6.6), we note that &#968; 0 (x k+1 ) &#8804; &#968; 0 (x k ) + k . Hence, we have by induction that</p><p>By setting</p><p>for all k &#8805; 0 (note that k satisfies the requirement of Theorem 7), we have</p><p>Here, (i), (ii) follows from (6.9) and &#945; k+1 -</p><p>and last inequality follows since</p><p>2 and (6.10) inside (6.4), we have (6.5). Hence, we conclude the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Remark 8</head><p>Compared to the convergence result (3.31) for exact LCPG, we have to control the accumulated error in &#916; for the inexact case (6.4). However, we need an even more stringent condition on the error to ensure asymptotic convergence. Specifically, we assume &#949; k to be smaller than the level increments &#948; k i to ensure that each subsequent subproblem is strictly feasible. As long as the subproblems are solved deterministically with sufficient accuracy, we can ensure such feasibility as well as the boundedness of the dual.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Remark 9</head><p>Note that the convergence analysis of the inexact method for the stochastic case will go through in a similar fashion. In particular, the subproblems of LCSPG are still deterministic in nature. Hence, a deterministic error can be easily incorporated into the analysis of the stochastic outer loop. In particular, Proposition 3 will have an additional k in the RHS. We can use k = min i&#8712;[m] &#948; k i 2 to ensure the strict feasibility. Following the analysis in Theorem 4, we will get the additional term K k=0 &#945; k k . Note that we have identical policies for &#945; k in the above analysis and Corollary 2. Furthermore, since &#948; k used above and in Corollary 2 are the same, we have identical values of k as well. Following the above development, we can easily bound the additional K k=0 &#945; k k term.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Solving the subproblem with the interior point method</head><p>Our goal is to develop an efficient interior point algorithm to solve problem (3.1)</p><p>Without loss of generality, we express the subproblem as the following QCQP:</p><p>We assume that the initial solution x of such problem is strictly feasible, namely, there exists &#948; &gt; 0 such that</p><p>Let e 1 = [1, 0, . . . , 0] T &#8712; R d+1 . With a slight abuse of notation, we can formulate (6.11) as the following problem</p><p>Here we set artificial variables L m+1 = 1, a m+1 = 0 and b m+1 = 1 2 R 2 for some sufficiently large R. We explicitly add such a ball constraint to ensure bound on (&#951;, x). Note that the bound R always exists since our domain is compact and the objective is Lipschitz continuous. Our goal is to apply the path-following method to solve (6.13). We denote</p><p>The key idea of the path-following algorithm is to approximately solve a sequence of penalized problems</p><p>with increased values of &#964; , and generate a sequence of strictly feasible solution u &#964; close to the central path-a trajectory composed of the minimizers u * &#964; = argmin u &#966; &#964; (u). We apply a standard path-following algorithm (See <ref type="bibr">[28,</ref><ref type="bibr">Chapter 4]</ref>) for solving (6.13), and outline the overall procedure in Algorithm 4. This algorithm consists of two main steps:</p><p>1. Initialization: We seek a solution u 0 near the analytic center (i.e. minimizer of &#966;(u)). To this end, we solve a sequence of auxiliary problems &#966;&#964; (u) = &#964; w T u + &#966;(u) where w = -&#8711;&#966;( &#251;). It can be readily seen that &#251; is in the central path of this auxiliary problem with &#964; = 1. Performing a reverse path-following scheme ( decreasing rather than increasing &#964; ), we gradually converge to the analytic center. 2. Path-following: We solve a sequence of penalized problems with an increasing value of &#964; by a damped version of Newton's method, which ensures the solutions in the proximity of the central path.</p><p>Algorithm 4 Path-following Interior Point Method ( <ref type="bibr">[28]</ref>)</p><p>Phase Zero: Approximate analytic center 2: for i = 0, 1, . . . do 3:</p><p>Set u * = u i+1 and Break; 7:</p><p>end if 8: end for Phase One: Path-following scheme 9: Set u 0 = u * , &#964; 0 = max{&#964; : n(&#966; &#964; , u 0 ) &#8804; &#954;}, s = &#8730; &#965; &#947; ln 2&#965; &#964; 0 &#949; -1; 10: for i = 0, 1, . . . , s do 11:</p><p>12: Obtain u i+1 from calling Newton(&#966; &#964; i+1 , u i , &#964; i+1 , &#954;); 13: end for 14: Output: u s+1 .</p><p>Damped Newton method for solving the subproblem. 15: function Newton( f , v 0 , &#964; , ) 16:</p><p>for s = 0, 1, 2, . . . do 17:</p><p>if n( f , v s ) &#8804; then Break; 18: end if 19:</p><p>end for 21: end function</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.1">Solving the Newton equation</head><p>First, we calculate the gradient and Hessian map of &#966; t (&#8226;):</p><p>where &#952; i = -gi (u) -1 , and</p><p>Note that computing the gradient &#8711;&#966; t (u) takes O(dm), hence the computation burden is from forming and solving the Newton systems. This is divided into two cases.</p><p>1. m &lt; d. Then the Hessian is the sum of a low rank matrix and a diagonal matrix.</p><p>Based on the Sherman-Morrison-Woodbury formula, we have</p><p>Computing the product N T &#915; -1 N takes O(m 2 d) while performing Cholesky factorization takes O(m 3 ). Therefore, the overall complexity of each Newton step is</p><p>In such case, we can directly compute N N T in O(md 2 ) and then perform Cholesky factorization &#8711; 2 &#966; &#964; (x) = L L T in O(d 3 ), followed by two triangle systems. Hence the overall complexity of a Newton step is</p><p>Due to the above discussion, the cost of computing each Newton system is O min{d, m} &#8226; md . (6.17)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.2">Complexity</head><p>Before deriving the complexity of solving the subproblems, we require some additional assumptions. We assume that M = max x &#8711;g i (x) and max g 0 (x)min x g 0 (x) &#8804; V . Note that these assumptions are easily satisfied if we assume functions in the original problem have bounded level sets.</p><p>According to [28, Theorem 4.5.1], the complexity of interior point methods depends not only on the time to follow the central path, but also on the time to arrive near the analytic center from an arbitrary initial point. Let us put it in the context of Algorithm 1. Despite the strict feasibility guarantee, we do not know whether x k is near the analytic center of each subproblem. It remains to show how to control the complexity of approximating the analytic center.</p><p>To measure the strict feasibility of the initial point, we use the Minkowsky function of the domain, which is defined by &#960; x (y) = inf{t &gt; 0 : x + t -1 (yx) &#8712; D} for any given x in the interior of the domain. With the help of the Minkowsky function, we bound the distance between the initial point and the boundary in the following proposition.</p><p>Proposition 6 Let &#251; = (g 0 ( &#951;), x) where &#951; = g 0 ( x) + &#948;. If u -&#251; &#8804; &#948; M+1 , then u is feasible for problem <ref type="bibr">(6.13)</ref>. Moreover, we have</p><p>where u * is defined in phase zero of Algorithm 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proof</head><p>We have</p><p>Analogously, for i = 0, 1, . . . , m, we have</p><p>Using triangle inequality, we have gi (u) = g i (x) &#8804; g i ( x) + &#948;=0. The last constraint in (6.13) is trivially satisfied for sufficiently large R. Therefore, u is a feasible point of (6.13).</p><p>Let t + = (M+1) &#251;-u * (M+1) &#251;-u * +&#948; , then from the above analysis, we know that the point</p><p>must be a feasible solution. Using the last constraint u &#8804; R, we immediately obtain the bound <ref type="bibr">(6.18)</ref>.</p><p>Using <ref type="bibr">[28,</ref><ref type="bibr">Theorem 4.5.1]</ref> and Proposition 6, we can derive the total complexity of solving the diagonal QCQP.</p><p>Theorem 9 Under the assumptions of Proposition 6, the total number of Newton steps to get an &#949; solution is</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Corollary 3</head><p>In the inexact LCPG method, assume that the subproblems are solved by Algorithm 4 and the returned solution satisfies the inexactness requirement in Theorem 8. Then, to get an O( , ) Type-II KKT point, the overall arithmetic cost of Algorithm 4 is</p><p>Proof According to Theorem 8, the total number of LCPG is K = O(1/&#949;). In the kth iteration of LCPG, we set the error criteria &#957; = O( 1 k 2 ) and &#949; = O( 1 k 2 ). Theorem 9 implies that the number of Newton steps is N k = O( &#8730; m ln(k)). Therefore, the total number of Newton steps in LCPG is</p><p>Combining this result with (6.17) gives us the desired bound.</p><p>10 First, at the kth step of LCPG, we need log(k) iterations of interior point methods, of which the complexity order is equally contributed by the two phases of IPM. Specifically, we first require O(ln(k)) Newton steps to pull the iterates from near the boundary to the proximity of the central path, and then require O(ln(k)) to obtain an O(1/k 2 )-accurate solution. Second, it is interesting to consider the case when the constraint is far less than the feature dimensionality, namely, m d. We observe that the total computation O dm 2.5 K ln K is linear in dimensionality. Third, despite the simplicity, the basic barrier method offers a relatively stronger approximate solution than what is needed in Theorem 8, the feasibility of the solution path allows us to weaken the assumption to &#949;k = 0. Nevertheless, besides our approach, it is possible to employ long-step and infeasible primal-dual interior point methods which may give a better empirical performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Solving subproblems with the first-order method</head><p>In this section, we use a previously proposed ConEx method <ref type="bibr">[4]</ref> to solve the subproblem (3.1) when general proximal functions &#967; i are present. Then, we analyze the overall complexity of LCPG method with ConEx method as a subproblem solver. First, we formally state the extended version of problem (6.11) as follows: min x&#8712;X &#966; 0 (x) := g 0 (x) + &#967; 0 (x) s.t. &#966; i (x) := g i (x) + &#967; i (x) &#8804; 0, i = 1, . . . , m. <ref type="bibr">(6.19)</ref> For the application of ConEx for the subproblem, we need access to a convex compact set X such that &#8745; i dom &#967; i &#8838; X . Moreover, X is a "simple" set in the sense that it allows easy computation of the proximal operator of &#967; 0 (x) + i=1 w i &#967; i (x) for any given weights w i , i = 1, . . . , m. Such assumptions are not very restrictive as many machine learning and engineering problems explicitly seek the optimal solution from a bounded set. Under these assumptions, we apply ConEx to solve the subproblem (3.1) of LCPG. We now reproduce a simplified optimality guarantee of the ConEx method below without necessarily going into the details of the algorithm.</p><p>Theorem 10 <ref type="bibr">[4]</ref> Let x be the output of ConEx after T iterations for problem <ref type="bibr">(6.19)</ref>. Assume that &#966; 0 is a strongly convex function and ( x, &#955;) is the optimal primal-dual solution. Moreover, Let B be a parameter of the ConEx method which satisfies B &gt; &#955; . Then, the solution x satisfies</p><p>123</p><p>Even though ConEx can be applied to a wider variety of convex function constrained problems, it has two vital and intricate issues that need to be addressed in our context:</p><p>1. The solution path of ConEx can be arbitrarily infeasible in the early iterations, while the successive iterations make the solutions infeasibility smaller. Note that the approximation criterion in Definition 7 requires guarantees on the amount of infeasibility. This implies ConEx has to run a significant number of iterations before getting sufficiently close to the feasible set. 2. Since ConEx is a primal-dual method, its convergence guarantees depend on the optimal dual solution &#955; * . Moreover, a bound on the dual, B(&gt; &#955; * ), is required to implement the algorithm to achieve an accelerated convergence rate of O(1/T</p><p>2 ) for strongly convex problems. From Theorem 10, it is clear that ConEx requires a bound B. This requirement naturally leads to two cases: (1) bound B can be estimated apriori, e.g., see Lemma 2; and (2) bound B is known to exist but cannot be estimated, e.g., see Theorem 1. Both cases have different convergence rates for the subproblem which leads to different overall computational complexity. Case 1: B can be estimated apriori. In this case, we do not need to estimate B k as in (6.20). Using the bound B, we can get accelerated convergence of ConEx in accordance with Theorem 10 which leads to better performance of the LCPG method. The corollary below formally states the total computational complexity of LCPG method for this case. Corollary 4 If an explicit value of B is known, the LCPG method with ConEx as subproblem solver obtains O( 1 K , 1 K ) type-II KKT point in O(K 2 ) computations. Proof According to Theorem 10, the required ConEx iterations for each subproblem can be bounded by</p><p>Since B is a constant, we have</p><p>. Hence, we conclude the proof. Case 2: B is known to exist but cannot be estimated. For the subproblem (3.1), we can easily find B k &gt; &#955; k+1 by using the difference in levels of successive iterations. This bound is weak, especially in the limiting case as it does not take into account Proposition 7 For subproblem (3.1), we have</p><p>(6.20)</p><p>Proof By Slater's condition, we know that &#955; k+1 exists. Then, due to saddle point property of ( x k+1 , &#955; k+1 ), we have for all x &#8712; X</p><p>where the equality follows by complementary slackness. Using x = x k in the above relation and noting that</p><p>where second inequality follows from &#955; k+1 &#8805; 0 and &#948; k &gt; 0 and last inequality follows due to the fact that &#955; k+1 1 &#8805; &#955; k+1 . We can further upper bound the LHS of the above relation as follows</p><p>where the last inequality follows since x k+1 is feasible for the original problem (1.1).</p><p>Combining the above two relations, we obtain (6.20). Hence, we conclude the proof.</p><p>We now state the final computation complexity of LCPG withConEx which uses the bound in <ref type="bibr">(6.20)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Corollary 5</head><p>If an explicit value of B is not known, the LCPG method with ConEx as subproblem solver obtains O( 1 K , 1 K ) type-II KKT point in O(K 4 ) computations.</p><p>Proof Using Proposition 7, we can set B k :=</p><p>where i * := argmin i&#8712;m &#951; i&#951; 0 i . Then, required ConEx iterations T k can be bounded by</p><p>Finally, in view of (6.9) and the fact that</p><p>) which is the overall computational complexity of LCPG method with ConEx as subproblem solver to obtain</p><p>Remark 11 [Gradient complexity vs. computational complexity] Note that evaluating the gradient of &#968; k i (x) is relatively simple since it does not involve any new computation of &#8711; f i (x). In that sense, the entire inner loop requires only one &#8711; f i computation; hence the total gradient complexity of &#8711; f i equals the total outer loops of inexact LCPG. On the other hand, inner loop computation does contribute to the problem's computational complexity. However, such iterations are expected to be very cheap given the ease of obtaining gradients for the QP subproblem (3.1) with identity hessian matrices.</p><p>where the last inequality uses the fact that -1 2 a + b 2 &#8804; -1 2 a 2a, b &#8804; - 1  2 a 2 + a b with a = xz + and b = z +x + . Finally, combining the above two results gives the desired inequality (7.1).</p><p>Using the above lemma, we provide the main convergence property of LCPG for convex optimization.</p><p>Lemma 9 Let x be feasible solution. Then, we have</p><p>where (i) follows from the definition of &#968; k 0 , (ii) follows since x k+1 is an k solution of (3.1), (iii) follows by complementary slackness for the optimal primal-dual solution for (3.1) and (iv) follows from Lemma 8. In particular, we use g(x)</p><p>Finally, note that</p><p>Using the above two relations in (7.7), we obtain (7.6). Hence, we conclude the proof.</p><p>Let x * be an optimal solution of (1.1) and D := max{ xy : x, y &#8712; dom &#967; 0 , &#968; i (x) &#8804; &#951; i , &#968; i (y) &#8804; &#951; i , for all i &#8712; [m]}. Now, we show convergence rate guarantees.</p><p>Theorem 11 Consider general convex optimization problems with &#956; 0 = 0. Suppose Assumption 3 is satisfied and set &#948; k = (&#951;-&#951; 0 ) (k+1)(k+2) . Then we have</p><p>123</p><p>Proof From Lemma 9 with &#956; 0 = 0 for convex part and &#968;(x * ) &#8804; &#951;, we have</p><p>Dividing both sides by L 0 + &#955; k+1 ,L 2 , we have</p><p>Note that the sequence { &#955; k+1 } is uniformly bounded above such that &#955; k+1 &#8804; B for all k &#8805; 0. Using this fact and the above relation, we have</p><p>Using &#948; k = &#951;-&#951; 0 (k+1)(k+2) and k = &#951;-&#951; 0 2(k+1)(k+2) , we have x k is strictly feasible solutions for (3.1) for all k. Hence, under Assumption 3, we can follow the steps of Theorem 1 to show uniform bound B on sequence { &#955; k }. Using these values in (7.9), we have</p><p>Due to the optimality of the exact solution x k+1 , we have</p><p>Combining these two relations, we get:</p><p>Effectively, inexact LCPG method is almost (up to an additive error of k ) a descent method. Using this relation recursively, we have</p><p>Using the above relation in (7.10), we have</p><p>Multiplying the above relation by k + 1 and summing from k = 0 to K -1, we have</p><p>After rearranging, this relation implies <ref type="bibr">(7.8)</ref>. Hence, we conclude the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem 12</head><p>Consider strongly convex problems (&#956; 0 &gt; 0) and suppose that Assumption 3 is satisfied. Set &#948; k = &#961; k (1 -&#961;)(&#951; -&#951; 0 ) where &#961; = L 0 -&#956; 0 2(L 0 -a&#956; 0 ) , 2 k &#8804; a(1 -&#961;)&#961; k &#951; -&#951; 0 and a &#8712; (0, 1). Then we have</p><p>Moreover, if k = 0, we have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>.12)</head><p>Proof Proceeding similar to the convex case, using Lemma 9, we obtain</p><p>Combining the above two results, we have</p><p>Let us denote</p><p>Multiplying both sides of the above inequality by</p><p>. Using these relations in (7.13), we have</p><p>Similar to the convex part, we also have</p><p>Using the above relation into (7.14), we have</p><p>Summing the above relation from k = 0 to K -1, we have</p><p>Note that</p><p>Combining the above two relations we obtain the desired result <ref type="bibr">(7.11)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Numerical study</head><p>In this section, we conduct some preliminary studies to examine our theoretical results and the performance of the LCPG method. The experiments are run on CentOS with Intel Xeon (2.60 GHz) and 128 GB memory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1">A simulated study on the QCQP</head><p>In the first experiment, we compare LCPG with some established open-source solvers such as CVXPY <ref type="bibr">[11]</ref> and DCCP <ref type="bibr">[32]</ref>. We consider the penalized Quadratically Constrained Quadratic Program (QCQP) described as follows, min</p><p>where each Q i (0 &#8804; i &#8804; m) is an n &#215; n matrix, b 0 , b 1 , . . . , b m are n-dimensional real vectors, &#945; is a positive weight on the &#8467; 1 norm penalty, which helps to promote sparse solution. In the first setting, we consider a convex constrained problem where each Q i is a positive semidefinite matrix. We set Q i = V DV T where V is an n &#215; n random sparse matrix with density 0.01, and its nonzero entries are uniformly distributed in [0, 1]. D is a diagonal matrix whose diagonal elements are uniformly distributed in between [0, 100]. We set b i = 10e +v, where e is a vector of ones and v &#8712; N (0, I n&#215;n ) is sampled from standard Gaussian distribution. We set c i = -10 to make x = 0 a strictly feasible initial solution. Furthermore, we add a ball constraint to ensure that the domain is a compact set. We set r = &#8730; 20 and &#945; = 1. We fix m = 10 and explore different dimensions n from the set {500, 1000, 2000, 3000, 4000}.</p><p>We solve Problem (8.1) by both CVXPY and LCPG. Both use the initial solution x = 0. For CVXPY, we use MOSEK as the internal solver due to its superior performance in quadratic optimization. In LCPG, for simplicity, we also solve the diagonal quadratic subproblem by MOSEK through CVXPY. Note that calling the external API repetitively for each LCPG subproblem only causes more overheads to run LCPG. Nonetheless, as we shall see, the standard IPM solvers can still fully leverage the diagonal structure and exhibit fast convergence.</p><p>In Table <ref type="table">3</ref>, we present the experiment results of the compared algorithms. The final objective, the norm of the dual solution (DNorm), and for LCPG, the maximum dual norm in the solution path (Max DNorm) are reported. All values represent the average of 5 independent runs. From the results, we observe that while LCPG does not outperform CVXPY for the small-size problem (n = 500), LCPG becomes increasingly favorable as the problem dimension increases. This justifies the empirical advantage of our proposed approach as we do not need to construct a full Hessian matrix. Moreover, interestingly, we observe that the dual solution norm { &#955; k } is increasing, reaching the maximum at the last iteration. This accounts for the equal values of DNorm and Max DNorm. Meanwhile, in all the cases, the dual remains bounded and the reported dual norm closely aligns with the solution returned by CVXPY. This result confirms our intuition that the dual bound is intricately tied to the nature of problems.</p><p>In the second setting of this experiment, we examine the performance of LCPG on nonconvex constrained optimization. Specifically, we express Q i as the difference of two matrices: Q i = P i -S i , where P i is generated in the same manner as Q i in the first setting, and S i = 10I n&#215;n . Given the construction of the quadratic components, it is natural to view the function 1  2 x T Q i x + b T i x + c i as a difference of two convex quadratic functions: 1  2 x T P i x + b T i x + c i -1 2 x T S i x. Leveraging this decomposition, we apply the DC programming, and more specifically, the DCCP framework to solve and LCSVRG. We apply all these algorithms to a sparsity-induced finite-sum problem, wherein a nonconvex constraint is incorporated into the supervised learning framework to actively enforce a sparse solution. The optimization problem is as follows min where f i (x) is a smooth loss function associated with the ith sample, &#968; 1 (x) is the difference between &#8467; 1 -penalty and a convex smooth function g(x). Employing a difference-of-convex constraint is seen as a tighter relaxation of the cardinality constraint x 0 &#8804; &#954; than the &#8467; 1 relaxation. The appealing properties of difference-ofconvex penalties have been demonstrated in various studies <ref type="bibr">[5,</ref><ref type="bibr">14,</ref><ref type="bibr">16,</ref><ref type="bibr">17,</ref><ref type="bibr">36,</ref><ref type="bibr">37]</ref>.</p><p>In view of the concave structure of -g(x), there is a strong asymmetry between the lower and upper curvature of -g(x), namely, the following -L g 2 yx 2 -&#8711;g(y) T (xy)g(y) &#8804; -g(x) &#8804; -g(y) -&#8711;g(y) T (xy) <ref type="bibr">(8.3)</ref> holds for certain L g &gt; 0. Note that this is much stronger than the L g smoothness condition which adds an extra L g 2 yx 2 on the right-hand side of <ref type="bibr">(8.3)</ref>. Due to this feature, one can impose a tighter piece-wise linear surrogate function constraint &#946; x 1g(x k ) -&#8711;g(x k )(xx k ) &#8804; &#951; k 1 in the LCPG subproblem. It should be noted that our analysis is readily adaptable to accommodate this scenario since it is the smoothness, as opposed to concavity/convexity, that plays a central role in our convergence analysis and that remains valid. An empirical advantage of this approach is that we now have a tractable subproblem solvable in nearly linear time. See more discussion in <ref type="bibr">[5]</ref>.</p><p>Our experiment considers the task of binary classification with logistic loss, denoted by f i (x) = log(1 + exp(-b i (a T i x)), where a i &#8712; R d , b i &#8712; {1, -1}, 1 &#8804; i &#8804; n. We use the SCAD penalty g(x) = d j=1 h &#946;,&#952; (x j ) where h &#946;,&#952; (&#8226;) is defined in <ref type="bibr">(3.22)</ref>. We use the real-sim dataset from the LibSVM repository <ref type="bibr">[10]</ref> and the covtype data from the UCI repository <ref type="bibr">[19]</ref>. For the latter, we formulate a binary classification task by distinguishing class "3" from the other classes. We set &#946; = 2, &#952; = 5, and &#951; 1 = &#963; d, with &#963; &#8712; {0.4, 0.6} for covtype and &#963; &#8712; {0.1, 0.2} for real-sim dataset. For each algorithm, we use its theoretically suggested batch size and stepsize. for a fair comparison, we count n evaluations of the stochastic gradient as an effective pass over the dataset and plot the objective value over the number of effective passes. Figure <ref type="figure">3</ref> plots the convergence result of the compared algorithms. It can be readily seen that LCSPG performs better than LCPG in terms of the number of gradient samples, and LCSVRG achieves the best performance among the three algorithms. The empirical findings further confirm our theoretical complexity analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">Conclusion</head><p>In this work, we presented a new LCPG method for nonconvex function constrained optimization which can achieve gradient complexity of the same order as that of unconstrained nonconvex problems. The key ingredient in our algorithm design is the use of constraint levels to ensure the subproblem feasibility, which allows us to overcome a well-known difficulty in bounding the Lagrange multipliers in the presence of nonsmooth constraints. Moreover, a merit of our convergence analysis is its striking similarity with that of gradient descent methods for unconstrained problems. Therefore, we can easily extend our method to minimizing stochastic, finite-sum, and structured nonsmooth functions with nonconvex function constraints; many of the complexity results were not known before. Another important feature of our work is that the method can deal with complex scenarios where the subproblems are not exactly solvable. To the best of our knowledge, existing work on sequential convex optimization (SQP, MBA) only assumes the subproblems to be exactly solved. We provided a detailed complexity analysis of LCPG when the subproblems are inexactly solved by customized interior point method and first-order method. Finally, we clearly distinguished the notion of gradient complexity from that of computational complexity. In terms of gradient complexity, all of our proposed methods are state-of-the-art and easy to implement. Whether the computational complexity can be further improved for composite cases remains an open problem that we leave as a future direction.</p><p>Funding Open access funding provided by SCELC, Statewide California Electronic Library Consortium Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit <ref type="url">http://creativecommons.org/licenses/by/4.0/</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Proof of Lemma 7</head><p>Let {&#948; n } n&#8805;1 be a function of C &#8734; -smooth, real-valued mollifier functions over R d where, for every n &#8805; 1, we have: (i) &#948; n &#8805; 0, (ii) &#948; n (&#964; )d&#964; = 1, and (iii) and &#948; n (&#964; ) = 0 for &#964; satisfying &#964; &#8805; for all x, y. Note that p n is C &#8734; -smooth. Hence, using Taylor's theorem, there exist &#958; &#8712; [x, y] such that p n (x)p n (y) -&#8711; p n (y), xy = 1 2 xy, &#8711; 2 p n (&#958; )(xy) + o( xy 2 ). Using this relation along with (A.1) and denoting v := yx, we have</p><p>Now, diving both sides of the above relation by v 2 , taking y &#8594; x which implies &#958; &#8594; x and &#8711; 2 p n (&#958; ) &#8594; &#8711; 2 p n (x), we have</p><p>for all v and x. The above relation is equivalent to the fact that &#8711; 2 p n (x) &#8804; max{L, &#956;} for all x. From here, for any x, y, we have &#8711; p n (x) -&#8711; p n (y) = 1 t=0 &#8711; 2 p(y + t(xy))(xy)dt &#8804; 1 t=0 &#8711; 2 p(y + t(xy))(xy) dt &#8804; 1 t=0 &#8711; 2 p(y + t(xy)) xy dt &#8804; max{L, &#956;} xy . Now, taking n &#8594; &#8734; and noting that &#8711; p n (x) &#8594; &#8711; p(x) for all x, we have <ref type="bibr">(5.3)</ref>. Hence, we conclude the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Proof of Theorem 7</head><p>First of all, using the definition of k and the fact that &#968; k i (x k+1 ) &#8804; &#951; k i + k for all i &#8712; [m], we have</p><p>Hence x k is strictly feasible solution of (3.1). Due to Slater condition, there exists a pair of optimal primal and dual solutions, which we denote by xk+1 and &#955; k+1 . We first prove the following lemma. Our argument will be based on the following key result. Proof Using optimality of x k+1 , strong convexity of &#968; k 0 , feasibility of x k for the subproblem (3.1) and &#968; k 0 (x k+1 ) &#8804; &#968; k 0 ( x k+1 ) + k , we have,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 10 lim</head><p>Since k is summable, we have that x kx k+1 2 is summable. This implies x kx k+1 &#8594; 0. Since k &#8594; 0 and x k+1 is a unique optimal solution of (3.1), we have x k+1x k+1 &#8594; 0. Then, using Cauchy-Schwarz inequality, we have x k+1 -x k &#8804; x k+1x k+1 + x k+1x k and hence, x k+1x k &#8594; 0. Hence, we conclude the proof. Now, we show boundedness of &#955;k+1 . Assume, for the sake of contradiction, that &#955;k+1 is unbounded. Let x be a limit point of {x k }. Passing to a subsequence if necessary, we have x k &#8594; x. Using Lemma 10, we have x k+1 &#8594; x and x k+1 &#8594; x. Then, we have</p><p>Note that the above relation is comparable to (3.10) up to an error term of k . Following the arguments in the proof of Theorem 1 (Part 1, (3.10) onwards) and noting that k is summable, we conclude that { &#955;k+1 } is bounded. Now, we prove limit point of {x k } is a KKT point. Since L k (x k+1 , &#955;k+1 ) &#8804; L k ( x k+1 , &#955;k+1 ) + k , we rewrite <ref type="bibr">(3.18)</ref> as</p><p>Let x be a limit point of sequence {x k }. Since &#955;k+1 is bounded, we assume limit point &#955;. Without loss of generality, we have x k &#8594; x and &#955;k+1 &#8594; &#955;. Then, in view of Lemma 10, we have lim k&#8594;&#8734; x k+1 &#8594; x and lim k&#8594;&#8734; x k+1 &#8594; x. Taking limit k &#8594; &#8734; in (B.2), we have &#8711; f 0 ( x) + m i=1 &#955;i &#8711; f i ( x), xx + &#967; 0 ( x) -&#967; 0 (x) + &#955;, &#967; ( x) -&#967;(x) &#8804; 0.</p><p>Note that the above equation matches with <ref type="bibr">(3.19)</ref> exactly. From here, we follow the proof of Theorem 1 (Part 2, (3.19) onwards) to conclude first-order stationarity of</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>This assumption is used to ensure the limit point of a certain sequence in the analysis. There are other conditions that can ensure the existence of a limit point as we will discuss in Remark 1.</p></note>
		</body>
		</text>
</TEI>
