<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Worst-case complexity of an SQP method for nonlinear equality constrained stochastic optimization</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/07/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10428485</idno>
					<idno type="doi">10.1007/s10107-023-01981-1</idno>
					<title level='j'>Mathematical Programming</title>
<idno>0025-5610</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Frank E. Curtis</author><author>Michael J. O’Neill</author><author>Daniel P. Robinson</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[A worst-case complexity bound is proved for a sequential quadratic optimization (commonly known as SQP) algorithm that has been designed for solving optimization problems involving a stochastic objective function and deterministic nonlinear equality constraints. Barring additional terms that arise due to the adaptivity of the monotonically nonincreasing merit parameter sequence, the proved complexity bound is comparable to that known for the stochastic gradient algorithm for unconstrained nonconvex optimization. The overall complexity bound, which accounts for the adaptivity of the merit parameter sequence, shows that a result comparable to the unconstrained setting (with additional logarithmic factors) holds with high probability.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>We present a worst-case complexity analysis of an algorithm for minimizing a smooth objective function subject to nonlinear equality constraints. (Due to the nature of the algorithm, this worst-case complexity analysis holds in terms of iterations, function evaluations, and derivative evaluations.) Problems of this type arise in various important applications throughout science and engineering, including optimal control, PDE-constrained optimization, and resource allocation <ref type="bibr">[3,</ref><ref type="bibr">4,</ref><ref type="bibr">20,</ref><ref type="bibr">29]</ref>. However, unlike the vast majority of the literature on equality constrained optimization, the algorithm that we consider has been designed to solve problems in which the objective function is stochastic, in the sense that it is defined by the expectation of a function that has a random variable argument. The algorithm that we consider assumes that evaluations of the objective function and its gradient are intractable to obtain, but that it has access to (unbiased) stochastic gradient estimates.</p><p>A few algorithms have been proposed recently for solving problems of this type. These approaches fall into two categories: penalty methods <ref type="bibr">[9,</ref><ref type="bibr">25,</ref><ref type="bibr">30]</ref> (which includes the class of augmented Lagrangian methods) and sequential quadratic optimization (commonly known as SQP) methods <ref type="bibr">[2,</ref><ref type="bibr">23]</ref>. Penalty methods aim to solve the constrained optimization problem by adding a term to the objective function, weighted by a penalty parameter, that penalizes constraint violation. Unconstrained optimization techniques are then applied to minimize the resulting penalty function, after which the penalty parameter may be modified and the minimization is performed again in an iterative manner until a solution is obtained that (approximately) satisfies the original constraints. Methods of this type perform well in some situations, but in others they perform poorly, e.g., due to ill-conditioning and/or nonsmoothness of the subproblems. Such methods also often suffer due to their sensitivity to the particular scheme used for updating the penalty parameter. See <ref type="bibr">[28]</ref> for further commentary on the advantages and disadvantages of penalty methods.</p><p>In practice in both deterministic and stochastic optimization contexts, penalty methods are frequently outperformed by SQP methods. Indeed, it is commonly accepted in the deterministic optimization literature that a state-of-the-art algorithm is an SQP method that chooses stepsizes based on a line search applied to a merit function. In this deterministic setting, such an algorithm is intimately connected with applying Newton's method to the first-order primal-dual necessary conditions for optimality of the problem <ref type="bibr">[32]</ref>.</p><p>In this paper, we present a worst-case complexity analysis of the SQP method proposed in <ref type="bibr">[2]</ref>, which can be seen as an extension of an SQP method from the deterministic to the stochastic setting. A consequence of our analysis is that, in an idealized setting in which (i) one knows a threshold for the merit parameter below which knowledge of the exact gradient would not lead to a merit parameter decrease in any iteration (which, for one thing, is a threshold for the merit function to be exact <ref type="bibr">[18]</ref>) and (ii) the algorithmic rule that might otherwise modify the merit parameter is skipped, the number of iterations required until the method generates a point at which first-order necessary conditions for optimality hold in expectation with accuracy &#949; &#8712; (0, &#8734;) is O(&#949; -4 ). This is the type of result that one should expect, since this is the same bound proved to hold for a stochastic gradient method employed to solve an unconstrained nonconvex problem <ref type="bibr">[14]</ref>. However, our analysis does not only consider this idealized setting; we go further and prove a worst-case complexity bound for the algorithm when the merit parameter threshold is unknown and the algorithm adaptively updates a monotonically nonincreasing merit parameter sequence. We prove under reasonable assumptions that the aforementioned worst-case bound, with additional logarithmic factors, holds with high probability. The high-probability aspect of this result arises due to the uncertainty of the behavior of the adaptive merit parameter sequence that is a consequence of the uncertainty due to the stochastic gradient estimates, and it does not reflect any uncertainty of the behavior of the method during situations in which the merit parameter sequence remains constant. A major challenge in our analysis is accounting for the transient behavior of the adaptive merit parameter sequence in the sense that our analysis accounts for the real possibilities that (i) different runs of the algorithm may decrease the merit parameter by varying amounts and (ii) it is not possible to bound the number of iterations until the merit parameter settles on a sufficiently small value. In other words, since our loose assumptions do not allow for one to presume that the algorithm experiences two distinct phases (e.g., one in which-over a bounded number of iterations-the merit parameter is reduced down to a threshold, then a second in which it remains fixed), our analysis must account for the possibility that the merit parameter is transient through any number of iterations, which leads to significant complications that are not present in the context of unconstrained optimization.</p><p>To the best of our knowledge, ours is the first worst-case complexity result for an SQP algorithm that operates in the highly stochastic regime (where one merely presumes that the stochastic gradient estimates have bounded variance) for solving stochastic optimization problems involving deterministic nonlinear equality constraints. Prior to this work, the only known complexity results for stochastic constrained optimization were for algorithms for solving problems with simple constraint sets that enable projection-based methods <ref type="bibr">[14,</ref><ref type="bibr">16,</ref><ref type="bibr">26]</ref> and Frank-Wolfe type methods <ref type="bibr">[19]</ref>. (One exception is a complexity bound proved for the SQP algorithm proposed in <ref type="bibr">[23]</ref>, although that result only holds for the idealized setting in which the algorithm has a priori knowledge of a threshold for the merit function parameter.) After the initial release of this paper, the article <ref type="bibr">[24]</ref> appeared that proposes an SQP method that uses stochastic estimates of the objective gradient and Hessian of the Lagrangian and computes search directions using a sketch-and-project framework. The algorithm in that paper does not use a merit function and requires that the stepsizes remain within prescribed intervals that are non-adaptive to the stochastic gradients. Under standard assumptions, that algorithm offers a worst-case complexity result that is comparable to the one we derive in this paper, as well as an asymptotic convergence rate. Our analysis focuses a great deal on the complications that arise due to the adaptivity of the merit parameter sequence, which essentially means that the algorithm in our consideration is aiming to reduce a merit function that changes during the optimization process. Hence, many aspects of our analysis are quite distinct from the analyses that have been presented for stochastic gradient methods in the context of unconstrained optimization or optimization over simple constraint sets, for which the tool for measuring the progress of an algorithm-namely, the objective function itself-remains the same throughout the optimization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Problem formulation</head><p>The algorithm that we consider is designed to solve problems of the form min x&#8712;R n f (x) s.t. c(x) = 0, with f (x) = E[F(x, &#969;)], <ref type="bibr">(1)</ref> where f : R n &#8594; R, c : R n &#8594; R m , &#969; is a random variable with associated probability space (&#937;, F, P), F : R n &#215; &#937; &#8594; R, and E represents expectation with respect to P.</p><p>In particular, similar to <ref type="bibr">[2]</ref>, we make the following assumption.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Assumption 1</head><p>The objective function f : R n &#8594; R is continuously differentiable and bounded below by f low &#8712; R and the corresponding gradient function &#8711; f : R n &#8594; R n is bounded and Lipschitz continuous with constant L &#8712; (0, &#8734;). The constraint function c : R n &#8594; R m (where m &#8804; n) and the corresponding Jacobian function J := &#8711;c : R n &#8594; R m&#215;n are bounded, each gradient function &#8711;c i : R n &#8594; R n is Lipschitz continuous with constant &#947; i for all i &#8712; {1, . . . , m}, and the singular values of J &#8801; &#8711;c are bounded below and away from zero.</p><p>Defining the Lagrangian : R n &#215; R m &#8594; R corresponding to (1) by (x, y) := f (x) + c(x) y, first-order primal-dual stationarity conditions for <ref type="bibr">(1)</ref>, which are necessary for optimality under Assumption 1, are given by 0 = &#8711; x (x, y) &#8711; y (x, y) = &#8711; f (x) + &#8711;c(x)y c(x) .</p><p>(2)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Notation</head><p>We adopt the notation that &#8226; denotes the 2 -norm for vectors and the vector-induced 2 -norm for matrices. We denote by S n the set of n &#215; n dimensional real symmetric matrices. The set of nonnegative integers is denoted as N := {0, 1, 2, . . . , }. For any integer k &#8712; N, we use <ref type="bibr">[k]</ref> to denote the subset of nonnegative integers up to k, namely, [k] := {0, . . . , k}. Correspondingly, to represent a set of vectors {v 0 , . . . , v k }, we define v [k] := {v 0 , . . . , v k }.</p><p>Given &#966; : R &#8594; R and &#981; : R &#8594; [0, &#8734;), we write</p><p>The algorithm that we analyze is iterative, generating in each realization a sequence {x k }. (See Sect. 4.1 for a complete description of the stochastic process generated by the algorithm.) We also append the iteration number to other quantities corresponding an iteration, e.g., f k := f (x k ) for all k &#8712; N.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Outline</head><p>Section 2 provides a worst-case complexity result for the algorithm from <ref type="bibr">[2]</ref> for the deterministic setting, and uses this result and further commentary to provide an overview of our main result for the stochastic setting. Details of the algorithm for the stochastic setting are presented in Sect. 3, followed by our main result and analysis, which are provided in Sect. 4. Finally, we provide concluding thoughts and mention future directions in Sect. 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Outline of main results</head><p>Our algorithm of consideration is derived from Algorithm 3.1 in <ref type="bibr">[2]</ref>, which is proposed for the stochastic setting. As a precursor, Algorithm 2.1 was proposed in <ref type="bibr">[2]</ref> for the deterministic setting, many of the features of which are used in Algorithm 3.1 in <ref type="bibr">[2]</ref> for the stochastic setting. In Algorithm 2.1 in <ref type="bibr">[2]</ref> for the deterministic setting, the kth search direction d k &#8712; R n is computed by solving a subproblem defined by a quadratic approximation of the objective function and an affine approximation of the constraints using derivative information at the current iterate x k &#8712; R n . This computation also results in a Lagrange multiplier vector y k &#8712; R m . The subsequent iterate is set by</p><p>In particular, based on properties of the search direction d k , a value of the merit parameter &#964; k &#8712; (0, &#964; k-1 ] is set by the algorithm, after which &#945; k &#8712; (0, &#8734;) is computed to ensure that &#966;(x k , &#964; k ) -&#966;(x k+1 , &#964; k ) is sufficiently positive. This description of the algorithm suffices for providing an outline of our main results; for further details about the algorithm that we analyze, see Sect. 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Complexity of the deterministic algorithm</head><p>To motivate our main result for the stochastic setting, it is instructive to state a worstcase complexity bound for the deterministic algorithm. Such a result is the following; further details and a proof are provided in Appendix A. The approximate stationarity conditions in (3) have been defined for consistency between the primal and dual stationarity measures; the reason that this choice leads to such consistency is revealed in the analysis in Appendix A. <ref type="bibr">[2]</ref> and suppose that Assumption 1 holds along with Assumption 2.4 from <ref type="bibr">[2]</ref> (see also the similar Assumption 2 on page 8 for our stochastic setting). Let &#964; -1 &#8712; R &gt;0 be the initial value of the merit parameter sequence and let &#964; min &#8712; (0, &#964; -1 ] be a positive lower bound for the merit parameter sequence (the existence of which follows from Lemma 2.16 in <ref type="bibr">[2]</ref>). Then, for any &#949; &#8712; (0, 1), there exists</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem 1 Consider Algorithm 2.1 in</head><p>in a number of iterations no more than</p><p>Theorem 1 is notable since it shows that Algorithm 2.1 in <ref type="bibr">[2]</ref> has the standard worst-case iteration, function-evaluation, and derivative-evaluation complexity for a first-order-derivative-based algorithm for solving equality constrained optimization problems; see <ref type="bibr">[8]</ref> for a comparable complexity bound for another "short-step" algorithm for solving equality constrained problems. That said, the O(&#949; -2 ) bound in Theorem 1 is not surprising. After all, such a complexity bound is well-known for gradient-based algorithms for solving unconstrained nonconvex optimization problems as well. Since Algorithm 2.1 in <ref type="bibr">[2]</ref> and its corresponding analysis do not exploit the use of exact higher-order derivative information, this complexity bound is on the order of what could be expected for such a method. There are ways for achieving improved worst-case complexity if higher-order derivatives are employed in a constrained-optimization algorithm (see, e.g., <ref type="bibr">[7,</ref><ref type="bibr">13,</ref><ref type="bibr">17]</ref>), but the use of such higherorder derivatives-especially in the stochastic setting-is outside of our scope.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Preview of the complexity of the stochastic algorithm</head><p>Moving to the stochastic setting, there are a few major technical hurdles that need to be addressed, all of which relate to the adaptivity of the merit parameter sequence. In particular, the analysis for the deterministic setting relies heavily on the facts that (i) each step of the algorithm yields a sufficient reduction in the merit function for the current value of the merit parameter, (ii) each such reduction in the merit function can be tied to a first-order primal-dual stationarity measure for the current iterate, and (iii) under Assumption 1, one can be certain of the existence of a positive lower bound for the merit parameter sequence. This lower bound for the merit parameter is referenced directly in the proof of the worst-case bound for the deterministic algorithm; in particular, it is shown (see Lemma 13 and the beginning of the proof of Theorem 4) that the improvement in the merit function from any iterate that is not &#949;-stationary (see <ref type="bibr">(3)</ref>) is at least proportional to min{1, &#964; min }&#949; -2 , even if the current value of the merit parameter is greater than &#964; min . Unfortunately, these properties of the steps and merit parameter sequence are not certain in the stochastic setting. For example, as discussed in <ref type="bibr">[2]</ref>, it is possible-even under Assumption 1-for the merit parameter sequence to vanish or for it to eventually remain constant at a value that is not sufficiently small, and for there to be iterations in which the expected reduction in the merit function cannot be tied to a first-order primal-dual stationarity measure. As a result, we have devised new analytical approaches that confront the fact that {&#964; k } is a random process, the ultimate behavior of which is uncertain.</p><p>To aid the reader, we provide here an overview and commentary about our ultimate complexity bound; see Corollary 1 on page 27. Our result is proved under Assumption 1 along with others that are introduced in the subsequent sections. For one thing, as is common in SQP methods for deterministic optimization, we assume that the subproblem defining the search direction in each iteration is defined by a matrix that is positive definite in the null space of the constraint Jacobian; see Assumption 2 on page 8. We also assume, as is common for stochastic gradient methods, that the stochastic gradient estimates are unbiased with variance bounded by M &#8712; (0, &#8734;), along with some related assumptions; see Assumption 3 on page 14. Furthermore, our analysis conditions on the occurrence of an event that we call E (see <ref type="bibr">(15)</ref>); this event captures situations in which, over a total of k max + 1 &#8712; N iterations, the merit parameter is reduced at most s max &#8712; [k max ] times and the merit parameter is bounded below by &#964; min &#8712; (0, &#8734;). Under these conditions, our main complexity result shows that, within k max + 1 iterations, it holds with probability 1 -&#948; &#8712; (0, 1) that the algorithm generates x k * &#8712; R n corresponding to which there exists an associated Lagrange multiplier</p><p>This form of the result is commonly called a convergence rate since it bounds the expected stationarity error from above by a function that decreases with the number of iterations performed, namely, k max + 1. This bound can be used to form a worst-case complexity result. Specifically, the result above and Jensen's inequality imply that, within k max + 1 iterations and as long as s max = O(log(k max )) (more on this below), it holds with probability 1 -&#948; that the algorithm requires at most O(&#949; -4 ) iterations to generate x k * with corresponding</p><p>The probability in this result is with respect to the distribution of the stochastic gradients conditioned on the occurrence of the event E.</p><p>The first three quantities on the right-hand side of the convergence rate, namely, in (5a), representing the initial objective function gap, initial constraint violation, and the variance of the stochastic gradient estimates, mirror the presence of similar terms that appear for comparable results for the stochastic gradient method in an unconstrained or simple-constraint-set setting <ref type="bibr">[14,</ref><ref type="bibr">16,</ref><ref type="bibr">26]</ref>. The final term in (5b), on the other hand, as well as the fact that the result is stated as a high-probability result, are unique to our setting and arise due to the adaptivity of the merit parameter sequence. If one were to have prior knowledge of &#964; min , then one could set &#964; -1 = &#964; min (and disable the update mechanism for the merit parameter in the algorithm), in which case our analysis would show that the expected stationarity error is bounded above by (5a) (surely, not only with high probability).</p><p>In the context of an adaptive merit parameter sequence, the particular form of our complexity result depends on the magnitude of s max relative to k max +1, i.e., the bound on the number of times that the merit parameter is decreased relative to the total number of iterations performed. One setting in which our result is relatively straightforward is when, over all realizations of the algorithm, the differences between the stochastic gradient estimates and the true gradients are bounded deterministically, in which case the merit parameter sequence is provably bounded below, which in turn means that s max is bounded by a value that is independent from k max + 1 (at least as long as k max is sufficiently large relative to &#964; -1 /&#964; min ); this follows from a deterministic lower bound on &#964; min <ref type="bibr">[2,</ref><ref type="bibr">Proposition 3.18]</ref> and the fact that whenever the merit parameter is decreased, it is done so by a constant factor. Beyond this setting, for another concrete example of a situation in which s max is guaranteed to be sufficiently small relative to k max , we prove in Sect. 4.5 that if the distributions of the stochastic gradient estimates are sub-Gaussian, then with probability 1-&#948; one finds that s max = O(log(log( k max &#948; ))), meaning that our proved convergence rate is not ruined by the term in (5b). There are certainly other special cases in which similar types of relationships between s max and k max hold, but for our purposes we simply provide the example in Sect. 4.5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Algorithm</head><p>For ease of reference, in this section we present Algorithm 3.1 from <ref type="bibr">[2]</ref> (slightly modified, as explained at the end of this section), which is designed to solve problems of the form (1) and is our focus for the remainder of the paper. In the spirit of an SQP method, the algorithm computes a search direction d k and Lagrange multiplier vector y k in iteration k &#8712; N by solving</p><p>where g k is a stochastic gradient estimate at x k and H k &#8712; S n is chosen independently from g k . Under Assumption 1 and the following Assumption 2 (that we make throughout the remainder of the paper), the solution of ( <ref type="formula">6</ref>) can be obtained from the unique solution of the linear system</p><p>, where for all k &#8712; N the matrix H k &#8712; S n is chosen independently from g k . In addition, there exists &#950; &#8712; R &gt;0 such that, for all k &#8712; N, the matrix</p><p>After computation of (d k , y k ), the remainder of the kth iteration involves (i) updating the merit parameter, (ii) updating an auxiliary parameter needed for the stepsize computation, and (iii) computing a positive stepsize. These algorithmic components are designed with the aim of yielding a sufficiently positive reduction in a model of the merit function, which in turn is aimed at yielding a sufficiently positive reduction in the merit function itself. The algorithm employs the model</p><p>and the reduction function &#916;q :</p><p>Specifically, in order to ensure in iteration k that &#964; k &#8804; &#964; k-1 and</p><p>holds for all &#964; &#8804; &#964; k , the algorithm sets, for user-defined &#963; &#8712; (0, 1), the value</p><p>and then sets, for some &#964; &#8712; (0, 1), the merit parameter value</p><p>Then, for use in the stepsize computation (as motivated in <ref type="bibr">[2]</ref>) it sets</p><p>for some &#958; &#8712; (0, 1), which, for one thing, ensures &#958; k &#8804; &#958; trial k . The last component in the kth iteration is to set the stepsize, the magnitude of which is controlled by a prescribed sequence {&#946; k } &#8834; (0, 1], which is employed in the following projection interval that is used in the stepsize computation:</p><p>where Proj(&#8226; | B) represents the projection operator onto the interval B &#8834; R. As in other stochastic-gradient-based methods, the convergence properties of the method depend on properties of {&#946; k }, which in many analyses is considered to be a constant or diminishing sequence. We establish our complexity result for the case of constant</p><p>Overall, the algorithm that we consider is stated as Algorithm 1. The only changes from Algorithm 3.1 in <ref type="bibr">[2]</ref> are the fixed iteration limit (k max ) and the concluding step for producing the return value (x k * ). This method of sampling k * to produce the return value is consistent with other approaches in the literature on complexity analyses for algorithms for solving nonconvex optimization problems; see, e.g., <ref type="bibr">[14]</ref>. It amounts to uniform sampling over the iterates when constant {&#946; k } is considered, as in our analysis. Finally, we remark that Algorithm 1 presumes knowledge of Lipschitz constants for the objective and constraint gradients, although in practice one might only estimate these values using standard procedures <ref type="bibr">[12]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 Stochastic SQP Algorithm</head><p>Require:</p><p>Compute (d k , y k ) as the solution of (7) 3: Set</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Complexity analysis</head><p>We begin our complexity analysis by describing the algorithm as a stochastic process (Sect. 4.1), then formalizing the assumptions that we make about the stochastic gradient estimates (Sect. 4.2). We then state, in some cases in a slightly modified form, some key lemmas from <ref type="bibr">[2]</ref> that are needed for our analysis (Sect. 4.3). Our generic complexity result, which has been outlined in Sect. 2, is then stated and proved (Sect. 4.4). Consequences and extensions of our generic complexity result are then discussed for some special cases of distributions for the stochastic gradient estimates for which our required assumptions hold with high probability (Sect. 4.5). Finally, we conclude this section by outlining a form of our generic complexity result that relaxes one of our minor simplifying assumptions (Sect. <ref type="bibr">4.6)</ref>.</p><p>Similarly as for the convergence analysis in <ref type="bibr">[2]</ref>, our complexity analysis makes use of orthogonal decompositions of the search directions computed by the algorithm; in particular, for all k &#8712; N, we express d k = u k + v k , where u k &#8712; Null(J k ) and v k &#8712; Range(J k ). We note here that conditioned on the algorithm having reached x k at iteration k, the normal component v k is deterministic, depending only on the constraint value c k and the Jacobian J k .</p><p>In addition to the quantities that are computed explicitly in Algorithm 1, our analysis also refers to the quantities that would have been computed in each iteration k &#8712; N, conditioned on the event that the algorithm has reached x k as the kth iterate, if the true gradient &#8711; f (x k ) is used in place of the stochastic gradient g k . These quantities are denoted by a "true" superscript. For example, in iteration k, the true search direction and corresponding true Lagrange multiplier estimate are the solution of the linear system</p><p>which may be decomposed as</p><p>Here, we write v k (without a superscript) since the normal component of the search direction is defined in a manner that makes it independent of the objective gradient (estimate). Similarly, the true value of the merit parameter that would have been computed is denoted</p><p>This definition of &#964; trial,true k guarantees that, for any &#964; &#8804; &#964; trial,true k , one finds</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Stochastic process</head><p>Henceforth, for the sake of formality, we shall refer in our analysis to the stochastic process generated by Algorithm 1. Specifically, in terms of values that are computed by the algorithm itself, we have the stochastic process</p><p>where, for all k &#8712; N, the random variables are: the algorithm iterate X k , stochastic gradient estimate G k , search direction D k , Lagrange multiplier estimate Y k , merit parameter T k , ratio parameter &#926; k , and stepsize A k . For all k &#8712; N, a realization of the corresponding element of this process has been denoted</p><p>Similarly, in terms of "true" values and step decomposition values that are not computed by the algorithm, but are defined for the sake of our analysis, we have the simultaneously generated process</p><p>where, for all k &#8712; N, the random variables are: the normal search direction component V k , the tangential search direction component U k , the true search direction D true k , the true tangential search direction component U true k , the true Lagrange multiplier estimate Y true k , and the true trial merit parameter T trial,true k . For all k &#8712; N, a realization of the corresponding element of this process has been denoted</p><p>Finally, for the sake of tracking the number of merit parameter updates that occur during runs of the algorithm, we define the stochastic process {S k }, where for all k &#8712; N the random variable S k represents the number of merit parameter decreases up to the end of the kth iteration, i.e., the number of iterations in which</p><p>In any run, the behavior of Algorithm 1 is dictated entirely by the initial conditions and the sequence of stochastic gradient estimates that are generated. Let G k denote the &#963; -algebra generated by the random variables {G 0 , . . . , G k-1 }, a realization of which (along with all initial conditions of the algorithm, including X 0 = x 0 ) determines the realizations of</p><p>For completeness, let G 0 = &#963; (x 0 ). As a result, {G k } k&#8805;0 is a filtration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Assumptions</head><p>Our analysis presumes certain good behavior of the sequences of merit and ratio parameters that are set adaptively by the algorithm. Formally, given</p><p>, our main result characterizes the worst-case behavior of Algorithm 1 conditioned on the event denoted as</p><p>which we define as the event such that</p><p>Consideration of this event as a focus for proving a worst-case complexity result for Algorithm 1 is justifiable for the following reasons.</p><p>-The condition in E that the ratio parameter sequence {&#926; k } is constant over all iterations is not actually essential for our analysis; rather, it is made for the sake of simplicity. Indeed, in Sect. 4.6, we present an extension of our main result to the setting in which this parameter sequence is not constant. Observe that, as proved in [2, Lemma 3.5], the sequence {&#926; k } is bounded below by a positive real number whose value is deterministic, i.e., it is independent of the sequence of stochastic gradient estimates that are generated by the algorithm. Hence, for the sake of simplicity, we assume for now that {&#926; k } is constant and leave the statement of the more complicated version of our main result to a subsection at the end of our analysis. -The conditions in E pertaining to the behavior of the merit parameter sequence are not necessarily minor, since stochasticity in the gradient estimates can cause the merit parameter to vanish, even in settings when the merit parameter would remain bounded below in the deterministic algorithm. That said, in Sect. 4.5, we consider a particular setting in which the distributions of the stochastic gradient estimates are sub-Gaussian over any run of the algorithm, in which case we show that the merit parameter remains bounded below with high probability, meaning that our main worst-case complexity bound-which holds with high probability due to the adaptivity of the merit parameter sequence-remains essentially unchanged in this setting when we do not presume upfront that the merit parameter sequence remains bounded above a positive real number. -The condition in E pertaining to the existence of s max is not actually an additional requirement beyond the existence of &#964; min in the event. After all, by the construction of Algorithm 1, it follows that when the merit parameter is decreased, it is decreased by at least a constant factor, from which it follows (under the existence of &#964; min ) that s max exists and satisfies</p><p>That said, for simplicity and generality in our analysis, we define s max as a quantity that is decoupled from the above (conservative) inequality.</p><p>In summation, while our analysis requires the existence of a lower bound for the merit parameter sequence, and correspondingly an upper bound on the number of potential decreases in the merit parameter, it does not presume other convenient behavior of the merit parameter sequence. For example, our analysis does not presume that the merit parameter sequence eventually settles on a sufficiently small value in a number of iterations that can be bounded in a convenient manner. Rather, our analysis respects the stochastic, transient behavior of the merit parameter sequence that one finds in the actual behavior of the algorithm in practice. We have strived to derive a complexity bound that, under relatively loose assumptions in the highly stochastic regime, matches (up to logarithmic factors) in a constrained setting the results that are considered stateof-the-art for the unconstrained setting, even though the algorithm needs to discover for itself, in an adaptive manner, an appropriate balance between the objective function and constraint violation measure. Given our focus on event E, we now introduce the filtration defined by</p><p>where, for all k &#8712; N, we use</p><p>Here and throughout the remainder of the paper, we let</p><p>) denote probability (respectively, expectation) conditioned on F k , which encodes information up to the start of iteration k &#8712; N, i.e., we define</p><p>Our analysis assumes the following about the stochastic gradient estimates. Such an assumption, namely, that conditioned on the filtration one has that the stochastic gradient is unbiased and has bounded variance, is common in analyses of stochastic optimization methods.</p><p>In addition, there exists</p><p>Observe that (19b) follows from <ref type="bibr">(18)</ref> if there exists p &#8712; (0, 1] such that, for any</p><p>After all, in this setting with <ref type="figure"/>and<ref type="figure">Z c</ref> k representing the complement of Z k , one finds along with Jensen's inequality that</p><p>so (19b) holds with M &#964; = &#8730; M/ p. Such a p exists, for example, when the objective of ( <ref type="formula">1</ref>) is a finite sum of N terms and each stochastic gradient estimate is computed as a so-called mini-batch estimate through the uniform (random) selection of b indices, in which case the above holds with p = b/N .</p><p>We make one additional assumption for our analysis, namely, the following.</p><p>Assumption 4 There exists p &#964; &#8712; (0, 1] such that, for all k &#8712; [k max ], one finds</p><p>Similar to [2, Proposition 3.16], Assumption 4 allows us to prove that, with high probability, the number of iterations in which T k &gt; T trial,true k is not too large. In [2, Example 3.17], it was shown that the inequality in this assumption holds with p &#964; = 1 2 when, conditioned on having reached a realized iterate x k , the stochastic gradient G k has a Gaussian distribution. We show in Sect. 4.5 that this result can be extended to other settings as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Properties of algorithm 1</head><p>In this section, we state key preliminary results from <ref type="bibr">[2]</ref> that are needed for our analysis. It is important to note that these results are written in <ref type="bibr">[2]</ref> in the context of conditioning, for all k &#8712; N, on a particular realization of the algorithm up to the beginning of iteration k, which is different from our setting in which we condition on F k . That said, one finds that these results from <ref type="bibr">[2]</ref> carry over, with nearly the same line of arguments, to our setting. After all, for any random variable X that is F k -measurable and any random variable</p><p>This property, combined with the arguments found in <ref type="bibr">[2]</ref>, is sufficient to prove the results of this section, so we state them without proof. By [2, Lemma 2.10], there exists</p><p>, where &#950; is defined in Assumption 2. Correspondingly, let us define</p><p>The following Lemmas 1 and 2 show that there exists a common quantity that both bounds from above the squared-norm of the true search direction plus the constraint violation and bounds from below the reduction in the model of the merit function. Essentially, the combination of these results shows that the search direction offers sufficient decrease relative to its norm. Here, Lemma 1 is stated using a different norm for c(X k ) than in [2, <ref type="bibr">Lemma 2.11]</ref>. The result holds in the same manner (with a different value for &#954; &#936; ) due to the norm equivalence between &#8226; and &#8226; 1 in R m . We state the result in this manner for consistency with the measure used in our final results.</p><p>Lemma 1 ([2, Lemma 2.11]) Let Assumptions 1 and 2 hold. Then, there exists &#954; &#936; &#8712; R &gt;0 such that, for all k &#8712; [k max ], the true search direction and constraint violation satisfy</p><p>Lemma 2 ([2, Lemma 2.12]) Let Assumptions 1 and 2 hold. Then, there exists &#954; q &#8712; R &gt;0 such that, for all k &#8712; [k max ] and</p><p>The next lemma shows that the reduction in the merit function is at least a reduction in the model of the merit function defined with respect to the true gradient and true search direction, except for two terms that may be attributed to the noise in the stochastic gradient estimates. Observe that the requirement in the lemma that</p><p>can be enforced in practice, despite the fact that {&#926; k } and {T k } evolve randomly. After all, since {&#926; k } and {T k } are monotonically nonincreasing, one need only choose &#946; &#8712; R &gt;0 sufficiently small such that</p><p>Lemma 3 ([2, Lemma 3.7]) Let Assumptions 1 and 2 hold and suppose that</p><p>The next two lemmas bound (in expectation) differences and products between stochastic and true quantities. These bounds are critical in the analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 4 Let Assumptions 1, 2, and 3 hold. Then, for all k</head><p>, and</p><p>Proof Except for the last inequality, the result follows from the assumptions and (the proof of) [2, <ref type="bibr">Lemma 3.8]</ref>. As for the last inequality, observe as in [2, Lemma 3.8] that</p><p>as desired.</p><p>Lemma 5 ([2, Lemma 3.9]) Let Assumptions 1, 2, and 3 hold. Then, for all k &#8712; [k max ], it follows that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Complexity result</head><p>In this section, we present our main complexity results. We derive our results in largely the same manner as the global convergence result in <ref type="bibr">[2]</ref>, but with two major changes that stem from the need to characterize the behavior of the algorithm in the context of an adaptive merit parameter sequence. At a high level, the two modifications are as follows:</p><p>1. We derive, in Lemma 6, an upper bound for the last term in <ref type="bibr">(21)</ref>, the derivation of which is complicated by the fact that, conditioned on F k , this term is the product of three correlated random variables:</p><p>A critical aspect of our derived bound is that we isolate a term for the event when</p><p>, since this happens to be an event that complicates subsequent aspects of our analysis. In Lemma 9, we prove a highprobability bound on the sum of the probabilities of the occurrences of this event over a run of the algorithm. 2. A critical aspect of the analysis in <ref type="bibr">[2]</ref> for the deterministic setting is that one can always tie the reduction in the model of the merit function to a first-order stationarity error measure (with respect to the constrained optimization problem) due to the fact that T trial,true k &#8805; T k for all k &#8712; N. Unfortunately, however, this inequality is not guaranteed to hold in the stochastic setting, which is problematic for our purposes in this paper. To account for this issue, we define an auxiliary sequence { T k } (not generated by the algorithm) such that Tk := min{T k , T trial,true</p><p>In Lemmas 7 and 8, we analyze behaviors of the algorithm with respect to this auxiliary sequence, and in Lemma 9 we provide a high-probability bound on the total number of iterations in which T trial,true k &lt; T k may occur. (More precisely, Lemma 9 considers a superset of the iterations in which this bound may occur, which serves our purposes just as well.) The first few results in this section consider properties of algorithmic quantities conditioned on F k . Conditioned on F k , let us define three events:</p><p>-</p><p>. We now derive an upper bound on the final term in <ref type="bibr">(21)</ref>. In the following lemma, we use, given k &#8712; [k max ], the stepsize values</p><p>The first value here represents a lower bound on the smallest stepsize that may be computed in the event that T k &lt; T k-1 , whereas the second value is the smallest stepsize that may be computed in the event that T k = T k-1 ; it is easily verified that &#945; &lt; min,k &lt; A = min,k . Hence, A max,k represents an upper bound on the largest stepsize that may be computed. Lemma 6 Suppose that Assumptions 1, 2, and 3 hold, and let &#954; d &#8712; R &gt;0 be defined by Lemma 4. Then, for all k &#8712; [k max ], and with the stepsizes (&#945; &lt; min,k , A = min,k , A max,k ) defined as in <ref type="bibr">(22)</ref>, one finds that</p><p>and for ease of exposition, let us denote</p><p>] for all j &#8712; {1, 2, 3}. By the Law of Total Expectation, the fact that 0 &lt; &#964; min &#8804; T k &#8804; T k-1 under E, <ref type="bibr">(20)</ref>, and the definitions of &#945; &lt; min,k , A = min,k , and A max,k , one finds that</p><p>Using this inequality, the Law of Total Expectation, <ref type="bibr">(20)</ref>, and Lemma 4 (E k [D k ] = D true k ), one obtains three upper bounds by adding and subtracting like terms:</p><p>From averaging these upper bounds And the definition of A max,k , one obtains</p><p>This bound can be rewritten as follows. By the Law of Total Expectation, Lemma 4, and</p><p>and along with Lemma 4 one finds that</p><p>In addition, Lemma 4 also yields that</p><p>Combining ( <ref type="formula">23</ref>), ( <ref type="formula">24</ref>), <ref type="bibr">(25)</ref>, and ( <ref type="formula">26</ref>), the desired result follows.</p><p>Now, for all k &#8712; [k max ], let us define</p><p>Lemma 7 Suppose that Assumptions 1, 2, and 3 hold and let &#954; d &#8712; R &gt;0 be defined by Lemma 4. Then, for all k &#8712; [k max ], one finds that</p><p>Proof By the definition of &#916;q in ( <ref type="formula">8</ref>), one has that</p><p>Now, for simplicity of notation, define</p><p>Let E Q denote the event that Q k &#8805; 0 occurs and let E c Q denote the event that Q k &lt; 0 occurs. By the Law of Total Expectation and ( <ref type="formula">20</ref>), one has that</p><p>Therefore, by the Law of Total Probability, Lemma 5, (20), Jensen's inequality, and convexity of max{&#8226;, 0}, it follows that</p><p>and by similar reasoning one finds that</p><p>Averaging these two upper bounds, one finds that</p><p>Our goal now is to bound the latter two terms in <ref type="bibr">(29)</ref>. Toward this end, observe that by the triangle and Cauchy-Schwarz inequalities, the proof of Lemma 4, Assumption 3, (20), Jensen's inequality, and concavity of the square root over R &#8805;0 ,</p><p>In addition, by the Cauchy-Schwarz inequality, the proof of Lemma 4, Assumption 3, (20), and since</p><p>By the Law of Total Expectation, <ref type="bibr">(30)</ref>, and ( <ref type="formula">31</ref>), it follows that</p><p>and by a similar argument, one finds that</p><p>The conclusion follows by combining these equations, <ref type="bibr">(28)</ref>, and (29).</p><p>Our next lemma bounds differences between expected reductions in the model of the merit function that account for cases when Tk &lt; T k .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 8</head><p>Let Assumptions 1, 2, and 3 hold and let &#954; d &#8712; R &gt;0 be defined by Lemma 4.</p><p>Then, for all k &#8712; [k max ], with (&#945; &lt; min,k , A = min,k , A max,k ) defined as in <ref type="bibr">(22)</ref> and Tk defined in <ref type="bibr">(27)</ref>, one finds that</p><p>and</p><p>Proof Under the stated assumptions and definitions, one finds that</p><p>Hence, under the stated assumptions, it follows from the stated lemma and definitions, along with the definition of &#916;q in <ref type="bibr">(8)</ref> and equation <ref type="bibr">(20)</ref>, that</p><p>and, along with</p><p>which together are the desired conclusions.</p><p>Our next lemma is a critical element of our analysis. For any s &#8712; N and &#948; &#8712; (0, 1), let</p><p>and</p><p>This lemma provides a bound on two quantities with high probability, the first of which is the sum of the probabilities of the occurrences of event E k,3 over the run of the algorithm.</p><p>The following lemma also provides a bound on the cardinality of the random index set,</p><p>By the manner in which {T k }, {T trial,true k }, and { Tk } are defined, this set is always a superset of the iterations in which Tk &lt; T k ; hence, by bounding the cardinality of (34), one bounds the cardinality of the set of iterations in which Tk &lt; T k , which is needed for our main theorem. The reason that we consider the set K &#964; in (34) is the fact that the event T trial,true k &lt; T k-1 and its complement are members of F k ; in other words, the occurrence of the event defined by this inequality does not depend on G k . Lemma 9 Suppose Assumptions 1, 2, and 3 hold. Then, for any &#948; &#8712; (0, 1),</p><p>In addition, suppose Assumption 4 holds as well. Then, for any &#948; &#8712; (0, 1), (35) holds and</p><p>Proof This result is proved in Appendix B.</p><p>Our proof of Lemma 9 uses a novel tree structure for analyzing the behavior of the adaptive merit parameter sequence in the context of an algorithm for solving constrained stochastic optimization problems; see also <ref type="bibr">[1,</ref><ref type="bibr">Appendix B]</ref> for the use of such a tree structure in a different context. Starting with the initialization at a root node, each subsequent level of the tree captures different sets of realizations of the algorithm in terms of the number of decreases of the merit parameter and the probability that the parameter will be decreased further in the next iteration. The leaves of the tree represent either bad situations when the sum of the probabilities that the merit parameter could decrease exceeds a critical threshold (defined with respect to the function defined in (33)) or good situations in which this sum remains below the threshold and the maximum number of merit parameter decreases has occurred and/or the iteration limit has been reached. Essentially, in this manner, bad situations are when the algorithm repeatedly has a high probability of decreasing the merit parameter, but does not do so a sufficient number of times. The proof of Lemma 9 ultimately relies on applications of Chernoff's bound-employed with respect to independent random variables that are defined carefully with respect to the non-independent random variables in the stochastic process defined by the algorithm-to show that the probability is small that the algorithm ends at a bad leaf node.</p><p>We are now prepared to prove a convergence rate result.</p><p>Theorem 2 Suppose Assumptions 1, 2, 3, and 4 hold, let s max &#8712; N \ {0}, let &#954; d &#8712; R &gt;0 be defined by Lemma 4, define</p><p>where</p><p>and, for all k &#8712; [k max ], let Tk be defined as in <ref type="bibr">(27)</ref>. Then, for any &#948; &#8712; (0, 1), it follows with K * having a discrete uniform distribution over [k max ] and &#948; and defined as in <ref type="bibr">(32)</ref> and (33) that, with probability at least 1 -&#948;,</p><p>Proof First, consider arbitrary k &#8712; [k max ]. By Lemmas 3 and 6 and equation ( <ref type="formula">20</ref>), one has that</p><p>Our next aim is to prove that, roughly speaking, one in fact finds that</p><p>Such a bound does not follow directly from (39) since the first term on the right-hand side in (39) involves a model reduction with respect to T k (which cannot be tied to a stationarity measure), whereas the first term on the right-hand side of the bound in (40) involves a model reduction with respect to Tk (which can be tied to a stationarity measure). Toward the aim of proving a bound of the form in (40), first observe that it follows with Lemma 7 that</p><p>where the final inequality follows due to the fact that (A max + &#952;&#946;)&#946; &#8804; A min holds by the definitions of &#946;, &#947; , A min , and A max .</p><p>Let us now combine (39) and (41) to prove a bound of the form in (40) by considering two complementary events. In particular, let E k,&#964; be the event that T trial,true</p><p>and let E c k,&#964; be the event that T trial,true</p><p>Observe that E k,&#964; and E c k,&#964; only depend on the history of the algorithm prior to iteration k and thus the &#963; -algebras generated by E k,&#964; and E c k,&#964; are included in F k . Therefore, by [15, Theorem 4.1.13], for any random variable Z , we have</p><p>Hence, we can use Lemma 7-and Lemma 8 as well, which is used below-even if one conditions on the occurrence of E k,&#964; or of E c k,&#964; . Let us now consider E c k,&#964; and E k,&#964; in turn. Conditioning on E c k,&#964; , one finds from (39), ( <ref type="formula">41</ref>), (42), and the fact that</p><p>On the other hand, one finds from (39), ( <ref type="formula">41</ref>), (42), and Lemma 8 that</p><p>Hence, by the laws of total probability and expectation, one finds that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Summing this inequality for all</head><p>123</p><p>Next, by Lemma 9, it follows that, with probability at least 1 -&#948;, one finds</p><p>Taking the total expectation (conditioned on E) of the above inequality,</p><p>The left-hand side of this inequality satisfies</p><p>Since {T k } is a non-decreasing sequence and f (X k ) &#8805; f min , one finds</p><p>Thus, from (44), it follows that</p><p>Combining this with (43) and dividing by A min k max k=0 &#946;, one obtains that</p><p>Hence, by the definitions of K * and &#946;, the desired conclusion follows.</p><p>The following corollary translates the result of the preceding theorem to a result pertaining to a stationary measure of (1); recall <ref type="bibr">(2)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Corollary 1 Under the assumptions, conditions, and definitions of Theorem 2, it holds with probability at least</head><p>Hence, the complexity bound described in Sect. 2.2 (see <ref type="bibr">(5)</ref>) holds.</p><p>Proof The result follows by Lemma 1, Lemma 2, Theorem 2, and ( <ref type="formula">13</ref>), which implies that &#8711; f</p><p>The worst-case complexity bound in Sect. 2.2 follows by combining this result with the definitions of &#954; E 3 , &#954; &#916;q,1 , &#954; &#916;q,2 , and Lemma 21 in Appendix C. This result, as well as that of Theorem 2, is proven under the assumption that s max &#8805; 1. When s max = 0, this result simplifies to a deterministic complexity bound with the terms dependent on s max and &#948; omitted. Under the condition s max = 0, the proof follows by noting that</p><p>is defined in the proof of Theorem 2) along with a similar argument to the proof of Theorem 2.</p><p>Again, we remark that this result, when viewed in terms of the squared norm of the gradient of the Lagrangian, matches the worst-case complexity of the stochastic gradient method for the unconstrained setting <ref type="bibr">[14]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Complexity result for symmetric sub-Gaussian distributions</head><p>In this section, we show that if each stochastic gradient is unbiased with a symmetric, sub-Gaussian distribution and (for simplicity) the ratio parameter sequence remains constant, then the conditions involved in Assumptions 3 and 4 and the event E in <ref type="bibr">(15)</ref> occur with high probability. Specifically, this is shown under Assumptions 1, 2, and Assumption 5 below, where it is important to note that Assumption 5 conditions on elements of G k , not F k . Overall, our analysis in this section shows that if one makes Assumptions 1, 2, and Assumption 5, then one can be assured that the conditions required for Assumptions 3 and 4 and the event E occur with high probability, meaning that if, in turn, one makes Assumptions 3 and 4 and assumes that E occurs, then our main results of the prior subsection hold.</p><p>and the random vectors</p><p>Our first lemma shows that, under Assumptions 1, 2, and 5, an inequality of the form in Assumption 4 holds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 10 Under Assumptions 1, 2, and 5, it follows for all k &#8712; [k max ] that</head><p>Proof Consider arbitrary k &#8712; [k max ]. Let Z k be a basis for the null space of J (X k ), which under Assumption 1 is a matrix in R n&#215; (n-m) . Then, let</p><p>Hence, when conditioned on G k , the random variables</p><p>are equivalent in distribution by Assumption 5. Therefore,</p><p>which combined leads to the desired conclusion.</p><p>Next, we state a result based on well-known properties of sub-Gaussian random variables. This lemma follows in the same manner as [21, <ref type="bibr">Lemma 5]</ref>.</p><p>Lemma 11 Suppose Assumption 5 holds. Then, for any &#948; &#8712; (0, 1),</p><p>We conclude by showing that, under Assumptions 1, 2, and 5, the conditions involved in Assumption 3 and E occur with high probability.</p><p>Lemma 12 Suppose that Assumptions 1, 2, and 5 hold, let</p><p>all k &#8712; N (whose existence follows from Assumption 1 and [2, Lemma 2.9]), let &#954; c be an upper bound for c(X k ) 2 for all k &#8712; N (whose existence follows from Assumption 1), and define</p><p>Then, for any &#948; &#8712; (0, 1), it follows with probability at least 1 -&#948; that the conditions in Assumption 3 and event E hold with</p><p>and s max = min</p><p>Proof By Lemma 11, the event considered in that lemma holds with probability at least 1 -&#948;. Hence, for the purposes of this proof, suppose that event holds. By Jensen's inequality, convexity of exp(&#8226;), and (45), it follows that</p><p>In addition, it follows from the event in Lemma 11 that ( <ref type="formula">19</ref>) holds with M &#964; as stated in the lemma. This accounts for Assumption 3. Now consider event E. First, it follows from the arguments of [2, Lemma 2.15 and 2.16] combined with the event in Lemma 11 that T k &#8805; &#964; min and T trial,true k &#8805; &#964; min for all k &#8712; [k max ] for &#964; min as stated in the lemma. Second, it follows from the stated value of &#964; min and ( <ref type="formula">16</ref>) that |{k &#8712; [k max ] : T k &lt; T k-1 }| &#8804; s max for s max as stated in the lemma. Finally, the desired behavior of {&#926; k } follows from Assumption 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Adaptive ratio parameter</head><p>In this section, we state a convergence rate result, which can be translated to a worstcase complexity result, that relaxes the definition of the event E considered in prior sections. In particular, we remove the assumption that</p><p>Importantly, it has been proved in [2, Lemma 3.5] that, under our remaining assumptions, there still exists</p><p>Therefore, by the manner in which the ratio parameter sequence is set, it follows that there exists a maximum number of k &#8712; [k max ] such that &#926; k &lt; &#926; k-1 . Denoting this limit as r max &#8712; N, it follows (for the same reasons as the bound for s max in ( <ref type="formula">16</ref>)) that</p><p>To formalize our new assumption, we define</p><p>as the event such that</p><p>Since {&#926; k } is bounded below deterministically, this event is identical to the event E defined in <ref type="bibr">(15)</ref>, except that one may have &#958; -1 &gt; &#958; min .</p><p>For the purposes of this section, redefining</p><p>our analysis of this case considers the following replacement of Assumption 3. Like in Assumption 3, the latter part of the assumption only needs to involve probabilities and expectations conditioned on the event that one or the other parameter decreases, i.e., T k &lt; T k-1 and/or &#926; k &lt; &#926; k-1 .</p><p>Assumption 6 There exists</p><p>In addition, there exist</p><p>We claim that this assumption holds with high probability under Assumption 5 (without the assumption that &#926; k = &#958; min for all k &#8712; [k max ]), a result that can be derived by modification of the techniques used in Sect. 4.5.</p><p>The complexity analysis for this case follows by essentially the same arguments as those used to derive a complexity result under Assumption 3. A slight modification of Lemma 6 is needed to include the three events related to the sign of &#8711; f (x k ) (D k -D true k ) that appear in Assumption 6 (as opposed to the one in Assumption 3), which yields a result in terms of the probabilities of these three events. Then, a slightly modified Lemma 9 and the union bound can be applied two additional times to derive a complexity result. Since the analysis is a similar, but tedious extension of the results in Sect. 4.4, we simply state the extension of (5) to this case without proof.</p><p>Theorem 3 Suppose that Assumptions 1, 2, 4, and 6 hold and consider arbitrary &#948; &#8712; (0, 1). Then, within k max + 1 iterations, it holds with probability at least 1 -&#948; that the algorithm generates x k * &#8712; R n corresponding to which there exists an associated Lagrange multiplier y true k * &#8712; R m such that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>We proved a worst-case complexity bound (in terms of iterations, function evaluations, and (stochastic) derivative evaluations) for the stochastic sequential quadratic optimization method for solving optimization problems involving a stochastic objective function and deterministic equality constraints proposed in <ref type="bibr">[2]</ref>. While key to the practical performance of the algorithm, the adaptivity of the merit parameter introduced a number of theoretical challenges to overcome. Under mostly standard assumptions, we proved that, with high probability, a measure of primal-dual stationarity decays at a rate of k -4 (ignoring log factors), which translates into a worst-case complexity bound on par with the stochastic gradient method in the unconstrained setting. While our analytical approach has been developed for an SQP method that uses an 1 -norm exact merit function, it may be applicable to a wide variety of algorithmic frameworks for constrained stochastic optimization. For example, our approach may be modified to apply to methods that adaptively update critical parameters at each iteration, such as adaptive penalty methods <ref type="bibr">[5,</ref><ref type="bibr">6,</ref><ref type="bibr">22]</ref>, adaptive augmented Lagrangian methods <ref type="bibr">[11]</ref>, adaptive barrier methods <ref type="bibr">[27]</ref>, and penalty-interior point methods <ref type="bibr">[10]</ref>.</p><p>In addition, many constrained optimization algorithms generate (often unconstrained) subproblems defined by an auxiliary parameter sequence that is updated dynamically based off of the solution to the previous subproblem. Algorithms of this type include penalty methods, augmented Lagrangian methods, and interior point methods <ref type="bibr">[28]</ref>. In cases when the objective is stochastic, this auxiliary sequence would also be a random process, in which case analyzing the behavior of such a process would be paramount to proving a complexity result for such a method. We believe that the techniques that we have devised for this paper are broadly applicable and foundational for performing complexity analyses of deterministically constrained stochastic optimization methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Proof of Theorem 1 (Deterministic Algorithm Complexity)</head><p>In this appendix, we prove Theorem 1, which states a worst-case complexity bound for Algorithm 2.1 of <ref type="bibr">[2]</ref>. We refer to quantities defined and employed in the analysis in <ref type="bibr">[2]</ref>. In particular, in this appendix, for all k &#8712; N, we suppose that</p><p>is the search direction computed by solving the SQP subproblem with g k = &#8711; f (x k ). As seen in <ref type="bibr">[2]</ref>, the convergence properties of Algorithm 2.1 in that paper are driven by reductions in a model of the merit function in each iteration. Our first lemma proves a useful lower bound for such a reduction.</p><p>as in <ref type="bibr">[2]</ref> and let</p><p>Then, for any &#949; &#8712; (0, 1),</p><p>Let us consider two cases. First, suppose that c k 1 &gt; &#954;&#949; 2 . Then, by [2, equation (2.9)],</p><p>which implies (47), as desired. Second, suppose that c k 1 &#8804; &#954;&#949; 2 &#8804; &#949; 2 , which by the definition of (&#949;, k) implies that g k + J k y k &gt; &#949;. It follows from this fact that </p><p>From this fact and the definition of &#954;, it follows that</p><p>which implies (47), as desired.</p><p>We now prove Theorem 1, further details of which are provided in the statement below.</p><p>Theorem 4 Define (&#964; -1 , f low , &#945; min , &#964; min , &#951;, &#963; ) &#8712; (0, &#8734;) 4 &#215; (0, 1) 2 as in <ref type="bibr">[2]</ref> and ( &#954;, &#954;) &#8712; (0, 1] &#215; (0, &#8734;) as in <ref type="bibr">(46)</ref>. Then, for any &#949; &#8712; (0, 1), Theorem 1 holds with <ref type="bibr">(4)</ref> given by</p><p>Proof To derive a contradiction, suppose (3) does not hold for all k &#8712; {0, . . . , K &#949; }.</p><p>Then, along with Lemma 13 and [2, equation (2.10) and Lemma 2.17], it follows for all such k that</p><p>By the definition of &#966;, this means for all such k that</p><p>Summing this inequality for all k &#8712; {0, . . . , K &#949; }, one can deduce that</p><p>Since {&#964; k } is monotonically nonincreasing, c K &#949; +1 1 &#8805; 0, and</p><p>Rearranging this inequality, one arrives at the conclusion that</p><p>which is a contradiction. Therefore, one arrives at the desired conclusion that Algorithm 2.1 yields an iterate satisfying (3) in at most K &#949; iterations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Proofs of Lemma 9</head><p>In this appendix, we prove Lemma 9. Toward this end, we prove for any &#948; &#8712; (0, 1) with &#948; as defined in <ref type="bibr">(32)</ref> and (s max , &#948;) as defined in (33), one finds</p><p>We build to this result, ultimately proved as Theorem 5, with a series of preliminary lemmas.</p><p>As our first preliminary result, we state a particular form of Chernoff's bound <ref type="bibr">[31]</ref> in the following lemma, which will prove instrumental in deriving (48).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 14</head><p>For any k &#8712; N, let {Y 0 , . . . , Y k } be independent Bernoulli random variables. Then, for any s &#8712; N and &#948; &#8712; (0, 1), it follows that</p><p>Proof Suppose that &#956; &#8805; (s, &#948;). By the multiplicative form of Chernoff's bound, it follows for &#961; := 1s/&#956; (which is in the interval (0, 1) by ( <ref type="formula">49</ref>)) that</p><p>Hence, to prove the result, all that remains is to show that e -1 2 &#956;(1-s/&#956;) 2 &#8804; &#948;, i.e., that</p><p>which holds if and only if &#956; 2 -2&#956;(s + log(1/ &#948;)) + s 2 &#8805; 0. Viewing the left-hand side of this inequality as a convex quadratic function in &#956;, one finds that the inequality holds as long as &#956; is greater than or equal to the positive root of the quadratic, i.e.,</p><p>This holds since &#956; &#8805; (s, &#948;); hence, the result is proved. Now, we turn our attention to proving (48). For any realization of a run of the algorithm up to iteration k &#8712; [k max ], let w k denote the number of times that the merit parameter has been decreased up to the beginning of iteration k and let p k denote the Notationally, since the behavior of the algorithm up to iteration k &#8712; N is determined by the initial conditions and stochastic gradients in G [k-1] , we write</p><p>to denote the event that the signature of the algorithm up to k is a member of N ( p [k] , w <ref type="bibr">[k]</ref> ). The initial condition, denoted for consistency as G [-1] &#8712; N ( p 0 , w 0 ), occurs with probability one. Based on the description above, the nodes of our tree satisfy: For any node at a depth of k &#8805; 2, the event</p><p>and</p><p>where I[&#8226;] denotes the indicator function of any given event.</p><p>Let us now define certain important sets of nodes in the tree. First, let</p><p>be the set of nodes at which the sum of the elements of p [k] is sufficiently small and either w k has reached s max or k has reached k max . A node in this set is of interest since, due to the iteration and/or merit parameter decrease limit having been reached, the probability is zero that a certain "bad" event can occur over all realizations with signatures that are members of the node; see Lemma 15 on page 38. Second, let</p><p>be the nodes in the complement of L good at which the sum of the elements of p [k] has exceeded the threshold (s max , &#948;) + 1. A node in this set is of interest since, due to this threshold having been exceeded, all realizations with signatures that are members of this node correspond to poor behavior of the algorithm (and there is no need to consider the behavior of the algorithm beyond this point). Going forward, we restrict attention to the tree defined by the root node and all paths from the root node that terminate at a node contained in L good &#8746; L bad . It is clear from this restriction and the definitions of L good and L bad that this tree is finite with the elements of L good &#8746; L bad being leaves.</p><p>Let us now define relationships between nodes. The parent of a node is defined as</p><p>On the other hand, the children of node N ( p [k] , w <ref type="bibr">[k]</ref> ) are defined as</p><p>This ensures that paths down the tree terminate at nodes in L good &#8746; L bad , making these nodes the leaves of the tree. For convenience in the remainder of our discussions, let</p><p>We define the height of node</p><p>) as the length of the longest path from</p><p>) to a leaf node, i.e., the height is denoted as</p><p>Next, let us define two more sets of nodes that will be useful later. Let</p><p>) such that the merit parameter decreases and let</p><p>) such that it does not decrease, so</p><p>and</p><p>Finally, let us define the event E bad,B as the event that for some j &#8712;</p><p>With respect to our goal of proving (48), the event E bad,B is of interest since it is the event that the given probabilities accumulated up to iteration j &#8712; [k max ] (and beyond) exceed the threshold found in (48) plus a factor that is inversely proportional to B.</p><p>Let us now prove some properties of the leaf nodes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 15 For any</head><p>On the other hand, for all k &#8712; [k max ] and ( p</p><p>Proof Consider an arbitrary index k &#8712; [k max ] and an arbitrary pair</p><p>By the definition of L good , it follows that</p><p>Since the maximum depth of a node is k max and the gap between the discrete values in (50) is 1 B , it follows along with (55) that</p><p>Therefore, for any j &#8712; {1, . . . , k}, one finds from conditional probability that</p><p>In addition, (54) cannot hold for j = 0 since (s max , &#948;)+1 &gt; 1. Hence, along with the conclusion above, it follows that E bad,B does not occur when a signature up to iteration j &#8712; {1, . . . , k} falls into a node along any path from the root node to</p><p>). Now, by the definition of L good , at least one of w k = s max or k = k max holds. Let us consider each case in turn. If k = k max , then it follows by the preceding arguments that</p><p>Otherwise, if w k = s max , then it follows by the definition of the event E that P[T i &lt; T i-1 |F i ] = 0 for all i &#8712; {k, . . . , k max }, and therefore the equation above again follows. Overall, it follows that</p><p>) &#8743; E bad,B |E] = 0, as desired. Next, we remark that</p><p>Hence, using the initial condition that G</p><p>Our goal is to bound (56). Toward this end, define</p><p>which by the definition of w [k] form a partition of {1, . . . , k}. For any i &#8712; I dec ,</p><p>On the other hand, for any i &#8712; I c dec ,</p><p>Thus, it follows that the latter term in (56) satisfies</p><p>Now let us bound this term. By the definition of L bad , one finds that</p><p>In addition, by the definition of s max , it follows that w k &#8804; s max for all k &#8712; [k max ], from which it follows that |I dec | &#8804; s max . Now, let {Z 0 , . . . , Z k-1 } be independent Bernoulli random variables such that, for all i &#8712; {0, . . . , k -1}, one has</p><p>By (57), it follows from the definition of these random variables that k-1 i=0 P[Z i = 1] &#8805; (s max , &#948;). Then, it follows by Lemma 14 and the preceding argument that</p><p>Combining this with (56), the desired conclusion follows.</p><p>Next, we present a lemma about nodes in the sets defined in (52) and (53). The lemma essentially states that a certain probability of interest, defined as the product of probabilities along a path to a child node, can be reduced to a product of probabilities to the child's parent node by partitioning the childen into those at which a merit parameter decrease has occurred and children at which a merit parameter decrease has not occurred.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 16 For all</head><p>and, similarly, one finds that</p><p>where by the definitions of C dec and C c dec it follows that the value of w k+1 in the sum in the former equation is one greater than the value of w k+1 in the sum in the latter equation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proof One finds that</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The desired conclusion follows since, by the definition of C dec (N ( p [k] , w [k]</head><p>)), all elements in the latter sum have S k = w k+1 = w k + 1, meaning that the sum is exhaustive over all possible outcomes with the same conditions, and hence the sum is 1.</p><p>The proof of the second desired conclusion follows in the same manner with C c dec in place of C dec and S k = w k+1 = w k in place of S k = w k+1 = w k + 1.</p><p>Next, we derive a result for certain nodes containing realizations with w k = s max -1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 17 For any</head><p>Proof By the supposition that</p><p>in which case the desired conclusion follows from Lemma 15. With this base case being established, we now prove the result by induction. Suppose that the result holds for all ( p</p><p>)) &#8804; j for some j &#8712; N. Our goal is to show that the same statement holds with j replaced by j +1. For this purpose, consider arbitrary ( p</p><p>Observe that by the definition of the child operators C, C dec , and C c dec , it follows that</p><p>Since w k = s max -1, it follows from the definition of C dec that for any ( p k+1 , w k+1 ) with</p><p>By the definition of s max , this implies that P[T k+1 &lt; T k |F k+1 ] = 0, so p k+1 = 0. In addition, since</p><p>) &#8712; L good . Consequently, from above and Lemma 15, one finds</p><p>)). Therefore, by the induction hypothesis and the result of Lemma 16, it follows that</p><p>which completes the proof.</p><p>Using the preceding lemma as a base case, we now perform induction on the difference s max -w k to prove a similar result for arbitrary s max .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 18 For any</head><p>The result holds in this case according to Lemma 15 since one finds that</p><p>&#8712; L good and s max -w k = 1, the result follows from Lemma 17. Hence, to prove the remainder of the result by induction, one may assume that it holds for all ( p</p><p>)) &#8804; t for some t &#8712; N, and s max -w k = r for some r &#8712; N\{0}, and show that it holds for all ( p (59) ,w [k] ) = t + 1 &gt; r -1 = s max -w k-1 -1, which combined with (59) proves the result for this case as well.</p><p>We now prove our first main result of this section.</p><p>Theorem 5 For any &#948; &#8712; (0, 1) with &#948; as defined in <ref type="bibr">(32)</ref> and (s max , &#948;) as defined in (33), one finds that (48) holds.</p><p>Proof First, consider the case where s max = 0. Then, by the definition of s max ,</p><p>for all k = [k max ], so the result holds trivially. Now, let s max &#8712; N \ {0}. By construction of our tree and the definitions of L good and L bad , one finds that h(N ( p 0 , w 0 )) &#8804; k max . In addition, by the definition of s max , s max -1 &lt; k max , so min{s max -w 0 -1, h(N ( p 0 , w 0 ))} = s max -1 &#8805; 0. Consider arbitrary B &#8712; N \ {0} (see (54)). By <ref type="bibr">Lemma 18 and (32)</ref>, Next, we present some preliminary results that are required to prove the second statement of Lemma 9. Recall the random index set K &#964; defined in (34). Our next lemma shows a property about any iteration k &#8712; [k max ] in which k &#8712; K &#964; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 19 Let Assumption 4 hold. Then, one finds that</head><p>Proof In any iteration where T trial,true k &lt; T k-1 , it follows that T trial,true k &lt; &#8734;, so</p><p>and thus</p><p>By the definition of T k , if</p><p>in an iteration such that T trial,true k &lt; T k-1 , then</p><p>meaning that T k &lt; T k-1 . Observe that the event k &#8712; K &#964; only depends on the history of the algorithm prior to iteration k and thus the &#963; -algebra generated by k &#8712; K &#964; is included in F k . Therefore, by <ref type="bibr">[15,</ref><ref type="bibr">Theorem 4.1.13]</ref>, for any random variable Z , we have</p><p>Therefore, with 1( &#7868;) denoting the indicator of event &#7868;, it follows from Assumption 4 that</p><p>as desired.</p><p>The previous lemma guarantees that in any iteration in which T trial,true k &lt; T k-1 , the probability is at least p &#964; that the merit parameter decreases. By the scheme for setting Proof By the law of total probability,</p><p>Now, similar to the previous lemma, we note that k &#8712; K &#964; only depends on the history of the algorithm prior to iteration k. Denoting the indicator of the event k &#8712; K &#964; as 1(k &#8712; K &#964; ), it follows by Lemma 19 that </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Lemma Required for the Proof of Corollary 1</head><p>This appendix provides the following lemma, which shows that the order notation result in (5a) and (5b) holds, as required in the proof of Corollary 1.</p><p>Lemma 21 Let &#948; &#8712; (0, 1), &#948; be defined in <ref type="bibr">(32)</ref>, s max &#8712; N \ {0} and (s max , &#948;) be defined in <ref type="bibr">(33)</ref> </p><note type="other">.</note></div></body>
		</text>
</TEI>
