<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Stochastic Inexact Sequential Quadratic Optimization Algorithm for Nonlinear Equality-Constrained Optimization</title></titleStmt>
			<publicationStmt>
				<publisher>INFORMS</publisher>
				<date>07/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10565084</idno>
					<idno type="doi">10.1287/ijoo.2022.0008</idno>
					<title level='j'>INFORMS Journal on Optimization</title>
<idno>2575-1484</idno>
<biblScope unit="volume">6</biblScope>
<biblScope unit="issue">3-4</biblScope>					

					<author>Frank E Curtis</author><author>Daniel P Robinson</author><author>Baoyu Zhou</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>A stochastic algorithm is proposed, analyzed, and tested experimentally for solving continuous optimization problems with nonlinear equality constraints. It is assumed that constraint function and derivative values can be computed but that only stochastic approximations are available for the objective function and its derivatives. The algorithm is of the sequential quadratic optimization variety. Distinguishing features of the algorithm are that it only employs stochastic objective gradient estimates that satisfy a relatively weak set of assumptions (while using neither objective function values nor estimates of them) and that it allows inexact subproblem solutions to be employed, the latter of which is particularly useful in large-scale settings when the matrices defining the subproblems are too large to form and/or factorize. Conditions are imposed on the inexact subproblem solutions that account for the fact that only stochastic objective gradient estimates are employed. Convergence results are established for the method. Numerical experiments show that the proposed method vastly outperforms a stochastic subgradient method and can outperform an alternative sequential quadratic programming algorithm that employs highly accurate subproblem solutions in every iteration.</p> <p>Funding: This material is based upon work supported by the National Science Foundation [Awards CCF-1740796 and CCF-2139735] and the Office of Naval Research [Award N00014-21-1-2532].</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>We propose, analyze, and present experimental results with a stochastic inexact sequential quadratic optimization (SISQO) algorithm for minimizing an objective function subject to (s.t.) nonlinear equality constraints. Specifically, our algorithm is designed to solve problems of the form</p><p>where f : R n &#8594; R and c : R n &#8594; R m are continuously differentiable, &#969; is a random variable with probability space (&#8486;, F , P), F : R n &#215; &#8486; &#8594; R, and E &#969; [&#8226;] represents expectation taken with respect to the distribution of &#969;. Problems of this type arise in numerous important application areas. A partial list is the following: (i) learning a deep convolutional neural network for image recognition that imposes properties (e.g., smoothness) of the systems of partial differential equations (PDEs) that the convolutional layers are meant to interpret <ref type="bibr">(Ruthotto and Haber 2020)</ref>; (ii) multiple deep learning problems (see <ref type="bibr">M&#225;rquez-Neila et al. 2017)</ref>, including physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data <ref type="bibr">(Zhu et al. 2019)</ref>, natural language processing with constraints on output labels <ref type="bibr">(Nandwani et al. 2019)</ref>, image classification, detection, and localization <ref type="bibr">(Ravi et al. 2019)</ref>, deep reinforcement learning <ref type="bibr">(Achiam et al. 2017)</ref>, deep network compression <ref type="bibr">(Chen et al. 2018)</ref>, and manifold-regularized deep learning <ref type="bibr">(Tomar and</ref><ref type="bibr">Rose 2014, Kumar Roy et al. 2018</ref>); (iii) accelerating the solution of PDE-constrained inverse problems by using a reduced-order model in place of a full-order model coupled with techniques to learn the discrepancy between the reduced-and full-order models <ref type="bibr">(Sheriffdeen et al. 2019</ref>); (iv) multistage modeling <ref type="bibr">(Shapiro et al. 2014</ref>); (v) portfolio selection <ref type="bibr">(Shapiro et al. 2014</ref>); (vi) optimal power flow <ref type="bibr">(Summers et al. 2015)</ref>; and (vii) statistical problems, such as maximum likelihood estimation with constraints <ref type="bibr">(Geyer 1991</ref><ref type="bibr">, Chatterjee et al. 2016</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Notation</head><p>Let R denote the set of real numbers, R &#8805;p (respectively, R &gt;p ) denote the set of real numbers greater than or equal to (respectively, strictly greater than) p &#8712; R, and N :&#65533; {0, 1, 2, : : : } denote the set of natural numbers. Let R n denote the set of n-dimensional real vectors, R m&#215;n denote the set of m-by n-dimensional real matrices, and S n denote the set of n-by n-dimensional symmetric real matrices. For any p &#8712; N \ {0}, let [p] :&#65533; {1, : : : , p}. The &#8467; 2 -norm is written simply as &#8214; &#8226; &#8214;. Any run of our algorithm generates a sequence of iterates {x k }, where x k &#8712; R n for all k &#8712; N. For all k &#8712; N, we append the subscript k to other quantities defined with respect to the kth iteration, and for brevity, we define &#8711;f k :&#65533; &#8711;f (x k ), c k :&#65533; c(x k ), and &#8711;c k &#65533; &#8711;c(x k ). We refer to the range space of &#8711;c k as Range(&#8711;c k ) and refer to the null space of &#8711;c T k as Null(&#8711;c T k ). The fundamental theorem of linear algebra provides that these spaces are orthogonal and Range(&#8711;c k ) + Null(&#8711;c T k ) &#65533; R n . Finally, recall (see, e.g., <ref type="bibr">Nocedal and Wright 2006</ref>) that a primal point x &#8712; R n and a dual point y &#8712; R m constitute a first-order stationary point for Problem (1) if and only if c(x) &#65533; 0 and &#8711;f (x) + &#8711;c(x)y &#65533; 0. These conditions are necessary for x to be a local minimizer when the constraint functions satisfy a constraint qualification as is assumed in the paper; see Assumption 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3.">Organization</head><p>Our algorithm is presented in Section 2, with a convergence analysis in Section 3. The results of numerical experiments are presented in Section 4, and concluding remarks are presented in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">SISQO Algorithm</head><p>A run of our proposed SISQO algorithm generates a sequence</p><p>where for all k &#8712; N, (x k , y k ) &#8712; R n &#215; R m is a primal-dual iterate pair; g k &#8712; R n is a stochastic gradient estimate; v k &#8712; R n is a normal primal direction that aims to reduced infeasibility by reducing a local derivative-based model of the &#8467; 2 -norm constraint violation measure; u k &#8712; R n is a tangential primal direction that aims to maintain the reduction in linearized infeasibility achieved by the normal primal direction while also aiming to reduce the objective by reducing a stochastic gradient-based quadratic approximation of the objective; d k :&#65533; v k + u k &#8712; R n is a full primal direction; &#948; k &#8712; R m is a dual direction; (&#961; k , r k ) &#8712; R n &#215; R m is a primal-dual linear system residual pair; (&#964; trial k , &#964; k ) &#8712; R &gt;0 &#8746; {&#8734;} &#215; R &gt;0 is a pair of trial and employed merit parameter values; (&#958; trial k , &#958; k ) &#8712; R &gt;0 &#215; R &gt;0 is a pair of trial and employed ratio parameter values; and (&#945; k, min , &#945; k, max , &#945; k ) &#8712; R &gt;0 &#215; R &gt;0 &#215; R &gt;0 is a tuple of minimum, maximum, and employed step size values, the last of which aims to produce a subsequent iterate x k+1 &#8592; x k + &#945; k d k yielding sufficient reduction in the &#8467; 2 -norm merit function. We present our algorithm in the context of the generation of a realization of the sequence (2), although our analysis ultimately considers the stochastic process defined by the algorithm, namely</p><p>of which ( <ref type="formula">2</ref>) is a realization. In the rest of this section, we present our algorithm in the context of the generation of a realization (2) toward our complete statement of Algorithm 1.</p><p>For the remainder of the paper, we make the following assumption.</p><p>Assumption 1. Let X &#8838; R n be an open convex set containing {x k } generated by every run of Algorithm 1. The objective f : R n &#8594; R is continuously differentiable and bounded below over X , and its gradient &#8711;f : R n &#8594; R n is Lipschitz continuous with constant L &#8712; R &gt;0 (with respect to the &#8467; 2 -norm) and bounded over X . The constraint c : R n &#8594; R m (with m &#8804; n) is continuously differentiable and bounded over X, and its Jacobian &#8711;c(&#8226;) T : R n &#8594; R m&#215;n is Lipschitz continuous with constant &#915; &#8712; R &gt;0 (with respect to the vector-induced &#8467; 2 -norm) and bounded over X . In addition, for all x &#8712; X , the singular values of &#8711;c(x) T are bounded uniformly below by a positive real number.</p><p>The elements of this assumption are standard in the continuous constrained optimization literature. Note that it does not include an assumption that X is bounded. One could relax the assumption to say that X contains {X k } almost surely, but because such a relaxation would only make it necessary to remark constantly on the probabilityzero event that {X k } &#8836; X without adding much value to our ultimate results, we employ Assumption 1 as it is stated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Merit Function</head><p>Motivated by the success of numerous line-search-SQO methods for solving deterministic equality constrained optimization problems, our algorithm employs an exact penalty function as a merit function; in particular, it</p><p>Curtis, Robinson, and Zhou: Stochastic Inexact Sequential Quadratic Optimization INFORMS Journal on Optimization, Articles in Advance, pp. 1-23, &#169; 2024 INFORMS</p><p>employs the &#8467; 2 -norm merit function &#966; : R n &#215; R &gt;0 &#8594; R defined by &#966;(x, &#964;) &#65533; &#964;f (x) + &#8214;c(x)&#8214;, where &#964; is a positive merit parameter that is updated adaptively. (The choice of the &#8467; 2 -norm in &#966; is not essential. Another norm could be used. The choice of the &#8467; 2 -norm merely makes certain calculations simpler for our presentation and analysis.) A model l : T d&#8214;, with which we define the model reduction function &#8710;l :</p><p>The merit and model reduction functions play critical roles in our inexactness conditions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Computing a Search Direction</head><p>During the kth iteration, the algorithm computes a normal direction</p><p>Instead of solving (5) exactly, the algorithm allows for an inexact solution to be employed by only requiring the computation of v k &#8712; Range(&#8711;c k ) satisfying the Cauchy decrease condition</p><p>where &#603; c &#8712; (0, 1] is user defined. In ( <ref type="formula">6</ref>), v c k :&#65533; &#65533;&#8711;c k c k is the negative-gradient direction for the objective of ( <ref type="formula">5</ref>) at v &#65533; 0, and</p><p>otherwise, if c k &#65533; 0, then &#8711;c k c k &#65533; 0, and v k &#65533; 0 is the unique solution to (5). Popular choices for computing a normal direction satisfying the aforementioned conditions include Krylov subspace methods, such as the linear conjugate gradient (CG) method; see, for example, <ref type="bibr">Nocedal and Wright (2006)</ref>. For describing the tangential direction computation, let us first describe what would be the computation in a deterministic variant of our approach. Given (x k , y k ), &#8711;f k , v k &#8712; Range(&#8711;c k ), and H k &#8712; S n satisfying Assumption 2, consider the quadratic optimization subproblem</p><p>which has the unique solution</p><p>We make the following assumption pertaining to {H k } throughout the paper.</p><p>Assumption 2. There exists &#950; &#8712; R &gt;0 and &#954; H &#8712; R &gt;&#950; such that for all k &#8712; N in any run of the algorithm, &#8214;H k &#8214; &#8804; &#954; H and u T H k u &#8805; &#950;&#8214;u&#8214; 2 for all u &#8712; Null(&#8711;c T k ). The introduction of ( <ref type="formula">8</ref>) and ( <ref type="formula">9</ref>) allows us to define, for the purposes of our analysis only (i.e., not for actual computation), the true and exact primal-dual search direction conditioned on the behavior of the algorithm up to the kth iteration as</p><p>Because our algorithm only presumes access to a stochastic gradient estimate g k of &#8711;f k , the corresponding exact, but not true, primal-dual search direction is</p><p>(For the description of our algorithm and our initial analysis, it suffices that g k &#8712; R n . Our ultimate required assumption about the stochastic gradient estimators is Assumption 6.) Our algorithm, to avoid having to form or factorize the matrix in (10), computes a tangential direction by computing (u k , &#948; k ) through iterative linear algebra techniques applied to the symmetric indefinite system (10). In particular, our algorithm computes (u k , &#948; k ) such that the full primal search direction d k :&#65533; v k + u k , dual direction &#948; k , and residual</p><p>satisfy at least one of two sets of conditions. Next, we describe these conditions that the algorithm employs to determine what constitutes an acceptable search direction and corresponding residuals.</p><p>In the deterministic setting, line-search-SQO methods commonly combine the search direction with an updating strategy for the merit parameter in a manner that ensures that the computed direction is one of sufficient descent for the merit function. The required descent condition is guaranteed to be satisfied by choosing the merit parameter to be sufficiently small so that the reduction in a model of the merit function (recall (4)) is sufficiently large; see, for example, <ref type="bibr">Byrd et al. (2008, lemma 3.1)</ref>. Following such an approach, our algorithm requires that (u k , &#948; k ) (yielding d k :&#65533; v k + u k ) be computed and &#964; be set such that the model reduction condition</p><p>holds for some user-defined &#963; u &#8712; (0, 1), &#603; u &#8712; (0, &#950;) (see Assumption 2), and &#963; c &#8712; (0, 1). The value for &#964; for which ( <ref type="formula">12</ref>) is required to hold depends on one of two different situations as described next. Condition (12) plays a central role in the conditions that we require (u k , &#948; k ) to satisfy. We define these in the context of termination tests (TTs) because they dictate conditions that once satisfied, can cause termination of an iterative linear system solver applied to (10). (The tests are inspired by the sufficient merit approximation reduction termination tests developed in <ref type="bibr">Byrd et al. (2008</ref><ref type="bibr">Byrd et al. ( , 2010) )</ref> and <ref type="bibr">Curtis et al. (2009)</ref> for a deterministic SQO method.) Our first termination test states that an inexact solution is acceptable if ( <ref type="formula">12</ref>) is satisfied with the current merit parameter value (i.e., &#964; &#8801; &#964; k &#8592; &#964; k&#65533;1 ), the norms of the residuals satisfy certain upper bounds, and either the tangential direction is sufficiently small in norm compared with the normal direction or the tangential direction is one of sufficiently positive curvature for H k and yields a sufficiently small objective value for (8) (with g k in place of &#8711;f k ). The test makes use of a sequence {&#946; k } that will also play a critical role in our step size selection scheme that is described in the next subsection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1.">Termination Test</head><p>, and v k &#8712; Range(&#8711;c k ) computed to satisfy <ref type="bibr">(6)</ref>, the pair (u k , &#948; k ) satisfies TT1 if with the pair (&#961; k , r k ) defined in (11), it holds that</p><p>and ( <ref type="formula">12</ref>) is satisfied with &#964; &#8801; &#964; k&#65533;1 . (In this case, &#964; k &#8592; &#964; k&#65533;1 , so (12) holds with &#964; &#8801; &#964; k .) TT1 cannot be enforced in every iteration, even in the deterministic setting, because there may exist points in the search space at which all of the conditions required cannot be satisfied simultaneously, even if the linear system (10) is solved accurately. In short, the algorithm needs to allow for the computation of a search direction for which (12) can only be satisfied with a decrease of the merit parameter. That said, the algorithm needs to be careful in terms of the situations in which such a decrease is allowed, or else, the merit parameter sequence may behave in a manner that ruins a convergence guarantee for solving the original constrained optimization problem. For our algorithm, we employ the following termination test for this situation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2.">Termination Test</head><p>), and v k &#8712; Range(&#8711;c k ) computed to satisfy <ref type="bibr">(6)</ref>, the pair (u k , &#948; k ) satisfies Termination Test 2 if with the pair (&#961; k , r k ) defined in (11), ( <ref type="formula">13</ref>)-( <ref type="formula">15</ref>) and</p><p>Curtis, Robinson, and Zhou: Stochastic Inexact Sequential Quadratic Optimization INFORMS Journal on Optimization, Articles in Advance, pp. 1-23, &#169; 2024 INFORMS</p><p>hold. (In this case, for user-defined &#603; &#964; &#8712; (0, 1), the algorithm will set</p><p>where</p><p>(18) so ( <ref type="formula">12</ref>) is satisfied with &#964; &#8801; &#964; k . See Lemma 3 for a proof.)</p><p>In Lemma 1, we show under a loose assumption about the iterative linear system solver that for all k &#8712; N, the algorithm can compute a pair (u k , &#948; k ) satisfying at least one of TT1 or TT2. Therefore, the index of each iteration of each realization of our method is contained in one of two index sets:</p><p>It is worthwhile to emphasize that in terms of the stochastic process defined by the algorithm, the index sets K 1 and K 2 are random (i.e., they may contain different indices in different runs). This randomness is handled as part of our convergence analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Computing a Step Size</head><p>Upon computation of d k &#8592; v k + u k , our algorithm computes a positive step size &#945; k to set x k+1 . Given positive Lipschitz constants L and &#915; (recall Assumption 1), it follows for all &#945; &#8712; R &gt;0 that</p><p>Combining these with (4) yields</p><p>This derivation provides a convex piecewise quadratic upper-bounding function for the change in the merit function corresponding to a step from x k to x k + &#945;d k . Given user-defined &#951; &#8712; (0, 1) and the aforementioned sequence {&#946; k } &#8834; (0, 1], our algorithm's step size selection scheme makes use of</p><p>The definition of &#945; suff k can be motivated as follows. Its value, when &#946; k &#65533; 1, is the largest in [0, 1] such that for all &#945; &#8712; [0, &#945; suff k ], the right-hand side of (20) (with &#8711;f k replaced by g k ) is less than or equal to &#65533;&#951;&#945;&#8710;l(x k , &#964; k , g k , d k ). Such an inequality is representative of one enforced in deterministic line-search-SQO methods. Otherwise, with &#946; k &#8712; (0, 1] introduced and not necessarily equal to one, the value of &#945; suff k can be diminished during the optimization, which allows for step size control as is required for convergence guarantees for certain stochastic gradient-based methods; see, for example, <ref type="bibr">Bottou et al. (2018)</ref>. The first term inside the min appearing in ( <ref type="formula">21</ref>) is important for the convergence guarantees that we prove for our method, but it can behave erratically because of the algorithm's use of stochastic gradients. To account for this, given user-defined &#603; &#958; &#8712; (0, 1), our algorithm defines</p><p>Combining this inequality with (21), the monotonically nonincreasing behaviors of {&#958; k } and {&#964; k }, and assuming that</p><p>where &#958; &#65533;1 and &#964; &#65533;1 initialize the sequences {&#958; k } and {&#964; k }, respectively, one finds that</p><p>The value &#945; min k serves as a minimum value (i.e., a lower bound) for our choice of step size. In our analysis, we show that the sequence {&#958; k } is bounded below and away from zero by a positive real number that is common to all runs of the algorithm (see Lemma 9).</p><p>Next, let us derive a maximum value (i.e., an upper bound) for our algorithm's choice of step size. If d k &#65533; 0, then without loss of generality, the algorithm can set &#945; k &#8592; &#945; k, min . Otherwise, if d k &#8800; 0, consider the strongly convex function &#966; : R &#8594; R defined by</p><p>Notice that when &#946; k &#65533; 1, it holds that &#966;(&#945;) &#8804; 0 for all &#945; &#8712; R &#8805;0 if and only if the right-hand side of (20) with &#8711;f k replaced by g k is less than or equal to &#65533;&#951;&#945;&#8710;l(x k , &#964; k , g k , d k ). Thus, following a similar argument, one can be motivated as to the fact that our algorithm never allows a step size larger than &#945;</p><p>Finally, again to mitigate adverse effects caused by the use of stochastic gradients, our algorithm employs the maximum step size</p><p>where &#952; &#8712; R &#8805;0 is user defined. Overall, our algorithm allows any step size with</p><p>Lemma 4 in our analysis shows that this interval is nonempty.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Updating the Primal-Dual Iterate</head><p>In the primal space, our algorithm employs the iterate update x k+1 &#8592; x k + &#945; k d k . However, in the dual space, it allows additional flexibility. For consistency with the deterministic setting (see, e.g., <ref type="bibr">Curtis et al. 2009, equation 2.19</ref>), our algorithm is stated to require y k+1 to satisfy</p><p>Clearly, choosing y k+1 &#8592; y k + &#948; k is one particular option satisfying (27), although other choices, such as leastsquares multipliers, could also be used.</p><p>Algorithm 1 (Stochastic Inexact Sequential Quadratic Optimization (SISQO))</p><p>Require:</p><p>and &#964; k by ( <ref type="formula">17</ref>) and ( <ref type="formula">18</ref>)</p><p>using the definitions in ( <ref type="formula">24</ref>) and ( <ref type="formula">26</ref>) 15: set x k+1 &#8592; x k + &#945; k d k , and choose y k+1 satisfying ( <ref type="formula">27</ref>) 16: end for</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Analysis</head><p>Our analysis is presented in three parts. In Section 3.1, we show that Algorithm 1 is well posed, which is followed by Section 3.2, in which a set of lemmas is proved that hold for every run of the algorithm. The analysis in Sections 3.1 and 3.2 is written generically in terms of a realization of a run of the algorithm. Then, in Section 3.3, we prove convergence properties for the algorithm that are written in terms of the stochastic process defined by the algorithm. This analysis focuses on an event that presumes certain behavior of the merit and ratio parameter sequences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Well-Posedness</head><p>Our aim in this subsection is to prove that each procedure in each iteration of every run of Algorithm 1 is performed in a manner that terminates finitely under mild assumptions. Along the way, we establish other useful properties. In this subsection and the following one, our analysis merely requires for all k &#8712; N that g k &#8712; R n and that H k &#8712; S n satisfies Assumption 2.</p><p>For the sake of generality, we make the following assumption about the iterative linear system solver employed in line 6 of Algorithm 1, which merely requires that the residual of the linear system solve vanishes asymptotically over any run of the solver. We emphasize, however, that there exist approaches, such as MINRES <ref type="bibr">(Paige and Saunders 1975)</ref>, that guarantee that an exact solution-which would satisfy our termination tests-can be computed in a number of linear system solver iterations that is bounded uniformly for all linear systems arising throughout any realization of the algorithm. However, we merely make the following assumption because it is all that is required by our analysis, and it offers more flexibility in the choice of linear system solver. Assumption 3. For all k &#8712; N in any run, the iterative linear system solver employed in line 7 of Algorithm 1 to compute</p><p>We also make the following assumption concerning the algorithm iterates and corresponding stochastic gradient estimates computed in each iteration. Assumption 4. For all k &#8712; N in any run, it holds that c k &#8800; 0 or g k &#8713; Range(&#8711;c k ).</p><p>We justify Assumption 4 in the following manner. In the deterministic setting, the algorithm encounters a point x k such that c k &#65533; 0 and &#8711;f k &#8712; Range(&#8711;c k ) if and only if there exists y k such that (x k , y k ) is first-order stationary for Problem (1). In such a scenario, it is reasonable to require that an exact solution of ( <ref type="formula">10</ref>) is computed or at least that a sufficiently accurate solution is computed such that a practical termination condition for (10) is triggered and the algorithm terminates. In the stochastic setting, the algorithm encounters c k &#65533; 0 and g k &#8712; Range(&#8711;c k ) if and only if x k is exactly feasible and the stochastic gradient lies exactly in the range space of &#8711;c k . Because g k is a stochastic gradient, we contend that it is unlikely that it will lie exactly in Range(&#8711;c k ), except in special circumstances. Thus, for simplicity in our analysis, we impose Assumption 4. Note that if Assumption 4 did not hold, then one of the following could be employed in a practical implementation. (i) If a sufficiently accurate solution of (10) satisfies neither TT1 nor TT2, then a new stochastic gradient could be sampled, perhaps following a procedure to ensure that if multiple new stochastic gradients are computed, then each is computed with lower variance, or (ii) random (e.g., Gaussian) noise could be added to g k for all k &#8712; N so that Assumption 4 holds almost surely in all iterations, in which case the convergence result that we prove holds almost surely.</p><p>We can now show that the search direction computation is well posed. We remark in passing that if one was to employ a linear system solver, such as MINRES, that would produce an exact solution of the linear system within a uniformly bounded number of iterations, then the arguments in the proof of the following lemma would show that the linear system solver computes (u k , &#948; k ), satisfying at least one of TT1 or TT2 in a uniformly bounded number of iterations.</p><p>Lemma 1. For all k &#8712; N in any run, the iterative linear system solver computes (u k , &#948; k ) satisfying at least one of TT1 or TT2 in a finite number of iterations.</p><p>Proof. We prove the result by considering two cases.</p><p>Case 1. c k &#8800; 0. For this case, we show that (u k , &#948; k ) &#8801; (u k, t , &#948; k, t ) satisfies TT2 for sufficiently large t &#8712; N. Let us first observe that it follows from Assumption 3, Assumption 4, and &#946; k &#8712; (0, 1] that ( <ref type="formula">13</ref>) and ( <ref type="formula">14</ref>) hold with (&#961; k , r k ) &#8801; (&#961; k, t , r k, t ) for all sufficiently large t &#8712; N.</p><p>Let us now show that ( <ref type="formula">15</ref>) holds for all sufficiently large t &#8712; N. Because c k &#8800; 0, it follows under Assumption 1 that v k &#8800; 0. If u k, * &#65533; 0, then Assumption 3 implies {u k, t } &#8594; u k, * &#65533; 0, in which case it follows from &#954; u &#8712; R &gt;0 that the former condition in (15) holds with u k &#8801; u k, t for all sufficiently large t &#8712; N. On the other hand, if u k, * &#8800; 0, then (10) and Assumption 2 imply</p><p>Combining this inequality with &#603; u &#8712; (0, &#950;), &#954; v &#8712; R &gt;0 , v k &#8800; 0, and Assumptions 2 and 3, it follows that the latter set of conditions in (15) holds with u k &#8801; u k, t for all sufficiently large t &#8712; N.</p><p>Finally, let us show that ( <ref type="formula">16</ref>) holds for all sufficiently large t &#8712; N, which combined with the previous conclusions, shows that TT2 is satisfied by (u k , &#948; k ) &#8801; (u k, t , &#948; k, t ) for all sufficiently large t &#8712; N. By Assumption 3, ( <ref type="formula">6</ref>), and</p><p>, which shows that ( <ref type="formula">16</ref>) holds with r k &#8801; r k, t for all sufficiently large t &#8712; N.</p><p>Case 2. c k &#65533; 0. For this case, we show that (u k , &#948; k ) &#8801; (u k, t , &#948; k, t ) satisfies TT1 for all sufficiently large t &#8712; N. First, recall that c k &#65533; 0 implies that v k &#65533; 0. We also claim that u k, * &#8800; 0. To prove this by contradiction, suppose that u k, * &#65533; 0. Combining this fact with v k &#65533; 0 and (10), it follows that g k + &#8711;c k (y k + &#948; k, * ) &#65533; 0, which with c k &#65533; 0, violates Assumption 4. Thus, u k, * &#8800; 0.</p><p>Next, notice that the argument used in the beginning of case (1) still applies in this case, which allows us to conclude that both ( <ref type="formula">13</ref>) and ( <ref type="formula">14</ref>) hold with (&#961; k , r k ) &#65533; (&#961; k, t , r k, t ) for all sufficiently large t &#8712; N. Combining (29) with Assumptions 2 and 3 and &#603; u &#8712; (0, &#950;) allows us to deduce that the second set of conditions in (15) holds with u k &#8801; u k, t for all sufficiently large t &#8712; N. Next, from the fact that v k &#65533; 0 and (10), it follows that &#8711;c <ref type="formula">10</ref>), and Assumption 2 shows that <ref type="formula">12</ref>) holds with &#964; &#8801; &#964; k&#65533;1 for all sufficiently large t &#8712; N. In summary, we have shown that for all sufficiently large t &#8712; N, the pair (u k , &#948; k ) &#8801; (u k, t , &#948; k, t ) satisfies TT1. w Next, we prove that every full primal direction is nonzero.</p><p>Lemma 2. For all k &#8712; N in any run, it holds that d k &#8800; 0.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proof. By contradiction, suppose that</head><p>then this shows that the inequality in (13) cannot hold, meaning that (u k , &#948; k ) satisfies neither TT1 nor TT2, which contradicts Lemma 1. Hence, the only possibility is that c k &#8800; 0, which we assume for the rest of the proof.</p><p>Notice that from <ref type="formula">16</ref>) is not satisfied; thus, (u k , &#948; k ) does not satisfy TT2. Also, observe from v k &#8800; 0 (which follows from c k &#8800; 0 and Assumption 1), d k &#65533; 0, and ( <ref type="formula">6</ref> <ref type="formula">12</ref>) is not satisfied with &#964; &#65533; &#964; k&#65533;1 ; thus, (u k , &#948; k ) does not satisfy TT1. Overall, we have reached a contradiction to Lemma 1, and because we have reached a contradiction in all cases, the original supposition that d k &#65533; 0 cannot be true. w</p><p>We now show that our update strategy for the merit parameter ensures that the model reduction condition (12) always holds for &#964; &#8801; &#964; k . We also show another important property of {&#964; k }. Lemma 3. For all k &#8712; N in any run, the inequality in (12) holds with &#964; &#8801; &#964; k . In addition, for all k &#8712; N such that</p><p>Proof. The desired conclusion follows for k &#8712; K 1 because of the manner in which TT1 is defined and the fact that the algorithm sets &#964; k &#8592; &#964; k&#65533;1 for all k &#8712; K 1 . Hence, let us proceed under the assumption that k &#8712; K 2 . The inequality in (12) holds for &#964; &#8801; &#964; k with</p><p>We now proceed to show that this inequality holds by considering two cases.</p><p>Curtis, Robinson, and Zhou: Stochastic Inexact Sequential Quadratic Optimization INFORMS Journal on Optimization, Articles in Advance, pp. 1-23, &#169; 2024 INFORMS Case 1. g</p><p>In this case, the algorithm sets &#964; k &#8592; &#964; k&#65533;1 . Combining this with ( <ref type="formula">16</ref>), &#8711;c T k u k &#65533; r k , and &#603; r &#8712; (&#963; c , 1) yields</p><p>The update (17) yields &#964; k &#8804; &#964; trial k , which combined with ( <ref type="formula">16</ref>), ( <ref type="formula">18</ref>), &#8711;c T k u k &#65533; r k , and &#603; r &#8712; (&#963; c , 1) yields</p><p>as desired. Moreover, from (17), we have</p><p>We conclude this subsection by showing that the interval defining our step size selection scheme (i.e., [&#945; min k , &#945; max k ]) is positive and nonempty for all k &#8712; N. We also show a useful property of the computed step size that is needed in our analysis.</p><p>Proof. It follows from ( <ref type="formula">24</ref>) and the fact that {&#946; k }, {&#958; k }, and {&#964; k } are positive sequences that &#945; min k &gt; 0 for all k &#8712; N. Hence, considering ( <ref type="formula">24</ref>) and ( <ref type="formula">26</ref>), to prove that 0 </p><p>, which with (25), shows that</p><p>For this case, it follows from (25), &#945; suff k &#8712; (0, 1], and the triangle inequality that <ref type="formula">4</ref>) and ( <ref type="formula">25</ref>), one finds (as previously mentioned) that &#966; is strongly convex. In addition, one finds that &#966;(0</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">General Results</head><p>In this subsection, we prove general results about the behavior of Algorithm 1. As in the previous subsection, our analysis here merely requires for all k &#8712; N that g k &#8712; R n and H k &#8712; S n satisfy Assumption 2. The next lemma gives a lower bound on &#8214;c k &#8214; &#65533; &#8214;c k + &#8711;c T k v k &#8214; relative to &#8214;c k &#8214;. Lemma 5. There exists &#954; 1 &#8712; R &gt;0 (a constant common to all runs of the algorithm) such that for all k &#8712; N in any run, one has that</p><p>Proof. This result follows as in <ref type="bibr">Curtis et al. (2009, lemma 3.5</ref>) but with small straightforward modifications to account for the fact that in our analysis here, the singular values of {&#8711;c T k } are bounded away from zero uniformly over all runs as part of Assumption 1. w The next lemma shows that &#8214;v k &#8214; is bounded below and above proportionally to &#8214;c k &#8214;. Lemma 6. There exists (&#954; 2 , &#954; 3 ) &#8712; R &gt;0 &#215; R &gt;0 (constants common to all runs of the algorithm) such that for all k &#8712; N in any run, one has that &#954; 2 &#8214;c k &#8214; &#8804; &#8214;v k &#8214; &#8804; &#954; 3 &#8214;c k &#8214;.</p><p>Proof. Assumption 1 ensures the existence of &#955; min &#8712; R &gt;0 such that &#8711;c T k &#8711;c k &#8829; &#955; min I for all k &#8712; N in any run. We now prove each desired inequality. First, consider the former inequality. Because this holds trivially when c k &#65533; 0, let us proceed under the assumption that c k &#8800; 0. One finds</p><p>It follows from this inequality, the triangle inequality, and ( <ref type="formula">6</ref>) that</p><p>Substituting in for the value of &#945; c k (recall ( <ref type="formula">7</ref>)) and</p><p>Again, substituting the value of &#945; c k and using the definition of &#955; min , one finds</p><p>It follows from these inequalities and Assumption 1 that there exists &#954; 2 &#8712; &gt;0 as claimed.</p><p>Let us now turn to the latter inequality. It follows from the normal direction computation that</p><p>Putting these facts together shows that</p><p>which combined with Assumption 1-namely, that the Jacobian function &#8711;c(&#8226;) T is bounded over the set X containing the iterates-establishes the existence of &#954; 3 &#8712; R &gt;0 as claimed. w</p><p>The next result gives a useful bound on the size of the search direction.</p><p>Lemma 7. There exists &#954; 4 &#8712; R &#8805;2 (a constant common to all runs of the algorithm) such that for all k &#8712; N in any run, one finds that</p><p>, the triangle inequality, and Lemma 6, it follows that</p><p>The existence of the required &#954; 4 &#8712; R &#8805;2 now follows from Assumption 1 because max{2, 2&#954; 2 3 &#8214;c k &#8214;} is uniformly bounded for all k &#8712; N in any run, which completes the proof. w</p><p>The next lemma shows that the model reduction &#8710;l(x k , &#964; k , g k , v k + u k ) is bounded below by a similar quantity as the upper bound for &#8214;d k &#8214; 2 in the previous lemma.</p><p>Lemma 8. There exists &#954; 5 &#8712; R &gt;0 (a constant common to all runs of the algorithm) such that for all k &#8712; N in any run, one has that &#8710;l</p><p>Lemma 3 shows that (12) holds with &#964; &#8801; &#964; k . Combining this fact with Lemma 5 and the monotonically nonincreasing behavior of {&#964; k } shows that</p><p>Curtis, Robinson, and Zhou: Stochastic Inexact Sequential Quadratic Optimization INFORMS Journal on Optimization, Articles in Advance, pp. 1-23, &#169; 2024 INFORMS which proves the existence of the claimed &#954; 5 &#8712; R &gt;0 because &#963; u , &#603; u , &#963; c , &#954; 1 , and &#964; &#65533;1 are positive real numbers. The remaining inequalities follow from Lemmas 2 and 7. w</p><p>We next prove a lower bound for {&#958; k } that is uniform over every run of the algorithm.</p><p>Lemma 9. There exists &#958; min &#8712; R &gt;0 (a constant common to all runs of the algorithm) such that in any run, there exists k &#958; &#8712; N and</p><p>Proof. For all k &#8712; N, it follows from ( <ref type="formula">22</ref>) and Lemmas 7 and 8 that</p><p>Now, consider any iteration such that &#958; k &lt; &#958; k&#65533;1 . For such iterations, it follows from ( <ref type="formula">22</ref>) and ( <ref type="formula">30</ref>) that</p><p>Combining this fact with the initial choice of &#958; &#65533;1 shows that &#958; k &#8805; &#958; min :&#65533; min{(1 &#65533; &#603; &#958; )&#954; 5 =&#954; 4 , &#958; &#65533;1 } for all k &#8712; N. Combining this result with the that &#958; k &lt; &#958; k&#65533;1 implies that &#958; k &#8804; (1 &#65533; &#603; &#958; )&#958; k&#65533;1 (it decreases by at least a factor of 1 &#65533; &#603; &#958; ) gives the desired result. w</p><p>The next lemma gives a bound on the change in the merit function in each iteration.</p><p>Lemma 10. For all k &#8712; N in any run, one has that</p><p>Proof. By Lemma 4, one has &#966;(&#945; k ) &#8804; 0. Hence, starting as in (20); adding and subtracting the terms</p><p>; using the definition of &#966;(&#8226;); and using the fact that &#8711;c</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Convergence Analysis</head><p>Our goal in this subsection is to prove a convergence result for our algorithm. In general, in a run of the algorithm, one of three possible events can occur with respect to the merit parameter sequence. One possible event is that the merit parameter sequence eventually remains constant at a value that is sufficiently small. This is the event that we consider in our analysis here, where the meaning of sufficiently small is defined formally in the event E that is introduced shortly. The other two possible events are that the merit parameter sequence vanishes or eventually remains constant at a value that is too large. As discussed in Berahas et al. (2021, section 3.2.2), the former of these two events does not occur for the algorithm in that paper if the differences between the stochastic gradient estimates and the true gradients of the objective are uniformly bounded in norm; in particular, see <ref type="bibr">Berahas et al. (2021, proposition 3.18)</ref>. It is straightforward to show that such a conclusion also holds for Algorithm 1 in this paper because the merit parameter update strategy follows the same kind of approach as for the algorithm in <ref type="bibr">Berahas et al. (2021)</ref>; in particular, see the consistency between <ref type="bibr">Berahas et al. (2021, equations (3.3</ref>) and (3.4)) and in this paper, ( <ref type="formula">17</ref>) and (18) as well as <ref type="bibr">Byrd et al. (2008, lemma 4.7)</ref>, which considers the setting of inexact linear system solutions using inexactness tolerance conditions of the same type as in this paper. Moreover, in Berahas et al. (2021, section 3.2.2), it is shown that the latter type of event (namely, that the merit parameter remains constant at a value that is too large) occurs with probability of zero if one makes a reasonable assumption about the influence of the stochastic gradient estimates on the computed search directions; see also <ref type="bibr">Berahas et al. (2023, section 4.3)</ref> for additional discussion of this case in the context of an algorithm that employs a step decomposition approach, like in Algorithm 1. Again, it is straightforward to see that such a conclusion also holds for Algorithm 1 because the merit parameter update strategy is of the same form. Consequently, for our purposes here, we do not consider these latter events because we contend that for practical purposes, one can focus on the first event for the same reasons as in <ref type="bibr">Berahas et al. (2021</ref><ref type="bibr">Berahas et al. ( , 2023))</ref>.</p><p>Our main convergence result for Algorithm 1 considers an assumption that combines all of the assumptions required for our analysis until this point and assumes certain behavior of the merit parameter sequence through an event denoted as E. For this event and the subsequent analysis, recall the stochastic process (3) defined by the algorithm. Consider for each k &#8712; N the condition</p><p>similar to the one appearing in ( <ref type="formula">18</ref>). (For the sake of brevity in our notation, we overload the meaning of H k ; here, it may be a random variable satisfying Assumption 6 introduced shortly, which is consistent with the previously introduced and employed Assumption 2.) With this condition, let us define the following trial value of the merit parameter that would be computed in iteration k &#8712; N if the algorithm was to employ &#8711;f (X k ) in place of G k and solve (8) exactly:</p><p>(To be clear, the quantity T trial, true k never needs to be computed by our algorithm; it is only used in our analysis.) Using this quantity, we define our event of interest, namely E, as the following.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1.">Event E.</head><p>For some (k min , &#964; min , f sup ) &#8712; N &#215; R &gt;0 &#215; R, the event E :&#65533; E(k min , &#964; min , f sup ) occurs if and only if f (X k min ) &#8804; f sup , and there exists (K, T &#8242; , &#926; &#8242; ) &#8712; N &#215; R &gt;0 &#215; R &gt;0 with K &#8242; &#8804; k min , T &#8242; &#8805; &#964; min , and &#926; &#8242; &#8805; &#958; min (see Lemma 9) such that</p><p>In other words, event E is the event in which by iteration k min , the merit and ratio parameter sequences become constant at values at least &#964; min and &#958; min , respectively, and the objective value at iteration k min is bounded. With respect to this event, we make the following assumption for the rest of our analysis, our ultimate focus of which will be on the behavior of the algorithm starting in iteration k min , at which point the adaptive merit and ratio parameters are constant.</p><p>Assumption 5. For some (k min , &#964; min , f sup ) &#8712; N &#215; R &gt;0 &#215; R, the event E :&#65533; E(k min , &#964; min , f sup ) occurs, and conditioned on the occurrence of E, Assumptions 1, 2, 3, and 4 hold. In addition, along with the restrictions that {&#946; k } &#8834; (0, 1] and (23) holds for all k &#8712; N, the sequence {&#946; k } k&#8805;k min is chosen in a manner that is F k min -measurable.</p><p>A few remarks are in order with respect to Assumption 5. First, that the event E includes that the merit parameter sequence is bounded below can, as previously mentioned, be justified for the same reasons as in <ref type="bibr">Berahas et al. (2021)</ref>; the additional requirement that it eventually remains constant can be justified by Lemma 3, which shows that if the merit parameter is decreased, then it is decreased by at least a constant factor. Note that it is the inequality T &#8242; &#8804; T trial, true k that represents the aforementioned notion of the merit parameter ultimately being sufficiently small for all large k. Second, that E includes that the ratio parameter sequence is bounded below and eventually remains constant is not a strong assumption; it follows under our prior assumptions (that are carried forward in Assumption 5) because of Lemma 9. That said, the critical aspect here is that E requires that this sequence has become constant by iteration k min . Third, that E includes that f (X k min ) is bounded above is a relatively weak assumption but</p><p>Curtis, Robinson, and Zhou: Stochastic Inexact Sequential Quadratic Optimization INFORMS Journal on Optimization, Articles in Advance, pp. 1-23, &#169; 2024 INFORMS</p><p>necessary for our purposes of ultimately showing that a sequence of stationarity measures vanishes in expectation. Finally, with respect to {&#946; k } k&#8805;k min , we state in Theorem 1 particular choices satisfying Assumption 5 for which our convergence guarantees hold. Precise strategies for setting these values that are consistent with our convergence guarantees are stated after Theorem 1.</p><p>Let G 0 be the &#963;-algebra defined by the initial conditions of Algorithm 1, and for all k &#8712; N, let G k be the &#963;-algebra generated by the initial conditions and {G 0 , : : : , G k&#65533;1 }. In addition, for all k &#8712; N, let the trace &#963;-algebra of E on G k be F k :&#65533; G k &#8745; E. Hence, {F k } is a filtration. For brevity, let</p><p>where P &#969; denotes probability with respect to the distribution of &#969; (and as for ( <ref type="formula">1</ref>), E &#969; denotes expectation with respect to the distribution of &#969;). Observe that conditioned on E, one has that</p><p>and one has that the random variables T &#8242; and &#926; &#8242; are F k -measurable for k &#65533; k min &#8805; K &#8242; . We make Assumption 6 about {G k } and {H k }. That the stochastic gradient estimators are unbiased is standard for algorithms based on stochastic approximation. One may be able to relax the so-called bounded-variance assumption introduced here, but we contend that this assumption is sufficient for showing the general type of convergence guarantee that our algorithm offers. Hence, we make a bounded-variance assumption here so as not to obfuscate the other details. Assumption 6. There exists M g &#8712; R &gt;0 such that for all k &#8712; N, the gradient estimator</p><p>In addition, for all k &#8712; N, the matrix H k (satisfying Assumption 5; i.e., the bounds in Assumption 2) is F k -measurable.</p><p>Combining Assumption 6 with Jensen's inequality, it holds for all k &#8712; N that</p><p>ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi</p><p>We now return to our analysis. First, let us derive bounds on the expected difference between U k and U true k . To that end, let us define Z k &#8712; R n&#215; (n&#65533;m) as an (F k -measurable) matrix whose columns form an orthonormal basis for Null(&#8711;c(X k ) T ), which implies that Z T k Z k &#65533; I and &#8711;c(X k ) T Z k &#65533; 0. Under Assumption 5 (namely, Assumption 1), let U k, 1 &#8712; R m and U k, 2 &#8712; R n&#65533;m be vectors forming the orthogonal decomposition of U k into Range(&#8711;c(X k )) and Null(&#8711;c</p><p>The corresponding values for D k and D true k are found to be</p><p>In the proof of our next lemma, we use the fact that</p><p>which can be seen as follows. The nonzero eigenvalues of AB are equal to those of BA when the products are valid, meaning that the nonzero eigenvalues of</p><p>, which are all one; hence, the bound in (37) holds.</p><p>Lemma 11. There exists &#954; 6 &#8712; R &gt;0 such that for all k &#8712; N with k &#8805; k min , one finds</p><p>ffi ffi ffi ffi ffi ffi ffi M g q + &#954; 6 &#946; k :</p><p>Proof. It follows from ( <ref type="formula">35</ref>) that</p><p>which combined with Assumption 6, shows that</p><p>Combining this equation with the triangle inequality, Assumption 5 (specifically, Assumptions 1 and 2), ( <ref type="formula">14</ref>), and (37) ensures the existence of &#954; 6 &#8712; R &gt;0 such that for all k &#8712; N,</p><p>which is the first desired result. Next, to derive the desired bound on</p><p>one can combine the expression for U k &#65533; U true k with the triangle inequality to obtain</p><p>Taking conditional expectation and using Assumption 6, (34), (37), and ( <ref type="formula">14</ref>), one finds</p><p>where &#954; 6 is the same value as used, which completes the proof. w</p><p>We now bound the difference in expectation between &#8711;f (X k</p><p>For the first term on the right-hand side, the Cauchy-Schwarz inequality,</p><p>Lemma 11, and Assumption 5 (i.e., Assumption 1) imply that there exists</p><p>Now, for the second term, first observe from Assumption 6 that</p><p>Combining this fact with (35), the Cauchy-Schwarz inequality, Assumption 5 (namely, Assumptions 1 and 2),</p><p>Curtis, Robinson, and Zhou: Stochastic Inexact Sequential Quadratic Optimization INFORMS Journal on Optimization, Articles in Advance, pp. 1-23, &#169; 2024 INFORMS (37), and (34) shows that there exists (&#954; 8 , &#954; 8</p><p>Combining the results gives the desired result. w</p><p>We now proceed to bound in expectation the last few terms appearing in the right-hand side of the inequality proved in Lemma 10. The next lemma considers the last pair of terms.</p><p>Lemma 13. There exists &#954; 9 &#8712; R &gt;0 such that for all k &#8712; N with k &#8805; k min , one finds</p><p>Proof. From ( <ref type="formula">11</ref>), ( <ref type="formula">14</ref>), the fact that</p><p>, ( <ref type="formula">24</ref>), ( <ref type="formula">23</ref>), and the monotonically nonincreasing behavior of {T k } and {&#926; k }, it follows that there exists &#954; 9 &#8712; R &gt;0 such that</p><p>which gives the desired conclusion. w</p><p>Our next result provides an upper bound in expectation for the second term appearing on the right-hand side of the inequality in Lemma 10. Lemma 14. There exists &#954; 10 &#8712; R &gt;0 such that for all k &#8712; N with k &#8805; k min , one finds</p><p>Proof. Let I k be the event that &#8711;f (X k ) T (D k &#65533; D true k ) &#8805; 0, and let I c k be its complementary event. It follows from (32), the definition of I k , the fact that</p><p>and the law of total expectation that for all k &#8805; k min , one finds Combining this with the fact that (26</p><p>and the law of total expectation shows for all k &#8805; k min that</p><p>Combining this with Lemma 11, (23</p><p>and Assumption 1 shows that there exists &#954; 10 &#8712; R &gt;0 such that for all k &#8805; k min , one finds</p><p>which is the desired conclusion. w</p><p>We now use the model reduction based on the true step D true k to show an upper bound on the expected reduction in the model based on the step D k .</p><p>Proof. It follows from Lemma 12; (4); the fact that</p><p>, and D true k are all F k -measurable for k &#8805; k min ; (9); and ( <ref type="formula">14</ref>) that for all k &#8805; k min ,</p><p>which is the desired result. w</p><p>We now prove our main result. In the result, the quantity &#8710;l(X k , T k , &#8711;f (X k ), D true k ) serves as a measure of stationarity with respect to (1); after all, the proof for Lemma 8 shows, with (&#8711;f</p><p>Thus, in a run, if there exists infinite K &#8838; N with lim k&#8712;K, k&#8594;&#8734; &#8710;l(x k , &#964; k , &#8711;f k , d true k ) &#65533; 0, then (38) and Lemma 6 imply that lim k&#8712;K, k&#8594;&#8734; &#8214;c k &#8214; &#65533; lim k&#8712;K, k&#8594;&#8734; &#8214;u true k &#8214; &#65533; lim k&#8712;K, k&#8594;&#8734; &#8214;v k &#8214; &#65533; 0, which combined with (9), shows that any limit of {(x k , y k + &#948; true k )} is a first-order stationary point for (1). In our stochastic setting, we prove for two different choices of {&#946; k } k&#8805;k min that an expected average measure of stationarity exhibits desirable properties. These properties match those ensured by a stochastic gradient method in the unconstrained setting (where &#8214;&#8711;f (X k )&#8214; 2 is the measure of stationarity).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem 1. Define</head><p>Then, the following results hold.</p><p>Curtis, Robinson, and Zhou: Stochastic Inexact Sequential Quadratic Optimization INFORMS Journal on Optimization, Articles in Advance, pp. 1-23, &#169; 2024 INFORMS i. If &#946; k &#65533; &#946; &#65533; &#968;A &#8242;</p><p>(1&#65533;&#951;)(A &#8242; +&#952;) for some &#968; &#8712; (0, 1) for all k &#8805; k min , then</p><p>where &#966; min &#8712; R is a lower bound for &#966;(&#8226;, T &#8242; ) over X by Assumption 5 (namely, Assumption 1).</p><p>ii.</p><p>for some &#968; &#8712; (0, 1) for all k &#8805; k min , then</p><p>In either case, if in a run, there exists K &#8838; N with |K | &#65533; &#8734; and lim k&#8712;K, k&#8594;&#8734; &#8710;l(x k , &#964; k , &#8711;f k , d true k ) &#65533; 0, then any limit point of {(x k , y k + &#948; true k )} is a first-order stationary point for (1). Proof. By the definition of A &#8242; , {&#946; k } &#8834; (0, 1], and line 15 of Algorithm 1, it follows that <ref type="formula">38</ref>)); Lemmas 10, 14, 8, 13, and 15; and the fact that {&#946; k } &#8834; (0, 1] that for all k &#8805; k min , one finds</p><p>where </p><p>Then, taking total expectation conditioned only on the event E, it follows for k &#8712; N that</p><p>After rearrangement, one finds that (39) holds, where the limit as k &#8594; &#8734; holds because of the aforementioned fact that E[&#966;(X k min , T &#8242; ) | E] is bounded under Assumption 5. Case ii. By the conditions on {&#946; k } k&#8805;k min , it follows in a similar manner as in case (i) that</p><p>which after rearrangement and taking limits as k &#8594; &#8734;, proves that (40) holds.</p><p>The final conclusion of the theorem follows by the arguments provided before the theorem. w</p><p>We close this section by remarking that as described in <ref type="bibr">Berahas et al. (2021)</ref>, the elements of {&#946; k } can be chosen to satisfy Assumption 5 and the conditions of Theorem 1. Specifically, to obtain the convergence guarantees in case (i) of Theorem 1, the algorithm can set</p><p>for all k &#8712; N, which clearly ensures that &#946; k &#8712; (0, 1] and that (23) holds. In addition, assuming that event E occurs, the value &#945; &#8242; k for sufficiently large k becomes the realization of A &#8242; stated in the theorem, in which case &#946; k &#65533; &#946; for all subsequent k satisfies the condition stated in the theorem. To obtain the convergence guarantees in case (ii) of Theorem 1, the algorithm can "reset" a diminishing sequence after each iteration, in which the merit parameter and/or the ratio parameter are decreased. Specifically, in any iteration, say k &#8712; N, in which the merit parameter and/or the ratio parameter were reduced from the prior iteration, one can set</p><p>. If the value for k is reset after every time the merit parameter and/or the ratio parameter are decreased, under event E, the value for k eventually will not be reset, meaning that in subsequent iterates, {&#946; k } will satisfy the conditions of the theorem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Numerical Results</head><p>In this section, we demonstrate the performance of a MATLAB implementation of Algorithm 1 for solving (i) a subset of the constrained and unconstrained testing environment with safe threads (CUTEst) set <ref type="bibr">(Gould et al. 2015</ref>) and (ii) two optimal control problems from Hinterm &#252;ller et al. <ref type="bibr">(2003)</ref>. The goal of our testing is to demonstrate the computational benefits of using inexact subproblem solutions obtained based on our termination tests from Section 2.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Iterative Solvers</head><p>To obtain the normal direction v k as an inexact solution of (5), we applied CG to &#8711;c k &#8711;c T k v &#65533; &#65533;&#8711;c k c k . Denoting the tth CG iterate as v k, t , where v k, 0 &#65533; 0, the method sets v k &#8592; v k, t , where t is the first CG iteration such that</p><p>. The properties of CG as a Krylov subspace method ensure that v k, t &#8712; Range(&#8711;c k ) for all t &#8712; N; hence, v k &#8712; Range(&#8711;c k ).</p><p>To obtain the tangential direction u k and associated dual search direction &#948; k , we applied the MINRES method <ref type="bibr">(Paige and</ref><ref type="bibr">Saunders 1975, Choi et al. 2011</ref>) to (10). (We discuss our choice of H k along with each set of experiments.) Letting (u k, t , &#948; k, t ) denote the tth MINRES iterate, where (u k, 0 , &#948; k, 0 ) &#65533; (0, 0), the method sets (u k , &#948; k ) &#8592; (u k, t , &#948; k, t ), where t is the first MINRES iteration such that for some &#954; &#8712; (0, 1) (recalling the definition of (&#961; k, t , r k, t ) in ( <ref type="formula">28</ref>)),</p><p>and TT1 and/or TT2 hold. The choice of &#954; &#8712; (0, 1) is discussed with each experiment.</p><p>Curtis, Robinson, and Zhou: Stochastic Inexact Sequential Quadratic Optimization INFORMS Journal on Optimization, Articles in Advance, pp. 1-23, &#169; 2024 INFORMS 4.2. Choosing the Step Size Algorithm 1 (see line 15) stipulates that the step size &#945; k chosen for the kth iteration satisfies &#945; k &#8712; [&#945; min k , &#945; max k ]. Keeping in mind that &#945; min k &#8804; &#945; suff k &#8804; min{&#945; &#966; k , 1} (see Lemma 4), we take advantage of this flexibility in choosing the step size by defining</p><p>(1:1)</p><p>t k &#945; min k otherwise, 8 &gt; &lt; &gt; : where t k is the largest value of t &#8712; N such that (1:1) t &#945; min k &#8804; min{&#945; &#966; k , &#945; min k + &#952;&#946; 2 k , 1}. We do not explicitly compute &#945; &#966; k in our code. Instead, we can verify whether (1:1) t &#945; min k &#8804; &#945; &#966; k (as needed) because it is equivalent to verifying whether &#966;((1:1) t &#945; min k ) &#8804; 0, which is computable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Algorithm Variants Tested</head><p>To test the utility of using inexact subproblem solutions in Algorithm 1, we consider two algorithm variants: SISQO and SISQO_exact. SISQO is Algorithm 1 with inexact solutions computed as described in Section 4.1 with a relatively large value for &#954; in (42). On the other hand, SISQO_exact is identical to SISQO with the exception that it uses a relatively small value for &#954; in ( <ref type="formula">42</ref>). (Because of the similarities of the algorithms, SISQO_exact acts as a proxy for the stochastic sequential quadratic programming (SQP) algorithm from <ref type="bibr">Berahas et al. (2021)</ref>, although because it employs iterative linear algebra techniques, we are able to compare SISQO_exact with SISQO more readily.) We specify the values of &#954; &#8712; (0, 1) used along with each of our tests in Sections 4.5 and 4.6. Our reason for comparing these two variants is to focus attention on the numerical gains obtained as a result of using inexact subproblem solutions. Both variants use the same computation for the normal step, so the performance difference can be attributed directly to the inexact tangential step computation. Additionally, we compare SISQO with a stochastic subgradient method employed to minimize the merit function &#966; directly (for various fixed values of &#964;). We refer to our implementation of this algorithm as Subgrad. Because H k is a diagonal matrix for all k &#8712; N in all of our experiments, one CG or MINRES iteration is comparable computationally with two iterations of Subgrad.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Metrics Used for Comparison</head><p>Our metrics of interest are infeasibility and stationarity. Given any iterate x k in a run of SISQO, we consider the termination conditions &#8214;c(x k )&#8214; &#8734; &#8804; 10 &#65533;6 and &#8214;&#8711;f k + &#8711;c k y k, ls &#8214; &#8734; &#8804; 10 &#65533;2 , where y k, ls is the least-squares multiplier at x k . If an iterate satisfying these conditions is found in the first 1,000 iterations, then SISQO terminates and returns x SISQO &#8592; x k . Otherwise, SISQO terminates after the 1,000th iteration and sets k &#8242; &#8592; arg min i&#8712;{0}&#8746; <ref type="bibr">[1,</ref><ref type="bibr">000]</ref> </p><p>This allows us to associate with each run of SISQO the two measures err feasibility &#65533; &#8214;c(x SISQO )&#8214; &#8734; and err stationarity &#65533; &#8214;&#8711;f (x SISQO ) + J(x SISQO ) T y SISQO &#8214; &#8734; , where y SISQO &#8712; R m is the least-squares multiplier at x SISQO . We use the total number of CG and MINRES iterations performed by SISQO as a budget for the total number of CG and MINRES iterations performed by SISQO_exact; no other termination condition is used for SISQO_exact. Upon termination of SISQO_exact, we define x exact -the iterate with which we define the feasibility and stationarity errors-using the same strategy as for setting x SISQO . Finally, we ran Subgrad for multiple instances of &#964;. (Further details on the choices of &#964; and the iteration budget for Subgrad are given with each experiment.) Upon termination of Subgrad, we define x subgrad -the iterate with which we define the feasibility and stationarity errors-using the same strategy as for setting x SISQO . In all cases, we define the KKT error as the maximum of the feasibility and stationarity errors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">Results on the CUTEst Problems</head><p>In the CUTEst set <ref type="bibr">(Gould et al. 2015)</ref>, there are 138 equality constrained problems with m &#8804; n. From these, we selected those such that (i) (n + m) &#8712; [500, 10, 000], (ii) the objective function is not constant, (iii) the objective function remained above &#65533;10 50 over the sequences of iterates generated by runs of our algorithm, and (iv) the linear independence constraint qualification was satisfied at all iterates encountered in each run of our algorithm. This process of elimination resulted in the following 11 test problems: ELEC, LCH, LUKVLE1, LUKVLE3, LUKVLE4, LUKVLE6, LUKVLE7, LUKVLE9, LUKVLE10, LUKVLE13, and ORTHREGC.</p><p>The function and derivative evaluations from CUTEst are deterministic, and for the purpose of these experiments, we exploited this fact to compute values as needed by our algorithm, including using function evaluations to estimate Lipschitz constants. However, we introduced noise into the computation of the objective gradients for each application of our stochastic algorithm. In particular, we generated stochastic gradients as</p><p>&#65533; , where for testing purposes, we considered the three noise levels &#603; N &#8712; {10 &#65533;4 , 10 &#65533;2 , 10 &#65533;1 }. This particular choice for defining the stochastic gradients ensured that an appropriate value for M g as indicated in Assumption 6 would be given by M g &#65533; {10 &#65533;8 , 10 &#65533;4 , 10 &#65533;2 }, corresponding to the values for &#603; N .</p><p>We set &#954; &#65533; 0:1 for SISQO and &#954; &#65533; 10 &#65533;7 for SISQO_exact. All of the remaining parameters were set identically: &#964; &#65533;1 &#65533; &#963; &#65533; &#954; v &#65533; 0:1, &#951; &#65533; 0:5, &#603; uv &#65533; 1, 000, &#958; &#65533;1 &#65533; &#603; c &#65533; 1, &#603; &#964; &#65533; &#603; &#958; &#65533; 0:01, &#954; &#961; &#65533; &#954; r &#65533; 100, &#603; 2 &#65533; 0:9, &#954; u &#65533; 10 &#65533;8 , &#967; &#65533; 1 &#65533; 10 &#65533;8 , &#952; &#65533; 10 4 , and &#946; k &#65533; 1 for all k &#8712; N. For all k &#8712; N, we randomly generated a sample point near x k , and then, we estimated L k and &#915; k using finite differences of the objective gradients and constraint Jacobians between x k and the sampled point. These values were used in place of L and &#915;, respectively, in our step size selection. Here, H k &#65533; I for all k &#8712; N in all runs.</p><p>For each test problem, we ran SISQO, SISQO_exact, and Subgrad with five different random seeds. As previously justified, the iteration budget for Subgrad was set to be twice the total numbers of CG and MINRES iterations used by SISQO. Also, for Subgrad, we ran the algorithm with the 11 merit parameter values in &#964; &#8712; {10 0 , 10 &#65533;1 , : : : , 10 &#65533;10 } with step sizes set as &#945; k &#65533; &#964; &#964;L k +&#915; k for all k &#8712; N, and then, we selected the best iterate over all of these runs. We computed the feasibility and KKT errors for all algorithms as described in Section 4.4; see Figure <ref type="figure">1</ref>.</p><p>From Figure <ref type="figure">1</ref>, one finds that SISQO performs better than SISQO_exact and Subgrad in terms of both feasibility and KKT errors. SISQO achieves smaller errors for smaller noise levels, which may be expected because of the fact that these experiments are run with constant {&#946; k }.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6.">Results on Optimal Control Problems</head><p>In our second set of experiments, we considered two optimal control problems motivated by those in Hinterm &#252;ller et al. <ref type="bibr">(2003)</ref>. In particular, we modified the problems to have equality constraints only and finite sum objective functions. Specifically, given a domain &#926; &#8712; R 2 , a constant N &#8712; N &gt;0 , reference functions w ij &#8712; L 2 (&#926;) and z ij &#8712; L 2 (&#926;) for (i, j) &#8712; {1, : : : , N} &#215; {1, : : : , N}, and a regularization parameter &#955; &#8712; R &gt;0 , we first considered the problem min w, z</p><p>&#65533; &#65533; s:t: &#65533; &#8710;w &#65533; z in &#926; and w &#65533; 0 on &#8706;&#926;: (43)</p><p>Second, with the same notation but z ij &#8712; L 2 (&#8706;&#926;), we also considered min w, z</p><p>&#8706;&#926;) &#65533; &#65533; s:t: &#65533; &#8710;w + w &#65533; 0 in &#926; and &#8706;w &#8706;p &#65533; z on &#8706;&#926;, <ref type="bibr">(44)</ref> where p represents the unit outer normal to &#926; along &#8706;&#926;. As reference functions for both problems, we chose z ij &#65533; 0 and</p><p>x 2 &#65533; for all (i, j) &#8712; {1, : : : , N} &#215; {1, : : : , N} for some Note. Subgrad, stochastic subgradient method. Curtis, Robinson, and Zhou: Stochastic Inexact Sequential Quadratic Optimization</p><p>(&#603; S , &#603; N ) &#8712; R &gt;0 &#215; R &gt;0 . We selected the following values for the constants: N &#65533; 3, &#955; &#65533; 10 &#65533;5 , &#603; S &#65533; 50, and &#603; N &#8712; {10 &#65533;4 , 10 &#65533;2 , 10 &#65533;1 }. Because the objective functions of ( <ref type="formula">43</ref>) and ( <ref type="formula">44</ref>) are finite sums, to generate stochastic gradients as unbiased estimates of the true gradient, we first uniformly generated random (i, j) &#8712; {1, : : : , N} &#215; {1, : : : , N}, and then, we computed the gradient corresponding to the (i, j)th term in the objective function. We note that with the choice of parameters, it follows that an appropriate value for M g in Assumption 6 is given by M g &#8776; {10 &#65533;8 , 10 &#65533;4 , 10 &#65533;2 } to correspond, respectively, to the values for &#603; N .</p><p>Because the optimal control problems have a quadratic objective function and linear constraints, we used the exact second derivative matrix H k &#65533; diag(I, &#955;I) for all k &#8712; N. For this choice, the curvature condition on H k in Assumption 2 is trivially satisfied.</p><p>In terms of algorithm parameters, we set &#954; &#65533; 10 &#65533;4 for SISQO and &#954; &#65533; 10 &#65533;7 for SISQO_exact. All of the remaining parameters were set identically for the two variants in the same manner as in the previous section with the following exceptions: &#964; &#65533;1 &#65533; 10 &#65533;4 , L &#65533; 1, and &#915; &#65533; 0, where the latter choice is valid because the objectives are quadratic and the constraints are linear.</p><p>For each of the two optimal control problems in ( <ref type="formula">43</ref>) and (44), we ran SISQO, SISQO_exact, and Subgrad with five different random seeds, and then, we computed their average feasibility and KKT errors as described in Section 4.4. We observed that {&#964; k } was constant in all runs of SISQO and SISQO_exact. Therefore, we ran Subgrad with only three merit parameter values, namely &#964; &#8712; {10 &#65533;2 , 10 &#65533;4 , 10 &#65533;6 }, and we choose step sizes as &#945; k &#8592; &#964; &#964;L+&#915; for all k &#8712; N. (In these experiments, the budget for Subgrad iterations was set to the total numbers of CG and MINRES iterations used by SISQO because the constraint function evaluations, required in each iteration of Subgrad, are as expensive computationally as each CG and MINRES iteration.) In Table <ref type="table">1</ref>, we report average feasibility and KKT errors as well as the average number of iterations performed by Algorithm 1 before termination ("iter.") and the number of CG and MINRES iterations ("C + M iter."), with the latter discussed in Section 4.1. The results are given in Table <ref type="table">1</ref>. One can observe that SISQO performs better than the others in terms of average feasibility and KKT errors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>We have proposed, analyzed, and tested an inexact stochastic SQP algorithm for solving stochastic optimization problems involving deterministic, smooth, nonlinear equality constraints. We proved a convergence guarantee (in expectation) for our algorithm that is comparable with that proved for the exact stochastic SQP method recently presented by <ref type="bibr">Berahas et al. (2021)</ref>, which in turn, is comparable with that known for the stochastic gradient in unconstrained settings <ref type="bibr">(Bottou et al. 2018)</ref>. Our MATLAB implementation, SISQO, illustrated the benefits of allowing inexact step computation for solving problems from the CUTEst set <ref type="bibr">(Gould et al. 2015)</ref> and two optimal control problems.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Downloaded from informs.org by [98.7.209.7] on 24 May 2024, at 09:40 . For personal use only, all rights reserved.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Curtis, Robinson, and Zhou: Stochastic Inexact Sequential Quadratic Optimization</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>INFORMS Journal on Optimization, Articles in Advance, pp. 1-23, &#169; 2024 INFORMS Downloaded from informs.org by [98.7.209.7] on 24 May 2024, at 09:40 . For personal use only, all rights reserved.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_3"><p>Curtis, Robinson, and Zhou: Stochastic Inexact Sequential Quadratic Optimization INFORMS Journal on Optimization, Articles in Advance, pp. 1-23, &#169; 2024 INFORMS</p></note>
		</body>
		</text>
</TEI>
