<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>The Cost of Nonconvexity in Deterministic Nonsmooth Optimization</title></titleStmt>
			<publicationStmt>
				<publisher>Institute for Operations Research and the Management Sciences (INFORMS)</publisher>
				<date>11/29/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10495482</idno>
					<idno type="doi">10.1287/moor.2022.0289</idno>
					<title level='j'>Mathematics of Operations Research</title>
<idno>0364-765X</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Siyu Kong</author><author>A. S. Lewis</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>We study the impact of nonconvexity on the complexity of nonsmooth optimization, emphasizing objectives such as piecewise linear functions, which may not be weakly convex. We focus on a dimension-independent analysis, slightly modifying a 2020 black-box algorithm of Zhang-Lin-Jegelka-Sra-Jadbabaie that approximates an ϵ-stationary point of any directionally differentiable Lipschitz objective using [Formula: see text] calls to a specialized subgradient oracle and a randomized line search. Seeking by contrast a deterministic method, we present a simple black-box version that achieves [Formula: see text] for any difference-of-convex objective and [Formula: see text] for the weakly convex case. Our complexity bound depends on a natural nonconvexity modulus that is related, intriguingly, to the negative part of directional second derivatives of the objective, understood in the distributional sense.</p> <p>Funding: This work was supported by the National Science Foundation [Grant DMS-2006990].</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>We consider the problem of minimizing a Lipschitz objective function f : R n ! R. We suppose that we are given a Lipschitz constant, an initial point x 0 2 R n , and an upper bound on the gap f (x 0 ) inf f . We have access to f at input points x 2 R n through an oracle that outputs only local information, such as the function value f(x) and a subgradient in @f (x). It is easy to see that the problem of approximating the minimum value inf f within a given tolerance &#9999; &gt; 0 suffers from the curse of dimensionality; it requires what amounts to grid search, needing a number of oracle calls growing like O 1 &#9999; n &#8677; , so depending exponentially on the dimension n.</p><p>Relaxing our goals, rather than minimization, we may instead seek points that are, in some sense, nearly critical. A well-known example is the case of a smooth but nonconvex objective function f : R n ! R, for which finding a point x 2 R n satisfying |rf (x) | &#63743; &#9999; is relatively easy. Assuming that f is bounded below and L-smooth, meaning that its gradient has a known Lipschitz constant L, elementary calculus shows that the gradient descent iteration x</p><p>x 1 L rf (x) always reduces the objective value f(x) by at least 1 2L |rf (x) | 2 . Assuming a bound M on the gap between the initial objective value and inf f , the algorithm succeeds after no more than 2LM  &#9999; 2 iterations, independent of the dimension n.</p><p>Less well known is an interesting analogous result for objectives f that are nonsmooth but convex; Davis and Drusvyatskiy [4] present a randomized algorithm that finds a point within a distance &#9999; of some point at which f has a subgradient with norm no larger than &#9999; using &#213; 1 &#9999; 2 &#8677; subgradient evaluations. The &#213;(&#8226;) notation suppresses logarithmic factors, but again, the complexity estimate is dimension independent. When f is just weakly convex, meaning that the function f + &#961; 2 | &#8226; | 2 is convex for some constant &#961; &gt; 0, the analogous algorithm (see <ref type="bibr">Davis and Drusvyatskiy [5]</ref>) still has a dimension-independent complexity bound, now of the form O 1</p><p>For general Lipschitz functions f, however, the analogous problem is intractable (see <ref type="bibr">Kornowski and Shamir [15]</ref>). More precisely, any algorithm guaranteed to approximate within a distance &#9999; a point with a Clarke subgradient of norm less than &#9999; must suffer from the same curse of dimensionality as grid search, requiring a number of subgradients growing like O 1</p><p>Although that intractability might seem the end of the question, there is a more relaxed proxy approach to minimizing Lipschitz functions, dating back to work of Goldstein <ref type="bibr">[10]</ref> in the 1970s. This method can be viewed as seeking a point in R n around which f is differentiable on some cluster of nearby points at which the gradients have a small convex combination-a small "Goldstein subgradient" in the language of Goldstein <ref type="bibr">[10]</ref>. Although some published algorithms, such as that of Mahdavi-Amiri and Yousefpour <ref type="bibr">[16]</ref>, accomplished this goal, no complexity analysis existed until recently.</p><p>A 2020 breakthrough of Zhang et al. <ref type="bibr">[24]</ref> presented an algorithm for this Goldstein subgradient problem with a complexity analysis depending on the radius &#948; of the cluster and the size &#9999; of the subgradient but independent of the dimension n. The algorithm assumes directional differentiability of the objective f and relying on an associated directional subgradient oracle, uses an innovative randomized line search to achieve an efficiency guarantee of essentially O(&#9999; 3 &#948; 1 ). The development of Zhang et al. <ref type="bibr">[24]</ref> may have practical as well as theoretical interest. The basic algorithm in Zhang et al. <ref type="bibr">[24]</ref> employs "null" steps, rather like the traditional bundle methods that have long enjoyed considerable success in large-scale convex optimization (see Sagastiz&#225;bal <ref type="bibr">[21]</ref>). The practicality of the basic algorithm of Zhang et al. <ref type="bibr">[24]</ref> remains unclear, but a related algorithm performs at least comparably with stochastic gradient descent in the authors' preliminary experiments. Two subsequent papers, Davis et al. <ref type="bibr">[6]</ref> and Tian and So <ref type="bibr">[22]</ref>, point out that small random perturbations allow a standard subgradient oracle to replace the directional version of Zhang et al. <ref type="bibr">[24]</ref>.</p><p>Two recent developments, Jordan et al. <ref type="bibr">[13]</ref> and Kornowski and Shamir <ref type="bibr">[15]</ref> (see also Jordan et al. <ref type="bibr">[14]</ref> 1 ), raise the question of deterministic algorithms for this problem. Although both papers prove positive results in the smooth case and Jordan et al. <ref type="bibr">[14]</ref> thereby develop a "white-box" deterministic smoothing approach to the nonsmooth problem, both manuscripts also prove negative results for the general dimension-independent question. However, the negative results of Jordan et al. <ref type="bibr">[13]</ref> and Kornowski and Shamir <ref type="bibr">[15]</ref> concern general Lipschitz optimization. In contrast, by modestly restricting the class of nonsmooth objectives, we are able to develop a simple deterministic black-box version of the algorithm of Zhang et al. <ref type="bibr">[24]</ref> with increased but still dimensionindependent complexity. Specifically, our contribution is an algorithm that achieves, up to a nonconvexity modulus for the objective, a dimension-independent complexity of O(&#9999; 4 &#948; 1 ), thus derandomizing the method of Zhang et al. <ref type="bibr">[24]</ref> at the expense of an extra order of &#9999;. When &#948; &#224; &#9999;, we arrive at the estimates noted in the abstract: O(&#9999; 4 ) for the original method of Zhang et al. <ref type="bibr">[24]</ref> and O(&#9999; 5 ) for our deterministic modification (which strengthens to O(&#9999; 4 ) in the weakly convex case). A higher level of nonconvexity corresponds to a larger nonconvexity modulus and hence, to a larger complexity bound; in this sense, the modulus measures a complexity "cost" for finding nearly critical points of nonconvex functions. Our analysis covers interesting objectives, such as piecewise linear functions, which are not even weakly convex. We relate the nonconvexity modulus of the objective with its distributional second derivative, hinting at an intriguing relationship between such derivatives and algorithmic complexity in general optimization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">The Optimization Problem and Oracle</head><p>Primarily to emphasize the elementary nature of our development, we adopt a rudimentary setting for our optimization problem. On a real inner product space X with corresponding norm | &#8226; |, we consider the problem of minimizing a function f : X ! R. The objective f may be neither smooth nor convex, and the space X may be neither finite dimensional nor even complete. The method we develop, following Zhang et al. <ref type="bibr">[24]</ref>, relies on an the following underlying idea. Definition 1. We say that an objective function f : X ! R has a directional subgradient map G : X 2 ! X when for all points x 2 X and directions e 2 X, the G&#226;teaux directional derivative</p><p>exists and satisfies hG(x, e), ei &#224; f 0 (x; e):</p><p>We say that G is L-bounded for some constant L &gt; 0 if its norm |G(x, e)| is never larger than L.</p><p>In applications, the objective function f is L-Lipschitz, and the vector G(x, e) is a subgradient of some kind for f at the point x associated with the direction e; therefore, we loosely refer to G(x, e) as a "subgradient." Nonetheless, we choose this rudimentary setting to emphasize again the elementary nature of our development, which makes no recourse to variational or Lipschitz analysis.</p><p>Example 1 (Differentiable Functions). For any function f that is L-Lipschitz and has a G&#226;teaux derivative rf (x) at every point x 2 X, the equation</p><p>defines an L-bounded directional subgradient map.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Kong and Lewis: The Cost of Nonconvexity in Deterministic Nonsmooth Optimization</head><p>Example 2 (Convex Functions). For an L-Lipschitz convex function f with convex subdifferential @f , any map G satisfying G(x, e) 2 argmax{hg, ei : g 2 @f (x)} for all x, e 2 X is an L-bounded directional subgradient map.</p><p>More generally, we have the following example, which covers many interesting objectives, including the weakly convex case. For a Lipschitz function f : R n ! R, the Clarke subdifferential @ c f (x) is the convex hull of the set of all limits of the form limrf (x r ) for points x r ! x in R n . The function f is subdifferentially regular when its G&#226;teaux directional derivative satisfies f 0 (x; e) &#224; max{hg, ei : g 2 @ c f (x)} for all x, e 2 R n :</p><p>Example 3 (Subdifferentially Regular Functions). Consider any L-Lipschitz subdifferentially regular function f : R n ! R. Then, any map G satisfying G(x, e) 2 argmax{hg, ei : g 2 @ c f (x)} for all x, e 2 R n is an L-bounded directional subgradient map.</p><p>Notwithstanding the generality of this example, we emphasize that our framework is not restricted to objectives that are subdifferentially regular, as the following result shows.</p><p>Proposition 1 (Directional Clarke Subgradient Maps). Any locally Lipschitz function f : R n ! R that is directionally differentiable has a directional subgradient map G : R n &#8677; R n ! R n satisfying G(x, e) 2 @ c f (x) for all x, e 2 R n .</p><p>Proof. We just need to prove that if f has a G&#226;teaux directional derivative at the point x 2 R n in the direction e 2 R n , then there exists a Clarke subgradient g 2 @ c f (x) satisfying hg, ei &#224; f 0 (x; e). For r &#224; 1, 2, 3, : : : , by the nonsmooth mean value theorem, there exists a point x r 2 x, x + 1 r e &#8676; &#8965; and a subgradient g r 2 @ c f (x r ) satisfying</p><p>Because the subdifferential @ c f mapping is closed and locally bounded, any limit point g of the sequence {g r } has the desired property. w</p><p>As discussed in Section 1, rather than trying to minimize the objective f, we instead seek a point x 2 X that is, in some sense, approximately critical. To this end, we make the following definition. We denote the closed ball in X of radius &#948; and center x by B &#948; (x).</p><p>Definition 2. Consider the setting of Definition 1. Corresponding to any constant &#948; &gt; 0, a Goldstein subgradient at x is a vector of the form</p><p>for a positive integer k, positive weights &#955; i summing to one, points x i 2 B &#948; (x), and directions e i 2 X for i &#224; 1, 2, : : : , k. The set of all Goldstein subgradients is denoted by @ &#948; f (x).</p><p>Loosely speaking, the Goldstein subdifferential @ &#948; f (x) consists of all convex combinations of subgradients at nearby points. Strictly speaking, our notion is potentially smaller than the standard definition for Lipschitz f : R n ! R, namely the closed convex hull of the set @ c f (B &#948; (x)).</p><p>We can now state our goal, which relies on a second constant &#9999; &gt; 0.</p><p>Aim. Find a point x 2 X and a Goldstein subgradient g 2 @ &#948; f (x) such that | g| &#63743; &#9999;.</p><p>The development of Zhang et al. <ref type="bibr">[24]</ref> accomplishes this goal, explicitly in the setting of Proposition 1 and assuming that f is directionally differentiable in the (stronger) Hadamard sense. It relies on the following oracle.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Oracle 1 (Directional Subgradient)</head><p>Input:</p><p>&#8226; a point x 2 X, &#8226; a direction e 2 X. Output:</p><p>&#8226; the objective value f(x),</p><p>&#8226; the directional derivative f 0 (x; e), &#8226; a subgradient-like vector G(x, e).</p><p>In this work, we rely on the same directional subgradient oracle. We emphasize that this oracle is stronger than the standard subgradient oracle. We cannot directly compare our approach with algorithms relying only on the standard oracle because the extra computational cost of the directional oracle is hidden in our analysis, just as it is in Zhang et al. <ref type="bibr">[24]</ref>. How generally available a directional oracle might be is unclear; a cautionary point of comparison is the NP hardness of deciding the existence of descent directions (see Nesterov <ref type="bibr">[18]</ref>).</p><p>In generic practice, however, for a Lipschitz objective f, we may expect that the algorithm we describe never encounters points x where f is nondifferentiable, in which case any subgradient oracle simply returns g &#224; rf (x) (see <ref type="bibr">Bianchi et al. [1]</ref>). More formally, nonetheless, we must consider nonsmooth points. Undeterred, Zhang et al. <ref type="bibr">[24]</ref> argue that availability of the directional oracle, although a nontrivial restriction, may be a reasonable assumption, directional subgradients being potentially computable via nonsmooth calculus rules. In contrast with Zhang et al. <ref type="bibr">[24]</ref>, our aim here is a fully deterministic algorithm. Accordingly, although we use this same stronger oracle, we instead use a deterministic line search, much as in Kornowski and Shamir <ref type="bibr">[15]</ref>. To ensure termination, we make a mild assumption about the directional behavior of the objective function f, similar in spirit to the idea of semismoothness (see Mifflin <ref type="bibr">[17]</ref>) common in nonsmooth computation but simpler and weaker.</p><p>Rather than the directional subgradient oracle on which we here rely, combined with a line search, one might instead consider other potential oracles. In particular, given an oracle that returns a descent direction, one might try to mimic smooth techniques, simply searching along that direction. However, even laying aside the NP hardness of a general descent direction oracle noted, such a basic approach is flawed. A classical example of Wolfe <ref type="bibr">[23]</ref> shows that gradient descent with exact line search on a nonsmooth continuous convex function can encounter only smooth points and yet, converge to a point that is not a minimizer. Nonsmoothness necessitates a more robust approach. Definition 3. Consider a directionally differentiable function f : X ! R. We call f directionally semismooth if all points x 2 X and directions e 2 X satisfy lim</p><p>On the other hand, following the approach of Facchinei and Pang [9], f is semismooth if it is Lipschitz, and the following stronger property holds: f 0 (x + e; e) f 0 (x; e) &#224; o(e) as e ! 0: Most Lipschitz functions in practice are semismooth, including in particular, difference-of-convex functions and semialgebraic functions; see <ref type="bibr">Bolte et al. [3]</ref>. As we shall see, directional semismoothness suffices to guarantee termination of our algorithm, but before describing it, we focus first on the line search.</p><p>Following Zhang et al. <ref type="bibr">[24]</ref>, as is usual in black-box-style analysis, we suppose that the optimizer has no access to the implementation of Oracle 1 (directional subgradient). In one interesting class of practical examples, the output "subgradients" may be generated cheaply but opaquely by standard autodifferentiation algorithms, like that in PyTorch (see <ref type="bibr">Paszke et al. [19]</ref>). Indeed, in the image classification experiments in Zhang et al. <ref type="bibr">[24]</ref>, where precise implementation of the directional subgradient oracle seems challenging, the authors simply use as a heuristic an autodifferentiation routine, assuming that it should never encounter nonsmooth ingredients in practice. In fact, even in the presence of nonsmooth ingredients, the occasional failure of autodifferentiation to output correct subgradients seems to have no impact on practical optimization, for reasons discussed by Bolte and Pauwels</p><p>As a more precise illustration of the computational cost of the directional subgradient oracle, we present a simple formal example of an easily implementable version-a special case of Example 3. Consider an objective of the form f (x) &#224; max i2I f i (x) for smooth functions f i indexed by a finite set I. Given an input consisting of a point x and direction e, the oracle works by first finding the active subset I(x) of indices i maximizing f i (x) and then, chooses as the output subgradient any gradient rf i (x) maximizing the inner product hrf i (x), ei over i 2 I(x). Typically, the gradient computations dominate, so the cost of the directional oracle is proportional to the size of the set I. We emphasize, however, that to be interesting, this computation should be invisible to the optimizer; given the same access to the functions f i as the oracle, the optimizer could solve the problem easily using classical nonlinear programming tools.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">A Simple Line Search</head><p>We pose the line search problem as a self-contained question. Consider points p &lt; q in R and a function h :</p><p>. Suppose that h is right differentiable on the interval [p, q), and an oracle returns, for any input t 2 [p, q), the value h(t) and the right derivative</p><p>(possibly extended valued). How difficult is it to find a point t satisfying h 0 + (t) &lt; 0? When h is Lipschitz, the most basic randomized strategy-uniformly sampling random points t in the interval-solves this problem with high probability. Denoting the Lipschitz constant by L, the right derivative h 0 + always lies in the interval [ L, L]. Denote the measure of the subset S of the interval [p, q] where h 0 + 0 by &#955;. Then, providing that the average slope satisfies h(q) h(p) q p &#224; &#963; &lt; 0, the fundamental theorem of calculus implies</p><p>Because the integrand is bounded below by zero on S and by -L on the complement S c , a set of measure (q p) &#955;, we deduce</p><p>Hence, the probability &#955; q p that a uniformly distributed random point t 2 [p, q] fails to satisfy h 0 (t) &lt; 0 is no larger than 1 &#963; L . Thus, for small &#963;, using at least L &#963; independent samples, the probability of success is at least 1 2 . However, we seek a deterministic algorithm, so we instead consider the following simple method, similar in spirit to one described by <ref type="bibr">Davis et al. [6]</ref>. We repeatedly bisect the interval [p, q], each time discarding the subinterval over which the function h decreases the least. The algorithm checks whether the right derivative at the midpoint of the current interval is negative, in which case it terminates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 (Search h by Bisection for Negative Derivative)</head><p>input:</p><p>Notice that the algorithm initially calls the oracle at the left end point p of the given interval, calculates the function value at the right end point q, and then, calls the oracle once during each bisection.</p><p>In general, this algorithm may fail to terminate. It is easy to construct a Lipschitz function h satisfying h(p) &gt; h(q), and yet, the derivative of h at the initial end point p and at every midpoint m is positive. To rule out such oscillatory examples, we can rely on directional semismoothness of h, which in this univariate setting, Semismoothness is more than enough to prove termination of the line search. The simple argument also applies to non-Lipschitz functions.</p><p>Proposition 2 (Termination of the Line Search). Suppose that the function h : [p, q] ! R satisfies h(p) &gt; h(q) and that its left and right derivatives satisfy the semismoothness conditions</p><p>Then, Algorithm 1 terminates.</p><p>Proof. If the algorithm does not terminate, then it generates monotonic sequences l k " and r k #, satisfying r k l k ! 0,</p><p>for each iteration k &#224; 0, 1, 2, : : : . (The line search ensures that the ratio in the second inequality never increases.) Denote the two sequences' mutual limit by m. Semismoothness ensures</p><p>which is a contradiction. Hence, for all large k, we have q &gt; r k &gt; m. The right end points r k decrease to m and so, are always updated eventually; hence, the line search guarantees</p><p>Adding now gives a contradiction. w</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">The Optimization Algorithm</head><p>To minimize a locally Lipschitz function f : X ! R using the directional subgradient oracle described, we study the following algorithm. The method we describe is essentially that of Zhang et al. <ref type="bibr">[24]</ref> but with the deterministic line search described in the preceding section.</p><p>Apply Algorithm 1 (bisection) using the formula</p><p>Notice that, in contrast with the classical subgradient method but like many classical approaches involving line searches, trust regions, or subgradient bundling techniques, this algorithm only updates the current iterate when it satisfies a sufficient decrease condition. Intermediate "null" steps serve to shorten the current Goldstein subgradient g.</p><p>When the objective f is directionally semismooth, Algorithm 1 terminates by Proposition 2, which in turn, guarantees termination of Algorithm 2, as we shall now prove. We use the following simple tool, following Zhang et al. <ref type="bibr">[24]</ref>.</p><p>Lemma 1. If two vectors g, g 0 2 B L (0) satisfy hg 0 , gi &#63743; 1 2 |g | 2 , then the shortest vector g 00 in the line segment [g, g 0 ] satisfies</p><p>Proof. For all t 2 [0, 1], we have</p><p>8L 2 proves the result. w We can now prove the validity of the algorithm, again imitating parts of the argument in Zhang et al. <ref type="bibr">[24]</ref>, which we reproduce for ease of reading.</p><p>Theorem 1 (Finite Termination). Suppose that we apply Algorithm 2 to a directionally semismooth function f : X ! R that is bounded below, with tolerance &#9999; &gt; 0, radius &#948; &gt; 0, and initial point x 0 2 X. Suppose that the directional subgradient map in Oracle 1 is L-bounded. Then, the algorithm terminates with a point x 2 X and a subgradient g 2 @ &#948; f (x) satisfying |g | &#63743; &#9999;. The number of line searches required does not exceed</p><p>Proof. Suppose that the current subgradient g is inadequate in the sense that, in the terminology of the algorithm description, it neither is small nor generates sufficient decrease. We then apply the bisection method, Algorithm 1, to the given function h. The initial interval is [p, q] &#224; [0, &#948;], and the average slope satisfies</p><p>We thus arrive at a subgradient g 0 2 @ &#948; f (x) satisfying hg 0 , &#285;i &lt; &#9999; 2 . We deduce hg 0 , gi &lt;</p><p>2 . The algorithm replaces the current subgradient g by the shortest vector g 00 in the line segment [g, g 0 ]. Because g 00 2 @ &#948; f (x), we can repeat this shortening process, providing that g 00 is also inadequate. Suppose that the subgradient g remains inadequate after completing k such shortening steps. Let &#961; i denote the quantity</p><p>16L 2 after i &#224; 0, 1, 2, : : : , k steps. Then, &#961; 0 &#63743; 1 16 , and for each I, we have</p><p>Hence, after no more than</p><p>shortening steps, each requiring one line search, we arrive at an adequate subgradient g 2 @ &#948; f (x). To summarize, starting at any point with an inadequate subgradient, we require no more than 16L 2 &#9999; 2 line searches before finding an adequate subgradient g.</p><p>There are now two possibilities. Either the subgradient g satisfies | g| &#63743; &#9999;, in which case we stop, or we perform a reduction step, replacing the current point x by x &#948; g | g | , thereby decreasing the objective value by at least the quantity &#948;&#9999; 3 . Because the objective is bounded below, beginning from the initial point x 0 , this procedure terminates after no more than d 3 &#948;&#9999; (f (x 0 ) inf f ))e reduction steps, from which the line search bound (1) follows. Proposition 2 ensures that each line search requires only finitely many oracle calls, completing the proof. w</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Complexity of the Line Search</head><p>To complete our complexity analysis for the minimization algorithm, we simply need to bound the number of oracle calls needed by each line search and multiply by our bound (1) on the number of line searches. Consider, therefore, the bisection method. When the function h : [p, q] ! R is convex, the problem is trivial; because h(p) &gt; h(q) ) h 0 + (p) &lt; 0, the algorithm terminates at the first oracle call. More generally, we proceed by correcting any lack of convexity in h by adding a convex perturbation s : [p, q] ! R.</p><p>Recall that, for any interval J, a function h : J ! R is difference of convex when there exists a convex function s : J ! R such that h + s is also convex. For such functions, we have the following tool.</p><p>Lemma 2. Consider a function h : [p, q] ! R and a convex function s : [p, q] ! R such that h + s is also convex. For any points x &lt; y in the interval [p, q], if h 0 + (x) 0, then h(y) h(x) y x s 0 + (x) s 0 (y): Proof. Denote the convex function h + s by r. The convex functions r and s satisfy r 0 + (x) 2 @r(x) and s 0 (y) 2 @s(y):</p><p>Hence,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>s(y) &#224; r(x) h(x) r(y) + h(y)</head><p>&#63743; h(y) h(x) + r 0 + (x)(x y) &#63743; h(y) h(x) + s 0 + (x)(x y), and the result follows. w</p><p>We can then use the change in derivative of the necessary perturbation s to bound the number of iterations in the line search.</p><p>Theorem 2. Consider a function h : [p, q] ! R and a convex function s : [p, q] ! R such that h + s is also convex. If the bisection method, Algorithm 1, evaluates h 0 + , the right derivative, k 1 times without terminating, then h(q) h(p) q p s 0 + (p) s 0 (q) k :</p><p>Proof. We proceed by induction on the number of evaluations k &#224; 1, 2, 3, : : : . The case k &#224; 1 follows immediately from Lemma 2 by setting x &#224; p and y &#224; q. Suppose that the result holds for any points p &lt; q and for any number of evaluations no larger than k. Now, consider an instance of the algorithm that completes k + 1 evaluations. After the first evaluation, consider the midpoint m &#224; 1 2 (p + q). There are two possible cases depending on whether 2h(m) &lt; h(p) + h(q):</p><p>(</p><p>We consider them in turn. Suppose first that Inequality (2) holds. After the first evaluation, the algorithm makes k further evaluations, beginning with the initial interval [p, m]. Hence, by the induction hypothesis,</p><p>Because the algorithm did not terminate during the first two evaluations, we know h 0 + (m) 0. Applying Lemma 2 with x &#224; m and y &#224; q shows h(q) h(m) q m s 0 + (m) s 0 (q): Hence,</p><p>The case where Inequality (2) fails is similar. After the first bisection, the algorithm makes k further bisections, beginning with the initial interval [m, q]. Hence, by the induction hypothesis, h(q) h(m) q m s 0 + (m) s 0 (q) k : Hence,</p><p>Definition 4. Given any interval J &#8674; R, the concave deviation of a function h : J ! R is the infimum of the Lipschitz constants of convex functions s : J ! R such that the sum h + s is also convex. Consider, for example, a &#961;-weakly convex function h, for some constant &#961; 0, meaning that the function t</p><p>&#8677; 2 is convex, with Lipschitz constant &#961; 2 (q p), and h + s is also convex. w The concave deviation for functions that are not weakly convex may shrink more slowly than the length of the interval. For example, on the interval [ &#948;, &#948;], the function h(t) &#224; |t| The concave deviation of a function h : [p, q] ! R that is difference of convex may not be finite. An example is the concave function &#199; &#199; &#8226; p on the interval [0, 1]. However, if h extends to a difference-of-convex function on an open interval containing the interval [p, q], then its nonconvexity bound on [p, q] must be finite because we can write h as a difference-of-convex functions, each of which must be Lipschitz on [p, q].</p><p>Consider any continuous semialgebraic function h : R ! R. We can partition R into finitely many closed intervals J such that each restriction h| J is either convex or concave. In general, such a partition may not guarantee that h is difference of convex; an example is the function x 1 3 . However, if h is also Lipschitz, then each convex or concave ingredient h| J extends to a corresponding convex or concave Lipschitz function on R, and from these, we can easily decompose h into a difference-of-convex Lipschitz functions. Thus, all semialgebraic Lipschitz functions on R are difference of convex, with finite concave deviation on any bounded interval.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Corollary 1 (Line Search Complexity). If a function h : [p, q] ! R has finite concave deviation M and average rate of decrease</head><p>then the number of evaluations of h 0 + , the right derivative, required before the bisection method, Algorithm 1, terminates is no more than 1 + b 2M &#963; c. Proof. Suppose that the bisection method evaluates the right derivative k 1 times without terminating. Fix any value M 0 &gt; M. By assumption, there exists a convex function s with Lipschitz constant less than M 0 such that the sum h + s is also convex. From Theorem 2, we deduce the inequalities Given an open interval I, consider a difference-of-convex function h : I ! R. As observed by Hartman <ref type="bibr">[11]</ref>, such functions are characterized by having left and right derivatives everywhere, which furthermore, are of bounded variation on every compact interval in I. Any such function h also has a second derivative D 2 h in the distributional sense; in general, it is a signed Radon measure on the interval I (see <ref type="bibr">Dudley [7]</ref>).</p><p>We review briefly the underlying construction. Consider any convex function s : I ! R. Its right derivative s 0 + is nondecreasing and right continuous, and hence, it defines a nonnegative Radon measure D 2 s on the interval I via the property (D 2 s)(p, q] &#224; s 0 + (q) s 0 + (p) for all p &lt; q in I:</p><p>(We could equivalently work from the property (D 2 s)(p, q) &#224; s 0 (q) s 0 + (p).) More generally, for any difference-ofconvex function h : I ! R, consider any convex function s : I ! R such that h + s is also convex. The second derivative D 2 h is just the signed measure D 2 (h + s) D 2 s, which is independent of the choice of s. Convexity of h is characterized by the property D 2 h 0. More generally, the Jordan decomposition decomposes D 2 h uniquely into a difference of nonnegative Radon measures,</p><p>with the minimality property (see Rudin <ref type="bibr">[20, p. 127]</ref>) that any other decomposition into a difference of nonnegative Radon measures D 2 h &#224; &#955; &#181; satisfies &#955; (D 2 h) + and &#181; (D 2 h) .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem 3. If the function h</head><p>Proof. Consider any convex function s : [p, q] ! R such that h + s is also convex. The second derivatives satisfy</p><p>so the minimality of the Jordan decomposition implies D 2 s (D 2 h) . If s is M-Lipschitz on [p, q], then</p><p>If the right-hand side is infinite, this completes the proof; so, suppose that it is finite. Define a function g : (p, q] ! R by</p><p>Then, g is a nonnegative nondecreasing left-continuous function that is bounded above and g(t) # 0 as t # p. Now, define a convex function s : [p, q] ! R by</p><p>Then, s 0 (t) &#224; g(t) for all t 2 (p, q] and s 0 + (p) &#224; 0. Furthermore, we have</p><p>is also convex, as is h + s, and the function s has Lipschitz constant</p><p>This completes the proof. w</p><p>As an illustration, we have the following result. Corollary 2 (Piecewise Linear Functions). Consider a continuous piecewise linear function h : [p, q] ! R, with m derivative discontinuities t 1 &lt; t 2 &lt;&#8943;&lt; t m in the interval (p, q). Define t 0 &#224; p and t m+1 &#224; q, and let g i be the value of the derivative on the interval (t i , t i+1 ) for 0 &#63743; i &#63743; m. Then, h has concave deviation</p><p>If h is L-Lipschitz, then this bound is no larger than d m 2 eL. Proof. Denoting a unit point mass at the point t by &#948; t , we have</p><p>and hence,</p><p>from which the claimed equation follows. The inequality is an easy consequence, using |g i | &#63743; L for each i. w</p><p>As an illustration, we present an example that underlines why the line search complexity estimate in Corollary 1 is the best that we can expect in general. In outline, although the functions h : [p, q] ! R that we consider satisfy h(p) &gt; h(q), their derivatives may often be positive.</p><p>Example 4 (Optimality of the Line Search). Consider any constant M &gt; 0 and a set T &#8674; [0, 1) of cardinality strictly less than 2M. Then, there exists a function h : [0, 1] ! R with concave deviation less than M that satisfies h(0) &#224; 0 and h(1) &#224; 1 and that has strictly positive derivative throughout T.</p><p>To see this, suppose first T &#8674; (0, 1). (The case when T contains zero is an easy modification.) Enumerate the points in increasing order:</p><p>where k &lt; 2M. Define h(0) &#224; 0 and h(1) &#224; 1. Fix any small &#947; &gt; 0, and define h(t i &#947;) &#224; t i &#947; 2 and h(t i + &#947;) &#224; t i + &#947; 2 for i &#224; 1, 2, : : : , k: At intermediate points in [0, 1], define h by linear interpolation. A quick calculation, using Corollary 2, shows that h has concave deviation 1 2</p><p>providing that &#947; is sufficiently small. Now, consider any line search method applicable to functions h : [0, 1] ! R satisfying h(0) &#224; 0 and h(1) &#224; 1, relying on evaluations of the value h and the right derivative h 0 + at points chosen one by one, and terminating once a derivative is negative. Suppose that the method is guaranteed to terminate after at most k queries, providing that the underlying function h has concave deviation strictly less than some given value M &gt; 0. The example proves k 2M.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Multivariate Functions</head><p>To understand the complexity of Algorithm 2 (nonsmooth minimization), we apply our analysis in the previous section to restrictions of multivariate objectives f to line segments. For any convex set C &#8674; X, consider a function f : C ! R. Given any length &#948; &gt; 0, let &#923;(&#948;) denote the supremum over all points x, y 2 C with |x y| &#63743; &#948; of the concave deviation for the function h : [0, &#948;] ! R defined by</p><p>We call the function &#923; : R ++ ! [0, + 1] the nonconvexity modulus for f. The following illustration follows immediately from Proposition 3. More generally, the function f is difference of convex when there exists a convex function q : C ! R such that f + q is also convex.</p><p>Proposition 5. Consider a convex set C &#8674; X and functions f , q : C ! R with both q and f + q convex. If q is M-Lipschitz, then the nonconvexity modulus of f satisfies &#923;(&#948;) &#63743; M for all &#948; &gt; 0.</p><p>Proof. Consider any points x, y 2 C with |x y | &#63743; &#948; and the function h defined by Equation (3). The function s : [0, &#948;] ! R defined by</p><p>is convex and M-Lipschitz, and h + s is convex; therefore, the concave deviation of h is no larger than M. The result follows. w</p><p>As a consequence, we deduce the following result.</p><p>Corollary 3. Consider any convex sets C &#8674; C 0 &#8674; R n , where C is nonempty and compact and C 0 is open, and any differenceof-convex function f : C 0 ! R. Then, the nonconvexity modulus of the restriction f C is uniformly bounded; there exists a finite constant M such that &#923;(&#948;) &#63743; M for all &#948; &gt; 0.</p><p>In particular, because polyhedral functions are globally Lipschitz, we have the following fact.</p><p>Corollary 4. If a function f : X ! R is the difference p q between a convex function p : X ! R and a polyhedral convex function q : X ! R, then the nonconvexity modulus of f is no larger than any Lipschitz constant for q.</p><p>More generally, consider a continuous function f : X ! R that is semilinear in the sense that X is a finite union of polyhedra, on each of which the function f is affine. Any such function has a Lipschitz constant L and furthermore, a uniform upper bound m on the number of possible gradient discontinuities in any function of the form (3). By Corollary 2, we deduce that the nonconvexity modulus &#923;(&#948;) is no larger than d m 2 eL. Returning to our analysis of Algorithm 2, we are ready for our main result.</p><p>Theorem 4 (Complexity of Minimization). Given a tolerance &#9999; &gt; 0 and a radius &#948; &gt; 0, consider a convex set C &#8674; X, a function f : C ! R that is bounded below, an associated L-bounded directional subgradient map G : X 2 ! X, and an initial point x 0 2 C such that f (x) &#63743; f (x 0 ) and |y| &#63743; &#948; ) x + y 2 C: Suppose that f has finite nonconvexity modulus &#923;(&#948;). Then, Algorithm 2 (nonsmooth minimization) requires at most</p><p>calls to Oracle 1 (directional subgradient) to find a point x 2 X and a Goldstein subgradient g 2 @ &#948; f (x) satisfying |g | &#63743; &#9999;.</p><p>Proof. Corollary 1 with &#963; &#224; &#9999; 6 shows that the bisection method requires at most &#8677; noted in the abstract.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>In summary, our deterministic algorithm, when applied to difference-of-convex objectives with bounded nonconvexity modulus, returns a point with an &#9999;-Goldstein subgradient of norm no larger than &#9999; after O(&#9999; 5 ) oracle </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix. Distributional Second Derivatives</head><p>We saw previously that the concave deviation of a univariate function h is determined by the negative part of its second distributional derivative D 2 h. In our application, we consider functions h that are restrictions of the underlying objective f to line segments of fixed length &#948;. We would, therefore, expect the nonconvexity modulus of f to be related to its own distributional second derivative. Here, we explore that relationship informally. Consider a locally Lipschitz function f : R n ! R. The distributional derivative of f is an n-vector Df, entries of which are distributions-linear functionals on the space of smooth, compactly supported functions g : R n ! R (test functions)that are continuous with respect to uniform convergence on compact sets. We can define Df through the relationship</p><p>for all vectors u 2 R n and test functions g : R n ! R. However, by a suitable version of Rademacher's theorem (see Evans and Gariepy [8, section 6.2, theorem 1]), the gradient rf exists almost everywhere and is essentially bounded, and it satisfies</p><p>In standard terminology (see Evans and Gariepy [8]), we can identify the classical gradient rf with both the distributional derivative Df and the "weak" derivative of f. The second distributional derivative of f is an n-by-n matrix D 2 f , entries of which are distributions. We can define D 2 f through the relationship</p><p>for all vectors u, v 2 R n and test functions g. If f is smooth, then D 2 f is just the matrix-valued measure with density r 2 f . More generally, we must consider D 2 f as a distribution, but at least for convex functions, we can be more specific; it is a positive semidefinite-valued Radon measure (see <ref type="bibr">Dudley [7]</ref> and Evans and Gariepy [8, section 6.3]).</p><p>Example A.1 (A Piecewise Linear Function). Consider the convex function f : R 2 ! R defined by f (x) &#224; x + 1 . For any vectors u, v 2 R 2 and smooth, compactly supported function g : R 2 ! R, we have</p><p>by the Gauss-Green formula. Thus, D 2 f is the matrix &#181; 0 0 0 , where the measure &#181; is related to Lebesgue measure &#955; via</p><p>for all measurable subsets of S &#8674; R 2 . For a more general understanding, we begin with the univariate case.</p><p>Example A.2 (Univariate Convex Functions). Consider a convex function f : R ! R. For 0 &lt; &#947; &lt; 1, we can construct a smooth approximation h &#947; : R ! [0, 1] of the standard step function, with the following properties:</p><p>and h is convex on [0, &#947; 2 ], linear on [&#947; 2 , &#947; &#947; 2 ], and concave on [&#947; &#947; 2 , &#947;]. For any interval (p, q] &#8674; R, the test function r &#947; :</p><p>Kong and Lewis: The Cost of Nonconvexity in Deterministic Nonsmooth Optimization R ! [0, 1] defined by</p><p>converges pointwise to the characteristic function &#967; (p, q] pointwise as &#947; # 0. By dominated convergence, we deduce</p><p>However, the left-hand side is Z r 0 &#947; f 0 &#224;</p><p>as &#947; # 0. We, thus, reproduce our earlier definition:</p><p>Clearly, this fact also holds for any difference-of-convex function f : R ! R.</p><p>The modulus of nonconvexity for a function f : R n ! R, which we denoted &#923;(&#948;) (for &#948; &gt; 0), is the supremum of the concave deviation of the restriction of f to line segments of the form S &#224; {z + tw : 0 &#63743; t &#63743; &#948;} for some point z 2 R n and unit direction w 2 R n . That concave deviation is the measure of S under the negative part of the measure D 2 (f | S ). We would, therefore, like to compare the distributional second derivative of this restriction with the distributional second derivative D 2 f . As we see in the next result, we should focus specifically on the directional distributional second derivative w T (D 2 f )w.</p><p>A simple approach is furnished by mollification. We fix a mollifier &#966; : R n ! R: a test function satisfying R &#966; &#224; 1 and with the property that, as &#947; # 0, the function &#966; &#947; (x) &#224; &#947; n &#966; 1 &#947; x &#8984; &#10003; converges as a distribution to the Dirac delta function.</p><p>Given a Radon measure &#181; on R n and any test function g : R n ! R, we can define the convolution g ? &#181; : R n ! R by In this result, the effect of the convolution is to focus attention on the line through the point z in the direction w. In informal language, we deduce that the modulus of nonconvexity &#923;(&#948;) is determined by the concentration of the negative parts of the measures w T (D 2 f )w around line segments of length &#948; in unit directions w.</p><p>For more intuition on directional distributional second derivatives of the form w T (D 2 f )w, let us consider a convex function f : R 2 ! R. After a suitable choice of basis, we can suppose that w is the first unit vector e 1 and therefore, consider the Radon measure (Df ) 11 . To understand this measure, consider the integral Z More generally, this argument suggests, loosely, that w T (D 2 f )w measures the variation of the directional derivative f 0 (&#8226;; w) along the direction w.</p><p>Endnote 1 We became aware of these concurrent independent works after completing the initial draft of this manuscript.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Mathematics of Operations Research, Articles in Advance, pp. 1-17, &#169; 2023 INFORMS Downloaded from informs.org by [31.4.245.160] on 21 December 2023, at 05:59 . For personal use only, all rights reserved.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Kong and Lewis: The Cost of Nonconvexity in Deterministic Nonsmooth Optimization</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>Downloaded from informs.org by [31.4.245.160] on 21 December 2023, at 05:59 . For personal use only, all rights reserved.</p></note>
		</body>
		</text>
</TEI>
