<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Parallel Algorithm for Non-Monotone DR-Submodular Maximization</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>07/12/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10204873</idno>
					<idno type="doi"></idno>
					<title level='j'>International Conference on Machine Learning</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Alina Ene</author><author>Huy L Nguyen</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In this work, we give a new parallel algorithm for the problem of maximizing a non-monotone diminishing returns submodular function subject to a cardinality constraint. For any desired accuracy $\epsilon$, our algorithm achieves a $1/e - \epsilon$ approximation using $O(\log{n} \log(1/\epsilon) / \epsilon^3)$ parallel rounds of function evaluations. The approximation guarantee nearly matches the best approximation guarantee known for the problem in the sequential setting and the number of parallel rounds is nearly-optimal for any constant $\epsilon$. Previous algorithms achieve worse approximation guarantees using $\Omega(\log^2{n})$ parallel rounds. Our experimental evaluation suggests that our algorithm obtains solutions whose objective value nearly matches the value obtained by the state of the art sequential algorithms, and it outperforms previous parallel algorithms in number of parallel rounds, iterations, and solution quality.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In this paper, we study parallel algorithms for the problem of maximizing a non-monotone DR-submodular function subject to a single cardinality constraint. The problem is a generalization of submodular maximization subject to a cardinality constraint. Many recent works have shown that DR-submodular maximization has a wide-range of applications beyond submodular maximization. These applications include maximum a-posteriori (MAP) inference for determinantal point processes (DPP), mean-field inference in log-submodular models, quadratic programming, and revenue maximization in social networks <ref type="bibr">(Kulesza et al., 2012;</ref><ref type="bibr">Gillenwater et al., 2012;</ref><ref type="bibr">Bian et al., 2016;</ref><ref type="bibr">Ito &amp; Fujimaki, 2016;</ref><ref type="bibr">Soma &amp; Yoshida, 2017;</ref><ref type="bibr">Bian et al., 2017;</ref><ref type="bibr">2018)</ref>.</p><p>The problem of maximizing a DR-submodular function subject to a convex constraint is a notable example of a non-convex optimization problem that can be solved with provable approximation guarantees. The continuous Greedy algorithm <ref type="bibr">(Vondr&#225;k, 2008)</ref> developed in the context of the multilinear relaxation framework applies more generally to maximizing DR-submodular functions that are monotone increasing (if x &#8407; &#8804; y &#8407; coordinate-wise then f (x &#8407; ) &#8804; f (y &#8407;)). Feldman et al. <ref type="bibr">(Feldman et al., 2011)</ref> developed a variant of continuous Greedy for non-monotone objectives. Chekuri et al. <ref type="bibr">(Chekuri et al., 2015)</ref> developed algorithms for both monotone and non-monotone DR-submodular maximization subject to packing constraints that are based on the continuous Greedy and multiplicative weights update framework. The work <ref type="bibr">(Bian et al., 2017)</ref> generalized continuous Greedy for submodular functions to the DR-submodular case and developed Frank-Wolfe-style algorithms for maximimizing non-monotone DR-submodular function subject to general convex constraints.</p><p>A significant drawback of these algorithms is that they are inherently sequential and adaptive. In fact the highly adaptive nature of these algorithms goes back to the classical greedy algorithm for submodular functions: the algorithm sequentially selects the next element based on the marginal gain on top of previous elements. In certain settings such as feature selection <ref type="bibr">(Khanna et al., 2017)</ref> evaluating the objective function is a time-consuming procedure and the main bottleneck of the optimization algorithm and therefore, parallelization is a must. Recent lines of work have focused on addressing these shortcomings and understanding the trade-offs between approximation guarantee, parallelization, and adaptivity. Starting with the work of Balkanski and Singer <ref type="bibr">(Balkanski &amp; Singer, 2018)</ref>, there have been very recent efforts to understand the tradeoff between approximation guarantee and adaptivity for submodular maximization <ref type="bibr">(Balkanski &amp; Singer, 2018;</ref><ref type="bibr">Ene &amp; Nguyen, 2019;</ref><ref type="bibr">Balkanski et al., 2019;</ref><ref type="bibr">Fahrbach et al., 2019;</ref><ref type="bibr">Chekuri &amp; Quanrud, 2019;</ref><ref type="bibr">Balkanski et al., 2018)</ref>. The adaptivity of an algorithm is the number of sequential rounds of queries it makes to the evaluation oracle of the function, where in every round the algorithm is allowed to make polynomially-many parallel queries. Recently, the work <ref type="bibr">(Fahrbach et al., 2018)</ref> gave an algorithm for maximizing a submodular function subject to a cardinality constraint in O(log n/&#1013;) rounds and 0.031&#1013; approximation. For the general setting of DR-submodular functions with m packing constraints, the work <ref type="bibr">(Ene et al., 2019)</ref> gave an algorithm with O(log(n/&#1013;) log(1/&#1013;) log(m + n)/&#1013; 2 ) rounds and 1/e-&#1013; approximation. In the special case of m = 1 constraint, this algorithm uses O(log 2 (n) log(1/&#1013;)/&#1013; 2 ) rounds.</p><p>In this work, we develop a new algorithm for DRsubmodular maximization subject to a single cardinality constraint using O(log n log(1/&#1013;)/&#1013; 3 ) rounds of adaptivity and obtaining 1/e&#1013; approximation. For small n compared to 1/&#1013;, this is not an improvement over previous work. However, for large n, the number of rounds is almost a quadratic improvement from O(log 2 n) in the previous work to the nearly optimal O(log n) rounds. Our experimental evaluation shows that our algorithm outperforms the state of the art even in the small n regime that is advantageous for the existing algorithms.</p><p>Theorem 1. Let f : [0, 1] n &#8594; R + be a DR-submodular function and k &#8712; R + . For every &#1013; &gt; 0, there is an algorithm for the problem max x &#8407; &#8712;[0,1] n : &#8741;x &#8407; &#8741;1&#8804;k f (x &#8407; ) with the following guarantees:</p><p>&#8226; The algorithm is deterministic if provided oracle access for evaluating f and its gradient &#8711;f ;</p><p>&#8226; The algorithm achieves an approximation guarantee of 1 e&#1013;; &#8226; The number of rounds of adaptivity is</p><p>)&#65026; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Preliminaries</head><p>Let f : [0, 1] n &#8594; R + be a non-negative function. The function is diminishing returns submodular</p><p>where 1 &#8407; i is the i-th basis vector, i.e., the vector whose i-th entry is 1 and all other entries are 0.</p><p>Throughout the paper, we assume that f is differentiable. We assume that we are given black-box access to an oracle for evaluating f and its gradient &#8711;f . We extend the function</p><p>An example of a DR-submodular function is the multilinear extension of a submodular function g. The multilinear extension is defined as</p><p>where R(x &#8407; ) is a random subset of V where each i &#8712; V is included independently at random with probability x &#8407; i .</p><p>Basic notation. We use e.g. x &#8407; = (x &#8407; 1 , . . . , x &#8407; n ) to denote a vector in R n . We use the following vector operations:</p><p>&#8407; ) be the n-dimensional all-zeros (resp. all-ones) vector. Let 1 &#8407; S &#8712; {0, 1} V denote the indicator vector of S &#8838; V , i.e., the vector that has a 1 in entry i if and only if i &#8712; S.</p><p>We will use the following result that was shown in previous work <ref type="bibr">(Chekuri et al., 2015)</ref>.</p><p>Lemma 2 ( <ref type="bibr">(Chekuri et al., 2015)</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">The Algorithm</head><p>In this section, we present an idealized version of our algorithm where we assume that we can compute exactly the step size on line 16. The idealized algorithm is given in Algorithm 1. In the supplement (Section B), we show how to implement that step efficiently and incur only O(&#1013;) additive error in the approximation.</p><p>The algorithm takes as input a target value M and it achieves the desired 1/e -O(&#1013;) approximation if M is an (1 + &#1013;) approximation of the optimal function value f (x &#8407; * ), i.e., we have f</p><p>As noted in previous work <ref type="bibr">(Ene et al., 2019)</ref>, it is straightforward to approximately guess such a value M using a single parallel round.</p><p>The algorithm maintains two solutions: the solution x &#8407; that will be returned, and a second solution z &#8407; &#8805; x &#8407; . We use z &#8407; to handle the non-monotonicity of the objective. Throughout the algorithm, we have z &#8407; &#8805; x &#8407; and thus &#8711;f (z &#8407;) &#8804; &#8711;f (x &#8407; ) by DR-submodularity. Instead of using the gradient at x &#8407; to decide which coordinates to update, we use the gradient at z &#8407;. By doing so, we may under-estimate the potential gain of some of the coordinates, but it allows us to deal with the non-monotonicity of the function. This strategy is reminiscent of the gradient lookahead technique of <ref type="bibr">(Ene et al., 2019)</ref>, although we use it in a very different way.</p><p>The algorithm iteratively increases the two solutions over 1/&#1013; phases (iterations of the outer for loop). During a single phase, the algorithm uses a descending thresholds approach.</p><p>Algorithm 1 Algorithm for max x &#8407; &#8712;[0,1] n : &#8741;x &#8407; &#8741;1&#8804;k f (x &#8407; ), where f is a non-negative DR-submodular function. The algorithm takes as input a target value M such that f</p><p>while v &gt; &#1013;v start and &#8741;z &#8407;&#8741; 1 &lt; &#1013;jk do 10:</p><p>else 15:</p><p>For a given &#951; &#8712; [0, &#1013; 2 ], we define:</p><p>end if</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>25:</head><p>end if</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>26:</head><p>end while 27: end for 28: return x &#8407;</p><p>Starting with an initial threshold v start , the algorithm only updates coordinates whose gradient is above the threshold.</p><p>The algorithm only updates a carefully chosen subset of these high-gain coordinates: the set S on line 11. The definition of S imposes two further restrictions on the fractional value of each coordinate in S. These restrictions allow us to control the &#8467; &#8734; norm of the solution and cope with the non-monotonicity of the objective function (cf. Lemma 2).</p><p>If the set S is empty, the algorithm lowers the threshold by an 1&#1013; factor on line 13. Otherwise, the algorithm simultaneously increases a large fraction of the coordinates in S using an appropriately chosen step size &#951;. The step size is carefully chosen to ensure several competing desiderata.</p><p>On one hand, we want to take large steps for fast convergence/high parallelism. On the other, we want to take small steps to ensure high function value gain, since the gradient approximates the function value only locally and we need to be conservative on how much we increase due to the non-monotonicity of the objective. We choose our step size so that, after the update, the size of S decreases by an 1&#1013; factor or we reach the budget of &#1013;jk allocated to the phase or we reach our cap of &#1013; 2 on the step size. Thus we converge fast by either shrinking S or filling up the budget, while simultaneously achieving high gains in function value.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Analysis of the Approximation Guarantee</head><p>In this section, we show that Algorithm 1 achieves a 1 e -O(&#1013;) approximation. Recall that we assume that &#951; 1 is computed exactly on line 16. In Section B of the supplement, we show how to extend the algorithm and the analysis so that the algorithm efficiently computes a suitable approximation to &#951; 1 that suffices for obtaining a 1 e -O(&#1013;) approximation. The omitted proofs can be found in the supplement (Section A).</p><p>In the following, we refer to each iteration of the outer for loop as a phase. We refer to each iteration of the inner while loop as an iteration. Note that the update vectors are non-negative in each iteration of the algorithm, and thus the vectors x &#8407; , z &#8407; remain non-negative throughout the algorithm and they can only increase. Additionally, since S(&#951;) &#8838; T (&#951;) &#8838; S, we have x &#8407; &#8804; z &#8407; throughout the algorithm. We will also use the following observations repeatedly, whose straightforward proofs are deferred to Section A of the supplement. By DR-submodularity, since the relevant vectors can only increase in each coordinate, the relevant gradients can only decrease in each coordinate. This implies that, for every &#951; &#8804; &#951; &#8242; , we have S(&#951;) &#8839; S(&#951; &#8242; ). Additionally, for every i &#8712; T (&#951;), we have</p><p>We will need an upper bound on the &#8467; 1 and &#8467; &#8734; norms of x &#8407; and z &#8407;. Since x &#8407; &#8804; z &#8407;, it suffices to upper bound the norms of z &#8407; (the &#8467; 1 norm bound will be used to show that the final solution is feasible, and the &#8467; &#8734; norm bound will be used to derive the approximation guarantee). The omitted proofs can be found in the supplement.</p><p>Lemma 3. Consider phase j of the algorithm (the j-th iteration of the outer for loop). Throughout the phase, the algorithm maintains the invariant that &#8741;z &#8407;&#8741; &#8734; &#8804; 1 -(1&#1013;) j + &#1013; 2 and &#8741;z &#8407;&#8741; 1 &#8804; &#1013;jk.</p><p>The following theorem is the heart of the analysis and it shows that each phase makes sufficient gain in function value. This is reminiscent of the gain made by the measured continuous greedy algorithm in the sequential setting and it implies the claimed approximation guarantee via induction. As discussed in Section 3, in order to achieve high parallelism, our algorithm proceeds very differently from the sequential continuous greedy algorithm. As a result, we need a very careful analysis to show the claimed gain.</p><p>Theorem 4. Consider phase j of the algorithm (the j-th iteration of the outer for loop). Let x &#8407; start and x &#8407; end be the vector x &#8407; at the beginning and end of the phase. We have</p><p>Proof. Recall that, within a single phase, the algorithm uses a descending thresholds approach and updates a carefully selected subset of the coordinates whose gradient/marginal gain is above the current threshold (line 11). The subset is chosen so that no coordinate increases too much. This allows us to control the &#8467; &#8734; norm of the solution throughout the phase, which in turn allows us to deal with the nonmonotonicity of the function. The main technical challenge is to show that we gain enough throughout the phase despite that:</p><p>(1) we are restricting the set of high-gain coordinates that are updated to ensure that no coordinate increases too much, and (2) we terminate phase j when we fill up &#1013;jk budget to ensure that we get a feasible solution after 1/&#1013; phases. We divide the analysis into two cases, depending on whether the threshold is ever updated on line 13.</p><p>Case 1: we have v end = v start . Note that the phase terminates with &#8741;z &#8407; end &#8741; 1 = &#1013;jk in this case.</p><p>We start with an overview of the analysis. All of the coordinates that were updated throughout the phase had gradient at least v start , and v start is chosen so that the gain is high.</p><p>The step size &#951; is chosen so that we update an 1&#1013; fraction of the coordinates in S. Thus, ignoring 1&#1013; factors, an iteration of the phase increases the function value by v start times the increase in the &#8467; 1 -norm of the solution. Since the phase fills up at least &#1013;k of the budget, the increase in value over all iterations is at least &#1013;kv start . Finally, the choice of v start ensures that we reached the desired gain.</p><p>We now give the formal argument. We fix an iteration of the phase that updates x &#8407; and z &#8407; on lines 20-23, and analyze the gain in function value in the current iteration. We let x &#8407; , z &#8407; denote the vectors right before the update on lines 20-23. Let x &#8407; &#8242; be the vector x &#8407; right after the update on line 20, and let z &#8407; &#8242; be the vector z &#8407; right after the update on line 21.</p><p>We have:</p><p>In (a), we used the fact that x &#8407; &#8242;x &#8407; &#8805; 0 and f is concave in non-negative directions.</p><p>We can show (b) as follows. We have x &#8407; &#8242; &#8804; z &#8407; &#8242; = z &#8407;(&#951;) and thus &#8711;f (x &#8407; &#8242; ) &#8805; &#8711;f (z &#8407;(&#951;)) by DR-submodularity. Additionally, for every coordinate i &#8712; T (&#951;), we have</p><p>In (c), we have used that S(&#951;) &#8838; T (&#951;), g &#8407;(&#951;) i &gt; 0 for all i &#8712; T (&#951;), and g &#8407;(&#951;</p><p>We can show (d) as follows. Since &#951; &#8804; &#951; 1 , we have</p><p>where the second inequality is by the choice of &#951; 1 .</p><p>Let &#951; t and S t denote &#951; and S in iteration t of the phase (note that we are momentarily overloading &#951; 1 and &#951; 2 here, and they temporarily stand for the step size &#951; in iterations 1 and 2, and not for the step sizes on lines 16 and 18). By summing up the above inequality over all iterations, we obtain:</p><p>We can show (a) as follows. Recall that we have &#8741;z</p><p>In (b), we used the definition of v start on line 7. In (c), we used that f (x &#8407; * ) &#8804; (1 + &#1013;)M .</p><p>Case 2: we have v end &#824; = v start . Note that this implies that v end &#8804; (1&#1013;)v start , since line 13 was executed at least once during the phase.</p><p>The analysis for this case is more involved. The difficulty stems from the fact that the algorithm updates only a subset of the high-gain coordinates to ensure that no coordinate increases too much. This means that, at the end of the phase, there could still be coordinates whose gain is above the threshold but we cannot increase them further during the current phase. We now formally define this problematic set of coordinates:</p><p>The coordinates in A are allowed to increase, and thus we can account for their gain similarly to the previous case: their contribution to the overall gain is at least v end times the increase in &#8467; 1 -norm of the coordinates in A. For the coordinates in A, we lower bound their contribution to the gain by the gradient times the increase in those coordinates.</p><p>The following lemma precisely states the lower bound that we obtain. Note that, for the coordinates in A, we use the gradient value and not v end , to get a tighter bound.</p><p>Lemma 5. We have</p><p>Proof. Fix an iteration of the phase that updates x &#8407; and z &#8407; on lines 20-23. Let x &#8407; , z &#8407; denote the vectors right before the update on lines 20-23. Let x &#8407; &#8242; be the vector x &#8407; right after the update on line 20, and let z &#8407; &#8242; be the vector z &#8407; right after the update on line 21.</p><p>We have:</p><p>In (a), we used the fact that x &#8407; &#8242;x &#8407; &#8805; 0 and f is concave in non-negative directions.</p><p>We can show (b) as follows. We have x &#8407; &#8242; &#8804; z &#8407; &#8242; = z &#8407;(&#951;) and thus &#8711;f (x &#8407; &#8242; ) &#8805; &#8711;f (z &#8407;(&#951;)) by DR-submodularity. Additionally, for every coordinate i &#8712; T (&#951;), we have &#8711; i f (z &#8407;(&#951;)) &gt; 0. Therefore, for every i &#8712; T (&#951;), we have</p><p>We can show (c) as follows. Since &#951; &#8804; &#951; 1 , we have</p><p>where the second inequality is by the choice of &#951; 1 . By the definition of S(&#951;), we have g &#8407;(&#951;) i &#8805; v &#8805; v end for every i &#8712; S(&#951;). By the definition of T (&#951;), we have g &#8407;(&#951;) i &gt; 0 for every i &#8712; T (&#951;).</p><p>Equality (d) follows from the fact that S &#8745; A = T (&#951;) &#8745; A, which we can show as follows. We have T (&#951;) &#8838; S, and S \ T (&#951;) is the set of all coordinates with negative gradient g &#8407;(&#951;). For every i &#8712; A, we have</p><p>where the first inequality is by DR-submodularity (since z &#8407;(&#951;) &#8804; z &#8407; end ) and the second inequality is by the definition of A and the fact that (z &#8407; end ) i &lt; 1 for all i &#8712; [n] (Lemma 3). Moreover, we have z &#8407;(&#951;) i &lt; 1 for all i &#8712; [n] (Lemma 3). Thus g &#8407;(&#951;) i &gt; 0 for all i &#8712; A, and hence S &#8745; A = T (&#951;) &#8745; A.</p><p>In (e), we have used that S(&#951;) &#8838; T (&#951;) and g &#8407;(&#951;) is nonnegative on the coordinates of T (&#951;). In (f), we have used that g &#8407;(&#951;) i &#8805; v &#8805; v end for all i &#8712; S(&#951;) . In (g), we have used that z &#8407;(&#951;) &#8804; z &#8407; end and thus &#8711;f (z &#8407;(&#951;)) &#8805; &#8711;f (z &#8407; end ) by DR-submodularity. Let &#951; t , S t , T t (&#951;), z &#8407; t (&#951;), g &#8407; t (&#951;) denote &#951;, S, T (&#951;), z &#8407;(&#951;), g &#8407;(&#951;) in iteration t of the phase (note that we are overloading &#951; 1 and &#951; 2 in this proof only, and they temporarily stand for the step size &#951; in iterations 1 and 2, and not for the step sizes on lines 16 and 18). By summing up the above inequality over all iterations, we obtain:</p><p>With the lower bound on the gain given by Lemma 5 in hand, the remaining work is to relate it to the optimal solution. Before doing so, we show that each coordinate in A increases sufficiently. Recall that the coordinates in A were prevented from increasing further by the two conditions on z &#8407; i in the definition of S: z</p><p>In the latter case, the lemma follows immediately. In the former case, we use that each iteration increases the coordinates by at most &#1013; 2 . Lemma 6. For every i &#8712; A, we have</p><p>Proof of Lemma 6. Since S was empty at the previous threshold v end /(1&#1013;), we have</p><p>Therefore we may assume it is the former. By Lemma 3,</p><p>where in the second inequality we used that (1&#1013;) j-1 &#8805; 1/3 for sufficiently small &#1013; (since</p><p>Recall that the goal of the remainder of the analysis is to relate the lower bound on the gain given in Lemma 5 to the optimal solution x &#8407; * . To this end, we use the inner product between the gradient at the end of the phase and the optimal solution as our potential. To complete the analysis, we sandwich this potential between the gain guaranteed by Lemma 5 and our target gain. We show that the potential is at least our target gain in Lemma 7. In Lemmas 8 and 9, we relate the gain from Lemma 5 to the potential.</p><p>As noted above, the following lemma shows that the potential is lower bounded by our target gain. The argument is similar to the analysis of the measured continuous greedy. In this part of the analysis, we use the fact that we carefully controlled the &#8467; &#8734; norm of the solution, which is crucial when working with non-monotone objectives.</p><p>Proof. We have</p><p>In (a), we used that (1a)b &#8805; max{a, b}a for all a, b &#8712; [0, 1].</p><p>In (b), we used the fact that f is concave in non-negative directions.</p><p>In (c), we used the fact that the algorithm maintains the invariant that f (x &#8407; ) &#8805; f (z &#8407;) via the update on line 23.</p><p>In (d), we used Lemma 2.</p><p>In (e), we used Lemma 3.</p><p>The next two lemmas relate the gain guaranteed by Lemma 5 to our potential and use Lemma 7 to complete the analysis.</p><p>Recall that the phase terminates with either v end &#8804; &#1013;v start or &#8741;z &#8407; end &#8741; 1 = &#1013;jk. We consider each of these cases in turn.</p><p>Lemma 8. Suppose that v end &#8804; &#1013;v start . We have:</p><p>Proof. By Lemma 5, we have:</p><p>In (a), we used Lemma 6 and the fact that 0</p><p>In (b), we used that &#8741;x &#8407; * &#8741; 1 &#8804; k and the definition of A.</p><p>In (c), we used Lemma 7. Inequality (d) follows from the definition of v start and the fact that f (x &#8407; * ) &#8805; M .</p><p>The analysis of the second case uses a similar but more involved argument.</p><p>Lemma 9. Suppose that &#8741;z &#8407; end &#8741; 1 = &#1013;jk. We have:</p><p>Proof. By Lemma 5, we have:</p><p>In (a), we used Lemma 6 and the fact that 0 &#8407; &#8804; x &#8407; * &#8804; 1 &#8407; and the fact that &#8711; i f (z &#8407; end ) &#8805; v end for all i &#8712; A.</p><p>We can show (b) as follows. Recall that we are in the case &#8741;z</p><p>In (c), we used that, for all i / &#8712; A, we have</p><p>In (d), we used Lemma 7.</p><p>Thus in both cases we achieve the desired gain, and the proof is complete.</p><p>Using induction and Theorem 4, we can show that the final solution returned by the algorithm is a 1/e -O(&#1013;) approximation. By construction, the final solution satisfies &#8741;x &#8407; &#8741; 1 &#8804; k, and thus it also satisfies the constraint.</p><p>Theorem 10. Let x &#8407; be the final solution returned by Algorithm 2. We have &#8741;x &#8407;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Analysis of the Number of Iterations</head><p>Recall that we refer to each iteration of the outer for loop as a phase. We refer to each iteration of the inner while loop as an iteration.</p><p>Theorem 11. The total number of iterations of the algorithm is O(log(n) log(1/&#1013;)/&#1013; 3 ).</p><p>Proof. There are O(1/&#1013;) phases. In each phase, there are O(log(1/&#1013;)/&#1013;) different thresholds v: the initial threshold is v start , the threshold right before the final one is at least &#1013;v start , and each update on line 13 decreases the threshold by a (1-&#1013;) factor. Thus it only remains to bound the number of iterations with the same threshold.</p><p>In the following, we fix a single threshold and we consider only the iterations of the phase at that threshold. Over these iterations, z &#8407; is non-decreasing in every coordinate, g &#8407; is nonincreasing in every coordinate by DR-submodularity and 1 &#8407;z &#8407; &#8805; 0 &#8407; , and the set S can only lose coordinates and thus |S| is non-increasing. Additionally, for each coordinate i &#8712; [n], the increase (z &#8407; end ) i -(z &#8407; start ) i over the entire phase is at most &#1013; + &#1013; 2 : the increase in each iteration is &#951; if i &#8712; S and 0 otherwise; since &#951; &#8804; &#1013; 2 and z</p><p>We say that an iteration is a large-step iteration if &#951; = &#1013; 2 and it is a smaller-step iteration if &#951; &lt; &#1013; 2 .</p><p>We first consider the large-step iterations. Let t be the last large-step iteration, and let S t be the set S in that iteration. Let i &#8712; S t . Note that i &#8712; S t &#8242; for all iterations t &#8242; &#8804; t, since S t &#8838; S t &#8242; . Thus every large-step iteration increases z</p><p>Since z &#8407; i increases by at most &#1013; + &#1013; 2 over the entire phase, it follows that the number of large-step iterations is O(1/&#1013;).</p><p>Next, we consider the smaller-step iterations. Note that, in every smaller-step iteration except possibly the last one, we have |S(&#951;)| &#8804; (1&#1013;)|S| (if &#951; = &#951; 2 &lt; &#1013; 2 , at the end of the iteration we have &#8741;z &#8407;&#8741; 1 = &#1013;jk and thus the phase ends; if &#951; = &#951; 1 , our choice of &#951; 1 ensures that |S(&#951; 1 )| &#8804; (1&#1013;)|S|). Thus every smaller-step iteration decreases |S| by at least an (1&#1013;) factor. Now note that |S| &#8804; n in the first iteration, |S| &#8805; 1 in the last iteration, and |S| can only decrease with each iteration. Thus the number of smaller-step iterations is O(log n/&#1013;).</p><p>In summary, there are O(1/&#1013;) phases, O(log(1/&#1013;)/&#1013;) different thresholds per phase, and O(log(n)/&#1013;) iterations per threshold. Thus the total number of iterations of the algorithm is O(log(n) log(1/&#1013;)/&#1013; 3 ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Experimental Results</head><p>We experimentally evaluate our parallel algorithm on instances of non-concave quadratic programming (NQP) and softmax extension of determinantal point processes (DPP) that were randomly generated similarly to previous work <ref type="bibr">(Bian et al., 2017)</ref>.</p><p>NQP instances are functions of the form f (</p><p>, where H &#8712; R n&#215;n is a matrix with non-positive entries, h &#8407; &#8712; R n . We randomly generated such instances as follows: we sampled each entry of H uniformly at random from [-10, 0], and we set h &#8407; = -0.2H &#8868; 1 &#8407; .</p><p>DPP instances are functions of the form f (x &#8407; ) = log det(diag(x &#8407; )(L -I) + I), where L &#8712; R n&#215;n is a psd matrix and I is the identity matrix. We randomly generated such instances as follows. We sampled the eigenvalues of L as follows: the i-th eigenvalue is &#8467; i = e ri , where r i was sampled uniformly from [-0.5, 1]. We sampled a random orthogonal matrix V . We set L = V diag(&#8467; 1 , . . . , &#8467; n )V &#8868; .</p><p>Algorithms, implementation details, and parameter choices.</p><p>We empirically compared our parallel algorithm with the sequential continuous greedy algorithm <ref type="bibr">(Chekuri et al., 2015;</ref><ref type="bibr">Bian et al., 2017)</ref> and the parallel multiplicative weights update algorithm <ref type="bibr">(Ene et al., 2019)</ref> (see Section C in the supplement for pseudocode descriptions). We implemented the sequential continuous greedy algorithm using a step size of &#1013;/n, leading to O(n/&#1013;) iterations and adaptive evaluations. We implemented our algorithm with a more aggressive update of the thresholds on line 13: instead of the update v &#8592; (1&#1013;)v, we performed the update v &#8592; 0.75 &#8226; v.</p><p>We used error &#1013; = 0.05 and budget k = 10 in all of the experiments.</p><p>Computing infrastructure. We implemented the algorithms in C++ and ran the experiments on an iMac with a 3.3 GHz Intel Core i5 processor and 8 GB of memory. The code used for generating the instances and evaluating the algorithms can be found in the supplement.</p><p>Results. The experimental results are shown in Figure <ref type="figure">1</ref>. Each value is the average value for 5 independently sampled instances and the error bar is &#177;1 standard deviation. The sequential continuous greedy algorithm achieved the highest solution value in all of the runs, and we report the value obtained by the parallel algorithm as the fraction of the continuous greedy solution value.</p><p>In all of the runs, our parallel algorithm achieves higher function value than the parallel multiplicative weights update algorithm, while the number of evaluations is significantly lower. The running time of the MWU algorithm is prohibitive on larger instances, and thus we were only able to compare the algorithms in the small n regime, which is advantageous for the MWU algorithm. </p></div>		</body>
		</text>
</TEI>
