<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Convergence of the EDIIS Algorithm for Nonlinear Equations</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/01/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10088784</idno>
					<idno type="doi">10.1137/18M1171084</idno>
					<title level='j'>SIAM Journal on Scientific Computing</title>
<idno>1064-8275</idno>
<biblScope unit="volume">41</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Xiaojun Chen</author><author>C. T. Kelley</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The Energy Direct Inversion on the Iterative Subspace (EDIIS) algorithm was designed to globalize Anderson acceleration, a method for improving the performance of fixed point iteration. The motivating application is electronic structure computations. In this paper we prove a convergence result for that algorithm and illustrate the theory with a computational example.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>1. Introduction. The purpose of this paper is to analyze the convergence of the Energy Direct Inversion on the Iterative Subspace (EDIIS) algorithm <ref type="bibr">[21]</ref>. EDIIS is a modification of Anderson acceleration <ref type="bibr">[1]</ref> or the Direct Inversion on the Iterative Subspace (DIIS) method <ref type="bibr">[34,</ref><ref type="bibr">37,</ref><ref type="bibr">22,</ref><ref type="bibr">21]</ref>. EDIIS relaxes the need for a sufficiently accurate initial iterate. EDIIS is the default solver for the self-consistent field (SCF) iteration in the widely used Gaussian <ref type="bibr">[12]</ref> quantum chemistry code. We prove convergence from any starting point in a convex set in which the fixed point map is a contraction and then analyze local convergence. Our local convergence is an improvement of the result in <ref type="bibr">[41]</ref> and applies to both EDIIS and Anderson acceleration.</p><p>We will begin this introductory section with a review of Anderson acceleration and some of the recent convergence results. We will then describe the EDIIS algorithm. In section 2 we prove our convergence results. Finally, in section 3 we report on a computation which both illustrates the theory and, as is also done in <ref type="bibr">[21]</ref>, shows how the convergence speeds for EDIIS and Anderson acceleration, while identical in theory, can differ significantly in practice.</p><p>Our notational convention is that vectors and vector-valued functions in R N are in bold. Scalars and elements of infinite dimensional spaces (e.g., integral operators and the functions acted upon by those operators) are in the usual italic math font.</p><p>Anderson acceleration <ref type="bibr">[1]</ref> is an iterative method for fixed point problems of the form <ref type="bibr">(1.1)</ref> u = G(u),</p><p>where u &#8712; R N and G : R N &#8594; R N . The method was designed to accelerate Picard or fixed point iteration, i.e.,</p><p>(1.2) u k+1 = G(u k ).</p><p>Anderson acceleration was originally designed for integral equations and has been widely used in electronic structure computations (see <ref type="bibr">[9]</ref> and many references since then) and is now very common in that field. Anderson acceleration is essentially the same as Pulay mixing <ref type="bibr">[33,</ref><ref type="bibr">32]</ref>, DIIS <ref type="bibr">[34,</ref><ref type="bibr">37,</ref><ref type="bibr">22,</ref><ref type="bibr">21]</ref>, and nonlinear GMRES <ref type="bibr">[25,</ref><ref type="bibr">30,</ref><ref type="bibr">45,</ref><ref type="bibr">4]</ref>. Other applications include nuclear reactor design <ref type="bibr">[42,</ref><ref type="bibr">16]</ref>, stiff dynamics <ref type="bibr">[13]</ref>, hydrology <ref type="bibr">[24]</ref>, and fluid-structure interaction <ref type="bibr">[10,</ref><ref type="bibr">15,</ref><ref type="bibr">23]</ref>, where the method is called interface quasi-Newton. The analysis of Anderson acceleration is far from complete. In this paper we assume, as do all theoretical results about this algorithm, that the map G is a contraction. In practice, however, Anderson acceleration does very well for problems in which G is either definitely not a contraction <ref type="bibr">[41]</ref> or not provably a contraction. The results here do not explain those cases.</p><p>Anderson acceleration was designed for a problem where Newton's method is not practical because obtaining approximate Jacobians or Jacobian-vector products is too costly. One should expect that Newton's method would perform better when derivative information can be had at a reasonable cost, and we have certainly found that to be the case in our own recent work <ref type="bibr">[16]</ref>. Anderson iteration maintains a history of residuals</p><p>of size at most m + 1, where the depth m is an algorithmic parameter. When m is important, we will call the iteration Anderson(m). Anderson(0) is Picard iteration by definition.</p><p>The formal description in Algorithm 1.1 is most convenient for analysis and exposition, but not for implementation. We refer the reader to <ref type="bibr">[44,</ref><ref type="bibr">41,</ref><ref type="bibr">7,</ref><ref type="bibr">38,</ref><ref type="bibr">43,</ref><ref type="bibr">39]</ref> for examples of efficient implementations.</p><p>The iteration uses the most recent m+1 residuals F(u j ) for k -m k &#8804; j &#8804; k where m k &#8804; min(k, m). The key step in the iteration is solving the optimization problem</p><p>for the coefficients {&#945; k j }. Any vector norm can be used in the optimization problem with no change in the convergence theory <ref type="bibr">[41]</ref>. In particular, the optimization problem for the coefficients in either the 1 or the &#8734; norm can be formulated as a linear programming problem <ref type="bibr">[8]</ref>. The optimization problem is easier to solve if one uses the 2 norm, and that is standard practice. In this case optimization problem for the coefficients can be expressed as a linear least squares problem and solved very inexpensively. One way to do this is to solve the linear least squares problem <ref type="bibr">(1.4)</ref> Minimize</p><p>The choice of m k is, in the original form, simply min(m, k). One can adapt m k as the iteration progresses to, for example, enforce well-conditioning of the linear least squares problem (1.4) <ref type="bibr">[44,</ref><ref type="bibr">39]</ref>.</p><p>One can also <ref type="bibr">[11,</ref><ref type="bibr">35,</ref><ref type="bibr">34,</ref><ref type="bibr">44,</ref><ref type="bibr">31]</ref> show that Anderson acceleration is related to multisecant quasi-Newton methods or, in the case of linear problems, GMRES. None of these results lead to a convergence proof, even in the linear case, unless the available storage is large enough to allow GMRES to take a number of iterations equal to the dimension of the problem. The recent work of one of the authors and his students <ref type="bibr">[41,</ref><ref type="bibr">40,</ref><ref type="bibr">39]</ref> contains the first convergence theory for Anderson acceleration as it is applied in practice.</p><p>1.1. Convergence theory. Theorem 1.1 is one of the convergence results from <ref type="bibr">[41]</ref>. That paper also has results for several special cases. We assume that G is a contraction with contractivity constant c &#8712; (0, 1) in a closed set</p><p>for all u, v &#8712; D. The contraction mapping theorem implies that G has a unique fixed point u * &#8712; D. As is standard, we let e = uu * and make the assumption from <ref type="bibr">[41]</ref> on the smoothness of G and the Anderson iteration coefficients. The convergence of the Picard iteration for a contraction map is q-linear <ref type="bibr">[19]</ref> with q-factor c, i.e., e k &#8804; c e k-1 .</p><p>We will show in this paper that Anderson acceleration is r-linear with r-factor c, which means</p><p>for some r &gt; 0.</p><p>There is M &#945; such that for all k &#8805; 0</p><p>The differentiability assumption is needed in the analysis, but not in the formulation or implementation of the algorithm. Our convergence result in section 2.2 relaxes the assumption to continuous differentiability. Theorem 1.1 <ref type="bibr">([41]</ref>). Let Assumption 1.1 hold, and let c &lt; r &lt; 1. Then if u 0 is sufficiently close to u * , the Anderson iteration converges to u * . In fact, for all k &#8805; 0,</p><p>The interpretation of this result is that if the initial data are sufficiently good, then the r-factor for Anderson iteration is no worse than the q-factor of Picard iteration as predicted by the contractivity constant c. While r-linear convergence is weaker than q-linear convergence, Anderson acceleration is often faster than Picard iteration in practice. The requirement that the initial iterate be near the solution is also meaningful in practice <ref type="bibr">[36,</ref><ref type="bibr">46,</ref><ref type="bibr">47]</ref> and motivated the EDIIS algorithm <ref type="bibr">[21]</ref>, which is the subject of this paper.</p><p>Both Picard iteration and Anderson acceleration can perform better than the prediction (see section 3). In practice, Anderson acceleration is often significantly better than Picard iteration, but there is no theory that explains this under practical (i.e., very limited storage) operating conditions.</p><p>1.2. The EDIIS algorithm. Anderson acceleration performs poorly for some applications. One example is electronic structure computations for metallic systems where the HOMO-LUMO gap is small and a good initial iterate is difficult to obtain. In this case both Picard iteration and Anderson acceleration perform poorly <ref type="bibr">[21]</ref>. In such cases one can sometimes use a small mixing parameter to ensure convergence, especially when the initial iterate is poor. However, a small mixing parameter may degrade the performance of the iteration-especially when near the solution. The role of the damping parameter &#946; in Picard iteration is simple damping:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>If one applies EDIIS or Anderson acceleration to</head><p>then <ref type="bibr">[40]</ref> one obtains</p><p>which is how damping is done in Anderson acceleration <ref type="bibr">[1]</ref>.</p><p>One attempt to solve these problems for small systems is the EDIIS algorithm from <ref type="bibr">[21]</ref>. In <ref type="bibr">[21]</ref> the authors also formulated the fixed point problem to directly minimize energy (hence the name of the method), but that does not affect the convergence analysis in this paper.</p><p>EDIIS differs from Anderson acceleration by imposing a nonnegativity constraint on the coefficients. So, the optimization problem becomes <ref type="bibr">(1.7)</ref> Minimize</p><p>In <ref type="bibr">[21]</ref> the authors present an example where EDIIS does well and both Picard and Anderson acceleration fail and another example where Anderson acceleration is successful and EDIIS, while converging, does not perform as well. We present another such example in section 3. One reason EDIIS might perform worse than Anderson acceleration could be that the optimization problem (1.7) for EDIIS has a more restricted feasible set and therefore a larger optimal value.</p><p>2. Convergence results. Our global convergence is Theorem 2.1. The proof does not require differentiability, but the convergence speed estimate is very pessimistic with an r-factor of c 1/(m+1) . We follow the global theorem with a local theorem that shows how the convergence behavior becomes locally r-linear with r-factor c, improving on the local results in <ref type="bibr">[41]</ref>. In fact,</p><p>Proof. The proof does not use the optimality properties of the coefficients and only requires that the iteration {u k } have the form</p><p>where m k &#8804; m, &#945; k j &#8805; 0, and m k j=0 &#945; k j = 1. We induct on k. Clearly (2.1) holds for both m k = 0, by definition, and k = 1, m k = 0 because the iteration in that case is a single Picard iteration (i.e., one step of Anderson(0)). Assume that the result holds for k &#8804; K. Then (2.2) and</p><p>Theorem 2.1 implies that for any &#948; &gt; 0 there is K such that all iterations {u k } k&#8805;K are in the set</p><p>Hence, starting an Anderson acceleration iteration after sufficiently many EDIIS iterations will result in local convergence at the rate predicted by Theorem 1.1, which is better than (2.1) since r can be arbitrarily near c and does not depend on m. However, it is not clear how to decide when to restart. The main result in section 2.  <ref type="bibr">[41]</ref> by both weakening the assumptions and improving the r-factor.</p><p>We will assume that an iteration begins with a history that lies in B(&#948;) for &#948; sufficiently small. This history could be either from the EDIIS iteration or from the Anderson acceleration iteration itself. Hence the assumption covers not only EDIIS but also allows us to improve the convergence theory from <ref type="bibr">[41]</ref>. We will show that the residuals converge r-linearly to zero with an r-factor of c. Formally our assumption is as follows.</p><p>Assumption 2.1. G is a continuously differentiable contraction on D &#8834; R N with contractivity constant c and u * is the unique fixed point of G in D.</p><p>The iteration begins with</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>and</head><p>(2.4)</p><p>Finally, there is &#265; &#8712; (c, 1) so that</p><p>Theorem 2.1 implies that Assumption 2.1 will hold after sufficiently many EDIIS iterations. In the theorem there is no history if m = 0 and in that case the iteration is a Picard iteration. While we are motivated by a local iteration from the EDIIS algorithm, the local theory does not require that the coefficients be nonnegative.</p><p>Assumption 2.1 weakens the ones in <ref type="bibr">[41]</ref> in two ways. The first is that we no longer assume that G is Lipschitz continuously differentiable. The second is that we do not assume that the coefficients {&#945; k j } come from any particular optimization problem-only that the linear combination of residuals has norm no larger than that of the most recent residual.</p><p>The idea of the analysis is that as the iteration converges, the upper bound for the r-factor will approach c and therefore the r-factor is no larger than c. In the case where there is no history, this fact was implicit in the results from <ref type="bibr">[41]</ref>. Adding the history makes the bookkeeping more difficult, and the proof of Theorem 2.2 must account for that.</p><p>Theorem 2.2. Let Assumption 2.1 hold. Assume that there is M &#945; such that</p><p>for all k &#8805; 0. Then if &#948; is sufficiently small, the iteration given by (2.3) and (2.4) converges to u * and</p><p>Proof. Let 0 &lt; &lt; &#265; -c. We will show that for e 0 sufficiently small,</p><p>This will complete the proof since is arbitrary and we can restart the proof once we have m vectors in the history which are near enough to u * to reduce further. We induct on k. Define L = (c/&#265;) m . We will show that (2.9)</p><p>for all k. Our assumption on the history that F(u l ) &#8804; &#265;l F(u 0 ) implies that (2.9) holds for 0 &#8804; k &#8804; m. Now suppose that (2.9) holds for all 0 &#8804; l &#8804; k with k &#8805; m. We will establish the bound for k + 1. The analysis has three steps. We first set &#948; small enough for the iteration to remain in D. We then derive an estimate for F(u k+1 ) and finally use that estimate to continue the induction.</p><p>Step 1, initialization of &#948;. Since G is continuous in D, there is a nondecreasing function &#961; &#8712; C[0, &#8734;) with &#961;(0) = 0 so that (2.10)</p><p>for all u &#8712; D. This implies that for all u &#8712; D,</p><p>where &#8710;(e) &#8804; &#961;( e ) e .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Contractivity of G implies that</head><p>Reduce &#948; if necessary so that (2.12)</p><p>This implies that (2.13)</p><p>for sufficiently small &#948; because (2.14)</p><p>Step 2, estimation of F(u k+1 ). We may write, for k &#8805; m -1,</p><p>We will estimate the two parts of the sum</p><p>and</p><p>separately.</p><p>Using only contractivity of G and (2.4), we have (2.15)</p><p>We now estimate B k . Using (2.12), we have for all u &#8712; D with</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>EDIIS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A373</head><p>The final stage in the proof is to show that, reducing &#948; if needed, (2.17)</p><p>Recall that</p><p>We use <ref type="bibr">(2.11</ref>) to obtain</p><p>Hence</p><p>We will estimate terms separately. First (2.18)</p><p>Finally, using (2.14) and (2.16), (2. <ref type="bibr">19</ref>)</p><p>Adding the two estimates (2.18) and (2.19) leads to (2.17).</p><p>Step 3, continuation of the induction. Combining (2.15), (2.17), (2.9), and the induction hypotheses, we have (2.20)</p><p>This implies (2.8), which in turn implies (2.7) because is arbitrary.</p><p>Theorem 2.2 and nonsingularity of F (u * ) also imply r-linear convergence of the errors with r-factor c. This extends and sharpens (1.6). Proof. We will use Lemma 5.2.1 from <ref type="bibr">[19]</ref>, which states that if u is sufficiently near u * and F (u * ) is nonsingular, then</p><p>which is (2.21).</p><p>3. Numerical example. We will use an example <ref type="bibr">[41]</ref> to show how the actual performance of EDIIS and Anderson acceleration can differ, even though the theoretical limiting convergence estimates are identical. Another point of this section is that the solver for the optimization problem can significantly affect the results.</p><p>The results in <ref type="bibr">[21]</ref> also illustrate this point. Our example is simple enough to directly compare the iteration histories for Picard iteration, EDIIS, and Anderson with the worst-case prediction given by the contractivity constant. We find that when Anderson acceleration performs well, as it does in this example, EDIIS offers no advantage. Moreover, the additional constraint on the optimization problem for the coefficients leads to slower convergence, exactly matching Picard iteration in this case.</p><p>The optimization problem for EDIIS requires more care than the linear least squares problem one must solve for Anderson acceleration. The reason for this is that one cannot simply use a QR factorization to solve <ref type="bibr">(1.4)</ref>. Instead one must apply a more sophisticated iterative solver. The approach of <ref type="bibr">[21]</ref> is a direct examination of the boundary of the feasible simplex, which is not practical for a depth much greater than m = 3. Since m is small in practice, expressing the optimization problem as a boundconstrained quadratic program is an efficient alternative. References <ref type="bibr">[27,</ref><ref type="bibr">26]</ref> survey the literature on this topic. For example, a bound-constrained quadratic programming code such as the MINQ <ref type="bibr">[29]</ref> code is a reasonable choice. However, this approach squares the condition number and can (and did in our testing) result in a singular or nearly singular KKT system and failure of the optimization code's internal linear solvers. The method of <ref type="bibr">[6]</ref>, while still squaring the condition number, is more robust and terminated without error for this example. The classic method from <ref type="bibr">[14]</ref> uses an active set method and the QR factorization to avoid using the normal equations. The approach in <ref type="bibr">[14]</ref> performed better in the example here, where the least squares coefficient matrix for the optimization problem is ill-conditioned <ref type="bibr">[41]</ref>.</p><p>The example is the midpoint rule discretization of the Chandrasekhar H-equation <ref type="bibr">[5,</ref><ref type="bibr">3]</ref>.</p><p>We seek a solution H * &#8712; C[0, 1]. When the parameter &#969; is important we will write H * as a function H * (&#181;, &#969;) of both &#181; and &#969;.</p><p>The integral equation and its midpoint discretization share the properties that the fixed point map</p><p>is a contraction for 0 &#8804; &#969; &lt; 1, but not for &#969; = 1. The Fr&#233;chet derivative (and the Jacobian for the discrete case) is singular at the solution for &#969; = 1, which is a simple fold singularity <ref type="bibr">[17,</ref><ref type="bibr">28]</ref>.</p><p>In this section we will compare the performance of Picard iteration, Anderson acceleration, and EDIIS for the case &#969; = .5 on an N = 100 point mesh. We terminated the iteration when the residual had decreased by a factor of 10 -12 .</p><p>One interesting result from <ref type="bibr">[41]</ref> is that Anderson(m) is more efficient than Newton's method for this example, even in the singular case. In the context of this paper it is important to note that Picard iteration converges faster than one would expect from estimating the contractivity parameter by the spectral radius of the Fr&#233;chet derivative of G at the solution, which is a lower bound for the operator norm of G. From <ref type="bibr">[41]</ref> &#961;(G (H</p><p>However <ref type="bibr">[18,</ref><ref type="bibr">20,</ref><ref type="bibr">2]</ref>, the solution is analytic in &#969; and Picard iteration exploits that property to obtain q-linear convergence with q-factor &#8804; &#961;(G (H * )) and much less for small &#969;. In fact, if</p><p>is the Taylor expansion of H * in &#969;, then the coefficient functions {a m (&#181;)} are nonnegative for 0 &#8804; &#181; &#8804; 1. Moreover, the series converges for &#969; = 1. Hence, if H k is the kth Picard iteration and H 0 &#8801; 0, then for all k &#8805; 0 and &#969;, &#181; &#8712; [0, 1],</p><p>All of the above statements about the singularity at &#969; = 1, the spectral radius of the Fr&#233;chet derivative, and the performance of Picard iteration apply to the discrete problem</p><p>In (3.2) &#181; i = (i -1/2)/N is the ith quadrature node for the N point composite midpoint rule, the vector h * is the solution of the discrete problem h = G(h), G(h * ) i is the ith component of G(h * ), and the ith component of h * is h * i &#8776; H * (&#181; i ). As noted above, the optimization problem (1.7) for EDIIS is harder than the one for Anderson acceleration, and the choice of solver can be important. We compare the method of <ref type="bibr">[14]</ref>, as implemented in the MATLAB lsqlin code with the "active-set" option, with the method from <ref type="bibr">[6]</ref>, as implemented with the "interior-point" option in lsqlin. The method of <ref type="bibr">[6]</ref> uses the normal equations and did exhibit problems with ill-conditioning. The computations were done on an Apple Macintosh running  Table <ref type="table">3</ref>.1 compares &#961;(G (H * )) to the r-factors of the residuals for Anderson acceleration, Picard iteration, and EDIIS. We estimate the r-factors by</p><p>, where the final iteration upon termination is h k . Note that, as discussed above, the q-factor for Picard iteration is smaller than the spectral radius. Anderson acceleration also does better than the theory predicts and, in fact, is more efficient than Newton-GMRES <ref type="bibr">[41]</ref>.</p><p>EDIIS-I is the only one of the methods which is sensitive to the ill-conditioning of the optimization problem. We examined this sensitivity by solving the problem twice, once with no limit on the condition number and again by reducing m k if necessary to limit the condition number to 10 5 . This has no effect on EDIIS-A and slightly slows Anderson acceleration down. We show the residual histories in Figure <ref type="figure">3</ref>.1, where one can clearly see the effect of limiting the condition number. As reported in <ref type="bibr">[41]</ref>, the optimization problem becomes more ill-conditioned as the iteration progresses. The figures show that the convergence of EDIIS-I degrades at the 6th iteration, but to a lesser degree when the condition number is limited. Note that the estimated r-factor seems to stabilize near the end and is, in the condition number limited case, back to Picard iteration for the final three iterations, albeit from a worse starting point.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusions.</head><p>The EDIIS algorithm was designed to improve the global convergence properties of the DIIS algorithm, which is also known as Anderson acceleration. We prove global convergence of the iteration and prove a local convergence result that applies to both EDIIS and Anderson acceleration and improves the results in <ref type="bibr">[41]</ref>. We observe, as did the inventors of the method <ref type="bibr">[21]</ref>, that the unmodified version of Anderson acceleration can have better local convergence in practice.</p></div></body>
		</text>
</TEI>
