<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Non-asymptotic global convergence rates of BFGS with exact line search</title></titleStmt>
			<publicationStmt>
				<publisher>Springer Nature</publisher>
				<date>08/19/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10643872</idno>
					<idno type="doi">10.1007/s10107-025-02256-7</idno>
					<title level='j'>Mathematical Programming</title>
<idno>0025-5610</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Qiujiang Jin</author><author>Ruichen Jiang</author><author>Aryan Mokhtari</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<title>Abstract</title> <p>In this paper, we explore the non-asymptotic global convergence rates of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method implemented with exact line search. Notably, due to Dixon’s equivalence result, our findings are also applicable to other quasi-Newton methods in the convex Broyden class employing exact line search, such as the Davidon-Fletcher-Powell (DFP) method. Specifically, we focus on problems where the objective function is strongly convex with Lipschitz continuous gradient and Hessian. Our results hold for any initial point and any symmetric positive definite initial Hessian approximation matrix. The analysis unveils a detailed three-phase convergence process, characterized by distinct linear and superlinear rates, contingent on the iteration progress. Additionally, our theoretical findings demonstrate the trade-offs between linear and superlinear convergence rates for BFGS when we modify the initial Hessian approximation matrix, a phenomenon further corroborated by our numerical experiments.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In this paper, we consider the unconstrained minimization problem min</p><p>where f : R d &#8594; R is strongly convex and twice continuously differentiable. We focus on the non-asymptotic global convergence properties of quasi-Newton methods for solving Problem <ref type="bibr">(1)</ref>. The core idea behind quasi-Newton methods is to mimic the update of Newton's method using only first-order information, i.e., the gradients of f . Specifically, the update rule at the k-th iteration is</p><p>where &#951; k is the step size and B k &#8712; R d&#215;d is a matrix constructed from the gradients of f to approximate the Hessian &#8711; 2 f (x k ). Various quasi-Newton methods have been developed, each distinguished by its strategy for constructing the Hessian approximation B k and its inverse. The key methods among them are the Davidon-Fletcher-Powell (DFP) method <ref type="bibr">[11,</ref><ref type="bibr">17]</ref>, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method <ref type="bibr">[5,</ref><ref type="bibr">15,</ref><ref type="bibr">19,</ref><ref type="bibr">39]</ref>, the Symmetric Rank-One (SR1) method <ref type="bibr">[10,</ref><ref type="bibr">26]</ref>, and the Broyden method <ref type="bibr">[3]</ref>. Notably, these quasi-Newton methods directly maintain and update the inverse matrix B -1 k using a constant number of matrix-vector multiplications. This results in a computational cost of O(d 2 ) per iteration and thus makes quasi-Newton methods more efficient than Newton's method, which involves computing the Hessian and solving a linear system that could incur a computational cost of O(d 3 ) per iteration.</p><p>Compared to other first-order methods, such as gradient descent and accelerated gradient descent, the primary advantage of quasi-Newton methods is their ability to achieve Q-superlinear convergence, i.e., lim</p><p>where x * &#8712; R d denotes the optimal solution of Problem <ref type="bibr">(1)</ref>. Specifically, <ref type="bibr">[6]</ref> and <ref type="bibr">[13]</ref> have established that both DFP and BFGS converge Q-superlinearly with unit step size &#951; k = 1, where the initial point x 0 is required to be within a local neighborhood of the optimal solution x * . Later, it has also been extended to various settings <ref type="bibr">[1,</ref><ref type="bibr">12,</ref><ref type="bibr">18,</ref><ref type="bibr">21,</ref><ref type="bibr">28,</ref><ref type="bibr">31,</ref><ref type="bibr">42,</ref><ref type="bibr">45</ref>]. However, these local convergence results are all asymptotic and fail to provide an explicit convergence rate after a finite number of iterations.</p><p>Recently, there has been progress regarding non-asymptotic local convergence analysis of quasi-Newton methods. The authors of <ref type="bibr">[38]</ref> showed that, if the initial point x 0 is in a local neighborhood of the optimal solution x * and the initial Hessian approximation matrix B 0 is initialized as L I , then BFGS with unit step size attains a local superlinear convergence rate of the form ( d L &#956;k ) k , where d is the problem's dimension, L is the Lipschitz parameter of the gradient, and &#956; is the strong convexity parameter. Later in <ref type="bibr">[37]</ref>, the local convergence rate of BFGS was improved to ( d log (L/&#956;) k ) k under similar initial conditions. Similar local superlinear convergence analysis has also been established for the SR1 method <ref type="bibr">[43]</ref>. In a concurrent work <ref type="bibr">[25]</ref>, it was demonstrated that, if x 0 is in a local neighborhood of the optimal solution x * and B 0 is sufficiently close to the exact Hessian at the optimal solution (or selected as the exact Hessian at x 0 ), then BFGS with unit step size achieves a local superlinear rate of (1/k) k/2 , which is independent of the dimension d and the condition number L/&#956;. While these non-asymptotic results successfully characterize an explicit superlinear rate, they rely heavily on local analysis, requiring the initial point to be sufficiently close to the optimal solution x * and imposing conditions on the step size and the initial Hessian approximation matrix B 0 . Consequently, these results cannot be directly extended to a global convergence guarantee. We discuss this issue in detail in Sect. <ref type="bibr">6</ref>.</p><p>To guarantee global convergence, quasi-Newton methods must be combined with line search or trust-region techniques. The first global result for quasi-Newton methods was derived by Powell in <ref type="bibr">[34]</ref>, where it was established that DFP with exact line search converges globally and Q-superlinearly. Later, Dixon <ref type="bibr">[14]</ref> proved that all quasi-Newton methods from the convex Broyden's class generate the same iterates using exact line search, thus extending Powell's result to the convex Broyden's class including BFGS. In order to relax the exact line search condition, the work in <ref type="bibr">[35]</ref> considered BFGS using inexact line search based on Wolfe conditions and showed that it retains global superlinear convergence. This result was later extended in <ref type="bibr">[9]</ref> to the convex Broyden class except for DFP. Moreover, <ref type="bibr">[7,</ref><ref type="bibr">10,</ref><ref type="bibr">26]</ref> showed that the SR1 method with trust-region techniques achieves global and superlinear convergence.</p><p>However, all these results lack an explicit global convergence rate; they only provide asymptotic convergence guarantees and fail to characterize the explicit global convergence rate of classical quasi-Newton methods. The only exception is a recent work in <ref type="bibr">[27]</ref>, where the authors also studied the global convergence rate of BFGS with exact line search. Specifically, it was shown that BFGS attains a global linear rate of (1 -2&#956; 3  L 3 (1</p><p>Lk ) -1 ) k , where Tr(&#8226;) denotes the trace of a matrix. We note that after k = O(d) iterations, their linear rate approaches the rate of (1 -2&#956; 3 L 3 ) k , which is substantially slower than gradient descent-type methods. More importantly, their study does not extend to demonstrating any superlinear convergence rate and fails to fully characterize the behavior of BFGS.</p><p>The discussions above reveal a major gap in classical quasi-Newton methods: the lack of an explicit global convergence rate characterization. Contributions. In this paper, we present the first results that contain explicit nonasymptotic global linear and superlinear convergence rates for the BFGS method with exact line search. Note that due to the equivalence result by Dixon <ref type="bibr">[14]</ref>, our results also hold for other quasi-Newton methods in the convex Broyden class with exact line search. At a high level, our convergence analysis sharpens the potential functionbased framework first introduced in <ref type="bibr">[8]</ref>, leading to a unifying framework for proving both the global linear convergence rates and the superlinear convergence rates. Our convergence results are global as they hold for any initial point x 0 &#8712; R d and any initial Hessian approximation matrix B 0 that is symmetric positive definite. Specifically, our analysis divides the convergence process into three phases, characterized by different convergence rates: (i) First linear phase: We show that</p><p>Here, B0 = 1 L B 0 is the scaled initial Hessian approximation matrix, &#936; (&#8226;) is a potential function defined later in <ref type="bibr">(18)</ref>, &#954; = L &#956; denotes the condition number, and</p><p>is based on the initial optimality gap with M as the Hessian's Lipschitz parameter. In particular, when k &#8805; &#936; ( B0 ), this leads to a linear rate of</p><p>&#954;}, the algorithm attains an improved linear rate matching that of standard gradient descent:</p><p>is the normalized initial Hessian approximation matrix.</p><p>To make our convergence rates easily interpretable, we focus on the global linear and superlinear convergence rates of the special case where B 0 = &#945; I for a given scalar &#945; &gt; 0. We also study the practical initialization, B 0 = cI , where c = s y</p><p>, and x 1 , x 2 being two randomly chosen vectors. We further consider B 0 = L I and B 0 = &#956;I as two specific cases. The global convergence results with these initializations are summarized in Table <ref type="table">1</ref>. Our analysis reveals a trade-off between the linear and the superlinear rates, depending on the choice of the initial matrix B 0 . Specifically, while both initializations lead to the same linear convergence rates, initiating with B 0 = L I allows the algorithm to reach this rate d log &#954; iterations earlier than with B 0 = &#956;I . On the other hand, for the superlinear convergence phase, the difference between B 0 = L I and B 0 = &#956;I essentially boils down to comparing d&#954; against (1 + 4C 0 )d log &#954;. Thus, when C 0 &#954;, initializing with B 0 = &#956;I enables an earlier transition to the superlinear convergence compared </p><p>to B 0 = L I , as well as a faster superlinear convergence rate. As we shall see in Sect. 7, our experiments also demonstrate this trade-off.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Additional related work.</head><p>In addition to the standard quasi-Newton methods such as BFGS, the superlinear convergence of other variants of quasi-Newton methods has also been studied in the literature. The greedy variants of quasi-Newton methods were first introduced in <ref type="bibr">[36]</ref> and developed in subsequent works <ref type="bibr">[22,</ref><ref type="bibr">29,</ref><ref type="bibr">30]</ref>. Instead of using the difference of successive iterates to update the Hessian approximation matrix, the key idea is to greedily select basis vectors to maximize a certain measure of progress. In <ref type="bibr">[36]</ref>, greedy BFGS is shown to achieve a local superlinear convergence rate of (d&#954;(1</p><p>) k and the superlinear convergence phase begins after d&#954; ln (d&#954;) iterations. Similar superlinear convergence rates are extended to other greedy quasi-Newton updates in <ref type="bibr">[22,</ref><ref type="bibr">29,</ref><ref type="bibr">30]</ref>. However, we note that their results are all local and require the initial point to be sufficiently close to the optimal solution x * . Recently, along a different line of work, the authors in <ref type="bibr">[23,</ref><ref type="bibr">24]</ref> proposed quasi-Newton-type methods based on the hybrid proximal extragradient framework <ref type="bibr">[32,</ref><ref type="bibr">40]</ref> and studied their global convergence rates. Specifically, it was shown that the quasi-Newton proximal extragradient method in <ref type="bibr">[23]</ref> achieves a global linear convergence rate of (1 -1/&#954;) k and a global superlinear rate of the form (1 + k/O(&#954; 2 d)) -k . However, these methods are distinct from the classical quasi-Newton methods such as BFGS analyzed in this paper, since they formulate the update of the Hessian approximation matrices B k as an online convex optimization problem and follow an online learning algorithm to update B k . Outline. In Sect. 2, we provide an overview of the BFGS method with exact line search, outline our assumptions, and introduce some preliminary lemmas for the exact line search scheme. Sect. 3 presents our general analytical framework, which is employed to establish global linear and superlinear convergence results for the BFGS method, along with the intermediate results for the update of quasi-Newton methods. In Sect. 4, we establish the global linear convergence rate of BFGS using exact line search that applies to any choices of B 0 and x 0 and we consider specific initializations with B 0 = &#945; I , B 0 = cI , B 0 = L I and B 0 = &#956;I . Building on the linear convergence rates, Sect. 5 details our global superlinear convergence results. In Sect. 6, we compare our analytical framework to both classical asymptotic analysis and recent local non-asymptotic analysis of BFGS. Sect. 7 displays our numerical experiments that corroborate our theoretical findings. Finally, we finish the paper by presenting some concluding remarks in Sect. 8. Notation. We use &#8226; to denote the &#8467; 2 -norm of a vector or the spectral norm of a matrix. We denote S d + and S d ++ as the set of symmetric positive semidefinite and symmetric positive definite matrices with dimension d &#215; d, respectively. Given two symmetric matrices A and B, we denote A B if and only if B -A is positive semidefinite. Given a matrix A, we use Tr(A) and Det(A) to denote its trace and determinant, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Preliminaries</head><p>In this section, we first outline the assumptions, notations, and lemmas essential for our convergence proof. Following this, we explore the general framework of quasi-Newton methods incorporating exact line search and provide an overview of the principal concepts underpinning the update mechanism in the convex Broyden's class of quasi-Newton methods, which encompasses both the BFGS and DFP algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Assumptions</head><p>To begin with, we state our assumptions on the objective functions f .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Assumption 1</head><p>The objective function f is strongly convex with parameter &#956; &gt; 0, i.e., &#8711; f (x) -&#8711; f (y) &#8805; &#956; xy , for any x, y &#8712; R d .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Assumption 2</head><p>The objective function gradient &#8711; f is Lipschitz continuous with parameter L &gt; 0, i.e., &#8711; f (x) -&#8711; f (y) &#8804; L xy for any x, y &#8712; R d .</p><p>Both Assumptions 1 and 2 are standard in the convergence analysis of first-order methods. Moreover, since f is twice differentiable, they imply that &#956;I &#8711; 2 f (x) L I for any x &#8712; R d . Additionally, the condition number of f is defined as &#954; := L &#956; . We also remark that Assumptions 1 and 2 are sufficient in our analysis to prove a global linear convergence rate of BFGS with exact line search. In order to achieve a superlinear convergence rate, we need to impose an additional assumption on the Hessian of the function f , stated below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Assumption 3</head><p>The objective function Hessian &#8711; 2 f is Lipschitz continuous along the direction of optimal solution x * with parameter M &gt; 0, i.e.,</p><p>Assumption 3 is commonly used in the analysis of quasi-Newton methods, such as in <ref type="bibr">[8]</ref>, as it provides a necessary smoothness condition for the Hessian of the objective function. Importantly, we do not require Hessian smoothness for arbitrary points x, y &#8712; R d ; rather, we impose a weaker condition, assuming that the Hessian is Lipschitz continuous only along the direction of the optimal solution x * .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Quasi-Newton methods with exact line search</head><p>Next, we briefly review the template for updating quasi-Newton matrices, focusing specifically on the DFP and BFGS algorithms. Specifically, at the k-th iteration, the update in (2) can be equivalently written as</p><p>where</p><p>Here, &#951; k &#8805; 0 represents the step size, and B k &#8712; R d&#215;d is the Hessian approximation matrix. Replacing B k with the exact Hessian &#8711; 2 f (x k ) turns the update into the classical Newton's method. Quasi-Newton methods aim to approximate the Hessian with firstorder information, typically adhering to a secant condition and a least-change property.</p><p>To elaborate, we define the variable difference s k and gradient difference y k as</p><p>The secant condition requires that B k+1 satisfies y k = B k+1 s k , ensuring the gradient consistency between the quadratic model</p><p>) and f at x k and x k+1 ; that is, <ref type="bibr">[33,</ref><ref type="bibr">Chapter 6]</ref>). That said, the secant condition does not uniquely define B k+1 . Thus, we impose a least-change property to ensure that B k+1 , satisfying the secant condition, is closest to B k in a specific proximity measure. Various proximity measures have been proposed in the literature <ref type="bibr">[16,</ref><ref type="bibr">19,</ref><ref type="bibr">20]</ref> and here we follow the variational characterization in <ref type="bibr">[16]</ref>. Specifically, for any symmetric positive definite matrix A &#8712; S d ++ , define the negative log-determinant function &#934;(A) =log Det(A) and define the Bregman divergence generated by &#934; by</p><p>Note that the Bregman divergence can be regarded as a measure of proximity between two positive definite matrices, and D &#934; (A, B) = 0 if and only if A = B. For the BFGS update, it was shown in <ref type="bibr">[16]</ref> that B k+1 is given as the unique solution of the minimization problem:</p><p>which admits the following explicit update rule:</p><p>Moreover, if we define H k := B -1 k as the inverse of the Hessian approximation matrix, it follows from the Sherman-Morrison-Woodbury formula that</p><p>The DFP update rule can be regarded as the dual of BFGS, where the roles of the Hessian approximation matrix B k+1 and its inverse H k+1 are exchanged. Specifically, the DFP update rules are given by</p><p>Both BFGS and DFP belong to a more general class of quasi-Newton methods, known as the convex Broyden's class <ref type="bibr">[4]</ref>. In this class, the Hessian approximation matrix B k+1 is defined as</p><p>where &#966; k &#8712; [0, 1] for any k &#8805; 0. Accordingly, there exists &#968; k &#8712; [0, 1] such that the Hessian inverse approximation matrix H k+1 is given by</p><p>The convex Broyden's class exhibits a crucial property: if the initial Hessian approximation matrix B 0 is symmetric positive definite and the objective function f is strictly convex, then all subsequent B k matrices produced by this class maintain symmetric positive definiteness (see <ref type="bibr">[33]</ref>).</p><p>To guarantee the global convergence of quasi-Newton methods in (4), it is necessary to employ a line search scheme to select the step size &#951; k . In this paper, our primary focus is on the exact line search step size, where we aim to minimize the objective function along the search direction d k . Specifically,</p><p>Remarkably, it was shown in <ref type="bibr">[14]</ref> that, when employing the exact line search scheme, the convex Broyden's class of quasi-Newton methods produce identical iterates given that the initial point x 0 and the initial matrix B 0 are the same. Thus, in the remainder of the paper, we focus on the BFGS update in <ref type="bibr">(7)</ref> as all results hold for other algorithms in the convex Broyden family. Finally, we introduce some intermediate results related to the exact line search step size, as defined in <ref type="bibr">(9)</ref>. These standard results (see, e.g., <ref type="bibr">[34]</ref>) are essential for the forthcoming demonstration of the convergence rate of the quasi-Newton method.</p><p>Lemma 1 Consider the standard quasi-Newton method in (4) with the exact line search specified in <ref type="bibr">(9)</ref>. The following results hold for any k &#8805; 0:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Convergence analysis framework</head><p>In this section, we introduce our theoretical framework for establishing the global convergence rates of the BFGS algorithm with exact line search. Our framework builds on two key propositions. In Proposition 1, we characterize the amount of function value decrease in one iteration in terms of the angle &#952; k between the steepest descent direction -g k and the search direction d k given in (4). Subsequently, Proposition 2 presents a potential function for the BFGS update, which leads to a lower bound on cos(&#952; k ).</p><p>To formally begin the analysis, we first present a weighted version of key vectors and matrices as introduced from the previous work <ref type="bibr">[13]</ref>. Specifically, given a weight matrix P &#8712; S d ++ (also referred to as a transformation matrix), we define the weighted gradient &#285;k , the weighted gradient difference &#375;k , and the weighted iterate difference &#349;k as</p><p>Similarly, we define the weighted Hessian approximation matrix Bk as</p><p>Note that the weight matrix P can be chosen as any positive definite matrix, and its choice will be evident from the context. In particular, as we shall see later, we use P = L I in Sect. 4 to prove the global linear convergence rate, and use P = &#8711; 2 f (x * ) in Sect. 5 to prove the global superlinear convergence rate. Moreover, since the above weighting procedure amounts to a change of the coordinate system, the weighted versions of the vectors and matrices defined in <ref type="bibr">(10)</ref> and <ref type="bibr">(11)</ref> retain the same algebraic relations as their original forms. In particular, the weighted Hessian approximation matrices generated by the BFGS algorithm follow the subsequent update rule:</p><p>Before introducing our first key proposition, we define a quantity &#952;k by</p><p>which is the angle between the weighted steepest descent direction -&#285;k and the weighted iterate difference &#349;k . It is well-known that the convergence of quasi-Newton methods can be established by monitoring the behavior of cos( &#952;k ). We next quantify the link between functional value decrease and cos( &#952;k ).</p><p>Proposition 1 Let {x k } k&#8805;0 be the iterates generated by the BFGS method with exact line search. Given a weight matrix P &#8712; S d ++ , recall the weighted vectors and matrices defined in <ref type="bibr">(10)</ref> and <ref type="bibr">(11)</ref>. For any k &#8805; 0, we have</p><p>where we define</p><p>As a corollary, we have that for any k &#8805; 1,</p><p>Proof First, we use the definition of &#945;k in (15) to write</p><p>Moreover, note that we have -&#285; k &#349;k = &#375; k &#349;k by Lemma 1(b). Hence, using the definition of &#952;k in ( <ref type="formula">13</ref>) and the definition of mk in <ref type="bibr">(15)</ref>, it follows that</p><p>Furthermore, we have &#285;k</p><p>) from the definition of qk in <ref type="bibr">(15)</ref>. Thus, the equality in <ref type="bibr">(17)</ref> can be rewritten as</p><p>By rearranging the term in the above equality, we obtain <ref type="bibr">(14)</ref>. To prove the inequality in <ref type="bibr">(16)</ref>, note that for any k &#8805; 1, we have</p><p>where the last equality is due to <ref type="bibr">(14)</ref>. Notice that the term 1 -&#945;i qi mi cos 2 ( &#952;i ) are nonnegative for any i &#8805; 0. Thus, by applying the inequality of arithmetic and geometric means twice, we obtain that</p><p>This completes the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Remark 1</head><p>We note that similar results relating f (x k )f (x k+1 ) to cos 2 ( &#952;k ) have appeared in prior work such as [9, Lemma 4.2] and <ref type="bibr">[8]</ref>, though they are used in the analysis of quasi-Newton methods with inexact line search. Compared with these prior results, Proposition 1 is more general in the sense that we consider the weighted iterates using a general weight matrix P. This flexibility enables us to obtain tighter bounds and, more importantly, to obtain a global superlinear convergence rate under the same framework (see Sect. 5). Another subtle yet important difference is that previous works typically upper bound the term mk by L prematurely, leading to the worst dependence on the condition number &#954;. Instead, we keep mk in ( <ref type="formula">14</ref>) as is and lower bound the term cos 2 ( &#952;k )/ mk together, as later shown in Proposition 2.</p><p>Remark 2 without loss of generality, we assume that x k = x * for any k &#8805; 0 throughout the paper, meaning the exact optimal solution x * is never reached. Consequently, this ensures that &#285;k = 0, &#951; k &gt; 0, and &#349;k = 0 for any k &#8805; 0. As a result, we have</p><p>since Bk is symmetric positive definite. Moreover, under this assumption, the definitions in equation ( <ref type="formula">15</ref>) are valid since all the denominators-</p><p>and &#349;k 2 -are nonzero. Similarly, mk = &#375; k &#349;k &#349;k 2 is well-defined and nonzero, as</p><p>Proposition 1 shows that BFGS's convergence rate hinges on four quantities: &#945;k , qk , mk , and cos( &#952;k ). Note that &#945;k and qk can be bounded using Assumptions 1, 2 and 3, independent of the quasi-Newton update, with details deferred to Sect. 3.1. The focus here is to establish a lower bound for cos 2 ( &#952;k )/ mk . This involves analyzing the dynamics of the Hessian approximation matrices {B k } k&#8805;0 through their trace and determinant, leveraging the following potential function from <ref type="bibr">[8]</ref> that integrates both:</p><p>Given <ref type="bibr">(6)</ref>, &#936; (A) can be regarded as the Bregman divergence generated by &#934;(A) = log det(A) between the matrix A and the identity matrix I . In particular, &#936; (A) &#8805; 0 and also we have &#936; (A) = 0 if and only if A = I . Now we are ready to state Proposition 2, which is a classical result in the quasi-Newton literature (e.g., see <ref type="bibr">[33,</ref><ref type="bibr">Section 6.4]</ref>). For completeness, we provide its proof in Appendix A.</p><p>Proposition 2 Given a weight matrix P &#8712; S d ++ , recall the weighted vectors and matrices defined in <ref type="bibr">(10)</ref> and <ref type="bibr">(11)</ref>. Let { Bk } k&#8805;0 be the weighted Hessian approximation matrices generated by the BFGS update in <ref type="bibr">(12)</ref>. Then we have</p><p>where mk and &#952;k are defined in <ref type="bibr">(15)</ref>. As a corollary, we have for any k &#8805; 1,</p><p>Taking exponentiation of both sides in <ref type="bibr">(20)</ref>, Proposition 2 provides a lower bound for the product k-1</p><p>and &#936; ( B0 ). We will use Assumptions 1, 2 and 3 to bound the term &#375;k 2 &#349; k &#375;k for any k &#8805; 0, as shown in Lemma 5 of Sect. 3.1. Moreover, the second term &#936; ( B0 ) depends on our choice of the initial Hessian approximation matrix B 0 . Specifically, we will consider two different initializations: (i) B 0 = L I ; (ii) B 0 = &#956;I . As we shall discuss in the upcoming sections, these two choices result in different bounds and thus lead to a trade-off between the initial linear convergence rate and the final superlinear convergence rate.</p><p>Having outlined our key propositions, Sects. 4 and 5 will merge Proposition 1 and Proposition 2 to demonstrate that BFGS achieves global non-asymptotic linear and superlinear convergence rates, respectively. Our approach involves selecting an appropriate weight matrix P and bounding the quantities in <ref type="bibr">(16)</ref> to derive the overall convergence rate. Specifically, we set P = L I for global linear convergence and P = &#8711; 2 f (x * ) for superlinear convergence. The following section presents intermediate lemmas that will be used to establish these convergence bounds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Intermediate lemmas</head><p>Next, we provide some intermediate results that lower bound the quantities &#945;k and qk defined in <ref type="bibr">(15)</ref> and the term &#375;k 2 &#349; k &#375;k appearing in <ref type="bibr">(19)</ref>. To do so, we first define the average Hessian matrices J k and G k as</p><p>These two matrices play an important role in our analysis, since the fundamental theorem of calculus implies that y k = J k s k and g k = G k (x kx * ) for any k &#8805; 0. We also define the weighted average Hessian matrix &#308;k = P -1 2 J k P -1 2 for the given weight matrix P &#8712; S d ++ . Moreover, we define a quantity C k that depends on the function value at the iterate x k :</p><p>where M is the Lipschitz constant of the Hessian in Assumption 3 and &#956; is the strong convexity parameter in Assumption 1. Given these definitions, in the following lemma, we characterize the relationship between different matrices that appear in our convergence analysis.</p><p>Lemma 2 Suppose Assumptions 1, 2, and 3 hold, and recall the definitions of the matrices J k in <ref type="bibr">(21)</ref>, G k in <ref type="bibr">(22)</ref>, and the quantity C k in <ref type="bibr">(23)</ref>. Then, the following statements hold:</p><p>(a) For any k &#8805; 0, we have that</p><p>(b) For any k &#8805; 0, we have that</p><p>(c) For any k &#8805; 0 and any &#964; &#8712; [0, 1], we have that</p><p>(d) For any k &#8805; 0 and &#964; &#8712; [0, 1], we have that</p><p>Proof Please check Appendix B.</p><p>After establishing Lemma 2, in the following three lemmas, we will provide bounds on the quantities &#945;k , qk and </p><p>in <ref type="bibr">(15)</ref>. Suppose Assumptions 1, 2, and 3 hold. Then, for any k &#8805; 0, we have</p><p>Proof To begin with, note that we have -&#285; k &#349;k = &#375; k &#349;k due to Lemma 1(b) and &#375; k &#349;k = y k s k for any choice of the weight matrix P. Thus, &#945;k can be equivalently</p><p>. We first prove the first bound in <ref type="bibr">(28)</ref>. By Assumptions 1 and 2, the function f is &#956;-strongly convex and its gradient is L-Lipschitz. Then for any x, y &#8712; R d , it holds that</p><p>This is also known as the interpolation inequality; see, e.g., <ref type="bibr">[41,</ref><ref type="bibr">Theorem 4]</ref>. By setting x = x k , y = x k+1 in (29) and recalling that</p><p>and g k+1 = &#8711; f (x k+1 ), we obtain that</p><p>Moreover, Lemma 1 shows that g k+1 s k = 0 due to exact line search. Thus, we can simplify the above inequality as</p><p>where we used Young's inequality in the second inequality and the fact that s k y k &#8804; s k y k due to Cauchy-Schwartz inequality in the third inequality. Hence, we conclude that &#945;k</p><p>&#954; . Now we proceed to establish the second lower bound on &#945;k . Given Taylor's theorem, there exists</p><p>where we used g k+1 s k = 0. Moreover, based on (26) in Lemma 2(c), we have</p><p>Hence, we obtain that</p><p>By combining the inequalities in <ref type="bibr">(30)</ref> and <ref type="bibr">(31)</ref>, the main claim follows. <ref type="bibr">(15)</ref>. Suppose Assumptions 1, 2, and 3 hold. Then we have the following results: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 4 Recall the definition qk</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(b) If we choose P</head><p>Proof We first prove (a). When P = L I , we have qk = <ref type="bibr">[2,</ref><ref type="bibr">Section 9.1.2]</ref>). Hence, we conclude that qk &#8805; 2&#956;/L = 2/&#954;.</p><p>Next, we prove (b). When P = &#8711; 2 f (x * ), we have &#285;k</p><p>By applying Taylor's theorem with Lagrange remainder, there exists &#964;k &#8712; [0, 1] such that</p><p>where we used the fact that &#8711; f (x * ) = 0 in the last equality. Moreover, by the fundamental theorem of calculus, we have</p><p>where we use the definition of G k in <ref type="bibr">(22)</ref>. Since &#8711; f (x * ) = 0 and we denote g k = &#8711; f (x k ), this further implies that</p><p>Combining <ref type="bibr">(32)</ref> and (33) leads to</p><p>Based on <ref type="bibr">(27)</ref> in Lemma 2(d), we have</p><p>Moreover, it follows from (25</p><p>Combining <ref type="bibr">(35)</ref> and <ref type="bibr">(36)</ref>, we obtain that</p><p>and hence</p><p>By using <ref type="bibr">(34)</ref> and the fact that &#285;k</p><p>and the claim follows.</p><p>Lemma 5 Suppose Assumptions 1, 2, and 3 hold. Then we have</p><p>As a corollary, we have the following results:</p><p>(a) If we choose P = L I , then &#375;k 2</p><p>Proof Note that by the fundamental theorem of calculus, we have y k = J k s k , which implies that &#375;k = &#308;k &#349;k . Hence, we can bound</p><p>Hence, if P = L I , then &#308;k = 1 L J k &#8804; 1 by Assumption 2, which proves the result in (a). Moreover, if P = &#8711; 2 f (x * ), then <ref type="bibr">24)</ref> in Lemma 2(a), which proves the result in (b).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Global linear convergence rates</head><p>In this section, we establish the explicit global linear convergence rates for the BFGS method using an exact line search step size, marking one of the first non-asymptotic global linear convergence analyses of BFGS with a line search scheme. The subsequent global superlinear convergence analyses are established based on these linear rates. Specifically, we combine the fundamental inequality ( <ref type="formula">16</ref>) from Proposition 1 with lower bounds of the terms &#945;k , qk , and cos 2 ( &#952;k )/ mk from Lemma 3, 4, 5 and Proposition 2 to prove all the global linear convergence rates. In this section, we set the weight matrix P as P = L I and we define the weighted matrix Bk as:</p><p>In the following lemma, we prove the global linear convergence rate of the BFGS method for any choice of B 0 &#8712; S d ++ .</p><p>Lemma 6 Let {x k } k&#8805;0 be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1 and 2 hold. For any initial point x 0 &#8712; R d and any initial Hessian approximation matrix B 0 &#8712; S d ++ , we have the following global linear convergence rate for any k &#8805; 1,</p><p>Proof Our starting point is applying Proposition 1 with the weight matrix P chosen as P = L I . Specifically, <ref type="bibr">(16)</ref> shows that to obtain a convergence rate, it suffices to prove a lower bound on k-1 i=0 &#945;i qi mi cos 2 ( &#952;i ). It follows from Lemma 3 that &#945;k =</p><p>for any k &#8805; 0. Moreover, by applying Lemma 4 with P = L I , we obtain that qk = &#285;k</p><p>&#954; for any k &#8805; 0. Furthermore, applying Proposition 2 with P = L I , it follows from (20) that</p><p>where in the last inequality we used &#375;i 2 &#349; i &#375;i &#8804; 1 by Lemma 5 with P = L I . Taking exponentiation of both sides, this further implies that</p><p>Combining all the pieces above, we get</p><p>Thus, it follows from Proposition 1 that</p><p>This completes the proof.</p><p>Notice that this result holds without the Hessian Lipschitz continuity assumption. In the next lemma, we present another version of the global linear convergence analysis with the additional assumption the Hessian of f is M-Lipschitz. We show that the BFGS method with exact line search will eventually reach a global linear convergence rate of (1 -1/O(&#954;)) k , which is the same as the gradient descent method.</p><p>Lemma 7 Let {x k } k&#8805;0 be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point x 0 &#8712; R d and any initial Hessian approximation matrix B 0 &#8712; S d ++ , we have the following global linear convergence rate for any k &#8805; 1,</p><p>Moreover, when k &#8805; (1</p><p>Proof We follow a similar argument as in the proof of Lemma 6 but with a different lower bound for &#945;k . Specifically, by Lemma 3, we also have &#945;k</p><p>. Combining this with qk &#8805; 2/&#954; and (39) leads to</p><p>To begin with, recall the definition that</p><p>Since the objective function is non-increasing by Lemma 1, it holds that C i &#8804; C 0 for any i &#8805; 0. Thus, from <ref type="bibr">(42)</ref> we have</p><p>Thus, by using Proposition 1 we obtain <ref type="bibr">(40)</ref>.</p><p>To prove the second claim in (41), we use the fact that 1 + x &#8804; e x for any x &#8712; R to get</p><p>Combining ( <ref type="formula">42</ref>) and ( <ref type="formula">43</ref>) leads to</p><p>Next, we prove an upper bound on</p><p>, which implies that k &#8805; &#936; ( B0 ). Then <ref type="bibr">(38)</ref> in <ref type="bibr">Lemma 6 and (40)</ref> together imply that</p><p>where we used the fact that e -&#936; ( B0 ) k &#8805; e -1 &#8805; 1 3 . Moreover, we decompose the sum</p><p>For the first part, we have</p><p>For the second part, by the definition of C i , we have</p><p>where we used</p><p>x for all 0 &#8804; x &#8804; 1 in the last inequality. Combining both inequalities, we arrive at</p><p>Thus, when the number of iterations k exceeds (1</p><p>Together with Proposition 1, this proves the second claim in <ref type="bibr">(41)</ref>.</p><p>We summarize all the global linear convergence results from the above two lemmas in the following theorem.</p><p>Theorem 1 Let {x k } k&#8805;0 be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point x 0 &#8712; R d and any initial matrix B 0 &#8712; S d ++ , we have the following global linear convergence rate for any k &#8805; 1,</p><p>where B0 is defined in <ref type="bibr">(37)</ref>. When k &#8805; &#936; ( B0 ), we have that</p><p>Moreover, when k &#8805; (1</p><p>In Theorem 1, we present three distinct linear convergence rates during different phases of the BFGS algorithm with exact line search. Specifically, the linear rate in (46) is applicable from the first iteration, but the contraction factor depends on the quantity e -&#936; ( B0 )/k , which can be exponentially small and thus imply a slow convergence rate. However, this quantity will be bounded away from zero as the number of iterations k increases, resulting in an improved linear rate. In particular, for k &#8805; &#936; ( B0 ), the quantity e -&#936; ( B0 )/k is bounded below by 1/3, leading to the second improved linear convergence rate in (47). Furthermore, as shown in Lemma 7, after an additional C 0 &#936; ( B0 ) + 3C 0 &#954; min{2(1 + C 0 ), 1 + &#8730; &#954;} iterations, we achieve the last linear convergence rate in (48), which is comparable to that of gradient descent.</p><p>From the discussions above, we observe that the quantity &#936; ( B0 ) (recall that B0 = 1 L B 0 ) plays a critical role in determining the transitions between different linear convergence phases, and a smaller &#936; ( B0 ) implies fewer iterations required to reach each linear convergence phase. Thus, we consider a special case to simplify our bounds, where B 0 = &#945; I and &#945; &gt; 0 is an arbitrary positive scalar. Given this simplification, we obtain</p><p>We apply this to Theorem 1 to establish the corresponding global linear rates, as stated in Corollary 1.</p><p>Corollary 1 Let {x k } k&#8805;0 be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point x 0 &#8712; R d and the initial Hessian approximation matrix B 0 = &#945; I with &#945; &gt; 0, we have the following global convergence rate for any k &#8805; 1,</p><p>Moreover, when k &#8805; (1</p><p>The above corollary characterizes the behavior of BFGS with exact line search when the initial Hessian approximation is a scaled identity matrix. In the following paragraphs, we refine these results by analyzing the bounds for specific values of &#945;.</p><p>First, we examine two extreme cases for initialization: &#945; = L (where B 0 = L I ) and &#945; = &#956; (where B 0 = &#956;I ). The former corresponds to an upper bound on the eigenvalues of the Hessian, while the latter uses a lower bound. Note that in the case where</p><p>= 0 and we achieve the global linear convergence rate in (50) for any k &#8805; 1. Moreover, when k &#8805; 3C 0 &#954; min{2(1 + C 0 ), (1 + &#8730; &#954;)}, we reach the second linear convergence rate in (51). In the case where</p><p>We have the following global convergence rate for any k &#8805; 1,</p><p>When k &#8805; d log &#954;, we achieve the global linear convergence rate in (50). Moreover, when k &#8805; (1</p><p>we reach the second linear convergence rate in (51). Comparing the above results, we observe that BFGS with B 0 = &#956;I requires additional d log &#954; iterations to achieve a similar linear rate as in the first case. However, as we present in the next section, the choice of the initial Hessian approximation matrix B 0 = &#956;I could achieve a superlinear rate faster. This trade-off between the linear and superlinear convergence phase is the fundamental consequence of different choices of the initial Hessian approximation matrix in our convergence analysis.</p><p>While the special cases discussed above are valuable for theoretical comparison, they may not be practical for selecting the initial Hessian approximation, as the constants &#956; and L are often unknown. A more practical choice, which can be easily computed, is to set B 0 = cI , where c is determined based on gradient and variable differences between two randomly selected points. Specifically, c is given by c = s y s 2 where s = x 2x 1 and y = &#8711; f (x 2 ) -&#8711; f (x 1 ), with x 1 and x 2 being two randomly chosen vectors. This choice ensures that c &#8712; [&#956;, L]. For this initialization of B 0 , we can establish the bound d c L -1 + log L c &#8804; d log &#954;. Applying this upper bound to our linear convergence result in Corollary 1, we obtain that when k &#8805; d log &#954;, we have the linear convergence rate in (50). When k &#8805; (1</p><p>we have the linear rate in (51).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Global superlinear convergence rates</head><p>In this section, we establish the non-asymptotic global superlinear convergence rate of BFGS with exact line search, employing a similar approach to the global linear convergence rate analysis from the previous section. We utilize the framework from Proposition 1 and integrate the lower bounds from Lemmas 3, 4, 5, and Proposition 2.</p><p>The key distinction lies in the choice of the weight matrix: instead of P = L I used in the linear convergence analysis, we opt for P = &#8711; 2 f (x * ) for the global superlinear convergence proof. We define the weighted matrix Bk as:</p><p>In the following proposition, we first provide a general global convergence bound with an arbitrary initial Hessian approximation matrix B 0 &#8712; S d ++ . All the global superlinear convergence rates are based on the following proposition.</p><p>Proposition 3 Let {x k } k&#8805;0 be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. Recall the definition of C k in <ref type="bibr">(23)</ref> and &#936; (&#8226;) in <ref type="bibr">(18)</ref>. For any initial point x 0 &#8712; R d and any initial Hessian approximation matrix B 0 &#8712; S d ++ , the following result holds for any k &#8805; 1,</p><p>Proof Recall that we choose the weight matrix as P = &#8711; 2 f (x * ) throughout the proof. From Lemma 3 and Lemma 4(b), we have &#945;k &#8805;</p><p>Hence, using the inequality 1 + x &#8804; e x for any x &#8805; 0, it follows that</p><p>Moreover, by using the inequality (20) in Proposition 2 with P = &#8711; 2 f (x * ), we obtain that</p><p>where in the last inequality we used the fact that &#375;i 2</p><p>This further implies that</p><p>Combining ( <ref type="formula">55</ref>), (56), and ( <ref type="formula">16</ref>) from Proposition 1, we prove that</p><p>where the last inequality is due to the fact that 1e -x &#8804; x for any x.</p><p>The above global result shows that the error after k iterations for the BFGS update with exact line search depends on the potential function of the weighted initial Hessian approximation matrix B0 , i.e., &#936; ( B0 ), and the sum of weighted function value optimality gap, i.e., k-1 i=0 C i . This result forms the foundation of our superlinear result, since if we can demonstrate that the sum k-1 i=0 C i is bounded above, it leads to a superlinear rate of the form O((1/k) k ).</p><p>Having established the non-asymptotic global linear convergence rate of BFGS in the previous section, we can leverage it to show that the sum k-1 i=0 C i is uniformly bounded above, allowing us to establish an explicit upper bound for this finite sum. In the following theorem, we apply the linear convergence results from Sect. 4 to prove the non-asymptotic global superlinear convergence rates of BFGS with exact line search for any initial Hessian approximation matrix B 0 &#8712; S d ++ .</p><p>Theorem 2 Let {x k } k&#8805;0 be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point x 0 &#8712; R d and any initial Hessian approximation matrix B 0 &#8712; S d ++ , we have the following superlinear convergence rate,</p><p>) where B0 and B0 are defined in <ref type="bibr">(37)</ref> and (53).</p><p>Proof From (45) in Lemma 7, we know that for k &#8805; 1,</p><p>Leveraging ( <ref type="formula">58</ref>) and (54) in Lemma 3, we prove that for k &#8805; 1,</p><p>and the proof is complete. This result indicates that BFGS with exact line search achieves a superlinear convergence rate when the number of iterations satisfies the condition k &#8805; &#936; ( B0</p><p>The initial matrix B 0 critically influences the required iterations to attain this rate, as it appears in the numerator of the upper bound through</p><p>Thus, different choices of B 0 yield different values for &#936; ( B0 ) + 4C 0 &#936; ( B0 ), affecting the number of iterations required for superlinear convergence. Indeed, one can try to optimize the choice of B 0 to make the expression &#936; ( B0 ) + 4C 0 &#936; ( B0 ) as small as possible.</p><p>Now we consider the special case where B 0 = &#945; I with &#945; &gt; 0 as any positive constant. For this case, we have</p><p>from Assumptions 1 and 2. Applying these bounds in Theorem 2, we obtain the following corollary.</p><p>Corollary 2 Let {x k } k&#8805;0 be the iterates generated by the BFGS method with exact line search and suppose that Assumptions 1, 2 and 3 hold. For any initial point x 0 &#8712; R d and the initial Hessian approximation matrix B 0 = &#945; I with &#945; &gt; 0, we have the following superlinear convergence rate,</p><p>The above corollary characterizes the non-asymptotic superlinear convergence rate of BFGS with exact line search when the initial Hessian approximation is an identity matrix multiplied by a constant &#945;. Similar to the linear convergence analysis, we present the superlinear convergence rates for specific values of &#945; in the following paragraphs.</p><p>Hence, we obtain the superlinear convergence rate</p><p>Similarly, when &#945; = &#956; (B 0 = &#956;I ), we have</p><p>This leads to the superlinear convergence rate</p><p>As shown in the above two results, choosing B 0 = L I minimizes &#936; ( B0 ), resulting in &#936; ( B0 ) = 0. However, &#936; ( B0 ) in this case could be as large as d&#954;. On the other hand, setting B 0 = &#956;I yields a more favorable upper bound, ensuring that both &#936; ( B0 ) and &#936; ( B0 ) are bounded by d log &#954;. Hence, initializing the Hessian approximation with B 0 = &#956;I instead of B 0 = L I could result in fewer iterations to reach the superlinear convergence phase. Generally, during the initial linear convergence stage, the iterates generated by the BFGS method with B 0 = L I outperform those with B 0 = &#956;I , due to a faster linear convergence speed. However, the BFGS method with B 0 = &#956;I transitions to the ultimate superlinear convergence phase in fewer iterations compared to B 0 = L I . This phenomenon has also been observed in our experiments in Sect. 7.</p><p>As in the linear convergence analysis, we also consider the practical initial Hessian approximation: B 0 = cI , where c is s y s 2 , with s = x 2 -x 1 , y = &#8711; f (x 2 )-&#8711; f (x 1 ), and x 1 , x 2 as two random vectors. For this choice of B 0 , we can derive the following upper bounds:</p><p>Applying these values of &#936; ( B0 ) and &#936; ( B0 ) to our superlinear convergence result in Corollary 2, we can obtain the following convergence guarantees for B 0 = cI :</p><p>(60) While all of our presented results are global and do not impose any initial condition on x 0 , in the following remark, we present a potential local result when B 0 = &#956;I . Remark 3 Consider the scenario where BFGS starts at a point x 0 near the optimal solution x * such that the initial error condition</p><p>In this case, we can establish that (1</p><p>. Thus, when B 0 = &#956;I , we obtain the local superlinear convergence rate of O( d log &#954; k ) k , which aligns with the local convergence result in <ref type="bibr">[37]</ref>. It is noteworthy that the local result in <ref type="bibr">[37]</ref> relied on a unit step size, while our local side result is derived using exact line search.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Discussions</head><p>Comparison with local non-asymptotic analysis. In this section, we discuss the recent non-asymptotic local convergence results for BFGS and DFP in <ref type="bibr">[25,</ref><ref type="bibr">37,</ref><ref type="bibr">38]</ref> and explain why these results cannot be easily extended to achieve global complexity bounds.</p><p>To begin with, note that these results are crucially based on local analysis and only apply when the iterates are close to the optimal solution x * and the step size &#951; k is set to 1 in this local region. Therefore, to extend their results into a global convergence guarantee, one plausible strategy is to employ a line search scheme to ensure global convergence, and then switch to the local analysis when the iterates enter the region of local convergence. However, this approach faces several challenges.</p><p>First, it remains unclear how to explicitly upper bound the number of iterations until the line search subroutine accepts the unit step size &#951; k = 1. Moreover, assume that the iterates enter the region of local convergence after k 0 iterations and we have &#951; k = 1 for all k &#8805; k 0 . Even then, there is no guarantee that the Hessian approximation matrix B k 0 will satisfy the necessary conditions required for the local analysis in <ref type="bibr">[25,</ref><ref type="bibr">37,</ref><ref type="bibr">38]</ref>. Specifically, for the analysis in <ref type="bibr">[25]</ref> to hold, B k 0 must be sufficiently close to the exact Hessian matrix, which is not satisfied in general. Regarding <ref type="bibr">[37,</ref><ref type="bibr">38]</ref>, we note that their analyses depend on the condition number of B k 0 , which could be exponentially large and thus render the superlinear rate meaningless. To be more concrete, inspecting the proofs in <ref type="bibr">[37,</ref><ref type="bibr">Lemma 5.4</ref>] and <ref type="bibr">[38,</ref><ref type="bibr">Theorem 4.2]</ref> reveals that the superlinear convergence rate occurs when k = &#937;(&#936; ( B-1 k 0 )) and k = &#937;(&#936; ( Bk 0 )), respectively, where</p><p>with J k 0 defined in <ref type="bibr">(21)</ref> and &#936; (&#8226;) is the potential function defined in <ref type="bibr">(18)</ref>. Consequently, it is essential to establish bounds for the smallest and largest eigenvalues of Bk 0 . However, the current theory indicates (see e.g. <ref type="bibr">[38,</ref><ref type="bibr">Theorem 4.1]</ref>) that e -2&#954; M&#955; 0 I Bk 0 e 2&#954; M&#955; 0 I , where &#955; 0 = (&#8711; 2 f (x 0 )) -1 2 &#8711; f (x 0 ) denotes the initial Newton decrement. This suggests that without a sufficiently small &#955; 0 , the extreme eigenvalues of Bk 0 will be exponentially dependent on the condition number &#954;, leading to &#936; ( B-1 k 0 ), &#936; ( Bk 0 ) = &#937;(de 2&#954; M&#955; 0 ). Hence, a superlinear rate will be achieved only after &#937;(de 2&#954; M&#955; 0 ) iterations.</p><p>Our convergence framework also diverges significantly from the previous works <ref type="bibr">[25,</ref><ref type="bibr">37,</ref><ref type="bibr">38]</ref> in terms of the proof strategy. Specifically, the approach in the aforementioned studies employs an induction argument to control the largest and smallest eigenvalues of the Hessian approximation matrix B k and prove a local linear convergence rate. In comparison, as presented in Sects. 4 and 5, we prove global linear and superlinear convergence rates without explicitly establishing upper or lower bounds on the eigenvalues of B k . This marks a notable departure from the local convergence analysis in <ref type="bibr">[38]</ref>, <ref type="bibr">[37]</ref>, and <ref type="bibr">[25]</ref>.</p><p>Comparison with global asymptotic analysis. As mentioned in Sect. 3, our convergence analysis framework resembles the approach taken in <ref type="bibr">[8,</ref><ref type="bibr">9,</ref><ref type="bibr">35]</ref> for proving asymptotic linear convergence rates of classical quasi-Newton methods such as BFGS and DFP. While these works considered inexact line search schemes and thus are different from our exact line search setting, they used a similar inequality as <ref type="bibr">(16)</ref> in Proposition 1 to express the convergence rate in terms of the angle &#952;k . Moreover, the authors in <ref type="bibr">[35]</ref> and <ref type="bibr">[9]</ref> analyzed the traces and the determinants of the Hessian approximation matrices {B k } k&#8805;0 separately to lower bound k-1 i=0 cos ( &#952;i ). Later, this process was simplified in <ref type="bibr">[8]</ref> by introducing the potential function &#936; (&#8226;) given in <ref type="bibr">(18)</ref>, combining the trace and determinant together as in our Proposition 2. However, since their main focus is on asymptotic convergence, we note that these previous works only demonstrate that ( k-1 i=0 cos ( &#952;i )) 1/k is lower bounded by a constant, without giving an explicit form. Furthermore, our work builds upon previous analyses by incorporating a weight matrix P, while earlier works correspond to setting P = I . Another notable difference is that we keep the term mk and lower bound the term cos 2 ( &#952;k )/ mk as shown in Proposition 2, whereas previous works relied on a looser bound for mk . These refinements enable us to provide a tighter linear convergence rate for the BFGS method.</p><p>On the other hand, in demonstrating superlinear convergence, our approach deviates significantly from that of <ref type="bibr">[8,</ref><ref type="bibr">9,</ref><ref type="bibr">35]</ref>. Specifically, the previous works relied on the Dennis-Mor&#233; condition, i.e., lim k&#8594;&#8734;</p><p>= 0, to establish asymptotic superlinear convergence. In comparison, we use the same framework outlined in Sect. 3 to establish both linear and superlinear convergence rates. The key distinction lies in the choice of the weight matrix P: we choose P = L I for showing linear convergence and P = &#8711; 2 f (x * ) for showing superlinear convergence. Thus, we provide a unified framework for studying the global non-asymptotic convergence of BFGS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Numerical experiments</head><p>In this section, we present our numerical experiments to corroborate our convergence rate guarantees, and in particular, we explore the difference between the convergence paths of BFGS under different initializations of B 0 . We further compare these variants of BFGS implementations with the gradient descent algorithm when deployed with exact line search. In our numerical experiments, all the step sizes used in BFGS with different B 0 and gradient descent are computed by the exact line search condition defined in <ref type="bibr">(9)</ref>. Specifically, we use MATLAB's "fminsearch" function from its optimization toolbox to determine the exact line search step size for all algorithms. In our experiments, all initial points are chosen as random vectors in the corresponding Euclidean vector spaces.</p><p>We focus on a hard cubic objective function defined in [44, Sect. 5], i.e.,</p><p>and g : R &#8594; R is defined as</p><p>Fig. <ref type="figure">1</ref> Convergence rates of BFGS with different B 0 and gradient descent for solving the hard cubic objective function when condition number and dimension is varied where &#945;, &#946;, &#955;, &#916; &#8712; R are hyper-parameters and {v i } n i=1 are standard orthogonal unit vectors in R d . This hard cubic function is used to establish a lower bound for secondorder methods.</p><p>In Fig. <ref type="figure">1</ref>, we compare the gradient descent method and BFGS with different initialization of B 0 : B 0 = L I , B 0 = &#956;I , B 0 = 10L I , B 0 = 0.1&#956;I , B 0 = &#8730; L&#956;I and B 0 = cI where c = s y s 2 . Here, s = x 2x 1 , y = &#8711; f (x 2 ) -&#8711; f (x 1 ), and x 1 , x 2 as two randomly selected vectors. Note that c &#8712; [&#956;, L] and the choice of B 0 = cI is the most commonly used initial Hessian approximation matrix in practice <ref type="bibr">[33]</ref>. In (a), (b), and (c) of Fig. <ref type="figure">1</ref>, we vary the problem's dimension while keeping the condition number as 1,000. Conversely, in (d), (e), and (f), we fix the problem's dimension as 600 and vary the condition number.</p><p>Several observations are in order.</p><p>-BFGS with B 0 = L I initially converges faster than BFGS with B 0 = &#956;I in most plots, aligning with our theoretical findings that the linear convergence rate of BFGS with B 0 = L I surpasses that of B 0 = &#956;I . -The transition to superlinear convergence for BFGS with B 0 = &#956;I typically occurs around k &#8776; d, as predicted by our theoretical analysis. Interestingly, this transition does not always coincide with the iterates approaching the solution's local neighborhood; in many cases, it occurs for BFGS with B 0 = &#956;I even when its error is larger than that of gradient descent. -Although BFGS with B 0 = L I initially converges faster, its transition to superlinear convergence consistently occurs later than for B 0 = &#956;I . Notably, for a fixed dimension d = 600, the transition to superlinear convergence for B 0 = L I occurs increasingly later as the problem condition number rises, an effect not observed for B 0 = &#956;I . This phenomenon indicates that the superlinear rate for B 0 = L I is Additionally, to analyze the sensitivity of BFGS to different line-search schemes, we compare its performance when B 0 = cI under three distinct line-search strategies, as shown in Fig. <ref type="figure">2</ref>.</p><p>The first approach is the Exact Line Search, which is the primary focus of our paper. It is implemented using MATLAB's "fminsearch" function.</p><p>The second approach is the Inexact Line Search, where the step size is determined by enforcing the well-known strong Wolfe conditions:</p><p>|&#8711; f (x t + &#951; t d t ) d t | &#8804; &#946;|&#8711; f (x t ) d t |. (64) &#349; k Bk &#349;k + &#375;k 2 &#349; k &#375;k , (65)</p><p>Taking the trace on both sides of the equation in <ref type="bibr">(12)</ref> and using the fact that Tr(ab ) = a b for any vectors a and b, we obtain the equality in (65). For the proof of (66), we refer the reader to <ref type="bibr">[38,</ref><ref type="bibr">Lemma 6.2]</ref> . Take the logarithm on both sides of the above equation, we obtain that log &#349; k &#375;k &#349; k Bk &#349;k = log Det( Bk+1 )log Det( Bk ).</p><p>Recall that mk = where the last inequality holds since xlog x + 1 &#8805; 0 for any x &gt; 0. Hence, <ref type="bibr">(19)</ref> follows from the above inequality. Finally, the result in <ref type="bibr">(20)</ref> follows from summing both sides of <ref type="bibr">(19)</ref>  where the last inequality holds since &#936; ( Bk ) &#8805; 0 for any k &#8805; 0.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Proof of Lemma 2</head><p>(a) Recall that J k = 1 0 &#8711; 2 f (x k + &#964; (x k+1x k ))d&#964; . Using the triangle inequality, we have</p><p>Moreover, it follows from Assumption 3 that &#8711; 2 f (x * ) -&#8711; 2 f (x k + &#964; (x k+1x k )) &#8804; M (1&#964; )(x *x k ) + &#964; (x *x k+1 ) for any &#964; &#8712; [0, 1]. Thus, we can further apply the triangle inequality to obtain Since f is strongly convex, by Assumption 1 and f (x k+1 ) &#8804; f (x k ), we have &#956; 2 x kx * 2 &#8804; f (x k )f (x * ), which implies that x kx * &#8804; &#8730; 2( f (x k )f (x * ))/&#956;. Similarly, since f (x k+1 ) &#8804; f (x k ), it also holds that</p><p>Moreover, notice that by Assumption 1, we also have J k &#956;I and &#8711; 2 f (x * ) &#956;I . Hence, (67) implies that where we used the definition of C k in <ref type="bibr">(23)</ref>. By rearranging the terms, we obtain <ref type="bibr">(24)</ref>. (b) Recall that G k = 1 0 &#8711; 2 f (x k + &#964; (x *x k ))d&#964; . Similar to the arguments in (a), we have</p><p>Moreover, notice that by Assumption 1 we also have G k &#956;I and &#8711; 2 f (x * ) &#956;I . The rest follows similarly as in the proof of (a) and we prove <ref type="bibr">(25)</ref>. (c) For any &#964; &#8712; [0, 1], we have</p><p>Together with (67), it follows from the triangle inequality that</p><p>Moreover, notice that by Assumption 1, we also have &#8711; 2 f (x k + &#964; (x k+1x k )) &#956;I and J k &#956;I . The rest follows similarly as in the proof of (a) and we prove <ref type="bibr">(26)</ref>.</p><p>(d) For any &#964; &#8712; [0, 1], we have that</p><p>Together with (68), it follows form the triangle inequality that</p><p>Moreover, notice that by Assumption 1, we also have &#8711; 2 f (x k + &#964; (x *x k )) &#956;I and G k &#956;I . The rest follows similarly as in the proof of (a) and we prove <ref type="bibr">(27)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Approximate Exact Line Search Algorithm</head><p>The approximation exact line search is implemented using the bisection Algorithm 1 with as the approximation error. The key idea is to select &#951; such that &#8711; f (x t + &#951;d t ) d t &#8776; 0, since the exact line search step size &#951; exact satisfies the condition &#8711; f (x t + &#951; exact d t ) d t = 0. We begin with an initial step size of &#951; = 1 and iteratively double it until &#8711; f (x t + &#951;d t ) d t &gt; 0. Once this condition is met, we apply the bisection algorithm, leveraging the sign of &#8711; f (x t +&#951;d t ) d t to refine &#951;. The bisection algorithm is well-suited for this task because the function h(&#951;) = &#8711; f (x t + &#951;d t ) d t is strictly increasing. This follows from the strong convexity of the objective function f , which ensures that h (&#951;) = d t &#8711; 2 f (x t + &#951;d t )d t &gt; 0. Additionally, we note that h(0) = &#8711; f (x t ) d t = -g t B -1 t g t &lt; 0, since B t is symmetric positive definite, and that h(&#951; exact ) = 0 with &#951; exact &gt; 0. In our experiments, we set the required accuracy for this scheme to be = 10 -8 and we observe that on average after 15 iterations the bisection method converges.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="0" xml:id="foot_0"><p>, &#8730; &#954;}</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>The MATLAB implementation used is available at https://www.cs.umd.edu/users/oleary/software/.</p></note>
		</body>
		</text>
</TEI>
