<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Provable convergence guarantees for black-box variational inference</title></titleStmt>
			<publicationStmt>
				<publisher>Conference on Neural Information Processing Systems</publisher>
				<date>12/06/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10536439</idno>
					<idno type="doi"></idno>
					
					<author>Justin Domke</author><author>Guillaume Garrigos</author><author>Robert Gower</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Black-box variational inference is widely used in situations where there is no proof that its stochastic optimization succeeds. We suggest this is due to a theoretical gap in existing stochastic optimization proofs-namely the challenge of gradient estimators with unusual noise bounds, and a composite non-smooth objective. For dense Gaussian variational families, we observe that existing gradient estimators based on reparameterization satisfy a quadratic noise bound and give novel convergence guarantees for proximal and projected stochastic gradient descent using this bound. This provides rigorous guarantees that methods similar to those used in practice converge on realistic inference problems.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Variational inference tries to approximate a complex target distribution with a distribution from a simpler family <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">3,</ref><ref type="bibr">4]</ref>. If p(z, x) is a target distribution with latent variables z &#8712; R d and observed data x &#8712; R n , and q w is a variational family with parameters w W &#8712; , the goal is to minimize the negative evidence lower bound (ELBO)</p><p>.</p><p>(</p><p>For subsequent discussion, we have decomposed the objective into the free energy l and the negative entropy h . Minimizing f is equivalent to minimizing the KL-divergence between q w to p(&#8226;|x), because KL (qw &#8741;p(&#8226;|x)) = f (w) + log p(x) . Recent research has focused on "black box" variational inference, where the target distribution p is sufficiently complex that one can only access it through evaluating probabilities (or gradients) at chosen points <ref type="bibr">[5,</ref><ref type="bibr">6,</ref><ref type="bibr">7,</ref><ref type="bibr">8,</ref><ref type="bibr">9,</ref><ref type="bibr">10,</ref><ref type="bibr">11]</ref>. Crucially, one can still get stochastic estimates of the variational objective f and of its gradient, and use these to optimize.</p><p>Still, variational inference can sometimes be unreliable <ref type="bibr">[12,</ref><ref type="bibr">13,</ref><ref type="bibr">14,</ref><ref type="bibr">15]</ref>, and some basic questions remain unanswered. Most notably: does stochastic optimization of f converge to a minimum of the objective? There has been various progress towards answering this question. One line of research seeks to determine if the variational objective f has favorable structural properties, such as smoothness or (strong) convexity <ref type="bibr">[13,</ref><ref type="bibr">16,</ref><ref type="bibr">17]</ref> (Sec. 2.1). Another line seeks to control the "noise" (variance or expected squared norm) of different gradient estimators <ref type="bibr">[18,</ref><ref type="bibr">19,</ref><ref type="bibr">20,</ref><ref type="bibr">21]</ref> (Sec. 2.2). However, few full convergence guarantees are known. That is, there are few known cases where applying a given stochastic optimization algorithm to a given target distribution is known to converge at a given rate.</p><p>We identify two fundamental barriers preventing from analysing this VI problem as a standard stochastic optimization problem. First, the gradient noise depends on the parameters w in a nonstandard way (Sec. 2.3). This adds great technical complexity and renders many traditional stochastic optimization proofs inapplicable. Second, stochastic optimization theory typically requires that the (exact) objective function f is Lipschitz continuous or Lipschitz smooth. But in our VI setting, under some fairly benign assumptions, the ELBO is neither Lipschitz continuous nor Lipschitz smooth.</p><p>We obtain non-asymptotic convergence guarantees for this problem, under simple assumptions.</p><p>Central contributions (Informal). Suppose that the target modellog p(&#8226;, x)is concave and Lipschitz-smooth, and that q w is in a Gaussian variational family parameterized by the mean and a factor of the covariance matrix (Eq. 2). Consider minimizing the negative ELBO f using either one of the two following algorithms:</p><p>&#8226; a proximal stochastic gradient method, with the proximal step applied to h and the gradient step applied to l , estimating &#8711;l(w) with a standard reparameterization gradient estimator (Eq. 5), and using a triangular covariance factor;</p><p>&#8226; a projected stochastic gradient method, with the gradient applied to f = l + h , estimating &#8711;f(w) using either of two common gradient estimators (Eq. 7 or Eq. 9), with the projection done over symmetric and non-degenerate (Eq. 3) covariance factors</p><p>Then, both algorithms converge with a 1/ &#8730; T complexity rate (Cor. 12), or 1/T if we further assume that log p(&#8226;, x)is strongly concave (Cor. 13).</p><p>We also give a new bound on the noise of the "sticking the landing" gradient estimator, which leads to faster convergence when the target distribution p is closer to Gaussian, up to exponentially fast convergence when it is exactly Gaussian (Cor. <ref type="bibr">14)</ref>.</p><p>This is achieved through a series of steps, that we summarize below.</p><p>1. We analyze the structural properties of the problem. Existing results show that with a Gaussian variational family, if -log p(&#8226;, x) is (strongly) convex or Lipschitz smooth, then so is the free energy l . This is for instance known to be the case for some generalized linear models, and we give a new proof of convexity and smoothness for some hierarchical models including hierarchical logistic regression (see <ref type="bibr">Appendix 7.3)</ref>. The remaining component of the ELBO, the neg-entropy h , is convex when restricted to an appropriate set. It is not smooth, but it was recently proved to be smooth over a certain non-degeneracy set.</p><p>2. We study the noise of three common gradient estimators. They do not satisfy usual noise bounds, but we show that they all satisfy a new quadratic bound (Definition 5). For the sticking-thelanding estimator, our bound formalizes the longstanding intuition that it should have lower noise when the variational approximation is strong (Thm. 4).</p><p>3. We identify and solve the key optimization challenges posed by the above issues via new convergence results for the proximal and projected stochastic gradient methods, when applied to objectives that are smooth (but not uniformly smooth) and with gradient estimators satisfying our quadratic bound.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Related work</head><p>Recently, Xu and Campbell <ref type="bibr">[22]</ref> analyzed projected-SGD (stochastic gradient descent) for Gaussian VI using the gradient estimator we will later call g ent <ref type="bibr">(7)</ref>, with a particular rescaling. They show that, asymptotically in the number of observed data, their method converges locally with a rate of1/ &#8730; T , under mild assumptions. Our results are less general in requiring convexity, but are non-asymptotic, apply with other gradient estimators, and give a faster 1/T rate with strong convexity.</p><p>Lambert et al. <ref type="bibr">[23]</ref> introduce a VI-like SGD algorithm, derived from a discretisation of a Wasserstein gradient flow, and show it converges at a 1/T rate for the 2-Wasserstein distance when the log posterior is smooth and strongly concave. This line was continued by Diao et al. <ref type="bibr">[24]</ref>, who propose a proximal-SGD method based on the decomposition f = l + h . They obtain a 1/ &#8730; T rate when the log posterior is smooth and convex, and a1/T rate when it is smooth and strongly convex. Unlike typical black-box VI algorithms used in practice, these algorithms require computing the Hessian of the posterior. We analyze more straighforward applications of SGD to the VI problem using standard gradient estimators. Under the same assumptions, our algorithms have the same rates for KL-divergence, which imply the same rates for 2-Wasserstein distance by of Pinsker's inequality and that the total-variation norm upper-bounds Wasserstein distance <ref type="bibr">[25,</ref><ref type="bibr">Remark 8.2]</ref>.</p><p>In concurrent work, Kim et al. <ref type="bibr">[26]</ref>, consider a proximal-SGD method similar to our approach in Sec. 6. They obtain a 1/T convergence rate, similar to what Cor. 12 gives when the log posterior is strongly concave. They also consider alternative parametrizations of the scale parameters that render the ELBO globally smooth, and obtain a nonconvex result: Under a relaxed version of the Polyak-Lojasiewicz inequality, they can guarantee a1/T 4 rate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Properties of Variational Inference (VI) problems</head><p>Traditionally, the ELBO was optimized using message passing methods, which essentially assume that p and q are simple enough that block-coordinate updates are possible <ref type="bibr">[2,</ref><ref type="bibr">3,</ref><ref type="bibr">4]</ref>. However, in the last decade a series of papers developed algorithms based on a "black-box" model where p is assumed to be complex enough that one can only evaluatelog p(or its gradient) at selected points z . The key observation is that even if p is quite complex, it is still possible to compute stochastic estimates of the gradient of the ELBO, which can be deployed in a stochastic optimization algorithm <ref type="bibr">[ 5,</ref><ref type="bibr">6,</ref><ref type="bibr">8,</ref><ref type="bibr">9,</ref><ref type="bibr">10]</ref>.</p><p>This paper seeks rigorous convergence guarantees for this problem. We study the setting where the variational family is the set of (dense) multivariate Gaussian distributions, parameterized in terms of w = (m, C) , where m is the mean and C is a factor of the covariance, i.e.</p><p>We optimize over w W &#8712; , where W = {(m, C) : C &#8827; 0} and will further require C to be either symmetric or triangular. This does not really change the problem, since for a given covariance matrix &#931; there always exists a symmetric (or lower triangular) positive definite factor C such that CC &#8868; = &#931; . Now, is it possible to solve this optimization problem? Without further assumptions, it is unlikely any guarantee is possible, because it is easy to encode NP-hard problems into this VI framework <ref type="bibr">[ 27,</ref><ref type="bibr">28]</ref>. We discuss below the assumptions we will make to be able to solve the problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Structural properties and assumptions</head><p>The properties of the free energy l depend on the properties of the target distribution p . It is necessary to assume that p is somehow "nice" to ensure the problem can be solved. Titsias and L&#225;zaro-Gredilla [17, <ref type="bibr">Proposition 1]</ref> showed that if -log p(&#8226;, x) is convex, then l is convex too. Challis and Barber <ref type="bibr">[16,</ref><ref type="bibr">Sec 3.2]</ref> showed that if the likelihoodp(&#8226;|z) is convex and the priorp(z) is Gaussian, then l is strongly-convex. Domke <ref type="bibr">[13,</ref><ref type="bibr">Theorem 9]</ref> showed that if -log p(&#8226;, x) is &#181; -strongly convex, then l is &#181; -strongly convex as well, and that the constant is sharp. Similarly, Domke <ref type="bibr">[13,</ref><ref type="bibr">Theorem 1]</ref> showed that if log p(&#8226;, x)is M -smooth, then l is also M -smooth, and that the constant is sharp.</p><p>In this paper we make two assumptions about the target distribution p : the negative log-probability log p(&#8226;, x) must be convex (or strongly convex), and Lipschitz smooth. Section 7.3 (Appendix) gives some example models where these assumptions are satisfied. For example, if the model is Bayesian linear regression, or logistic regression with a Gaussian prior, then -log p(&#8226;, x) is smooth and strongly convex. In addition, if the target is a hierarchical logistic regression model, then log p(&#8226;, x) is smooth and convex.</p><p>Assumptions on p also impact what a minimizer w * = (m * , C * ) of f (w) can look like. Intuitively, if the target log p(&#8226;, x)is &#181; -strongly concave, then we would expect the target distribution to be "peaky". This means that the optimal distribution would be close to a delta function centered at the MAP solution: m * would be close to some maximum of log p(&#8226;, x)noted m, and the covariance factor C * would not be too large. This intuition can be formalized: in this context we have &#8741;C * &#8741; 2 + m &#8741; * -m&#8741; 2  2 &#8804; d/&#181; <ref type="bibr">[13,</ref><ref type="bibr">Theorem 10]</ref>. Similarly, if log p(&#8226;, x)is M -Lipschitz smooth, then we expect that the target is not too concentrated, so we might expect that the optimal covariance cannot not be too small. Formally, it can be shown that the singular values of the covariance factor C * are greater than 1/ &#8730; M [13, <ref type="bibr">Theorem 7]</ref>.</p><p>The properties of the neg-entropy h are inherited from the choice of the variational family and do not depend on p . Since we consider a Gaussian variational family, h is known in closed-form. To avoid some technical issues, we will restrict h so that the covariance factor is positive definite, so h(w) is equal to -ln det C up to a constant if C is positive definite (see Appendix 7.1), and to+&#8734; otherwise . So h inherits the properties of the negative log determinant, meaning it is a proper closed convex function whenever C is symmetric or triangular (see Appendix 7.2). From an optimization perspective, it is natural to use a gradient-based algorithm. But unfortunately h is not Lipschitz smooth on its domain, because its gradient can change arbitrarily quickly when the singular values of C become small. However, it can be shown <ref type="bibr">[13,</ref><ref type="bibr">Lemma 12]</ref> that h is M -smooth over the following set of non-singular parameters (which contains the solution w * , see the previous paragraph) given by</p><p>Instead of computing gradients, an optimizer might want to use the proximal operator of h , which can also be computed in closed form (see Sec. 4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Gradient estimators</head><p>This paper considers gradient estimators based on the "path" method or "reparameterization". These assume some base distribution s is known, together with some deterministic transformation T w , such that the distribution of T w (u) is equal to q w when u s &#8764; . Then, we can write</p><p>In the case of multivariate Gaussians, the most common choice is to use s = N (0, I) and T w (u) = Cu + m , which we will consider in the rest of the paper.</p><p>We will consider three different gradient estimators based on the path-type strategy (Eq. 4). We will provide new noise bounds for each of them (all the proofs are in Appendix 8). Our analysis is mostly based on the following general result <ref type="bibr">[20,</ref><ref type="bibr">Thm. 3]</ref>.</p><p>Theorem 1. Let T w (u) = Cu + m for w = (m, C) . Let &#981; : R d &#8594; R be M -smooth, suppose that &#981; is stationary at m, and define w = ( m, 0). Then</p><p>Furthermore, the first inequality cannot be improved.</p><p>Intuitively, this result says that the noise of a gradient estimator is lower when the scale parameter C is small and the mean parameter m is close to a stationary point.</p><p>The first estimator we consider is &#8711;l(w) only. It is given by taking &#981; = -log p into Eq. 4, i.e.</p><p>The next result gives a bound on the noise of this estimator, which is a direct consequence of Theorem 1. The second line uses Young's inequality to bound the noise in terms of (i) the distance of w from w * and (ii) a constant determined by the distance of w * from some fixed parameters w.</p><p>Theorem 2. Suppose that log p(&#8226;, x)is M -smooth and has a maximum (or stationary point) at m, and define w = ( m, 0). Then, for every w and every solution w * of the VI problem,</p><p>Second, we consider an estimator of the gradient of the full objectivel + h. It is obtained by simply taking the above estimator and adding the true (known) gradient of the neg-entropy, i.e.</p><p>The noise of g ent can be bounded since it only differs from g energy by the deterministic quantity &#8711;h(w) , and the fact that-provided w W &#8712; L -the singular values of w cannot be too small and so &#8711;h(w) cannot be too large. This is formalized in the following theorem, where again, the second line relaxes the result into a term based on the distance of w from w * plus constants. (Another type of noise bound <ref type="bibr">[29]</ref> for g ent was obtained by Kim et al. <ref type="bibr">[21]</ref> for diagonal Gaussians and a variety of parameterizations of the covariance.) Theorem 3. Suppose that log p(&#8226;, x)is M -smooth, that it is maximal at m, and define w = ( m, 0). Then, for every L &gt; 0 , for every w W &#8712; L and every solution w * of the VI problem,</p><p>&#8804; 4(d + 3)M<ref type="foot">foot_1</ref> &#8741;w -w * &#8741; 2 + 4(d + 3)M 2 &#8741;w * -w&#8741; 2 + 2dL.</p><p>While g ent used the exact gradient of h , it may be beneficial to use a stochastic estimator h instead.</p><p>To derive our third estimator, write the gradient of the ELBO as<ref type="foot">foot_0</ref> </p><p>where the parameters v serve as a way to "hold w constant under differentiation". This leads to our third estimator, called the "sticking the landing" (STL) gradient estimator</p><p>Intuitively, we expect thatg STL will tend to have lower noise than g ent when the posterior is wellapproximated by the variational distribution. The reason is that, as observed by Roeder et al. <ref type="bibr">[30]</ref>, if q w (z) were a perfect approximation of p(z|x) , then the two terms in Eq. 9 would exactly cancel (for every u ) and so the estimator would have zero variance.</p><p>Below we formalize this intuition in what we believe is the first noise bound on g STL . Theorem 4. Suppose that log p(&#8226;, x) is M -smooth. Consider the residual r(z) := log p(z, x)log qw * (z) for any solution w * of the VI problem, assume that it has a stationary point m, and define &#373; = ( m, 0). Then r is K -smooth for some K &#8712; [0, 2M ] , and for all w W &#8712; M , E &#8741;g STL &#8741; 2 2 &#8804; 8(d + 3)M 2 &#8741;w -w * &#8741; 2 2 + 2(d + 3)K 2 &#8741;w -&#710;w&#8741; 2 2 (10)</p><p>When the target is Gaussian then K = 0 in Eq. 10, meaning that when w = w * , the STL estimator has no variance. This is a distinguishing feature of g STL , as opposed to g energy or g ent .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Challenges for optimization</head><p>Three major issues bar applying existing analysis of stochastic optimization methods to our VI setting: 1) non-smooth composite objective 2) lack of uniform smoothness and 3) lack of uniformly bounded noise of the gradient estimator.</p><p>The first issue is due to the non-smoothness of neg-entropy function h . This means that under the benign assumption that the target log p is smooth, the full objective l + h cannot be smooth, since a nonsmooth function plus a smooth function is always nonsmooth. This renders stochastic optimization proofs (e.g. those for the "ABC" conditions <ref type="bibr">[29]</ref>) that do not use projections or proximal operators inapplicable.</p><p>One way to tackle this first issue would be to use a non-smooth proof technique, but these rely on having a uniform gradient noise bound (our third issue). Alternatively, one can overcome the non-smoothness of the neg-entropy function by either using a proximal operator, or projecting onto a set where h is smooth, namely W M . This is the strategy we pursue.</p><p>Table 1: Table of known black-box VI properties relevant to optimization. In some cases there could be multiple valid w * or w in which case these results hold for all simultaneously.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Description Definition</head><p>Estimator for &#8711;l g energy = -&#8711; w log p(Tw (u), x)</p><p>Stationary point of l w = (&#175;z, 0) for any z that is a stationary point of log p(&#8226;, x)</p><p>g ent , and g STL are quadratically bounded The second issue is that existing analyses for proximal/projected stochastic methods either rely on a uniform noise bound ([31, Cor. 3.6]) or uniform smoothness [32, 33, 34, 35, 36], both of which are not known to be true for VI. By uniform smoothness, we refer to the assumption that log p(Tw (u), x) is M -smooth for every u , with M being independent of u . Instead, we can only guarantee that log p(Tw (u), x) is smooth in expectation, i.e. that l(w) is smooth. Several works [9, Cond. 1][37, Thm. 1][38, Sec. 4][39, Assumption A1][40, Thm. 1][41, Assumption 3.2] assumed that the full objective l + h is smooth, but we are not aware of cases where this holds in practice for VI and in our parameterization (Eq. 2) this cannot be true if log pis smooth.</p><p>The third issue is the lack of a uniform noise bound. Most non-smooth convergence guarantees within stochastic optimization <ref type="bibr">[42]</ref> assume that the noise of the gradient estimator is uniformly bounded by a constant. But this does not appear to be true even under favorable assumptions-e.g. it is untrue if the target distribution is a standard Gaussian. The best that one can hope is that the gradient noise can be bounded by a quadratic that depends on the current parameters w , and-depending on the estimator-even this may only be true when the parameters are in the set W M . Thus, our main optimization contribution is to provide theoretical guarantees for stochastic algorithms under such a specific noise regime, which we make precise in the next definition.</p><p>Definition 5. Let &#981; be a differentiable function. We say that a random vectorg is a quadratically bounded estimator for &#8711;&#981; at w with parameters (a, b, w * ), if it is unbiased E [g] = &#981; &#8711; (w) and if the expected squared norm is bounded by a quadratic function of the distance of parameters w to w * , i.e. E &#8741;g&#8741; 2 2 &#8804; a w -w &#8741; * &#8741; 2 2 + b. The noise bounds derived in Section 2.2 for the estimatorsg energy , g ent , and g STL imply that they are uniformly quadratically bounded estimators. See Appendix 8.2 for a table with the corresponding constants a and b .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Stochastic Optimization with quadratically bounded estimators</head><p>In this section we give new convergence guarantees for the Prox-SGD algorithm and the Proj-SGD algorithm with quadratically bounded gradident estimators. Because these may be of independent interest, they are presented generically, without any reference to the VI setting. We specialize these results to VI in Section 4. For both algorithms, we present results assuming the problem to be strongly convex or just convex. All proofs for this section can be found in the Appendix (Sec. 9).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Stochastic Proximal Gradient Descent</head><p>Here we want to minimize a function which is the sum of two terms l + h , were both l and h are proper closed convex functions, and l is smooth. For this we will use stochastic proximal gradient.</p><p>Definition 6 (Prox-SGD). Let w 0 be a fixed initial parameters and let &#947; 1 , &#947; 2 , &#8226; &#8226; &#8226; be a sequence of step sizes. The stochastic proximal gradient (Prox-SGD) method is given by</p><p>where g t is a gradient estimator for &#8711;l(w t ), and the proximal operator is defined as</p><p>Theorem 7. Let l be a &#181; -strongly convex and M -smooth function, and let w = argmin(l) . Let h be a proper closed convex function, and let w * = argmin(l + h) . Let (w t ) t&#8712;N be generated by the Prox-SGD algorithm, with a constant stepsize &#947; &#8712; 0, min{ &#181; 2a</p><p>, 1 &#181; } . Suppose that g t is a quadratically bounded estimator (Def. 5) for &#8711;l with parameters (a, b, w * ). Then,</p><p>Alternatively, if we use the decaying stepsize</p><p>In both cases, T = O(&#1013; -1 ) iterations are sufficient to guarantee thatE w &#8741; T -w * &#8741; 2 &#8804; &#1013;.</p><p>The above theorem gives an anytime 1/T rate of convergence when the stepsizes are chosen based on the strong convexity and gradient noise constants &#181; and a . Note that we do not need to know precisely those constants: We show in Appendix 9.3 that using any constant step size proportional to T / log T leads to a log T /T rate.</p><p>Theorem 8. Let l be a proper convex and M -smooth function. Let h be a proper closed convex function, and let w * &#8712; argmin(l + h) . Let (w t ) t&#8712;N be generated by the Prox-SGD algorithm, with a constant stepsize &#947; &#8712; (0, 1 M ]. Suppose that g t is a quadratically bounded estimator (Def. 5) for &#8711;l with parameters (a, b, w * ). Then,</p><p>where &#952; def = 1 1 + 2a&#947; 2 and wT def = P T t=1 &#952; t+1 w t P T t=1 &#952; t+1</p><p>. In particular, if &#947; = 1 &#8730; aT</p><p>, then</p><p>Thus, T = O(&#1013; -2 ) iterations are sufficient to guarantee thatE f ( wT ) -inf f &#8804; &#1013;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Stochastic Projected Gradient Descent</head><p>Here we want to minimize a function f over a set of constraints W , where W is a nonempty closed convex set and f is a proper closed convex function which is differentiable on W . To solve this problem, we will consider the stochastic projected gradient algorithm.</p><p>Definition 9 (Proj-SGD). Let w 0 be some fixed initial parameter, and let &#947; 1 , &#947; 2 , &#8226; &#8226; &#8226; be a sequence of step sizes. The projected stochastic gradient (Proj-SGD) method is given by</p><p>where g t is a gradient estimator for &#8711;f(w t ), and the projection operator is defined as</p><p>Note that we do not require f to be smooth, meaning that this setting is not a particular case of the one considered in Section 3.1. In fact, the arguments we use in the proofs for Proj-SGD are more closely related to stochastic subgradient methods than stochastic gradient methods. Theorem 10. Let W be a nonempty closed convex set. Let f be a &#181; -strongly convex function, differentiable on W . Let w * = argmin W (f ) . Let (w t ) t&#8712;N be generated by the Proj-SGD algorithm, with a constant stepsize &#947; &#8712; 0, min{ &#181; 2a</p><p>, 2 &#181; } . Suppose that g t is a quadratically bounded estimator (Def. 5) for &#8711;f with parameters (a, b, w * ). Then,</p><p>Alternatively, if we use the decaying stepsize</p><p>In both cases, T = O(&#1013; -1 ) iterations are sufficient to guarantee thatE w</p><p>Note that Eq. 13 is a sum of two terms: one that decays exponentially in T and one that decreases only when the stepsize &#947; is sufficiently small. This has an important consequence: if one uses the gradient estimator g STL and the target distribution is exactly a Gaussian, thenb = 0 (Appendix 8.2), meaning that the algorithm will converge at an exponential rate. This is similar to many results in the stochastic optimization literature showing faster rates hold when interpolation holds <ref type="bibr">[35]</ref>. Theorem 11. Let W be a nonempty closed convex set. Let f be a convex function, differentiable on W . Let w * &#8712; argmin W (f ) . Let (w t ) t&#8712;N be generated by the Proj-SGD algorithm, with a constant stepsize &#947; &#8712; (0, +&#8734;) . Suppose that g t is a quadratically bounded estimator (Def. 5) for &#8711;f at w t with constant parameters (a, b, w * ). Then,</p><p>where</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Solving VI with provable guarantees</head><p>We now specialize the optimization results of the previous section to our VI setting. We aim to minimize the negative ELBO, f : R</p><p>and where V is the vector space of matrices in which the covariance factors C belong. Under the assumption that the log-target log p(&#8226;, x)is M -smooth and concave, we propose two strategies to minimize this objective. All the proofs for this section can be found in Appendix 10.</p><p>1. Apply the Prox-SGD algorithm to the sum l + h , using a proximal step with respect to h , and a stochastic gradient step with respect to l . The gradient of l will be estimated with g energy <ref type="bibr">(5)</ref>, which is globally quadratically bounded. The prox of h admits a closed form formula if we choose that V is the space of lower triangular matrices. 2. Apply the Proj-SGD algorithm to minimize f over W M . The gradient of f will be estimated using either g ent <ref type="bibr">(7)</ref> or g STL <ref type="bibr">(9)</ref>, both of which are quadratically bounded on W M . The projection onto W M admits a closed form formula if choose that V is the space of symmetric matrices.</p><p>Corollary 12 (Prox-SGD for VI). Consider the VI problem where q w is a multivariate Gaussian distribution (Eq. 2) with parameters w = (m, C) &#8712; R d &#215; T d , and assume that this problem admits a solution w * . Suppose that log p(&#8226;, x)is M -smooth and concave (resp. &#181; -strongly concave). Generate a sequence w t by using the Prox-SGD algorithm (Def. 6) applied to l and h (Eq. 16), using g energy <ref type="bibr">(5)</ref> as an estimator of &#8711;l . Let the stepsizes &#947; t be constant and equal to 1/( p a energy T ) (resp.</p><p>be decaying as in Theorem 7 with a energy = 2(d + 3)M 2 ). Then, for a certain average wT of the iterates, we have for</p><p>Corollary 13 (Proj-SGD for VI). Consider the VI problem where q w is a multivariate Gaussian distribution (Eq. 2) with parameters w = (m, C) &#8712; R d &#215; S d , and assume that this problem admits a solution w * . Suppose that log p(&#8226;, x)is M -smooth and concave (resp. &#181; -strongly concave). Generate a sequence w t by using the Proj-SGD algorithm (Def. 9) applied to the function f = l+h (Eq. 16) and the constraint W M (Eq. 3), using g ent <ref type="bibr">(7)</ref> or g STL <ref type="bibr">(9)</ref> as an estimator of &#8711;f . Let the stepsizes &#947; t be constant and equal to p 2/(aT ) (resp. be decaying as in <ref type="bibr">Theorem 10)</ref> with a = a ent = 4(d + 3)M 2</p><p>or a = a STL = 24(d + 3)M 2 . Then, for a certain average wT of the iterates, we have for</p><p>Corollary 14 (Proj-SGD for VI -Gaussian target). Consider the setting of Corollary 13, in the scenario that log p(&#8226;, x)is &#181; -strongly concave, that we use the g STL estimator, and that we take a constant stepsize &#947; t &#8801; &#947; &#8712; 0, min{</p><p>Let us now discuss the practical implementation of these two algorithms (more details and an explicit implementation of the methods are given in Appendix 10.2). These require computing either the proximal operator of the neg-entropy h , or the projection onto the set of non-degenerate covariance factors W M . These can be computed <ref type="bibr">[13]</ref> for every w = (m, C) as:</p><p>Our theory for the Prox-SGD algorithm formally assumes that the covariance factor C lives in the vector space of lower-triangular matrices. This can be implemented by letting C &#8712; R d&#215;d and "clamping" the upper-triangular entries to zero throughout computation. Then the gradient g energy (5) can be computed using automatic differentiation, and the proximal operator can be computed as in the above equation. Or, one may choosew = (m, c) where c &#8712; R d&#215;(d+1)/2 is the lower-triangular entries of the matrix so C = tril(c) . Gradients with respect to w = (m, c) can again be estimated using automatic differentiation (including the tril operator), and the proximal operator can be computed by forming C , using the above formula, and then extracting the lower-triangular entries.</p><p>Similarly, our theory for the Proj-SGD algorithm formally assumes the covariance factor C lives in the vector space of symmetric matrices. This can be implemented by letting C &#8712; R d&#215;d , computing the gradient estimator g ent <ref type="bibr">(7)</ref> or g STL (9) over the reals using automatic differentiation, and "symmetrizing"</p><p>C &#8868; after applying each gradient update, right before projection. Or, one may choosew = (m, c) where c &#8712; R d&#215;(d+1)/2 is the lower-triangular entries of the matrix, so C = symm(c) , where symm is a function similar to tril except creating symmetric matrices. The gradient estimator g ent <ref type="bibr">(7)</ref> or g STL (9) can be computed using automatic differentiation (including the symm operator). In this case, to project one would first form C , then use the above formula, and then extract the lower-triangular matrices.</p><p>In terms of cost, computing log qw (z) or T w (u) needs &#920;(d 2 ) operations. Computing the prox of h requires d operations, but in contrast projecting onto W M requires diagonalizing a symmetric matrix, which takes &#920;(d 3 ) operations. We note that this is not necessarily an issue, because 1) eigenvalue decomposition has excellent computational constants in moderate dimensions; 2) computing the target log p(z, x) may itself require &#920;(d 3 ) or more operations; 3) it is common to average a "minibatch" of gradient estimates, meaning each eigenvalue decomposition computation is amortized over many ELBO evaluations. Still, in high dimensions, when log p(z, x) is inexpensive, and small minibatches are used, computing such decomposition in each iteration could become a computational bottleneck. In this scenario, the Proj-SGD algorithm is clearly cheaper, with its &#920;(d) cost for the proximal step.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Discussion</head><p>While this paper has focused on convergence with dense Gaussian variational distributions, the results in Sec. 3 apply more generally. It is conceivable that the necessary smoothness, convexity, and noise bounds could be established for other variational families, in which case these optimization results would provide "plug in" guarantees. We suspect this might be fairly easy to do with, e.g., elliptical distributions or location-scale families, using similar techniques as done with Gaussians. But it may be difficult to do so for broader variational families.</p><p>If -log p is strongly convex and smooth, another proof strategy is possible: Smoothness guarantees that w * &#8712; W M and strong convexity guarantees that w * &#8712; {w : w -&#8741; &#175;w&#8741; 2 2 &#8804; d/&#181;} . So one could do projected SGD, projecting onto the intersection of these two sets. The negative ELBO would be strongly convex and smooth over that set, and since &#8741;w -&#175;w&#8741; 2 is bounded, gradient noise is upperbounded by a constant, meaning classical projected SGD convergence rates that assume smoothness, strong convexity, and uniform gradient noise apply. However, projecting onto the intersection of those sets poses a difficulty and this uniform noise bound may be loose in practice.</p><p>Several variants of the gradient estimators we consider have been proposed to reduce noise, often based on control variates or sampling from a different base distribution <ref type="bibr">[30,</ref><ref type="bibr">37,</ref><ref type="bibr">43,</ref><ref type="bibr">44,</ref><ref type="bibr">45,</ref><ref type="bibr">46,</ref><ref type="bibr">47,</ref><ref type="bibr">48,</ref><ref type="bibr">49,</ref><ref type="bibr">50]</ref>. It would be interesting to establish gradient noise bounds for these estimators.</p><p>Various VI variants have been proposed that use natural gradient descent. It would also be interesting to seek convergence guarantees for these algorithms.</p><p>VI can be done with "score" or "reinforce" gradient estimators <ref type="bibr">[6,</ref><ref type="bibr">8]</ref>, which use the identity</p><p>in place of how we used Eq. 4 for the "path" estimators we considered. While path estimators often seem to have lower variance, this is not always true: Take an unnormalized log-posterior&#981;(z) where z is scalar. Now, define &#981; &#8242; (z) = &#981;(z) + &#8730; &#1013; sin(z/&#1013;), where &#1013; is very small. Then, &#981; and &#981; &#8242; represent almost the same posterior, and score estimators for them will have almost the same variance. Yet, the derivative of the added term is cos(z/&#1013;)/ &#8730; &#1013; , so a path estimator for &#981; &#8242; will have much higher variance than one for &#981; . This underlines why smoothness is essential for our guarantees-it excludes posteriors like &#981; &#8242; . Future work further unravel when score estimators perform better than path estimators.</p><p>One attractive feature of VI is that data subsampling can often be used when computing gradients, decreasing the cost of each iteration. While we have not explicitly addressed this, our proofs can be easily generalized, at least for target distributions of the form p(z, x) = p(z) Q n=1 p(xn |z). The only issue is to bound the noise of the subsampled estimator. While our gradient variance guarantees use Theorem 1 from <ref type="bibr">[20]</ref>, a more general result <ref type="bibr">[20,</ref><ref type="bibr">Theorem 6</ref>] considers data subsampling. Very roughly speaking, with uniform data subsampling of 1 datum of a time, the gradient noise bounds in Thms. 2 and 3 would increase by a factor between 1 (no increase) and the number of data, depending on how correlated the data are-less correlation leads to a larger increase. (More precisely, the second term in Thm. 3 would not increase.) These increases would manifest as larger constants a and b in the quadratic bounds, after which exactly the same results hold. However, it is not obvious if the bound for g STL in Thm. 4 can be generalized in this way. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Notations</head><p>We note M d the space of d&#215;d matrices, S d the subspace of symmetric matrices, and T d the subspace of lower triangular matrices. We use</p><p>} to denote the set of parameters with positive definite covariance matrix whose singular values are greater or equal to</p><p>variance of vector-valued random variables</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">VI problems and their structural properties</head><p>This section contains the proofs of all the claims made in Section 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Modeling the VI problem</head><p>We recall and detail here the setup of our problem. Given a target distribution p and observed data x , we want to solve the following Variational Inference problem min</p><p>where q w is a multivariate Gaussian variational family, whose density may be written as</p><p>We make the assumption (VI) has a non-degenerated solution, i.e. a solution for which the covariance matrix is invertible.</p><p>We will impose that our parameters have the form w = (m, C) &#8712; R d &#215; V , where V is some vector subspace of M d , the space of d &#215; d matrices. On top of that, we will also want the covariance factor to be positive definite. In other words, we will impose that our parameters are taken from the set</p><p>For the sake of simplicity, the reader can assume for most of the paper that V = M d . But when coming to analyze our algorithms, we will see that we should take either V = T d (the subspace of d &#215; d lower triangular matrices) or V = S d (the subspace of d &#215; d symmetric matrices). This choice is dictated by two reasons: it guarantees that our problem remains convex (see <ref type="bibr">Lemma 19)</ref>, and it lowers the computational cost of our algorithms (see Section 10.2). Fortunately, requiring C to be symmetric (or triangular) and positive definite does not change our problem, as stated in the next Lemma. Lemma 15 (Equivalent problems for triangular/symmetric covariance factors).Suppose that w &#8712; R d &#215;M d is such that q w minimizes KL (qw &#8741;p(&#8226;|x)) over R d &#215;M d . Then there exists w * = (m * , C * ) such that q w and q w * have the same distribution, with C * being positive definite and symmetric (or lower triangular).</p><p>Proof. Write w = ( m, C) , and take m * = &#175;m. Let &#931; = C C&#8868; . Since we assumed that (VI) has a non-degenerated solution, we can assume without loss of generality that &#931; is invertible. Therefore we can apply the Cholesky decomposition and write &#931; = C * (C * ) &#8868; , where C * is lower triangular and positive definite. We can also define C * = ( &#931;) 1/2 to obtain a symmetric and positive definite factor.</p><p>Once restricted to W , our problem (VI) becomes equivalent to</p><p>where</p><p>This equivalence is mostly standard, but we state it formally for the sake of completeness.</p><p>Lemma 16 (VI reformulated). Problems (VI) and (18) are equivalent.</p><p>Proof. Let w = (m, C) W &#8712; , and let us show that KL (qw &#8741;p(&#8226;|x)) = l(w) + h(w) . By definition of the Kullback-Liebler divergence, and using p(z|x) = p(z, x)/p(x) , we can write</p><p>If we note H(X) the entropy of a random variable, we can see that E z q &#8764; w log qw (z) = -H(z) where z q &#8764; w . We can rewrite z via an affine transform as z = Cu + m with u N &#8764; (0, I) , so we can use <ref type="bibr">[51,</ref><ref type="bibr">Section 8.6]</ref> to write (see also the arguments in <ref type="bibr">[13]</ref>)</p><p>Here H(u) , the entropy of u, is a constant equal to (d/2)(1 + log(2&#960;) . Moreover since w W &#8712; , we know that C &#8827; 0 , which implies that log | det C| = log det C. Because log p(x) and H(u) are both constants, we deduce that minimizing KL (qw &#8741;p(&#8226;|x)) over w W &#8712; is equivalent to minimizing l + h , where h(w) = -log det C if w W &#8712; and +&#8734; otherwise.</p><p>We emphasize that working with a triangular or symmetric representation for C is not restrictive, and that doing computations in this setting is pretty straightforward, as we illustrate next.</p><p>Lemma 17 (Computations with triangular/symmetric covariance factors). Let V be a vector subspace of 1. This is immediate due to the fact that V is convex, since it is a vector subspace of M d .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Let &#953;</head><p>) be the canonical injection. Then we clearly have &#981; V = &#981; &#8226; &#953; . Moreover, &#953; is a linear application whose adjoint is &#953; * is exactly the orthogonal projection proj R d &#215;V . So the computation of the gradient follows after applying the chain rule :</p><p>. As for the norm, it suffices to observe that the orthogonal projection is a linear operator with norm less or equal than 1.</p><p>3. Using that the orthogonal projection is a contractive map gives</p><p>4. This is a standard linear algebra result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Smoothness and convexity for VI problems</head><p>Now we state the main smoothness and convex properties for VI problems. We will see in our analysis that the smoothness properties often revolve around the following subset of W</p><p>where &#963; min (C) refers to the smallest singular value of C .</p><p>Lemma 18 (Smoothness for VI). Assume that log p(&#8226;, x)is M -smooth. Then</p><p>Proof.</p><p>1. Combine [13, Theorem 1] with Lemma 17.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2.</head><p>Combine the fact that the gradient of -log det at a matrix X is -X -&#8868; with Lemma 17.2.</p><p>3. Combine [13, <ref type="bibr">Lemma 12]</ref> with Lemma 17.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 19 (Convexity for VI).</head><p>Assume that -log p(&#8226;, x) is convex (resp. &#181; -strongly convex). Then</p><p>Proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Combine</head><p>Titsias and L&#225;zaro-Gredilla[17, Proposition 1] (resp. Domke [13, Theorem 9]) with Lemma 17.1. 2. Convexity of h follows from [52, Example 7.13 and Theorem 7.17]. Convexity of W comes from the fact that it is essentially the set of symmetric positive definite matrices (which is equivalent to &#955; min (C) &gt; 0 ). Let us now prove that W M is convex and closed. First we notice that, because of symmetry, we can rewrite W M as</p><p>This set is closed because &#955; min is a continuous function. To see that W M is convex we are going to use the fact that &#955; min is a concave function. Take C 1 , C 2 &#8712; W M , &#945; &#8712; [0, 1] and write</p><p>Using Sylvester's criterion, it is easy to see that a lower triangular matrix is positive definite if and only if its diagonal has positive coefficients. In other words, W = {(m, C) &#8712; R d &#215; T d : Cii &gt; 0} , which is clearly convex. It remains to verify that h is closed convex even if W is not closed. For all w = (m, C) &#8712; R d &#215; T d , we can use the triangular structure of C to write</p><p>where we used the notation &#948; A for the indicator function of a set A , which is equal to zero on A and +&#8734; outside. Since -log(t) + &#948; (0,+&#8734;) (t) is a closed convex function, we see that h is a separable sum of closed convex functions, and thus is itself closed and convex.</p><p>4. Let d = 2 , let I be the identity matrix, let R = 0 1 -1 0 be a rotation matrix, and define &#968; : R &#8594; R with &#968;(t) = h(I + tR) . Observe that &#968; is well defined because I + tR &#8827; 0 for all t &#8712; R . On the one hand, if h was convex, then &#968; would be convex too, as it is a composition of a convex function with the affine function t 7 &#8594; I + tR . On the other hand, we can compute explicitly that &#968;(t) = -log(1 + t 2 ), which is not convex. <ref type="bibr">Theorem 7]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 20 (Solutions of VI). Assume that log p(&#8226;, x)is M -smooth, and that</head><p>Suppose now that V = S d , meaning that w * is a minimizer of f over W &#8834; R d &#215; S d . We see that w * must also be a minimizer of f over R d &#215;M d , because if there was a better solution &#373; = ( m, &#264;) with &#264; M &#8712; d , we could consider ( &#264; &#264;&#8868; ) 1/2 &#8712; S d which would itself be a better solution than C * , which is a contradiction. So we can apply <ref type="bibr">[13,</ref><ref type="bibr">Theorem 7]</ref>  Proof. According to the definition of q w (see <ref type="bibr">(17)</ref>), the Hessian of log qw is &#8711; 2 z (log qw )(z) = -(CC &#8868; ) -1 , and so &#8741;&#8711; 2 z (log qw )(z)&#8741; = &#963; max ((CC &#8868; ) -1 ) = &#963; min (C) -2 . </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3">A case study : linear models</head><p>where A is a matrix with a n in the n -th column. Then</p><p>Now, take some arbitrary vector v . We have that</p><p>Putting both the above results together, we have that</p><p>This implies that all eigenvalues of -&#8711; 2 z log pare in the interval [&#181;, &#946;]. All the claimed properties follow from this:</p><p>&#8226; For any eigenvalue &#955; of &#8711; 2 z log p(z, x), we know that &#955; i (&#8711; 2 z log p(z, x)) &#8804; M = max (|&#181;| , |&#946;|) .Thus, log p(z, x) is M -smooth over z .</p><p>&#8226; If &#181; &#8805; 0 , this means all eigenvalues of -&#8711; 2 z log p(z, x) are positive, which implies that log p(z, x) is convex.</p><p>&#8226; If &#181; &gt; 0 , then all eigenvalues of -&#8711; 2 z log(z, x) are bounded below by &#945; , which impiles that log p(z, x) is &#181; -strongly convex.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Corollary 25. Take a linear regression model with inputs</head><p>Proof. In this case we have that</p><p>Or, more abstractly,</p><p>Thus we have that &#952; min = &#952; max = 1 &#963; 2 . It follows from Theorem 24 that -log p(z, x) is &#181; -strongly convex for &#181; = &#955; min &#931; -1 + 1 &#963;2 AA &#8868; and M -smooth for M = &#955; max &#931; -1 + 1 &#963;2 AA &#8868; . In the particular case where &#931; = I , we get that &#181; = &#955; min I + 1</p><p>Corollary 26. Take a logistic regression model with inputs a n &#8712; R d and outputs</p><p>Proof. In this case we have that</p><p>Or, more abstractly,</p><p>The second derivative is non-negative and goes to zero if &#945; &#8594; -&#8734; or &#945; &#8594; +&#8734; . It is maximized at &#945; = 0 in which case &#981; &#8242;&#8242; = 1 4 &#946; 2 . But of course, in this model, the second input is &#946; = x n &#8712; {-1, +1} , meaning &#946; 2 = always. Thus we have that &#952; min = 0 and &#952; max = 1 . It follows from Theorem 24 that -log p(z, x) is &#181; -strongly convex for &#181; = &#955; min &#931; -1 and M -smooth for M = &#955; max &#931; -1 + 1 4 AA &#8868; .</p><p>In the particular case where &#931; = I , we get that &#181; = 1 and M = &#955; max <ref type="bibr">(</ref>1 + 1 4 AA &#8868; ) = 1 + 1 4 &#963; max (A) 2 . Theorem 27. Take a heirarchical model of the form p(&#952;, z, x) = p(&#952;) Q i p(zi |&#952;)p(xi |&#952;, z i ), where all log-densities are twice-differentiable. Then, log p(&#952;, z, x)is smooth joinly over (&#952;, z) if and only if the spectral norm of the second derivative matrices &#8711; &#952;&#952; &#8868; log p(&#952;, z, x) , &#8711; &#952;z &#8868; i log p(&#952;, z i , x i ), and &#8711; z i z &#8868; i log p(&#952;, z i , x i ) are all bounded uniformly over (&#952;, z). Proof. The following conditions are all equivalent: 1. log p(&#952;, z, x)is smooth as a function of (&#952;, z) 2. &#8711; 2 log p(&#952;, z, x) 2 is bounded above (uniformly over (&#952;, z), where &#8711; 2 denotes the Hessian with respect to (&#952;, z) and &#8741;&#8226;&#8741; 2 denotes the spectral norm.) 3. &#8711; 2 log p(&#952;, z, x) is bounded above (uniformly over (&#952;, z), where &#8711; 2 denotes the Hessian with respect to (&#952;, z) and &#8741;&#8226;&#8741; denotes a block infinity norm on top of spectral norm inside each block. 4. &#8711; 2 &#952;&#952; &#8868; log p(&#952;, z, x) 2 , &#8711; 2 &#952;z &#8868; i log p(&#952;, z, x) 2 , and &#8711; 2 z i z &#8868; j log p(&#952;, z, x) 2 are all bounded above (uniformly over (&#952;, z)) 5. &#8711; 2 &#952;&#952; &#8868; log p(&#952;, z, x) 2 , &#8711; 2 &#952;z &#8868; i log p(&#952;, z i , x i ) 2 , and &#8711; 2 z i z &#8868; j log p(&#952;, z, x) 2 are all bounded above (uniformly over (&#952;, z)) 6. &#8711; 2 &#952;&#952; &#8868; log p(&#952;, z, x) 2 , &#8711; 2 &#952;z &#8868; i log p(&#952;, z i , x i ) 2 , and &#8711; 2 z i z &#8868; i log p(&#952;, z, x) 2 are all bounded above (uniformly over (&#952;, z)) 7. &#8711; 2 &#952;&#952; &#8868; log p(&#952;, z, x) 2 , &#8711; 2 &#952;z &#8868; i log p(&#952;, z i , x i ) 2 , and &#8711; 2 z i z &#8868; i log p(&#952;, z i , x i ) 2 are all bounded above (uniformly over (&#952;, z)) Corollary 28. Take a hierarchical logistic regression model defined by p(&#952;, z, x) = p(&#952;)</p><p>where</p><p>. Then, -log p(&#952;, z, x)is convex and smooth with respect to (&#952;, z).</p><p>Proof. It's immediate that -log p is convex with respect to(&#952;, z). We can show that this function is smooth, by verifying the three conditions of Theorem 27.</p><p>&#8226; p(zi |&#952;) is a Gaussian with constant covariance, so is smooth with a constant independent of &#952; . And we can write that log p(xij |z i ) = log 1 + exp -x ij z &#8868; i a ij . This function is also smooth over z i with a constant that doesn't depend on &#952; . Thus log p(zi |&#952;) + P j log p(xij |z i ) is smooth over z i with a smoothness constant that is independent of &#952; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Estimators and variance bounds</head><p>We remember that unless specified otherwise, we consider that w = (m, C) &#8712; R d &#215; V , where V is a vector subspace of M d .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1">Variance bounds for the estimators</head><p>Theorem 1. Let T w (u) = Cu + m for w = (m, C) . Let &#981; : R d &#8594; R be M -smooth, suppose that &#981; is stationary at m, and define w = ( m, 0). Then</p><p>Furthermore, the first inequality cannot be improved.</p><p>Proof. This bound is proven by <ref type="bibr">[20,</ref><ref type="bibr">Thm. 3]</ref> in the case that V = M d . We conclude that this bound holds for every subspace V of M d by using the inequality in Lemma 17.2.</p><p>Theorem 2. Suppose that log p(&#8226;, x)is M -smooth and has a maximum (or stationary point) at m, and define w = ( m, 0). Then, for every w and every solution w * of the VI problem,</p><p>Proof. It is a direct consequence of Theorem 1 and Young's inequality</p><p>Theorem 3. Suppose that log p(&#8226;, x)is M -smooth, that it is maximal at m, and define w = ( m, 0). Then, for every L &gt; 0 , for every w W &#8712; L and every solution w * of the VI problem,</p><p>Proof. The difference between g ent and g energy is the addition of the constant vector &#8711;h(w) . Thus, we have that</p><p>It remains to bound the final term. We can do this using the closed-form expression for the entropy gradient (see Lemma 18.2), plus the assumption that w W &#8712; L &#8834; W , to see that</p><p>We conclude with Young's inequality</p><p>For the next Theorem, we will need the follwing technical Lemma :</p><p>Proof. Develop the squares to write</p><p>Since Eu = 0 , we immediately see that E A &#10216; &#8868; b, u&#10217; = 0 . Because of symmetry, the third-order moment Eu&#8741;u&#8741; 2 is also zero, so we obtain E A &#10216; &#8868; b, u&#8741;u&#8741; 2 &#10217; = 0 . We also have the second-order moment </p><p>The conclusion follows after gathering all those identities. Theorem 4. Suppose that log p(&#8226;, x) is M -smooth. Consider the residual r(z) := log p(z, x)log qw * (z) for any solution w * of the VI problem, assume that it has a stationary point m, and define &#373; = ( m, 0). Then r is K -smooth for some K &#8712; [0, 2M ] , and for all w W &#8712; M ,</p><p>Proof. Let w W &#8712; M be fixed. For the duration of the proof, we take v to be a second copy of w (held constant under differentiation with respect to w ). Now, we can rearrange the estimator in Eq. 9 by making appear q w * (Tw (u)) , namely</p><p>p(Tw (u), x) .</p><p>Introducing &#981;(z) := log qv (z) -log qw * (z) , and recalling r(z) = log p(z, x) -log q w * (z) , we have</p><p>We assume that log p(&#8226;, x)is M -smooth, and that V = M d , T d or S d , which implies that w * &#8712; W M (see <ref type="bibr">Lemma 20)</ref>. This in turn guarantees that log qw * is also M -smooth (see <ref type="bibr">Lemma 21)</ref>. Thus, r is the sum of two M -smooth functions, so it is K -smooth with K &#8712; [0, 2M ] . Using Young's inequality, together with Theorem 1 applied to r , we obtain</p><p>where &#373; = ( m, 0), with m being a stationary point of r . Now it remains to control the last term of inequality <ref type="bibr">(19)</ref>. Apply the chain rule onto &#981;(Tw (u)) to write (with the help of Lemma 37)</p><p>This gives us directly that</p><p>where we used that &#8741;vu &#8868; &#8741; 2 F = v &#8741; &#8741; 2 &#8741;u&#8741; 2 for any vectors v and u . Computing &#8711; z &#981; amounts to computing the gradient of the Gaussian density q w . From its definition (see Eq. 17) we have:</p><p>Recalling the definition of&#981; = log qv -log qw * , and</p><p>In other words, we have</p><p>where</p><p>We can now combine <ref type="bibr">(20)</ref> and ( <ref type="formula">21</ref>) and use Lemma 29 to write</p><p>Let us now introduce the function&#954; : R d &#215;M d &#8594; R { &#8746; +&#8734;} , defined by&#954;(w) := E z q &#8764; w log qw (z)log qw * (z) if C is invertible, +&#8734; otherwise. This function is nothing but the Kullback-Liebler divergence between the two Gaussian distributions q w and q w * , and can be computed in closed-form on its domain:</p><p>where we note as before w = (m, C) ,</p><p>This allows us to compute its gradient at w W &#8712; M : <ref type="formula">22</ref>) and the fact that &#8711; w &#954;(w * ) = 0 (because w * is a minimizer of &#954; ), we deduce that</p><p>Now we remind that w * &#8712; W M which implies that log qw * is M -smooth. This means that E z q &#8764; w log qw * is M -smooth as well, according to Domke <ref type="bibr">[13,</ref><ref type="bibr">Theorem 1]</ref>. On the other hand, we know (see the proof of Lemma 16) thatE z q &#8764; w log qw * = h(w) which is M -smooth on W M (see <ref type="bibr">Lemma 18.3)</ref>. All this implies that &#954; is 2M -smooth on W M , from which we conclude</p><p>The Theorem's main inequality <ref type="bibr">(10)</ref> follows after plugging the above inequality into <ref type="bibr">(19)</ref>. The second inequality follows after using Young's inequality</p><p>To conclude the proof, it remains to check that K = 0 whenever p(&#8226;|x) is Gaussian. In that case, because our main objective function f (w) is the divergence between q w and p(&#8226;|x), it is clear that f is minimized whenever q w = p(&#8226;|x) . Therefore, without loss of generality, we can assume that p(&#8226;|x) = q w * . This implies that the residual r is constant (r(z) = log p(x) ), and so is 0-smooth. </p><p>Proof. Combine the results of Theorems 2, 3 and 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">Optimization proofs</head><p>This section contains all the proofs for the Theorems stated in Section 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.1">Anytime Convergence Theorem for Strongly Convex</head><p>Theorem 31 ( Theorem 3.2 in <ref type="bibr">[ 53]</ref> ). Suppose we are given a sequence of w t iterates such that for a step size &#947; t &#8804; 1 2L and given constants &#181; &gt; 0 and &#963; 2 &gt; 0 we have that</p><p>By switching to a decaying stepsize according to 2  for t &#8805; t * where t * = 4 L/&#181; &#8970; &#8971; , we have that</p><p>Proof. We divide the proof for steps t that are great or smaller than t * . For t &#8804; t * , we have by unrolling <ref type="bibr">(23)</ref> starting from t * , . . . , 1 , with &#947; t &#8801; &#947; gives</p><p>Now consider t &#8805; t * for which we have that &#947; t = 1 &#181; 2t+1 (t+1) 2 which when inserted into <ref type="bibr">(23)</ref> gives 4  .</p><p>Multiplying through by (t + 1) 2 and re-arranging gives</p><p>where we used that 2t+1 t+1 &#8804; 2 . Summing up over t = t * , . . . , T and using telescopic cancellation gives</p><p>Now using that for t &#8804; t * we have that &#947; = 1 L and ( <ref type="formula">25</ref>) holds, thus</p><p>where we used that (1 -&#947;&#181;) &#8804; 1. Substituting &#947; = 1 2L , t * = 4 L/&#181; &#8970; &#8971; and multiplying by (T + 1) 2 gives</p><p>which concludes the proof of (24).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.2">Complexity Lemma for Convex</head><p>Lemma 32. Suppose we are given a sequence wT and constants a, b, B &gt; 0 such that</p><p>Thus, T = O(&#1013; -2 ) iterations are sufficient to guarantee thatE f ( wT ) -inf f &#8804; &#1013;.</p><p>Proof. First we show that 1 1-&#952; T &#8804; 2 is equivalent to</p><p>Indeed this follows since</p><p>x for x &#8805; 0 . Applying this gives that</p><p>it is sufficient to enforce that</p><p>Assuming A &#8805; p 2/B , this last condition holds if</p><p>Since we also impose that BA 2 &#8805; 2 we have that 1 BA 2 -1 &#8804; 1 thus for <ref type="bibr">(27)</ref> to hold it suffices that T &#8805; 2. Substituting &#947; = A &#8730; T and the relaxation 1  1-&#952; T &#8804; 2 into the original bound gives the claimed result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.3">Proximal gradient descent with strong convexity and smoothness</head><p>We start by recalling some standard properties of the proximal operator. Lemma 33. Let &#947; &gt; 0 and w * &#8712; argmin w l(w) + h(w). Assume h(w) is convex and that l(w) is continuously differentiable at w * . Then 1. For all w, w &#8242; and &#947; &gt; 0 , prox &#947;h (w)</p><p>Theorem 7. Let l be a &#181; -strongly convex and M -smooth function, and let w = argmin(l) . Let h be a proper closed convex function, and let w * = argmin(l + h) . Let (w t ) t&#8712;N be generated by the Prox-SGD algorithm, with a constant stepsize &#947; &#8712; 0, min{ &#181; 2a</p><p>, 1  &#181; } . Suppose that g t is a quadratically bounded estimator (Def. 5) for &#8711;l with parameters (a, b, w * ). Then,</p><p>Alternatively, if we use the decaying stepsize &#947; t = min</p><p>In both cases, T = O(&#1013; -1 ) iterations are sufficient to guarantee thatE w</p><p>Proof. We start by using the non-expansiveness and fixed-point properties of the proximal operator to write</p><p>Above the first line uses the fixed point property of the proximal operator that w * = prox &#947;h (w *&#947; l &#8711; (w)) , while the second line uses the fact that the proximal operator is non-expansive. Now, we apply our assumption on the gradient estimator to write</p><p>For the last term, we have by strong convexity and the fact that g t is an unbiased estimator that</p><p>where the second line follows from the fact that l is &#181; strongly convex.</p><p>Putting the pieces together, we get that</p><p>Now the following conditions are equivalent:</p><p>Now since &#947; &#8804; 1 &#181; &#8660; 1 -&#947;&#181; &#8805; 0 we can apply the above recursively, which gives that</p><p>But we can bound this geometric sum by</p><p>leading to the result that</p><p>To prove the anytime result, we can now apply Theorem 31. To do so, we need to map the notation we use here to the notation used in Theorem 31. The mapping we need is L = a &#181; , and &#963; 2 = b + M 2 &#8741;w * -w&#8741; 2  2 which when inserting into Theorem 31 gives the result.</p><p>Corollary 34. Under the conditions of Theorem 7, if&#947; = A log T T for 1 &#181; &#8804; A , and T is large enough that</p><p>Proof. The previous theorem can only be applied when &#947; &#8804; &#181; 2a , i.e. when A log T T &#8804; &#181; 2a or 2aA &#181; &#8804; T log T . Supposing that is true, observe that (1 -x) T &#8804; exp(-xT ), from which it follows that (1 -&#947;&#181;) T &#8804; exp (-&#181;A log T ) = 1  T &#181;A &#8804; 1 T .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.4">Proximal gradient descent with convexity and smoothness</head><p>Theorem 8. Let l be a proper convex and M -smooth function. Let h be a proper closed convex function, and let w * &#8712; argmin(l + h) . Let (w t ) t&#8712;N be generated by the Prox-SGD algorithm, with a constant stepsize &#947; &#8712; (0, 1 M ]. Suppose that g t is a quadratically bounded estimator (Def. 5) for &#8711;l with parameters (a, b, w * ). Then,</p><p>where &#952; def = 1 1 + 2a&#947; 2 and wT def = P T t=1 &#952; t+1 w t P T t=1 &#952; t+1</p><p>. In particular, if &#947; = 1 &#8730; aT</p><p>, then</p><p>Thus, T = O(&#1013; -2 ) iterations are sufficient to guarantee thatE f ( wT ) -inf f &#8804; &#1013;.</p><p>Proof. Let us start by looking at &#8741;w t+1 -w * &#8741; 2 -&#8741;w t -w * &#8741; 2 . Expanding the squares, we have that</p><p>Since w t+1 = prox &#947; t h (w t -&#947; t g t ), we know from Lemma 33 that</p><p>and therefore that w t -w t+1 &#947; t &#8712; g t + &#8706;h(w t+1 ),</p><p>where &#8706;h(w) is the subdifferential of h . Consequently there exists a subgradient b t+1 &#8706; &#8712; h(w t+1 ) such that</p><p>We decompose the last term of (28) as</p><p>(29) For the first term in the above we can use that b t+1 &#8706; &#8712; h(w t+1 ) is a subgradient together with the defining property of subgradient to write</p><p>On the second term we can use the fact that l is M -smooth to write</p><p>On the last term we can use the convexity of f to write</p><p>By inserting ( <ref type="formula">30</ref>), ( <ref type="formula">31</ref>), ( <ref type="formula">32</ref>) into <ref type="bibr">(29)</ref> gives</p><p>Now using the above in <ref type="bibr">(28)</ref>, and our assumption that &#947; t M &#8804; 1 , we obtain</p><p>We now have to control the last term of <ref type="bibr">(33)</ref> in expectation. To shorten our notation we temporarily introduce the operators</p><p>Notice in particular that w t+1 = prox &#947; t h ( T (w t )) . So the the last term of (33) can be decomposed as</p><p>We observe that the last term is, in expectation, is equal to zero. This is becauseprox &#947; t h (T (w t ))-w * is deterministic when conditioned on w t . Since we will later on take expectations, we drop this term now and keep on going. As for the first term, using the nonexpansiveness of the proximal operator (Lemma 33), we have that</p><p>Using the above two bounds in <ref type="bibr">(35)</ref> we have proved that (after taking expectation)</p><p>Injecting the above inequality into <ref type="bibr">(33)</ref> and multiplying through by &#947; t , we obtain</p><p>From now on we assume that the stepsize sequence is constant &#947; t &#8801; &#947; . Now, applying our assumption on the gradient estimator, we have</p><p>Inject this inequality into <ref type="bibr">(36)</ref>, and reordering the terms gives</p><p>where we introduced the constant C def = 1 + 2a&#947; 2 . We will now do a weighted telescoping by introducing a sequence &#945; t+1 &gt; 0 of weights. Multiplying the above inequality by some &#945; t+1 &gt; 0 , and summing over 0, . . . , T -1gives</p><p>To be able to use a telescopic sum in the right-hand term, we need that &#945; t+1 C = &#945; t (an idea we borrowed from <ref type="bibr">[54]</ref> ). This holds by choosing &#945; t def = 1 C t , which allows us to write</p><p>Let us now introduce wT def =</p><p>, using that &#945; 0 = 1 , and by using the convexity of f with Jensen's inequality, we obtain</p><p>To finish the proof, it remains to compute the geometric series P T -1 t=0 &#945; t+1 which is given by</p><p>Using the above and substituting for C = 1 + 2a&#947; 2 and &#952; = 1/C gives Note that with this choice of &#947; we have that</p><p>Thus for T &#8805; 2 the result of Lemma 32 gives</p><p>5 Projected gradient descent with strong convexity Theorem 10. Let W be a nonempty closed convex set. Let f be a &#181; -strongly convex function, differentiable on W . Let w * = argmin W (f ) . Let (w t ) t&#8712;N be generated by the Proj-SGD algorithm, with a constant stepsize &#947; &#8712; 0, min{ &#181; 2a</p><p>, 2 &#181; } . Suppose that g t is a quadratically bounded estimator (Def. 5) for &#8711;f with parameters (a, b, w * ). Then,</p><p>Alternatively, if we use the decaying stepsize</p><p>In both cases, T = O(&#1013; -1 ) iterations are sufficient to guarantee thatE w &#8741; T -w * &#8741; 2 &#8804; &#1013;.</p><p>Proof. Plugging in one step of Proj-SGD we have that</p><p>where we used that projections are non-expansive. Taking expectation conditioned on w t , and expanding squares we have that</p><p>t , where we used E [g t ] = f &#8711; (w t ) and definition 5, and that f is &#181; -strongly convex. Taking full expectation and re-arranging gives</p><p>Re-arranging we have that</p><p>Using &#947; t &#8801; &#947; constant, and since &#947; &#8804; 2 &#181; &#8660; 1 -&#181;&#947; 2 &#8805; 0 we have by unrolling the recurrence that</p><p>we have that</p><p>which concludes the proof of (13).</p><p>To prove <ref type="bibr">(14)</ref> we will apply Theorem 31, for which we will switch our notation to match that of Theorem 31. That is, substituting</p><p>which now holds for &#947; t &#8804; 1 2L and fits exactly the format of ( <ref type="formula">23</ref>), excluding the additional hat on &#181;. Thus we know from following verbatim the proof that continues after (23) that for a stepsize schedule of</p><p>for t &lt; t *</p><p>1 &#956; 2t + 1 (t + 1) 2 for t &#8805; t * for t * = 4 L/ &#8970; &#710;&#181;&#8971; we have that E &#8741;w T +1 -w * &#8741; 2 &#8804; 16 L/ &#8970; &#710;&#181;&#8971; 2 (T + 1) 2 &#8741;w 0 -w * &#8741; 2 + &#963; 2 &#956;2 8 T + 1 . After switching back notation L = a &#181; , &#963; 2 = b 2 and &#956; = &#181; 2 gives (14). 9.6 Projected gradient descent with convexity Theorem 11. Let W be a nonempty closed convex set. Let f be a convex function, differentiable on W . Let w * &#8712; argmin W (f ) . Let (w t ) t&#8712;N be generated by the Proj-SGD algorithm, with a constant stepsize &#947; &#8712; (0, +&#8734;) . Suppose that g t is a quadratically bounded estimator (Def. 5) for &#8711;f at w t with constant parameters (a, b, w * ). Then,</p><p>where</p><p>Thus, T = O(&#1013; -2 ) iterations are sufficient to guarantee thatE f ( wT ) -inf f &#8804; &#1013;.</p><p>Proof. Plugging in one step of the algorithm, we have that</p><p>where we used that projections are non-expansive. Taking expectation conditioned on w t , and expanding squares we have that</p><p>where we used that g t is an unbiased estimator with bounded squared norm. Taking full expectation and re-arranging gives</p><p>From now on we use a constant stepsize &#947; t &#8801; &#947; . Let C = (1 + a&#947; 2 ). Next we would like to setup a telescopic sum. To make this happen, we multiply through by &#945; t+1 and impose that &#945; t+1 C = &#945; t .</p><p>This holds with &#945; t def = 1 C t . Multiplying through by &#945; t+1 and summing from t = 0, . . . , T -1 we have that</p><p>Using that &#945; 0 = 1, dropping the negative term -&#945; T E h w T -w * 2 i , dividing through by 2&#947; P T -1 k=0 &#945; k+1 and using Jensen's inequality we have that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Finally using that</head><p>T -1</p><p>gives that</p><p>10 Solving VI with stochastic optimization</p><p>10.1 Convergence of the algorithms Corollary 12 (Prox-SGD for VI). Consider the VI problem where q w is a multivariate Gaussian distribution (Eq. 2) with parameters w = (m, C) &#8712; R d &#215; T d , and assume that this problem admits a solution w * . Suppose that log p(&#8226;, x)is M -smooth and concave (resp. &#181; -strongly concave). Generate a sequence w t by using the Prox-SGD algorithm (Def. 6) applied to l and h (Eq. 16), using g energy (5) as an estimator of &#8711;l . Let the stepsizes &#947; t be constant and equal to 1/( p a energy T ) (resp. be decaying as in Theorem 7 with a energy = 2(d + 3)M 2 ). Then, for a certain average wT of the iterates, we have for T &#8805; 2 that</p><p>Proof. Our assumption on the log target imply that l is smooth and convex (resp. &#181; -strongly convex) according to Lemmas 18.1 and 19.1. Furthermore, since log p(&#8226;, x) is smooth, we know from Theorem 30 that the gradient estimator g energy is quadratically bounded at every w t with respect to constant parameters (a, b, w * ), with a = 2(d + 3)M 2 . Then, because we work with triangular scale parameters, we know from Lemma 19. We have now verified the hypotheses of Theorem 8 (resp. <ref type="bibr">Theorem 7)</ref>, which gives us the desired bounds.</p><p>Corollary 13 (Proj-SGD for VI). Consider the VI problem where q w is a multivariate Gaussian distribution (Eq. 2) with parameters w = (m, C) &#8712; R d &#215; S d , and assume that this problem admits a solution w * . Suppose that log p(&#8226;, x)is M -smooth and concave (resp. &#181; -strongly concave). Generate a sequence w t by using the Proj-SGD algorithm (Def. 9) applied to the function f = l+h (Eq. 16) and the constraint W M (Eq. 3), using g ent <ref type="bibr">(7)</ref> or g STL <ref type="bibr">(9)</ref> as an estimator of &#8711;f . Let the stepsizes &#947; t be constant and equal to p 2/(aT ) (resp. be decaying as in <ref type="bibr">Theorem 10)</ref> with a = a ent = 4(d + 3)M 2</p><p>or a = a STL = 24(d + 3)M 2 . Then, for a certain average wT of the iterates, we have for T &#8805; 2 that</p><p>Proof. Our assumption on the log target imply that l is smooth and convex (resp. &#181; -strongly convex) according to Lemmas 18.1 and 19.1. Regarding the entropy, since log p(&#8226;, x)is M -smooth, we know from Lemma 18.3 that h is M -smooth on W M . We also know that h is closed convex, since we consider symmetric scale parameters (see <ref type="bibr">Lemma 19.2)</ref>. Furthermore, the smoothness of log p(&#8226;, x) implies that the gradient estimator g ent is quadratically bounded at every w t &#8712; W M , with respect to constant parameters (a, b, w * ). This is stated in Theorem 30, where we can see that for g ent we can take a ent = 4(d + 3)M 2 , and for g STL we can take a STL = 2(d + 3)(2K 2 + (2M) 2 ). Since we know from Theorem 4 that K &#8804; 2M , we can consider that a STL &#8804; 24(d + 3)M 2 . Finally, we know from Lemma 20 that the minimizers of f belong to W M , meaning that argmin(f ) = argmin W M (f ) .</p><p>We have now verified the hypotheses of Theorem 11 (resp. <ref type="bibr">Theorem 10)</ref>, which gives us the desired bounds.</p><p>Corollary 14 (Proj-SGD for VI -Gaussian target). Consider the setting of Corollary 13, in the scenario that log p(&#8226;, x)is &#181; -strongly concave, that we use the g STL estimator, and that we take a constant stepsize &#947; t &#8801; &#947; &#8712; 0, min{</p><p>&#181; 2a STL , 2 &#181; } . Assume further that p(&#8226;|x) is Gaussian. Then, E &#8741;w T -w * &#8741; 2 2 &#8804; 1 -&#181;&#947; 2 T &#8741;w 0 -w * &#8741; 2 2 .</p><p>Proof. We use the same arguments as in the proof of Corollary 13, but this time we use a constant stepsize, which gives us the bound (13) from Theorem 10, where b = 4(d + 3)K 2 &#8741;w * -&#710;w&#8741; 2 according to Theorem 4. All we need to conclude is to verify that b = 0 . This comes from the assumption that the target is Gaussian, from which we deduce thatK = 0 (see again Theorem 4). </p><p>Proof. This is essentially proved in <ref type="bibr">[ 13,</ref><ref type="bibr">Theorem 13]</ref>. We provide a proof here for the sake of completeness, also because our entropy here enforces positive definiteness. So we need to compute</p><p>Because the objective here is separated in z and X , we can optimizae those variables separately. It is clear that the optimal z will be z = m . For the matrix component, it remains to solve argmin</p><p>Now remember that for a triangular matrix, being positive definite is equivalent to have positive diagonal elements (Lemma 19.3). Because X is a lower triangular matrix, we have-log det X = P d i=1 -log X ii and X ij = 0 when i &lt; j . So our problem becomes</p><p>This is a gain a separated optimization problem with respect to the variables X ij . So we deduce that when i &#8805; j we must take &#264;ij = C ij , and when i = j we need to take</p><p>This is the proximal operator of -log, which can be classically computed [55, Example 6.9], and gives the desired result. </p><p>Proof. Consider g = &#948;W M to be the indicator function of W M , which is equal to 0 on W M and +&#8734; elsewhere. Then proj W M = prox g , and we can compute it with [55, <ref type="bibr">Theorem 7.18]</ref>. For this, we need to write g(C) = F (&#955;(C)) , where &#955;(C) is the set of eigenvalues of C , and F is the indicator function of K := {z &#8712; R d : z i &#8805; 1/ &#8730; M} . In that way, we see that g is a symmetric spectral function in the sense of <ref type="bibr">[55,</ref><ref type="bibr">Definition 7</ref>.12], and the conclusion follows.</p><p>1. Algorithm 1 is equivalent to the Prox-SGD algorithm (Def. 6) applied to l and h (Eq. 16), using g energy <ref type="bibr">(5)</ref> as an estimator of &#8711;l and using lower triangular covariance factors.</p><p>2. Algorithm 2 is equivalent to the Proj-SGD algorithm (Def. 9) applied to the function f = l + h (Eq. 16) and the constraint W M (Eq. 3), and using g ent <ref type="bibr">(7)</ref> as an estimator of &#8711;f .</p><p>3. Algorithm 3 is equivalent to the Proj-SGD algorithm (Def. 9) applied to the function f = l + h (Eq. 16) and the constraint W M (Eq. 3), and using g STL <ref type="bibr">(9)</ref> as an estimator of &#8711;f .</p><p>Proof. In this proof, we fix an iteration t &#8712; N , consider u t &#8764; N (0, I) and we note w t = (m t , C t ).</p><p>1. From (Def. 6) we know that w t+1 = prox &#947; t h (w t -&#947; t g energy (ut )</p><p>) . According to Lemma 38, we have w t -&#947; t g energy (ut ) = (m t , C t ) -&#947; t (&#960;t , proj T d (&#960;t u &#8868; t )), where &#960; t = -&#8711; z log p(Ct u t + m t , x). Moreover, we know from Lemma 35 whatprox &#947; t h is: prox &#947; t h (m, C) = (m, R &#947; t (C)), where R &#947; t (C) ij = 1 2 (Cii + p C 2 ii + 4&#947; t ) if i = j, C ij if i &#824; = j. First, we see that with respect to the m variable the proximal operator is the identity, meaning that we indeed have m t+1 = m t -&#947; t &#960; t . Second, we see that with respect to the C variable we have C t+1 = R &#947; t (Ct -&#947; t proj T d (&#960;t u &#8868; t )). It is immediate to see by induction that since we initialize with C 0 &#8712; T d , then C t &#8712; T d as well. Therefore we can use the linearity of proj T d to write that C t+1 = R &#947; t proj T d (Ct -&#947; t &#960; t u &#8868; t ) . The conclusion follows after observing that (R &#947; t &#8226; proj T d ) (C)(i, j) = &#63729; &#63732; &#63730; &#63732; &#63731; C(i, j) if i &gt; j, 1 2 C(i, i) + p C(i, i) 2 + 4&#947; t if i = j, 0 if i &lt; j. 2. From (Def. 9) we know that w t+1 = proj W M (w t -&#947; t g ent (ut )) . According to Lemma 38, we have w t -&#947; t g ent (ut ) = (m t , C t ) -&#947; t (&#960;t , proj S d (&#960;t u &#8868; t -C -&#8868; t</p><p>)). Moreover, we know from Lemma 36 whatproj W M is:</p><p>where R(C) = U DU &#8868; , with UDU &#8868; being a SVD decomposition of C , and D(i, i) = max{D(i, i); M -1/2 } . First, we see that with respect to the m variable the projection is the identity, meaning that we indeed have m t+1 = m t -&#947; t &#960; t .</p><p>Second, we see that with respect to the C variable we have</p><p>It is immediate to see by induction that since we initialize with C 0 &#8712; S d , then C t &#8712; S d as well. Therefore we can use the linearity of proj S d to write that C t+1 = R proj S d (Ct -&#947; t (&#960;t u &#8868; t -C -&#8868; t</p><p>)) .</p><p>The conclusion follows after setting &#264;t+1 = C t -&#947; t (&#960;t u &#8868; t -C -&#8868; t ), and using the fact that proj S d (C) = (C + C &#8868; )/2 .</p><p>3. Use the same reasoning as in the previous item, with the appropriate computation for the STL estimator from Lemma 38.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>One could also build an estimator without holding w constant in this way. However, in the case of Gaussian variational families, this turns out to be mathematically equivalent to using a closed form entropy because log qw (Tw (u)) = -<ref type="bibr">1</ref> </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>&#8741;u&#8741;<ref type="bibr">2</ref> 2 -d 2 log(2&#960;) -log |C| , which has the same gradient (independent of u ) as h(w) .</p></note>
		</body>
		</text>
</TEI>
