<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Sample Average Approximation for Black-Box Variational Inference</title></titleStmt>
			<publicationStmt>
				<publisher>Association for Uncertainty in Artificial Intelligence</publisher>
				<date>06/15/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10535050</idno>
					<idno type="doi"></idno>
					
					<author>Javier Burroni</author><author>Justin Domke</author><author>Daniel Sheldon</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Black-box variational inference (BBVI) is a general-purpose approximate inference approach that converts inference to a stochastic optimization problem. However, the difficulty of solving the BBVI optimization problem reliably and robustly using stochastic gradient methods has limited its applicability. We present a novel optimization approach for BBVI using the sample average approximation (SAA). SAA converts stochastic problems to deterministic ones by optimizing over a fixed random sample, which enables optimization tools such as quasi-Newton methods and line search that bypass the difficulties faced by stochastic gradient methods. We design an approach called "SAA for VI" that solves a sequence of SAA problems with increasing sample sizes to reliably and robustly solve BBVI problems without problem-specific tuning. We focus on quasi-Newton methods, which are well suited to problems with up to hundreds of latent variables. Our experiments show that SAA for VI simplifies the VI problem and achieves faster performance than existing methods.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>It is a long-standing research direction to develop robust inference methods that perform well on a wide range of real models. This is of immense practical interest in fields like astrophysics, epidemiology, political science, psychology, ecology, and others, where a scientist supplies a model and data, and the goal is to recover the posterior distribution of latent variables. However, inference is extremely challenging in general and formally intractable except for restricted cases, so approximations are needed.</p><p>Variational inference (VI) is one of the main approximate inference approaches. It poses inference as an optimization problem to find a distribution from a specified family that is as close as possible to the posterior by maximizing the evidence lower bound (ELBO) <ref type="bibr">(Wainwright and Jordan, 2008;</ref><ref type="bibr">Jaakkola and Jordan, 1997;</ref><ref type="bibr">Beal, 2003)</ref>, or, equivalently, minimizing the KL-divergence to the posterior.</p><p>In the quest to make VI broadly applicable and "automatic", recent work has focused on "black box" variational inference (BBVI) <ref type="bibr">(Ranganath et al., 2014;</ref><ref type="bibr">Titsias and L&#225;zaro-Gredilla, 2014;</ref><ref type="bibr">Kucukelbir et al., 2017;</ref><ref type="bibr">Yin and Zhou, 2018;</ref><ref type="bibr">Hoffman and Ma, 2020;</ref><ref type="bibr">Buchholz et al., 2018)</ref>. BBVI performs ELBO maximization using only "black box" access to the model in the form of evaluations of the log joint density or the gradient thereof. This allows VI to be applied to a wide range of models, especially when paired with recent modeling frameworks such as Stan <ref type="bibr">(Carpenter et al., 2017)</ref> that make it easy for users to specify models that are converted to routines for log-densities and gradients.</p><p>To achieve this generality, BBVI treats ELBO maximization as a stochastic optimization problem, which it solves via stochastic gradient descent (SGD) <ref type="bibr">(Wingate and Weber, 2013;</ref><ref type="bibr">Blei et al., 2017;</ref><ref type="bibr">Kucukelbir et al., 2017;</ref><ref type="bibr">Ranganath et al., 2014;</ref><ref type="bibr">Rezende et al., 2014;</ref><ref type="bibr">Kingma and Welling, 2013)</ref> or a variant such as Adam <ref type="bibr">(Kingma and Ba, 2015)</ref> or AdaGrad <ref type="bibr">(Duchi et al., 2011)</ref>. However, in practice, the difficulty of solving this stochastic optimization problem reliably and robustly has severely limited the applicability of BBVI <ref type="bibr">(Agrawal et al., 2020;</ref><ref type="bibr">Welandawe et al., 2022)</ref>. A particular challenge is selecting step size sequences that allow rapid progress and avoid suboptimality. This motivates the consideration of alternate stochastic optimization methods that can perform more reliably for BBVI problems.</p><p>In this paper, we propose an alternative optimization approach for BBVI based on the on sample average approximation (SAA) <ref type="bibr">(Healy and Schruben, 1991;</ref><ref type="bibr">Robinson, 1996;</ref><ref type="bibr">Shapiro and Wardi, 1996;</ref><ref type="bibr">Kleywegt et al., 2002;</ref><ref type="bibr">Kim et al., 2015)</ref>. A key feature of SAA is that it draws a fixed random sample and then solves a deterministic optimization problem. This enables tools such as line-search and second-</p><p>0 5 1x 10x 100x Running-time improvement SAA much faster 100 200 300 SAA more accurate ELBO improvement (SAA -Adam) 60 120 180</p><p>Running-time (s)</p><p>-15,000 -6,000 -600</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ELBO</head><p>Figure <ref type="figure">1</ref>: Top: ELBO improvement (nats) vs. running-time improvement (number of times faster) for SAA for VI compared to Adam, across 9 Stan models and 6 Bayesian logistic regression models using a dense-covariance Gaussian distribution. The bordered point indicates that the models "australian" and "ionosphere" share the same coordinates. Bottom: Optimization traces for the "electric" model. See Section 5 and Appendix B for details.</p><p>order optimization, which are traditionally unavailable for BBVI but can substantially improve performance. We focus on the application of quasi-Newton methods with line search to BBVI with Gaussian approximating families. This is well suited to problems with up to several hundred latent variables, which covers a very large number of applied statistical models such as those that appear in the Stan model library, many of which remain very challenging for BBVI. Quasi-Newton SAA can also scale to much larger models when using diagonal Gaussian approximating families.</p><p>Figure <ref type="figure">1</ref> illustrates the speed and accuracy benefits of SAA compared to Adam when approximating the posterior distributions of 9 real Stan models and 6 Bayesian logistic regression models (see Table <ref type="table">2</ref>) using Gaussian distributions with dense covariance matrices. SAA is always comparable to or better to Adam in terms of solution quality, and, for 10 out of 15 models, either achieves a much better solution, or achieves a comparable solution much faster. Notably, nearly a third of the models are failure cases for Adam, where SAA finds a solution that is hundreds of nats better.</p><p>To achieve this robustness, we design the SAA for VI algorithm, which applies SAA to BBVI in an efficient and automatic way whenever the approximating family is reparameterizable. To address the Monte Carlo error introduced by using a fixed random sample within SAA, we adapt techniques from the SAA literature to solve a sequence of problems with increasing sample sizes until a stopping criterion is reached <ref type="bibr">(Chen and Schmeiser, 2001)</ref> and develop a custom stopping criterion for BBVI as well as default schedules for samples sizes and optimization tolerances to achieve robust out-of-the-box performance. SAA for VI also leverages the GPU-friendly nature of the SAA objective to increase optimization efficiency.</p><p>Our empirical results demonstrate that SAA for VI on our benchmark is competitive with state-of-the-art BBVI optimization methods-including first-order methods (Adam and AdaGrad) as well as a prior second-order stochastic optimization algorithm for BBVI <ref type="bibr">(Liu and Owen, 2021)</ref>-while simplifying the variational inference process.</p><p>Concurrently with our work, <ref type="bibr">Giordano et al. (2023)</ref> proposed a sample average approximation algorithm for variational inference, motivated by the same challenges of stochastic gradient methods that limit the robustness and broad applicability of BBVI. We discuss the relationship between our method and theirs in Section 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">BACKGROUND</head><p>We are interested in approximating the posterior distribution of a latent variable given some observed data, i.e.,p(Z | x), where Z is the latent variable and x is the observed data.</p><p>To achieve this, we will approximate the posterior with a distribution from an indexed family of approximations</p><p>, where &#952; is a vector of parameters that parameterize the approximation q &#952; (Z) , and d is the dimension of &#952; .</p><p>VI proposes to approximate the posterior distribution by finding a member from Q that is closest in Kullback-Leibler divergence to the true distribution. This is achieved by maximizing the evidence lower bound (ELBO), which is a function of the parameters:</p><p>The optimization problem can be formulated as:</p><p>(2) Under smoothness assumptions, black-box VI presents this problem as a smooth stochastic optimization problem (SOP) and suggests solving it using methods based on stochastic gradient descent (SGD). Specifically, it uses stochastic gradient ascent to maximize the ELBO by updating the parameters as follows:</p><p>At every iteration t , samples z 1 , . . . , z n from q &#952; t are drawn and the sample mean of the function g &#952; t (Z) is being computed, where g &#952; t (Z) is a R d -valued random vector whose expectation equals the gradient. Then, this estimate is used, along with some &#947; t &#8712; R + , to update the parameters according to:</p><p>This function can be obtained using various methods, including the score function estimator <ref type="bibr">(Wingate and Weber, 2013;</ref><ref type="bibr">Ranganath et al., 2014)</ref> or, if the distribution is reparameterizable, the 'reparameterization trick' <ref type="bibr">(Kingma and Welling, 2013;</ref><ref type="bibr">Fu, 2006;</ref><ref type="bibr">Kingma and Welling, 2019;</ref><ref type="bibr">Rezende et al., 2014)</ref>, among others. A random variable Z comes from a reparameterizable distribution q &#952; if there exist a C 1 function z &#952; and a density q base such that Z = z &#952; () for q &#8764; base . We refer to these values as noise. In such case, the stochastic optimization problem becomes</p><p>where q &#8764; base . It then follows that, at every step t of the optimization, the update rule of Eq. ( <ref type="formula">3</ref>) is</p><p>Despite its simplicity, the explanation above fails to convey the complexities of choosing hyperparameters, particularly the step size &#947; t , also known as the learning rate. The user can opt to use a step size schedule &#947; = (&#947; t ) t&#8712;N &#8834; R + that meets the Robbins-Monro conditions ( k&#947;k 1 = &#8734; and k&#947;k 2 &lt; &#8734; ), which can lead to SGD converging at a critical point due to the use of unbiased estimators of the gradients <ref type="bibr">(Robbins and Monro, 1951;</ref><ref type="bibr">Ranganath et al., 2014;</ref><ref type="bibr">Jankowiak and Obermeyer, 2018)</ref>. However, the specific sequence of the schedule is not specified and different schedules may affect the speed of convergence differently [cf. <ref type="bibr">Agrawal et al. (2020)</ref>]. Critically, the random nature of estimating the loss function and its gradient makes it impractical to use traditional line-search methods. Additionally, the choice of the number of samples n drawn at each iteration can affect the optimization process, as a larger n provides a more accurate gradient estimate but may increase the computational cost. Balancing this trade-off is an important aspect of algorithm design.</p><p>Moreover, controlling the variance of gradient estimates significantly influences the performance of the optimization algorithm, affecting stability and convergence properties, and further adding to the complexity of the problem. In this context, the choice of the gradient estimator g &#952; t is crucial. Instead of employing the na&#239;ve estimator by taking the average of the gradient of ln p(z&#952; t ()) -ln q &#952; t (z&#952; t ()) , one can consider alternative methods such as the sticking-thelanding estimator <ref type="bibr">(Roeder et al., 2017)</ref> or, when the entropy term H&#952; = -E[ln q &#952; t (z&#952; t ())] is available in closed form, estimating the gradients of E[ln p(z&#952; t ())] + H &#952; . Although all these estimators are unbiased, they exhibit different variance behaviors, which can impact the optimization process.</p><p>To reduce the variance of the gradient estimator, control variates can also be applied <ref type="bibr">(Ranganath et al., 2014;</ref><ref type="bibr">Geffner and Domke, 2018)</ref>. These choices contribute to the overall complexity of choosing hyperparameters, step size schedules, and the number of samples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">METHODS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">SAMPLE AVERAGE APPROXIMATION</head><p>The problem of ELBO maximization in the reparameterization setting of Eq. ( <ref type="formula">4</ref>) is formulated as an SOP where the stochasticity comes from a fixed probability distribution, i.e., a probability distribution which does not depend on &#952; . Furthermore, the function inside the expectation is a smooth function of the parameters &#952; . Solutions to these problems can be approximated using the sample average approximation (SAA): a sample average over a fixed sample replaces the expectation, effectively transforming the SOP into a deterministic optimization problem.</p><p>We propose to use SAA for black-box VI. To use SAA, we take n i.i.d. samples = 1 , . . . , n from the distribution q base and define the deterministic training objective function</p><p>which is a function of &#952; alone.</p><p>Then, the optimization problem in Eq. ( <ref type="formula">4</ref>) can be transformed into a deterministic optimization problem</p><p>where v &#952; ( i ) = ln p(z &#952; ( i ), x) -ln q&#952;(z&#952;( i )) denote the log-weights, also known as log-importance ratios. Since the optimization is performed with the fixed set , we refer to it as the training noise.</p><p>We want to recover the optimal parameters &#952; * of L . In an unconstrained smooth optimization setting, we need to specify how to compute a search direction and a step size. For the search direction, we will use L-BFGS <ref type="bibr">(Broyden, 1970;</ref><ref type="bibr">Fletcher, 2013;</ref><ref type="bibr">Goldfarb, 1970;</ref><ref type="bibr">Shanno, 1970;</ref><ref type="bibr">Nocedal, 1980)</ref>.</p><p>In contrast to the SGD setting, deterministic optimization allows us to specify the step size using line search and ask for it to satisfy the strong Wolfe conditions <ref type="bibr">(Nocedal and Wright, 1999)</ref>. Specifically, for0 &lt; c 1 &lt; c 2 &lt; 1 , the step size &#947; must simultaneously satisfy the modified curvature (MC) and sufficient increase (SI) conditions, that is, &#8711; L (&#952; + &#947;r) T r &#8804; c 2 &#8711; L (&#952;) T r , and, (MC) L (&#952; + &#947;r) &#8805; L (&#952;) + c 1 &#947;&#8711; L (&#952;) T r.</p><p>(SI)</p><p>We will use L-BFGS with line search to find a local optimum of Eq. ( <ref type="formula">5</ref>), and denote the process that does so by Opt(&#952;, n, , &#964; ) . Here, &#964; is the maximum number of iterations for which L-BFGS will run, and &#952; is an initial value of the parameters. Besides the arguments of L (&#952;), we also need to specify the value of &#964; .</p><p>Sandwiching the optimal ELBO Critically, the training objective L (&#952;) and the ELBO L(&#952;) may differ for a fixed &#952; . The ELBO, as defined in Eq.( <ref type="formula">1</ref>), is an expectation over the distribution q &#952; , while the training objective is computed based on an average over a fixed sample . In contrast, the optimal ELBO refers to the value of the ELBO achieved by the maximizer of Eq. ( <ref type="formula">2</ref>), denoted as &#952; * , and depends only on the target distribution and the approximating family.</p><p>During optimization with a fixed sample of training noise n = 1 , . . . , n , one might wonder how much the learned parameters &#952; * n and the distribution q &#952; * n depend on these noise samples, and, in particular, how this dependency translates into the tightness of the gap between the ELBO L(&#952; * n ) of the learned approximation and its upper bound, the optimal ELBO L(&#952; * ). Fortunately, two results by <ref type="bibr">Mak et al. (1999)</ref> are relevant to our discussion. Note that until the noise variables 1 , . . . , n are realized, the quantity &#952; * n and all functions of it are random. Let &#710;n+1 = &#710;1, . . . , n+1 be a sample of size n + 1 taken i.i.d. from q base . Assuming the deterministic optimization with fixed noise converges to a global optimum, it holds that: (i) the ELBO and training objective sandwich the optimal ELBO (in expectation), that is, <ref type="bibr">and (ii)</ref> the training objective converges monotonically to the optimal ELBO from above (in expectation), that is,</p><p>In particular, these results mean that we can use standard statistical techniques to quantify the discrepancy between the ELBO L(&#952; * n ) and the training objective L (&#952; * n ) by comparing the distribution of the log-weights v 1 , . . . , v n for a fresh sample of noise, referred to as testing noise, and the training noise, a technique first used by <ref type="bibr">Mak et al. (1999)</ref>. Figure <ref type="figure">2</ref> displays the distribution of log-weights for a growing sample size. As the number of samples increases, the training objective value decreases and approaches that of the ELBO estimation, which in turn increases, indicating progress toward the ultimate goal of ELBO maximization, while tightening the gap around the optimal ELBO. We adopt the classical approach of tightening this gap by solving a sequence of SAA approximations for an increasing sequence of sample sizes (n t ) t&#8712;N &#8838; N , which creates a sequence of solutions (&#952; * n t ) t&#8712;N . <ref type="bibr">Shapiro (2003)</ref> give general the optimization process, and we will describe it later in this section.</p><p>The algorithm initializes with guess &#952; 0 , sample size n 0 , and maximum optimizer iterations &#964; 0 . In each iteration t , we double the sample size to tighten the gap around the optimal ELBO. We draw training noise n t = 1 , . . . , n t from the base distribution q base and then use the optimizer to find the maximizer &#952; * t of the deterministic objective L n t . If the optimizer reaches the iteration limit &#964; t , we double its value.</p><p>When the optimizer Opt finishes in a small number of iterations, the parameters may remain almost unchanged, resulting in nearly identical log-weights. Consequently, any convergence test based on these log-weights might not be indicative. Though such behavior could signal convergence, it might be due to chance. To address this uncertainty, we require a minimum of VERY_SMALL_ITER iterations before considering convergence. However, if the optimizer finishes without reaching this number of iterations for three consecutive step sizes, we stop the process.</p><p>Algorithm 1 SAA for VI 1: Input: &#952; , n , &#964; Output: parameters &#952; * 2: t &#8592; 0, count &#8592; 0 3: while count &lt; 3 do 4:</p><p>&#951; &#8592; number of iter used by the optimizer 8:</p><p>if count = 0 and converged?(&#952;, n , t) then 15:</p><p>. Statistically compare means:</p><p>return True 10: return False Stopping Algorithm 2 defines the stopping criteria for our optimization process, which involves computing log-weights. Specifically, given the training noise n t and the parameters &#952; t , we compute the log-weights v &#952; t ( 1 ), . . . , v &#952; t ( n t ), which we denote as v &#952; t ( n t ). We also compute a new set of log-weights using 10k fresh samples of testing noise, denoted by v &#952; t ( &#710;10k ).</p><p>To decide whether to halt or continue the optimization process, we use a two-sided t-test to compare the means of log-weights. We compare the mean log-weight calculated with the training noise, v &#952; t ( n t ), to the mean log-weight computed from the testing noise, v &#952; t ( &#710;10k ). The null hypothesis asserts that these means are the same. If we cannot reject the null hypothesis, we terminate the optimization process. Although the assumptions required for the t-test (e.g., that the training log-weights are i.i.d.) might not strictly hold in all cases, we employ this statistical test as a heuristic for stopping the optimization. Our approach draws inspiration from the methodology outlined in <ref type="bibr">Mak et al. (1999)</ref>. Alternatively, statistical tests such as the Kolmogorov-Smirnov or the Cram&#233;r-von Mises could be used to directly compare the log-weight distributions. In Appendix H, we evaluate the alternatives and show that the t-test is a reasonable choice.</p><p>Our optimization process terminates when the null hypothesis cannot be rejected with a significance level of1%. Checking for convergence only when count = 0 avoids meaningless tests, as without optimizer updates, the distributions of training log-weights v &#952; t ( n t ) and testing log-weights v &#952; t ( &#710;10k ) would be nearly identical. We also introduce two additional stopping conditions: the maximum number of iterations max_t and the threshold &#948; for the difference between the training objective L (&#952;t ) and the estimated ELBO L &#710;10k (&#952;t ) . In our experiments, we set max_t to ensure that the maximum sample size was n max = 2 18 , and &#948; to 0.01. In Appendix F, we provide a more detailed discussion of the hyperparameters used in our experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">RELATED WORK</head><p>In the existing literature, there are efforts to incorporate second-order information into stochastic optimization, which have been applied to VI. <ref type="bibr">Byrd et al. (2016)</ref> introduced a method that employs the L-BFGS update formula through subsampled Hessian-vector products, referred to as batched-L-BFGS or batched quasi-Newton. <ref type="bibr">Liu and Owen (2021)</ref> applied the method from <ref type="bibr">Byrd et al. (2016)</ref> to address the variational inference problem, with the optional inclusion of quasi-Monte Carlo (QMC) sampling to further decrease the variance of the gradient estimator. Both approaches involve a two-step algorithm: (1) updating the parameters at each iteration using L-BFGS's two-loop recursion, and (2) updating the displacement vector sand gradient difference vector y of L-BFGS every B steps by employing the average of the parameters from the preceding B iterations. In the work of <ref type="bibr">Liu and Owen (2021)</ref>, each iteration involves drawing a fixed-size sample of noise from q base to estimate the ELBO gradient and conduct the line search. The sample size is not extensively discussed in their work; however, the experiments were conducted with sample sizes of 128 or 256. These values are larger than those typically used in the literature, suggesting that the sample size could indeed be a relevant factor to consider. Our method deviates from the approach proposed by <ref type="bibr">Liu and Owen (2021)</ref> in two key ways. Firstly, we execute a complete deterministic optimization using a fixed set of noise, effectively reducing uncertainty. Secondly, we seamlessly integrate the sample size consideration into the algorithm itself, consequently minimizing the need for user input. As we demonstrate in Section 5.1.2, these differences lead to significant improvements when handling complex target and approximating distributions.</p><p>An alternative approach to incorporating second-order information into the variational inference problem can be found in the work of <ref type="bibr">Zhang et al. (2022)</ref>. Their method employs L-BFGS to identify modes or poles of the posterior distribution. Subsequently, the data generated by L-BFGS is utilized to estimate the posterior covariance around the mode, which is then used to parameterize an approximating distribution. This approach more closely resembles the Laplace approximation than methods that seek approximations to a global optimizer of the ELBO from a fixed parametric family.</p><p>We share a common goal with <ref type="bibr">Welandawe et al. (2022)</ref>, who also drew inspiration from <ref type="bibr">Agrawal et al. (2020)</ref> to develop a system for variational inference that requires minimal user input. However, their method employs SGD for optimizing the ELBO and uses a heuristic schedule to update the step size &#947; t during the optimization process. They initially use a fixed step size and incorporate tools to detect when the SGD process reaches stationarity, at which point they decrease the step size by a factor &#961; . During the stationary regime, they calculate the average of the parameters and take it as the optimal parameters for a given step size &#952; * &#947; t . They repeat the process of decreasing the step size until the symmetrized KL divergence between the current distribution and the optimal distribution q * (for the approximating family) falls below a threshold &#958; . Notably, since the optimal distribution q * is not known, the authors estimated the KL divergence between q * and the current distribution q &#952; * &#947; t . The authors observed that taking the average of the parameters in the stationary regime significantly improves the approximation quality compared to considering each parameter at every iteration.</p><p>In the machine learning literature, the application of sample average approximation has been relatively rare. Some early works include PEGASUS by <ref type="bibr">Ng and Jordan (2000)</ref>, in which the authors addressed partially observable Markov decision processes by replacing the value of a policy (an expectation) with the sample average of the value function applied to a finite number of states for optimization purposes. In a different context, <ref type="bibr">Sheldon et al. (2010)</ref> explicitly utilized the sample average approximation technique in a network design setting, where a na&#239;ve greedy approach was not applicable. More recently, <ref type="bibr">Balandat et al. (2020)</ref> adopted sample average approximation to optimize the acquisition function in Bayesian optimization. SAA was previously used for VI in a specialized capacity in several papers <ref type="bibr">(Giordano et al., 2018;</ref><ref type="bibr">Domke and Sheldon, 2018;</ref><ref type="bibr">Giordano et al., 2019;</ref><ref type="bibr">Domke and Sheldon, 2019;</ref><ref type="bibr">Giordano et al., 2022)</ref>; our work and the concurrent work of <ref type="bibr">Giordano et al. (2023)</ref> are the first to explore its general applicability.</p><p>As mentioned in the introduction, <ref type="bibr">Giordano et al. (2023)</ref> concurrently and independently developed a method based on the sample average approximation for black-box variational inference. The two papers employ the same basic algorithmic idea but have several differences in scope. Unlike <ref type="bibr">Giordano et al. (2023)</ref>, we focus substantially on the case where SAA with a fixed sample size has significant error and therefore one needs to solve a sequence of problems with increasing sample sizes. We introduce heuristics that guide the selection of sample sizes and the decision of when to halt the process. On the other hand, <ref type="bibr">Giordano et al. (2023)</ref> exploit the determinism of the SAA problem to develop techniques based on sensitivity analysis and the theory of "linear response covariances" <ref type="bibr">(Giordano et al., 2015</ref><ref type="bibr">(Giordano et al., , 2018) )</ref> to improve posterior covariance estimates of black-box VI and to estimate the Monte Carlo error of the SAA procedure, which are outside the scope of our work. They present a theoretical result indicating a failure mode for SAA when the number of samples is too few compared to the dimension of the latent variables. Specifically, for a Gaussian approximation with a dense covariance matrix, the sample size n must be at least equal to the dimension d Z of the latent space for the SAA problem to be bounded. Interestingly, although they conclude that this limitation prevents the use of SAA for VI with a dense Gaussian approximation, we show in the experiments section that, for interesting models, it is indeed feasible. Two reasons for this discrepancy are: (1) our SAA sample sizes can reach up to 2 18 , unlike their usual size of 30, and (2) our largest model has 501 latent variables, whereas theirs have up to 15K. Thus, their theoretical result provides useful guidance on the limitations of SAA for VI, while our empirical work shows that SAA for VI can be practical up to quite large sample sizes. We have provided an addendum in Appendix E that uses their theoretical result to improve our method: when using a dense approximation, the sequence of SAA problems should begin with a sample size larger than d Z ; this makes SAA for VI even faster by avoiding wasted effort for small sample sizes.</p><p>Another related work is <ref type="bibr">Gianniotis et al. (2016)</ref>, who independently developed a variational inference approach that uses the reparameterization trick with Gaussian distributions and optimizes a Monte Carlo approximation to the ELBO, mirroring work in the machine learning community [cf. <ref type="bibr">Kucukelbir et al. 2017;</ref><ref type="bibr">Kingma and Welling 2013;</ref><ref type="bibr">Ranganath et al. 2014</ref>]. Like us, <ref type="bibr">Gianniotis et al. (2016)</ref> also optimize over a fixed set of noise samples and assess overfitting. However, there are several differences in focus. Most notably, our work is developed in the more general context of BBVI and applies to a wide range of models and approximating families, as long as they are reparameterizable.</p><p>We also frame our approach within the lens of stochastic optimization, which is consistent with our overall goal of using second-order information to solve the BBVI problem in a robust way that requires minimal user input.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EXPERIMENTS</head><p>We now present experiments comparing SAA to other methods in terms of optimization quality and running time. We examine two types of models following the setup of <ref type="bibr">Burroni et al. (2023)</ref>. These include 11 models from the Stan model library<ref type="foot">foot_0</ref>  <ref type="bibr">(Stan Development Team, 2021;</ref><ref type="bibr">Carpenter et al., 2017)</ref> as well as Bayesian logistic regression models applied to 6 UCI datasets <ref type="bibr">(Dua and Graff, 2017</ref>).<ref type="foot">foot_1</ref> Details of the datasets are in Appendix A. For each model p(Z, x), where Z is a d Z -dimensional random vector, the approximating distribution q &#952; can either be a diagonal Gaussian or a d Z -dimensional multivariate Gaussian distribution. The former is a product of d Z independent Gaussians, where the parameters &#181; i and &#963; 2 i &gt; 0 are specific to each Z i . The latter has parameters &#181; i and LL T , where L &#8712; R d Z &#215;d Z is a lowertriangular matrix with diagonal elements that are positive, enforced by applying the softplus transformation. <ref type="foot">3</ref> We run all our experiments on GPUs.</p><p>We conduct performance comparisons and an ablation study. We compare primarily to Adam with a fixed step-size, which is commonly used for black-box VI optimization, and batched quasi-Newton, a newer method that introduces second-order information in the optimization process. We also compare to Adagrad. For all baseline methods, we use the na&#239;ve gradient estimator described in Section 2. When using Gaussian approximating distributions, this estimator corresponds to the one obtained when the entropy term is computed in closed-form. In the ablation study, we explore how our decisions affect the algorithm's performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">PERFORMANCE COMPARISON</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.1">Adam</head><p>Adam (Kingma and Ba, 2015) is a standard default optimizer for BBVI. Step size choice is less relevant with Adam than with SGD but still a factor to consider. For each experiment (combination of model and approximating family) we ran Adam with three different step-sizes (0.1, 0.01, and 0.001) and ran 20 repetitions of each combination. At each iteration, we estimated the gradient of the ELBO by taking 16 samples from q &#952; . For each model and approximating family, we selected the step size in hindsight that provided the highest median ELBO across the 20 repetitions. (See Appendix B for more details on the Adam experiments.) For SAA for VI, we used the algorithm described in Section 3.2 with the default parameter values of Table <ref type="table">15</ref> in the appendix.</p><p>We conducted two comparisons. First, we assessed the median ELBO, obtained across 20 repetitions, at the end of the optimization process using both Adam and SAA for VI. We initially ran Adam for40, 000iterations, but found that more iterations were needed for some models and increased the number for models such as election88, electric, irt, madelon, and radon; see details in the appendix.</p><p>In the second comparison, we focused on the time required to reach a specified ELBO. For each model and approximating distribution we identified as a "benchmark ELBO" the smaller of the two median ELBO values achieved by Adam and SAA, respectively, across their 20 repetitions. In other words, this ELBO value was achieved in at least half of the runs by both optimizers. We then evaluated how long it took for each method to reach an ELBO value within 1 nat of the benchmark ELBO.</p><p>Figure <ref type="figure">1</ref> and Table <ref type="table">1</ref> show summary results for Stan models and Bayesian logistic regression with dense Gaussian distributions; detailed ELBO comparisons and additional models appear in Table <ref type="table">3</ref> in the appendix. See also Figure <ref type="figure">3</ref>, which compares the final ELBO values obtained by all methods evaluated. Although Adam occasionally attains marginally superior median ELBO values for certain models--due to the stopping criterion of SAA for VI--SAA for VI consistently achieves higher median ELBOs for complex models. We noticed that Adam's performance was erratic for models like election88Exp and had a tendency to diverge, especially for the hepatitis model when optimized beyond 40, 000iterations. This divergence partially accounts for the pronounced disparity in median ELBOs between Adam and SAA for VI. We note that it's possible that Adam could achieve higher ELBO values by searching over a finer stepsize grid; however, it is exactly this type of difficult and time-intensive tuning we seek to avoid with SAA. Table <ref type="table">4</ref> in the appendix lists the time each method takes to achieve the adjusted ELBO and their respective ratios. SAA for VI is almost always faster, often by factors of 10 to 100. For instance, optimizing the electric model using Adam takes about a minute, whereas SAA for VI accomplishes the same in under 2 seconds, making SAA more than 30 times faster. Note that for Adam we only counted the compute time of the best-performing of the three learning rates, making the comparison even more favorable for SAA for VI. Since GPUs allow for vectorized multi-sample model evaluation, the wall clock time in seconds serves as the most meaningful metric for comparing the compute time of both methods. Given these results, we confidently conclude that SAA for VI is a faster alternative to Adam in these scenarios.</p><p>with &#952; * t or instead draw a new set of parameters? Pasupathy (2010) provides an intuition of why using a warm start is helpful: in principle, the optimization process for larger sample sizes begin from a place that probably is close to a solution. To empirically verify this intuition, we conducted an experiment to compare the performance of warm start and drawing fresh parameters across different models and approximating distributions. For each combination of models and distribution, we ran the sequence of SAA problems until convergence, using either warm start or by sampling new parameters at the beginning of each inner optimization. Specifically, at each iteration t , we initialized the process either with the previously computed optimal parameters &#952; * t-1 (warm start) or by drawing a new random set of parameters (fresh start). Our results, presented in Table 12 in the appendix, show that using warm start results in a significant reduction in the total time taken to converge. For example, on the election88 dataset, using fresh samples takes 20&#215; more time than using a warm start.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">LIMITATIONS AND DIRECTIONS FOR EXTENSIONS</head><p>We presented the SAA for VI method under the assumption that the approximating family is reparameterizable. This assumption was explicit in the formulation of the stochastic optimization problem in Equation ( <ref type="formula">4</ref>) and the deterministic optimization problem in Equation ( <ref type="formula">5</ref>). While we focused on using Gaussian approximating families, the SAA for VI algorithm can be applied to other reparameterizable families, such as normalizing flows (Tabak and Turner, 2013; <ref type="bibr">Rezende and Mohamed, 2015;</ref><ref type="bibr">Kingma et al., 2016;</ref><ref type="bibr">Papamakarios et al., 2021;</ref><ref type="bibr">Agrawal et al., 2020)</ref>. An interesting direction for future work would be extending SAA for VI to non-reparameterizable families. In this context, a recent work by <ref type="bibr">Zimmermann et al. (2024)</ref> proposed optimizing a forward KL divergence objective using the sample average approximation while removing the reparameterizable family restriction.</p><p>Two other limitations relate to scalability: (1) our method does not currently scale to models with very large numbers of latent variables unless diagonal Gaussian approximating families are used, and (2) SAA for VI using quasi-Newton methods does not support subsampling for models with large numbers of local latent variables. For (1), future work can consider extensions that enrich the variational family beyond diagonal Gaussians while retaining scalability of SAA, such as hierarchical distributions <ref type="bibr">(Agrawal and Domke, 2021)</ref> or normal distributions with a "diagonal plus low-rank" covariance structure <ref type="bibr">Tomczak et al. (2020)</ref>. Other tools that increase the model capacity for the covariance matrix by slowly increasing the number of parameters, like the Householder flow <ref type="bibr">(Tomczak and Welling, 2017)</ref>, could also be considered. For (2), future work may consider alternative optimizers for the deterministic subproblem of SAA for VI that support data subsampling while still benefiting from the deterministic fixing of parameters, such as first-order methods that exploit a finite-sum structure <ref type="bibr">(Vaswani et al., 2019)</ref>.</p><p>Lastly, we used PyTorch's off-the-shelf implementation of L-BFGS in our experiments <ref type="bibr">(Paszke et al., 2019)</ref>. While it generally produces good results, we observed some failure cases due to issues while bracketing the step size, leading to NaNs or Infs and causing the optimization to fail. This would trigger the need for a larger sample size, increasing computational cost. However, a more robust implementation could potentially recover from these failures and continue the optimization process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">CONCLUSION</head><p>In this paper, we introduced the SAA for VI algorithm, which provides an effective and accurate solution to variational inference problems, significantly reducing the reliance on manual hyperparameter tuning. This promising method enhances both efficiency and precision in addressing these challenges.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A DATASETS DESCRIPTION</head><p>We use the same datasets as <ref type="bibr">Burroni et al. (2023)</ref>. The table below, adapted from their paper, provides a summary of the datasets employed. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B DETAILED COMPARISON WITH ADAM</head><p>We provide further details on the experimental setup introduced in Section 5.1. We used the Adam optimizer with the default parameters from the torch.optim package in PyTorch <ref type="bibr">(Paszke et al., 2019)</ref>, with the exception of the step-size, which we varied across 0.1, 0.01, and 0.001. To approximate the distributions, we used a Gaussian with a diagonal covariance matrix and a more expressive Gaussian with a dense covariance matrix. We show the median ELBO values achieved by Adam and SAA for VI in Table <ref type="table">3</ref>, and the running time in Table <ref type="table">4</ref>. Additionally, we provide the results disaggregated by steps size in Tables <ref type="table">5</ref> and <ref type="table">6</ref>. In all instances, we conducted 20 repetitions of the experiments, estimating the objective function with 16 samples from the variational approximation q &#952; t . Every 100iterations, we estimated the ELBO using 10, 000fresh samples from q &#952; t . Although our initial experiments spanned 40, 000iterations, the dense approximation yielded unsatisfactory results for certain models. Consequently, we extended the number of iterations for these models. Specifically, the irt model was run for 200, 000iterations, while the madelon, election88, electric, and radon models were executed for 400, 000iterations. Despite these extensions, only minor changes in the maximum achieved ELBO were observed. It's noteworthy that the hepatitis model diverged when executed beyond40, 000iterations using the dense approximation. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1 LARGER SCALE EXPERIMENT</head><p>To compare the performance of SAA for VI and Adam on a larger model, we used the stochastic volatility model from the Stan library <ref type="bibr">(Carpenter et al., 2017)</ref>. Following <ref type="bibr">Lai et al. (2022)</ref>, we modeled the exchange rates of 23 international currencies against the US dollar as stochastic volatilities. 4 To increase the complexity of the task, we employed daily data from 2021/01/01 to 2023/12/31, resulting in a model with17, 228latent variables. In Figure <ref type="figure">4</ref>, we present the ELBO achieved by SAA for VI and Adam over time, showing that SAA for VI reaches and ELBO higher than the optimal ELBO achieved by Adam several times faster. When using the diagonal covariance approximation, the speed improvement for SAA for VI is notably higher, reaching at least an order of magnitude in most cases. See Section 5.1 for more information. Table 6: Maximum ELBO achieved by Adam and SAA for VI with Gaussian distribution and dense covariance matrix as approximating distribution: median across seeds. The table shows the median of the maximum ELBO achieved by Adam</p><p>and SAA for each model when using a gaussian distribution with dense covariance matrix as approximating distribution. For each step-size used with Adam, we ran the algorithm 20 times and reported the median of the maximum ELBO achieved.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C DETAILED COMPARISON WITH BATCHED QUASI-NEWTON</head><p>In this section, we provide further details regarding experiments conducted using the batched quasi-Newton method as described by <ref type="bibr">Liu and Owen (2021)</ref>. We compare the maximum ELBO attained by the batched quasi-Newton to that achieved by SAA for VI. This comparison is made for both the Gaussian distribution with a diagonal covariance matrix (Table <ref type="table">8</ref>) and the one with a dense covariance matrix (Table <ref type="table">9</ref>). While results in the diagonal scenario align closely with ours, the batched quasi-Newton method often converges to a suboptimal solution in the dense case.</p><p>Additionally, we report the wall-clock time for each experiment in Table <ref type="table">10</ref>. We executed each experiment for 40,000 iterations and performed 20 independent runs for each one. Our method incorporates a stopping criterion based on convergence. To ensure a fair comparison with batched quasi-Newton, we need to detect when the algorithm converges. To approximate this, we first calculate the highest ELBO for each of the 20 independent runs using both batched quasi-Newton and SAA for VI. Then, we compute the median ELBO value across the repetitions for each method. Finally, we determine the minimum median ELBO value between the two methods and calculate the total time taken until the algorithm reaches within 1 nat of this minimum median ELBO value. These results are presented in Table <ref type="table">10</ref>.</p><p>Similar to the experiments with Adam, this calculation does not account for the time spent on sample sizes that were not useful.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Diagonal Gaussian</head><p>Batched quasi-  <ref type="table">9</ref>: Final ELBO by the batched quasi-Newton method for VI using a Gaussian distribution with a dense covariance matrix <ref type="bibr">(Liu and Owen, 2021)</ref>. The results for SAA for VI are included as a benchmark (column (v) of Table <ref type="table">3</ref>). The batched quasi-Newton method frequently converges to suboptimal solutions, indicated by7, especially in models from the Stan examples repository. In models like election88, the SAA for VI method demonstrates a significant performance advantage. The initial sample size for the batched quasi-Newton method was set to 16 and increased when necessary to enhance the method's ELBO. between SAA for VI and batched quasi-Newton across various models and approximating distributions. The analysis for the approximation using a dense covariance matrix considers runs with a batch size of 128 for batched quasi-Newton. For models marked with 7, indicating failure of batched quasi-Newton in the dense covariance matrix approximation, reports are limited to madelon and irt as they closely approach the maximum ELBO. The table also shows the running time improvement of SAA for VI over batched quasi-Newton; values greater than 1 imply that SAA for VI is faster.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D ADDITIONAL RESULTS FOR SAA FOR VI</head><p>Table <ref type="table">4</ref> shows the median time taken by SAA for VI to reach the maximum ELBO achieved by Adam. In this section, we present the total time taken by SAA for VI until its completion. Notably, for some models like election88, SAA reached an ELBO over 200 nats higher than Adam, clarifying the discrepancies between Table <ref type="table">11</ref> and <ref type="table">Table 4</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E ADDENDUM</head><p>As mentioned in the related work section, a result by <ref type="bibr">Giordano et al. (2023)</ref> demonstrates the futility of using a sample size smaller than the dimension of the latent space for the ELBO optimization problem. In this section, we provide a proof sketch of this result, adapted to our notation.</p><p>Theorem E.1 (Theorem 2 of <ref type="bibr">Giordano et al. (2023)</ref>). Let q &#952; be a Gaussian distribution with parameters &#952; = (&#181;, LL T ), where &#181; &#8712; R d Z and L &#8712; R d Z &#215;d Z is a lower-triangular matrix with positive diagonal elements. If we draw a sample of size n &lt; d Z from q base , denoted by = 1 , . . . , n , then the optimization problem in Eq. ( <ref type="formula">5</ref>) is unbounded:</p><p>Proof. Since n &lt; d Z , there exists a nonzero vector v &#8712; R d Z such that hv, i i = 0 for all 1 &#8804; i &#8804; n . Without loss of generality, assume that the largest index `with v`6= 0satisfies v`= 1 . Define the lower triangular matrix</p><p>Then, we have (L &#955; i )`= 0 = (L 0 i )`for all 1 &#8804; i &#8804; n . Let &#952; &#955; = (0, L &#955; L T &#955; ). For &#955; &gt; 0 , we obtain</p><p>where c is a constant independent of &#955; .</p><p>The result follows by letting &#955; &#8594; &#8734; .</p><p>With this result in mind, we decided to adapt the SAA for VI algorithm by, in the case of a dense covariance matrix approximation, drawing a sample of size n , set as twice the smallest power of two exceeding the latent space dimension d Z .</p><p>Table <ref type="table">13</ref> and 14 present the experimental results alongside the previously computed results. As observed, starting with a larger sample size allows us to reduce the number of iterations required to achieve a certain accuracy. Furthermore, this reduction is substantial. This outcome was anticipated because, when the problem was unbounded, the optimization process for smaller n typically concluded when the maximum number of iterations was reached, meaning the entire computational budget was utilized. Table 14: Comparison of running time, in seconds, for batched quasi-Newton and SAA for VI across various datasets, using a Gaussian approximating distribution with a dense covariance matrix, showing the running time improvement of SAA for VI over batched quasi-Newton. The minimum sample size n for SAA in VI is displayed. For models where the batched quasi-Newton method did not fully converge (7), we only show results for mushrooms and irt, as the others diverged. Two settings are considered: one with a minimum n of 32 for all datasets (used in this paper [cf. Table 10]), and another with the minimum sample size set to the nearest power of 2 greater than twice d Z , the dimension of the latent space. As in Table 13, the results indicate that avoiding small sample sizes can significantly reduce the running time of SAA in VI.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Adam</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F HYPERPARAMETERS</head><p>As with any optimization algorithm, our implementation of the SAA for VI algorithm uses certain constants and hyperparameters. Table <ref type="table">15</ref> details the purpose of each such number, along with the rationale behind our chosen values. We emphasize that SAA for VI performs well across many models without tuning these parameters (our experiments used a single setting): many can be considered constants, while others control tradeoffs between computation and precision in a straightforward way, such as tolerance parameters. While the current hyperparameter values are not tuned, we are open to the possibility of further enhancing the algorithm's performance through careful tuning.</p><p>The sequence of sample sizes is controlled by the first two hyperparameters. We tested a variety of exponentially increasing sequences and determined that the performance was largely unaffected by the specific choice. However, the initial sample size showed a more pronounced effect on performance as it could potentially 'save work' by avoiding smaller sample sizes if larger ones are required. This is not always predictable; our addendum, following Giordano et al. ( <ref type="formula">2023</ref>)'s concurrent work, refines SAA for VI by tuning this value based on the model and approximation family.</p><p>The remaining hyperparameters, listed last in the table, mainly dictate when to halt the process. For example, a user may deem being 1 nat away from the optimum as adequate, thus setting &#948; to 1 instead of 0.01. The &#945; (significance level for t-test) could also be adjusted depending on the desired balance between computation cost and approximation precision. Similar parameters are used in most implementations of other optimization algorithms (maximum iterations, absolute/relative tolerance, etc.) and tend to be less critical than parameters like step sizes as they affect the trade-off between computational time and numerical precision rather than the fundamental operation of the algorithm. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G ADDITIONAL EXPERIMENTS WITH ADAM AND ADAGRAD</head><p>This section provides supplementary experimental findings with Adam and AdaGrad. We further explore the performance of Adam with two additional sample sizes: 1 and 256. For AdaGrad, we maintain the sample sizes consistent with those discussed in the main body of the text. The step-size search across0.1, 0.01, and 0.001remains unchanged in all experiments.</p><p>As the sample size increases, the maximum ELBO value in most models tends towards the one obtained using SAA for VI, as demonstrated in Table <ref type="table">16</ref>. (We do not show the results for Adam with n = 1 because those were of poorer quality than the results for n = 16 .) Despite this improvement, some models still exhibit significant disparity. It is important to note that Adam's computational cost continues to be higher than SAA for VI, as evidenced by Table <ref type="table">17</ref>. Note that the same instances of SAA for VI were used in all scenarios. Meaning, for each SAA for VI iteration, we ran Adam nine times. This repetition is not reflected in the presented numbers.</p><p>In AdaGrad's case, as shown in the Tables, there are promising results for Bayesian logistic regression models. However, the same performance does not extend to the Stan models. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H STATISTICAL TEST ABLATION</head><p>Algorithm 2 employs a statistical test to decide whether to continue or stop training. Specifically, training continues as long as the means of the log-weights used for training and a new set of log-weights-i.e., an estimation of ELBO-are statistically different. An alternative approach would be to compare the distributions of both training and testing log-weights using tests designed for this task. To examine this, we conducted experiments similar to those in Section 5, comparing distributions with the two-sample Kolmogorov-Smirnov test (KS-test) and the two-sample Cram&#233;r-von Mises test (CvM-test). The findings are detailed in Table <ref type="table">18</ref>. Across all cases, the outcomes closely resemble those achieved with the t-test. Generally, the algorithm runs for slightly longer when using the t-test compared to the KS-test or the CvM-test. This delay is attributed to the greater statistical power gained from comparing means rather than distributions. When comparing distributions, the CvM-test yields marginally better results than the KS-test, attributed to the CvM-test's higher statistical power <ref type="bibr">(Stephens, 1974)</ref>.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>congress, election88, election88Exp, electric, hepatitis, irt, mesquite, radon, and wells   </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>a1a, australian, ionosphere, madelon, mushrooms, and sonar</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>Following<ref type="bibr">Kucukelbir et al. (2017)</ref>, we transform the model p into one with unconstrained real-valued latent variables, using PyTorch's<ref type="bibr">(Paszke et al., 2019)</ref> constraints framework.</p></note>
		</body>
		</text>
</TEI>
