<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Amortized Inference Regularization</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2018</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10080180</idno>
					<idno type="doi"></idno>
					<title level='j'>Proc. 32nd Annual Conference on Neural Information Processing Systems</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Rui Shu</author><author>Hung H Bui</author><author>Shengjia Zhao</author><author>Mykel J Kochenderfer</author><author>Stefano Ermon</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The variational autoencoder (VAE) is a popular model for density estimation and representation learning. Canonically, the variational principle suggests to prefer an expressive inference model so that the variational approximation is accurate. However, it is often overlooked that an overly-expressive inference model can be detrimental to the test set performance of both the amortized posterior approximator and, more importantly, the generative density estimator. In this paper, we leverage the fact that VAEs rely on amortized inference and propose techniques for amortized inference regularization (AIR) that control the smoothness of the inference model. We demonstrate that, by applying AIR, it is possible to improve VAE generalization on both inference and generative performance. Our paper challenges the belief that amortized inference is simply a mechanism for approximating maximum likelihood training and illustrates that regularization of the amortization family provides a new direction for understanding and improving generalization in VAEs.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Variational autoencoders are a class of generative models with widespread applications in density estimation, semi-supervised learning, and representation learning <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">3,</ref><ref type="bibr">4]</ref>. A popular approach for the training of such models is to maximize the log-likelihood of the training data. However, maximum likelihood is often intractable due to the presence of latent variables. Variational Bayes resolves this issue by constructing a tractable lower bound of the log-likelihood and maximizing the lower bound instead. Classically, Variational Bayes introduces per-sample approximate proposal distributions that need to be optimized using a process called variational inference. However, per-sample optimization incurs a high computational cost. A key contribution of the variational autoencoding framework is the observation that the cost of variational inference can be amortized by using an amortized inference model that learns an efficient mapping from samples to proposal distributions. This perspective portrays amortized inference as a tool for efficiently approximating maximum likelihood training. Many techniques have since been proposed to expand the expressivity of the amortized inference model in order to better approximate maximum likelihood training <ref type="bibr">[5,</ref><ref type="bibr">6,</ref><ref type="bibr">7,</ref><ref type="bibr">8]</ref>.</p><p>In this paper, we challenge the conventional role that amortized inference plays in variational autoencoders. For datasets where the generative model is prone to overfitting, we show that having an amortized inference model actually provides a new and effective way to regularize maximum likelihood training. Rather than making the amortized inference model more expressive, we propose instead to restrict the capacity of the amortization family. Through amortized inference regularization (AIR), we show that it is possible to reduce the inference gap and increase the log-likelihood performance on the test set. We propose several techniques for AIR and provide extensive theoretical and empirical analyses of our proposed techniques when applied to the variational autoencoder and the importance-weighted autoencoder. By rethinking the role of the amortized inference model, amortized inference regularization provides a new direction for studying and improving the generalization performance of latent variable models.</p><p>2 Background and Notation 2.1 Variational Inference and the Evidence Lower Bound Consider a joint distribution p &#952; (x, z) parameterized by &#952;, where x &#8712; X is observed and z &#8712; Z is latent. Given a uniform distribution p(x) over the dataset D = {x (i) }, maximum likelihood estimation performs model selection using the objective max &#952; E p(x) ln p &#952; (x) = max &#952; E p(x) ln z p &#952; (x, z)dz.</p><p>(1)</p><p>However, marginalization of the latent variable is often intractable; to address this issue, it is common to employ the variational principle to maximize the following lower bound</p><p>where D is the Kullback-Leibler divergence and Q is a variational family. This lower bound, commonly called the evidence lower bound (ELBO), converts log-likelihood estimation into a tractable optimization problem. Since the lower bound holds for any q, the variational family Q can be chosen to ensure that q(z) is easily computable, and the lower bound is optimized to select the best proposal distribution q * x (z) for each x &#8712; D.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Amortization and Variational</head><p>Autoencoders <ref type="bibr">[1,</ref><ref type="bibr">9]</ref> proposed to construct p(x | z) using a parametric function g &#952; &#8712; G(P) : Z &#8594; P, where P is some family of distributions over x, and G is a family of functions indexed by parameters &#952;. To expedite training, they observed that it is possible to amortize the computational cost of variational inference by framing the per-sample optimization process as a regression problem; rather than solving for the optimal proposal q * x (z) directly, they instead use a recognition model</p><p>x (z). The functions (f &#966; , g &#952; ) can be concisely represented as conditional distributions, where</p><p>The use of amortized inference yields the variational autoencoder, which is trained to maximize the variational autoencoder objective</p><p>We omit the dependency of (p(z), g) on &#952; and f on &#966; for notational simplicity. In addition to the typical presentation of the variational autoencoder objective (LHS), we also show an alternative formulation (RHS) that reveals the influence of the model capacities F, G and distribution family capacities Q, P on the objective function. In this paper, we use (q &#966; , f ) interchangeably, depending on the choice of emphasis. To highlight the relationship between the ELBO in Eq. ( <ref type="formula">2</ref>) and the standard variational autoencoder objective in Eq. ( <ref type="formula">5</ref>), we shall also refer to the latter as the amortized ELBO.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Amortized Inference Suboptimality</head><p>For a fixed generative model, the optimal unamortized and amortized inference models are</p><p>A notable consequence of using an amortization family to approximate variational inference is that Eq. ( <ref type="formula">5</ref>) is a lower bound of Eq. ( <ref type="formula">2</ref>). This naturally raises the question of whether the learned inference model can accurately approximate the mapping x &#8594; q * x (z). To address this question, <ref type="bibr">[10]</ref> defined the inference, approximation, and amortization gaps as</p><p>Studies have found that the inference gap is non-negligible <ref type="bibr">[11]</ref> and primarily attributable to the presence of a large amortization gap <ref type="bibr">[10]</ref>.</p><p>The amortization gap raises two critical considerations. On the one hand, we wish to reduce the training amortization gap &#8710; am (p train ). If the family F is too low in capacity, then it is unable to approximate x &#8594; q * x and will thus increase the amortization gap. Motivated by this perspective, <ref type="bibr">[5,</ref><ref type="bibr">12]</ref> proposed to reduce the training amortization gap by performing stochastic variational inference on top of amortized inference. In this paper, we take the opposing perspective that an over-expressive F hurts generalization (see Appendix A) and that restricting the capacity of F is a form of regularization that can prevent both the inference and generative models from overfitting to the training set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Amortized Inference Regularization in Variational Autoencoders</head><p>Many methods have been proposed to expand the variational and amortization families in order to better approximate maximum likelihood training <ref type="bibr">[5,</ref><ref type="bibr">6,</ref><ref type="bibr">7,</ref><ref type="bibr">8,</ref><ref type="bibr">13,</ref><ref type="bibr">14]</ref>. We argue, however, that achieving a better approximation to maximum likelihood training is not necessarily the best training objective, even if the end goal is test set density estimation. In general, it may be beneficial to regularize the maximum likelihood training objective.</p><p>Importantly, we observe that the evidence lower bound in Eq. ( <ref type="formula">2</ref>) admits a natural interpretation as implicitly regularizing maximum likelihood training</p><p>This formulation exposes the ELBO as a data-dependent regularized maximum likelihood objective. For infinite capacity Q, R(&#952; ; Q) is zero for all &#952; &#8712; &#920;, and the objective reduces to maximum likelihood. When Q is the set of Gaussian distributions (as is the case in the standard VAE), then</p><p>is Gaussian for all x &#8712; D. In other words, a Gaussian variational family regularizes the true posterior p &#952; (z | x) toward being Gaussian <ref type="bibr">[10]</ref>. Careful selection of the variational family to encourage p &#952; (z | x) to adopt certain properties (e.g. unimodality, fully-factorized posterior, etc.) can thus be considered a special case of posterior regularization <ref type="bibr">[15,</ref><ref type="bibr">16]</ref>.</p><p>Unlike traditional variational techniques, the variational autoencoder introduces an amortized inference model f &#8712; F and thus a new source of posterior regularization.</p><p>In contrast to unamortized variational inference, the introduction of the amortization family F forces the inference model to consider the global structure of how X maps to Q. We thus define amortized inference regularization as the strategy of restricting the inference model capacity F to satisfy certain desiderata. In this paper, we explore a special case of AIR where a candidate model f &#8712; F is penalized if it is not sufficiently smooth. We propose two models that encourage inference model smoothness and demonstrate that they can reduce the inference gap and increase log-likelihood on the test set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Denoising Variational Autoencoder</head><p>In this section, we propose using random perturbation training for amortized inference regularization. The resulting model-the denoising variational autoencoder (DVAE)-modifies the variational autoencoder objective by injecting &#949; noise into the inference model</p><p>Note that the noise term only appears in the regularizer term. We consider the case of zero-mean isotropic Gaussian noise &#949; &#8764; N (0, &#963;I) and denote the denoising regularizer as R(&#952; ; &#963;). At this point, we note that the DVAE was first described in <ref type="bibr">[17]</ref>. However, our treatment of DVAE differs from <ref type="bibr">[17]</ref>'s in both theoretical analysis and underlying motivation. We found that <ref type="bibr">[17]</ref> incorrectly stated the tightness of the DVAE variational lower bound (see Appendix B). In contrast, our analysis demonstrates that the denoising objective smooths the inference model and necessarily lower bounds the original variational autoencoder objective (see Theorem 1 and Proposition 1).</p><p>We now show that 1) the optimal DVAE amortized inference model is a kernel regression model and that 2) the variance of the noise &#949; controls the smoothness of the optimal inference model. Lemma 1. For fixed (&#952;, &#963;, Q) and infinite capacity F, the inference model that optimizes the DVAE objective in Eq. ( <ref type="formula">13</ref>) is the kernel regression model</p><p>where w &#963; (x,</p><p>is the RBF kernel.</p><p>Lemma 1 shows that the optimal denoising inference model f * &#963; is dependent on the noise level &#963;. The output of f * &#963; (x) is the proposal distribution that minimizes the weighted Kullback-</p><p>, where the weighting w &#963; (x, x (i) ) depends on the distance xx (i) and the bandwidth &#963;. When &#963; &gt; 0, the amortized inference model forces neighboring points (x (i) , x (j) ) to have similar proposal distributions. Note that as &#963; increases,</p><p>n , where n is the number of training samples. Controlling &#963; thus modulates the smoothness of f * &#963; (we say that f * &#963; is smooth if it maps similar inputs to similar outputs under some suitable measure of similarity). Intuitively, the denoising regularizer R(&#952; ; &#963;) approximates the true posteriors with a "&#963;-smoothed" inference model and penalizes generative models whose posteriors cannot easily be approximated by such an inference model. This intuition is formalized in Theorem 1. Theorem 1. Let Q be a minimal exponential family with corresponding natural parameter space &#8486;. With a slight abuse of notation, consider f &#8712; F : X &#8594; &#8486;. Under the simplifying assumption that p &#952; (z | x (i) ) is contained within Q and parameterized by &#951; (i) &#8712; &#8486;, and that F has infinite capacity, then the optimal inference model in Lemma</p><p>and Lipschitz constant of f * &#963; is bounded by O(1/&#963; 2 ).</p><p>We wish to address Theorem 1's assumption that the true posteriors lie in the variational family. Note that for sufficiently large exponential families, this assumption is likely to hold. But even in the case where the variational family is Gaussian (a relatively small exponential family), the small approximation gap observed in <ref type="bibr">[10]</ref> suggests that it is plausible that posterior regularization would encourage the true posteriors to be approximately Gaussian.</p><p>Given that &#963; modulates the smoothness of the inference model, it is natural to suspect that a larger choice of &#963; results in a stronger regularization. To formalize this notion of regularization strength, we introduce a way to partially order a set of regularizers {R i (&#952;)}. Definition 1. Suppose two regularizers R 1 (&#952;) and R 2 (&#952;) share the same minimum</p><p>Note that any two regularizers can be modified via scalar addition to share the same minimum. Furthermore, if R 1 is stronger than R 2 , then R 1 and R 2 share at least one minimizer. We now apply Definition 1 to characterize the regularization strength of R(&#952; ; &#963;) as &#963; increases. Definition 2. We say that F is closed under input translation if f &#8712; F =&#8658; f a &#8712; F for all a &#8712; X , where f a (x) = f (x + a).</p><p>Proposition 1. Consider the denoising regularizer R(&#952; ; &#963;). Suppose F is closed under input translation and that, for any &#952; &#8712; &#920;, there exists f &#8712; F such that f (x) maps to the prior p &#952; (z) all x &#8712; X . Furthermore, assume that there exists &#952; &#8712; &#920; such that p &#952; (x, z) = p &#952; (z)p &#952; (x). Then R(&#952; ; &#963; 1 ) is stronger R(&#952; ; &#963; 2 ) when &#963; 1 &#8805; &#963; 2 ; i.e., min &#952; R(&#952; ; &#963; 1 ) = min &#952; R(&#952; ; &#963; 2 ) = 0 and R(&#952; ; &#963; 1 ) &#8805; R(&#952; ; &#963; 2 ) for all &#952; &#8712; &#920;.</p><p>Lemma 1 and Proposition 1 show that as we increase &#963;, the optimal inference model is forced to become smoother and the regularization strength increases. Figure <ref type="figure">1</ref> is consistent with this analysis, showing the progression from under-regularized to over-regularized models as we increase &#963;.</p><p>It is worth noting that, in addition to adjusting the denoising regularizer strength via &#963;, it is also possible to adjust the strength by taking a convex combination of the VAE and DVAE objectives. In particular, we can define the partially denoising regularizer R(&#952; ; &#963;, &#945;) as</p><p>Importantly, we note that R(&#952; ; &#963;, &#945;) is still strictly non-negative and, when combined with the log-likelihood term, still yields a tractable variational lower bound.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Weight-Normalized Amortized Inference</head><p>In addition to DVAE, we propose an alternative method that directly restricts F to the set of smooth functions. To do so, we consider the case where the inference model is a neural network encoder parameterized by weight matrices {W i } and leverage <ref type="bibr">[18]</ref>'s weight normalization technique, which proposes to reparameterize the columns w i of each weight matrix W as</p><p>where v i &#8712; R d , s i &#8712; R are trainable parameters. Since it is possible to modulate the smoothness of the encoder by capping the magnitude of s i , we introduce a new parameter u i &#8712; R and define</p><p>The norm w i is thus bounded by the hyperparameter H. We denote the weight-normalized regularizer as R(&#952; ; F H ), where F H is the amortization family induced by a H-weight-normalized encoder. Under similar assumptions as Proposition 1, it is easy to see that min &#952; R(&#952; ; F H ) = 0 for any H &#8805; 0 and that R(&#952; ; F H1 ) &#8805; R(&#952; ; F H2 ) for all &#952; &#8712; &#920; when H 1 &#8804; H 2 (since F H1 &#8838; F H2 ).</p><p>We refer to the resulting model as the weight-normalized inference VAE (WNI-VAE) and show in Table <ref type="table">1</ref> that weight-normalized amortized inference can achieve similar performance as DVAE.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Experiments</head><p>We conducted experiments on statically binarized MNIST, statically binarized OMNIGLOT, and the Caltech 101 Silhouettes datasets. These datasets have a relatively small amount of training data and are thus susceptible to model overfitting. For each dataset, we used the same decoder architecture across all four models (VAE, DVAE (&#945; = 0.5), DVAE (&#945; = 1.0), WNI-VAE) and only modified the encoder, and trained all models using Adam <ref type="bibr">[19]</ref> (see Appendix E for more details). To approximate the log-likelihood, we proposed to use importance-weighted stochastic variational inference (IW-SVI), an extension of SVI <ref type="bibr">[20]</ref> which we describe in detail in Appendix C. Hyperparameter tuning of DVAE's &#963; and WNI-VAE's F H is described in Table <ref type="table">7</ref>.</p><p>Table <ref type="table">1</ref> shows the performance of VAE, DVAE, and WNI-VAE. Regularizing the inference model consistently improved the test set log-likelihood performance. On the MNIST and Caltech 101 Silhouettes datasets, the results also show a consistent reduction of the test set inference gap when the inference model is regularized. We observed differences in the performance of DVAE versus WNI-VAE on the Caltech 101 Silhouettes dataset, suggesting a difference in how denoising and weight normalization regularizes the inference model; an interesting consideration would thus be to combine DVAE and WNI. As a whole, Table <ref type="table">1</ref> demonstrates that AIR benefits the generative model.</p><p>The denoising and weight normalization regularizers have respective hyperparameters &#963; and H that control the regularization strength. In Figure <ref type="figure">1</ref>, we performed an ablation analysis of how adjusting where {z 1 . . . z k } are k samples from the proposal distribution q &#966; (z | x) to be used as importancesamples. Analysis by <ref type="bibr">[22]</ref> allows us to rewrite it as a regularized maximum likelihood objective</p><p>where fk (or equivalently qk ) is the unnormalized distribution fk (x, z 2 . . . z k )(z 1 ) = p &#952; (x, z 1 )</p><p>and D(q p) = q(z) [ln q(z)ln p(z)] dz is the Kullback-Leibler divergence extended to unnormalized distributions. For notational simplicity, we omit the dependency of fk on (z 2 . . . z k ).</p><p>Importantly, <ref type="bibr">[22]</ref> showed that the IWAE with k importance samples drawn from the amortized inference model f is, on expectation, equivalent to a VAE with 1 importance sample drawn from the more expressive inference model fk .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Importance Sampling Attenuates Amortized Inference Regularization</head><p>We now consider the interaction between importance sampling and AIR. We introduce the regularizer R k (&#952; ; &#963;, F H ) as follows</p><p>which corresponds to a regularizer where weight normalization, denoising, and importance sampling are simultaneously applied. By adapting Theorem 1 from <ref type="bibr">[8]</ref>, we can show that Proposition 3. Consider the regularizer R k (&#952; ; &#963;, F H ). Under similar assumptions as Proposition 1, then R k1 is stronger than</p><p>A notable consequence of Proposition 3 is that as k increases, AIR exhibits a weaker regularizing effect on the posterior distributions {p &#952; (z | x (i) )}. Intuitively, this arises from the phenomenon that although AIR is applied to f , the subsequent importance-weighting procedure can still create a flexible fk . Our analysis thus predicts that AIR is less likely to cause underfitting of IWAE-k's generative model as k increases, which we demonstrate in Figure <ref type="figure">2</ref>. In the limit of infinite importance samples, we also predict AIR to have zero regularizing effect since f&#8734; (under some assumptions) can always approximate any posterior. However, for practically feasible values of k, we show in Tables <ref type="table">2</ref> and<ref type="table">3</ref> that AIR is a highly effective regularizer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Experiments</head><p>Table <ref type="table">2</ref>: and Test set evaluation of the four models when trained with 8 importance samples. L 8 (x) denotes the amortized ELBO using 8 importance samples.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Overly Expressive Amortization Family Hurts Generalization</head><p>In the experiments by <ref type="bibr">[10]</ref>, they observed that an overly expressive amortization family increases the test set inference gap, but does not impact the test set log-likelihood. We show in Table <ref type="table">4</ref> that <ref type="bibr">[10]</ref>'s observation is not true in general, and that an overly expressive amortization family can in fact hurt test set log-likelihood. Details regarding the architectures are provided in Appendix E.</p><p>Table <ref type="table">4</ref>: Performance evaluation when an over-expressive amortization family is used (i.e. a larger encoder). Comparison is made against models that use a smaller encoder. The results show that using a large encoder consistently hurts generalization by over 1 nat. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Revisiting [17]'s Denoising Variational Autoencoder Analysis</head><p>In <ref type="bibr">[17]</ref>'s Lemma 1, they considered a joint distribution p &#952; (x, z). They introduced an auxiliary variable z into their inference model (here z takes on the role of the perturbed input x = x + &#949;. To avoid confusion, we stick to the notation used in their Lemma) and considered the inference model</p><p>They considered two ways to use this inference model. The first approach is to marginalize the auxiliary latent variable z . This defines the resulting inference model</p><p>This yields the lower bound</p><p>Next, they considered an alternative lower bound</p><p>[17]'s Lemma 1 claims that 1. L a and L b are valid lower bounds of ln p &#952; (x)</p><p>Using Lemma 1, <ref type="bibr">[17]</ref> motivated the denoising variational autoencoder by concluding that it provides a tighter bound than marginalization of the noise variable. Although statement 1 is correct, statement 2 is not. Their proof of statement 2 is presented as follows</p><p>We indicate the mistake with ? =; their proof of statement 2 relied on the assumption that Training parameters. Important training parameters are provided in Table <ref type="table">6</ref>. We used the Adam optimizer and exponentially decayed the initial learning rate according to the formula</p><p>where t &#8712; {0, . . . , T -1} is the current iteration and T is the total number of iterations. Early-stopping is applied according to IWAE-5000 evaluation on the validation set.</p><p>Table <ref type="table">6</ref>: Training parameters used for each dataset. The same architecture is used for all models, with minor modification for WNI-VAE (to account for the weight-normalization implementation). In all cases, we use a Bernoulli decoder and a Gaussian encoder. Notation: d300 denotes a dense layer with ELU activation and 300 output units. z64 denotes 1) a dense layer with 64 output units (represents the mean of z) and 2) a dense layer with softplus activation and 64 output units (represents the variance of z). x784 denotes a dense layer with 784 output units (represents the logits for x) MNIST (Appendix A) d1000-d1000-d1000-z64 d300-d300-x784 10 -3 1.5 &#215; 10 6 100 MNIST d300-d300-z64 d300-d300-x784 10 -3 1.5 &#215; 10 6 100 OMNIGLOT d200-d200-z64 d200-d200-x784 10 -3 1.5 &#215; 10 6 100 CALTECH d500-z64 d500-x784 10 -4 4 &#215; 10 5 10</p><p>Regularization strength tuning. The denoising and weight normalization regularizers have hyperparameters &#963; and H respectively. See Table <ref type="table">7</ref> for hyperparameter search space details. We performed a basic grid search and tuned the regularization strength hyperparameters based on the validation set.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F Proofs</head><p>Remark. Some of the proofs mention the notion of an infinite capacity F, G or Q. To clarify, we say that F has infinite capacity if it is the set of all possible functions that map from X to Q. Analogously, G has infinite capacity if it is the set of all possible functions that map from Z to P. We say that Q has infinite capacity if it is the set of all possible distributions over the space Z. Lemma 1. For fixed (&#952;, &#963;, Q) and infinite capacity F, the inference model that optimizes the DVAE objective in Eq. ( <ref type="formula">13</ref>) is the kernel regression model</p><p>where w &#963; (x, x (i) ) = K&#963;(x,x (i) ) j K&#963;(x,x (j) ) and K &#963; (x, y) = exp -x-y 2&#963; 2</p><p>is the RBF kernel.</p><p>Proof. Define x = x + &#949; and p(x, x) = p(x)N (x | x, &#963;I). Rewrite the objective as</p><p>Recall that F has infinite capacity. This lower bound is tight since we can select</p><p>Reexpressing Eq. ( <ref type="formula">46</ref>) by p(x | x) yields Eq. ( <ref type="formula">14</ref>).</p><p>Theorem 1. Let Q be a minimal exponential family with corresponding natural parameter space &#8486;.</p><p>With a slight abuse of notation, consider f &#8712; F : X &#8594; &#8486;. Under the simplifying assumption that p &#952; (z | x (i) ) is contained within Q and parameterized by &#951; (i) &#8712; &#8486;, and that F has infinite capacity, then the optimal inference model in Lemma 1 returns f * &#963; (x) = &#951; &#8712; &#8486;, where</p><p>and Lipschitz constant of f * &#963; is bounded by O(1/&#963; 2 ).</p><p>Proof. Proof provided in two parts.</p><p>Part 1. The Kullback-Leibler divergence can be represented as a Bregman divergence associated with the strictly convex log-partition function A of the minimal exponential family as follows</p><p>Proposition 1 from <ref type="bibr">[33]</ref> shows that that for any convex combination weights {w i } , n i=1 w i = 1, the minimizer of a weighted average of Bregman divergences is</p><p>It thus follows that</p><p>Part 2. We will write the derivative &#8711; x f * &#963; (x) in matrix form by the following notation</p><p>where we also suppose input space x is n-dimensional, latent parameter space &#8486; is d-dimensional, and there are m training examples. Then</p><p>T Let &#8226; 1 be the induced 1-norm for matrices, then by the sub-multiplicative property</p><p>Since M 1 is a constant with respect to &#963;, we only have to bound &#8711; x W &#963; (x) T 1 . To do this we study the derivative of &#8711; x w &#963; (x, x (i) ), where</p><p>Let | &#8226; | denote taking element-wise absolute value, and x &#8804; * y denotes for all elements of the vector |x i | &#8804; |y i |. By Cauchy inequality and</p><p>This gives us a bound on the matrix 1-norm</p><p>Because both &#8486; and X are convex sets, this implies the following Lipschitz property</p><p>Proposition 1. Consider the denoising regularizer R(&#952; ; &#963;). Suppose F is closed under input translation and that, for any &#952; &#8712; &#920;, there exists f &#8712; F such that f (x) maps to the prior p &#952; (z) all x &#8712; X . Furthermore, assume that there exists &#952; &#8712; &#920; such that p</p><p>Proof. Proof is provided in two parts.</p><p>Part 1. Recall that R is always non-negative. Since there exists &#952; &#8712; &#920; such that p &#952; (x, z) = p &#952; (z)p &#952; (x), and f &#8712; F such that f (x) = p &#952; (z), then min &#952; R(&#952; ; &#963;) = 0 for any choice of &#963;.</p><p>Since F is closed under input translation,</p><p>It thus follows that R(&#952; ; &#963; 1 ) &#8805; R(&#952; ; &#963; 2 ) for all &#952; &#8712; &#920;.</p><p>Proposition 2. Let P be an exponential family with corresponding mean parameter space M and sufficient statistic function T (&#8226;). With a slight abuse of notation, consider g &#8712; G : Z &#8594; M. Define q(x, z) = p(x)q(z | x), where q(z | x) is a fixed inference model. Supposing G has infinite capacity, then the optimal generative model in Eq. (5) returns g * (z) = &#181; &#8712; M, where</p><p>Proof. For a given inference model q(z | x), the optimal generator maximizes the objective max g&#8712;G E p(x) E q(z|x) [ln g(z)(x)] = max g&#8712;G E q(x,z) [ln g(z)(x)] .</p><p>(56)</p><p>= max g&#8712;G E q(x,z) ln p g(z) (x) (57)</p><p>= E q(z) max &#181;&#8712;M E q(x|z) ln p &#181; (x),</p><p>where p &#181; denotes the distribution p &#8712; P with associate mean parameter &#181;. This inequality is tight since we can select g * &#8712; G such that g * (z) = arg max &#181;&#8712;M E q(x|z) ln p &#181; (x).</p><p>Recall that the maximum likelihood and maximum entropy solutions are equivalent for an exponential family. From the moment-matching condition of maximum entropy, it follows that g * (z) = arg max &#181;&#8712;M E q(x|z) ln p &#181; (x) (61)</p><p>= E q(x|z) [T (x)] (62)</p><p>Proposition 3. Consider the regularizer R k (&#952; ; &#963;, F H ). Under similar assumptions as Proposition 1, then R k1 is stronger than R k2 when k 1 &#8804; k 2 ; i.e., min &#952; R k1 (&#952; ; &#963;, F H ) = min &#952; R k2 (&#952; ; &#963;, F H ) = 0 and R k1 (&#952; ; &#963;, F H ) &#8804; R k2 (&#952; ; &#963;, F H ) for all &#952; &#8712; &#920;.</p><p>Proof. Proof is provided in two parts.</p><p>Part 1. The relevant assumptions are that there exists &#952; &#8712; &#920; such that p &#952; (x, z) = p &#952; (z)p &#952; (x), and f &#8712; F H such that f (x) = p &#952; (z). Note that R k is always non-negative. It follows readily that min &#952; R k (&#952; ; &#963;, F H ) = 0 for any choice of k. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>32nd Conference on Neural Information Processing Systems (NIPS 2018), Montr&#233;al, Canada.</p></note>
		</body>
		</text>
</TEI>
