skip to main content

Title: Rate-Regularization and Generalization in Variational Autoencoders
Variational autoencoders (VAEs) optimize an objective that comprises a reconstruction loss (the distortion) and a KL term (the rate). The rate is an upper bound on the mutual information, which is often interpreted as a regularizer that controls the degree of compression. We here examine whether inclusion of the rate term also improves generalization. We perform rate-distortion analyses in which we control the strength of the rate term, the network capacity, and the difficulty of the generalization problem. Lowering the strength of the rate term paradoxically improves generalization in most settings, and reducing the mutual information typically leads to underfitting. Moreover, we show that generalization performance continues to improve even after the mutual information saturates, indicating that the gap on the bound (i.e. the KL divergence relative to the inference marginal) affects generalization. This suggests that the standard spherical Gaussian prior is not an inductive bias that typically improves generalization, prompting further work to understand what choices of priors improve generalization in VAEs.
Authors:
; ; ; ; ;
Editors:
Banerjee, A.; Fukumizu, K.
Award ID(s):
1901117
Publication Date:
NSF-PAR ID:
10280434
Journal Name:
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics
Volume:
130
Page Range or eLocation-ID:
3880--3888
Sponsoring Org:
National Science Foundation
More Like this
  1. Deep latent-variable models learn representations of high-dimensional data in an unsupervised manner. A number of recent efforts have focused on learning representations that disentangle statistically independent axes of variation by introducing modifications to the standard objective function. These approaches generally assume a simple diagonal Gaussian prior and as a result are not able to reliably disentangle discrete factors of variation. We propose a two-level hierarchical objective to control relative degree of statistical independence between blocks of variables and individual variables within blocks. We derive this objective as a generalization of the evidence lower bound, which allows us to explicitly represent the trade-offs between mutual information between data and representation, KL divergence between representation and prior, and coverage of the support of the empirical data distribution. Experiments on a variety of datasets demonstrate that our objective can not only disentangle discrete variables, but that doing so also improves disentanglement of other variables and, importantly, generalization even to unseen combinations of factors
  2. Decision trees are important both as interpretable models amenable to high-stakes decision making, and as building blocks of ensemble methods such as random forests and gradient boosting. Their statistical properties, however, are not well understood. The most cited prior works have focused on deriving pointwise consistency guarantees for CART in a classical nonparametric regression setting. We take a different approach, and advocate studying the generalization performance of decision trees with respect to different generative regression models. This allows us to elicit their inductive bias, that is, the assumptions the algorithms make (or do not make) to generalize to new data, thereby guiding practitioners on when and how to apply these methods. In this paper, we focus on sparse additive generative models, which have both low statistical complexity and some nonparametric flexibility. We prove a sharp squared error generalization lower bound for a large class of decision tree algorithms fitted to sparse additive models with C component functions. This bound is surprisingly much worse than the minimax rate for estimating such sparse additive models. The inefficiency is due not to greediness, but to the loss in power for detecting global structure when we average responses solely over each leaf, an observationmore »that suggests opportunities to improve tree-based algorithms, for example, by hierarchical shrinkage. To prove these bounds, we develop new technical machinery, establishing a novel connection between decision tree estimation and rate-distortion theory, a sub-field of information theory.« less
  3. The enormous size of modern deep neural net-works makes it challenging to deploy those models in memory and communication limited scenarios. Thus, compressing a trained model without a significant loss in performance has become an increasingly important task. Tremendous advances has been made recently, where the main technical building blocks are pruning, quantization, and low-rank factorization. In this paper, we propose principled approaches to improve upon the common heuristics used in those building blocks, by studying the fundamental limit for model compression via the rate distortion theory. We prove a lower bound for the rate distortion function for model compression and prove its achievability for linear models. Although this achievable compression scheme is intractable in practice, this analysis motivates a novel objective function for model compression, which can be used to improve classes of model compressor such as pruning or quantization. Theoretically, we prove that the proposed scheme is optimal for compressing one-hidden-layer ReLU neural networks. Empirically,we show that the proposed scheme improves upon the baseline in the compression-accuracy tradeoff.
  4. We generalize the information bottleneck (IB) and privacy funnel (PF) problems by introducing the notion of a sensitive attribute, which arises in a growing number of applications. In this generalization, we seek to construct representations of observations that are maximally (or minimally) informative about a target variable, while also satisfying constraints with respect to a variable corresponding to the sensitive attribute. In the Gaussian and discrete settings, we show that by suitably approximating the Kullback-Liebler (KL) divergence defining traditional Shannon mutual information, the generalized IB and PF problems can be formulated as semi-definite programs (SDPs), and thus efficiently solved, which is important in applications of high-dimensional inference. We validate our algorithms on synthetic data and demonstrate their use in imposing fairness in machine learning on real data as an illustrative application.
  5. We study social choice rules under the utilitarian distortion framework, with an additional metric assumption on the agents' costs over the alternatives. In this approach, these costs are given by an underlying metric on the set of all agents plus alternatives. Social choice rules have access to only the ordinal preferences of agents but not the latent cardinal costs that induce them. Distortion is then defined as the ratio between the social cost (typically the sum of agent costs) of the alternative chosen by the mechanism at hand, and that of the optimal alternative chosen by an omniscient algorithm. The worst-case distortion of a social choice rule is, therefore, a measure of how close it always gets to the optimal alternative without any knowledge of the underlying costs. Under this model, it has been conjectured that Ranked Pairs, the well-known weighted-tournament rule, achieves a distortion of at most 3 (Anshelevich et al. 2015). We disprove this conjecture by constructing a sequence of instances which shows that the worst-case distortion of Ranked Pairs is at least 5. Our lower bound on the worst-case distortion of Ranked Pairs matches a previously known upper bound for the Copeland rule, proving that in themore »worst case, the simpler Copeland rule is at least as good as Ranked Pairs. And as long as we are limited to (weighted or unweighted) tournament rules, we demonstrate that randomization cannot help achieve an expected worst-case distortion of less than 3. Using the concept of approximate majorization within the distortion framework, we prove that Copeland and Randomized Dictatorship achieve low constant factor fairness-ratios (5 and 3 respectively), which is a considerable generalization of similar results for the sum of costs and single largest cost objectives. In addition to all of the above, we outline several interesting directions for further research in this space.« less