Gaussian processes (GPs) offer a flexible class of priors for nonparametric Bayesian regression, but popular GP posterior inference methods are typically prohibitively slow or lack desirable finite-data guarantees on quality. We develop a scalable approach to approximate GP regression, with finite-data guarantees on the accuracy of our pointwise posterior mean and variance estimates. Our main contribution is a novel objective for approximate inference in the nonparametric setting: the preconditioned Fisher (pF) divergence. We show that unlike the Kullback–Leibler divergence (used in variational inference), the pF divergence bounds bounds the 2-Wasserstein distance, which in turn provides tight bounds on the pointwise error of mean and variance estimates. We demonstrate that, for sparse GP likelihood approximations, we can minimize the pF divergence bounds efficiently. Our experiments show that optimizing the pF divergence bounds has the same computational requirements as variational sparse GPs while providing comparable empirical performance—in addition to our novel finite-data quality guarantees.
more »
« less
Discriminant Analysis under f-Divergence Measures
In statistical inference, the information-theoretic performance limits can often be expressed in terms of a statistical divergence between the underlying statistical models (e.g., in binary hypothesis testing, the error probability is related to the total variation distance between the statistical models). As the data dimension grows, computing the statistics involved in decision-making and the attendant performance limits (divergence measures) face complexity and stability challenges. Dimensionality reduction addresses these challenges at the expense of compromising the performance (the divergence reduces by the data-processing inequality). This paper considers linear dimensionality reduction such that the divergence between the models is maximally preserved. Specifically, this paper focuses on Gaussian models where we investigate discriminant analysis under five f-divergence measures (Kullback–Leibler, symmetrized Kullback–Leibler, Hellinger, total variation, and χ2). We characterize the optimal design of the linear transformation of the data onto a lower-dimensional subspace for zero-mean Gaussian models and employ numerical algorithms to find the design for general Gaussian models with non-zero means. There are two key observations for zero-mean Gaussian models. First, projections are not necessarily along the largest modes of the covariance matrix of the data, and, in some situations, they can even be along the smallest modes. Secondly, under specific regimes, the optimal design of subspace projection is identical under all the f-divergence measures considered, rendering a degree of universality to the design, independent of the inference problem of interest.
more »
« less
- PAR ID:
- 10355565
- Date Published:
- Journal Name:
- Entropy
- Volume:
- 24
- Issue:
- 2
- ISSN:
- 1099-4300
- Page Range / eLocation ID:
- 188
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Krause, Andreas; Brunskill, Emma; Cho, Kyunghyun; Engelhardt; Barbara; Sabato, Sivan; Scarlett, Jonathan (Ed.)Robust Markov decision processes (MDPs) address the challenge of model uncertainty by optimizing the worst-case performance over an uncertainty set of MDPs. In this paper, we focus on the robust average-reward MDPs under the modelfree setting. We first theoretically characterize the structure of solutions to the robust averagereward Bellman equation, which is essential for our later convergence analysis. We then design two model-free algorithms, robust relative value iteration (RVI) TD and robust RVI Q-learning, and theoretically prove their convergence to the optimal solution. We provide several widely used uncertainty sets as examples, including those def ined by the contamination model, total variation, Chi-squared divergence, Kullback-Leibler (KL) divergence and Wasserstein distance.more » « less
-
null (Ed.)Transformation-based methods have been an attractive approach in non-parametric inference for problems such as unconditional and conditional density estimation due to their unique hierarchical structure that models the data as flexible transformation of a set of common latent variables. More recently, transformation-based models have been used in variational inference (VI) to construct flexible implicit families of variational distributions. However, their use in both nonparametric inference and variational inference lacks theoretical justification. We provide theoretical justification for the use of non-linear latent variable models (NL-LVMs) in non-parametric inference by showing that the support of the transformation induced prior in the space of densities is sufficiently large in the L1 sense. We also show that, when a Gaussian process (GP) prior is placed on the transformation function, the posterior concentrates at the optimal rate up to a logarithmic factor. Adopting the flexibility demonstrated in the non-parametric setting, we use the NL-LVM to construct an implicit family of variational distributions, deemed GP-IVI. We delineate sufficient conditions under which GP-IVI achieves optimal risk bounds and approximates the true posterior in the sense of the Kullback–Leibler divergence. To the best of our knowledge, this is the first work on providing theoretical guarantees for implicit variational inference.more » « less
-
Summary Model selection is crucial both to high-dimensional learning and to inference for contemporary big data applications in pinpointing the best set of covariates among a sequence of candidate interpretable models. Most existing work implicitly assumes that the models are correctly specified or have fixed dimensionality, yet both model misspecification and high dimensionality are prevalent in practice. In this paper, we exploit the framework of model selection principles under the misspecified generalized linear models presented in Lv & Liu (2014), and investigate the asymptotic expansion of the posterior model probability in the setting of high-dimensional misspecified models. With a natural choice of prior probabilities that encourages interpretability and incorporates the Kullback–Leibler divergence, we suggest using the high-dimensional generalized Bayesian information criterion with prior probability for large-scale model selection with misspecification. Our new information criterion characterizes the impacts of both model misspecification and high dimensionality on model selection. We further establish the consistency of covariance contrast matrix estimation and the model selection consistency of the new information criterion in ultrahigh dimensions under some mild regularity conditions. Our numerical studies demonstrate that the proposed method enjoys improved model selection consistency over its main competitors.more » « less
-
Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. We establish non-asymptotic absolute error bounds for a neural estimator realized by a shallow NN, focusing on four popular 𝖿-divergences---Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory to bound the two sources of error involved: function approximation and empirical estimation. The bounds characterize the effective error in terms of NN size and the number of samples, and reveal scaling rates that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are minimax rate-optimal, achieving the parametric convergence rate.more » « less
An official website of the United States government

