Transformationbased methods have been an attractive approach in nonparametric inference for problems such as unconditional and conditional density estimation due to their unique hierarchical structure that models the data as flexible transformation of a set of common latent variables. More recently, transformationbased models have been used in variational inference (VI) to construct flexible implicit families of variational distributions. However, their use in both nonparametric inference and variational inference lacks theoretical justification. We provide theoretical justification for the use of nonlinear latent variable models (NLLVMs) in nonparametric inference by showing that the support of the transformation induced prior in the space of densities is sufficiently large in the L1 sense. We also show that, when a Gaussian process (GP) prior is placed on the transformation function, the posterior concentrates at the optimal rate up to a logarithmic factor. Adopting the flexibility demonstrated in the nonparametric setting, we use the NLLVM to construct an implicit family of variational distributions, deemed GPIVI. We delineate sufficient conditions under which GPIVI achieves optimal risk bounds and approximates the true posterior in the sense of the Kullback–Leibler divergence. To the best of our knowledge, this is the first work on providing theoretical guarantees for implicit variationalmore »
This content will become publicly available on February 1, 2023
Discriminant Analysis under fDivergence Measures
In statistical inference, the informationtheoretic performance limits can often be expressed in terms of a statistical divergence between the underlying statistical models (e.g., in binary hypothesis testing, the error probability is related to the total variation distance between the statistical models). As the data dimension grows, computing the statistics involved in decisionmaking and the attendant performance limits (divergence measures) face complexity and stability challenges. Dimensionality reduction addresses these challenges at the expense of compromising the performance (the divergence reduces by the dataprocessing inequality). This paper considers linear dimensionality reduction such that the divergence between the models is maximally preserved. Specifically, this paper focuses on Gaussian models where we investigate discriminant analysis under five fdivergence measures (Kullback–Leibler, symmetrized Kullback–Leibler, Hellinger, total variation, and χ2). We characterize the optimal design of the linear transformation of the data onto a lowerdimensional subspace for zeromean Gaussian models and employ numerical algorithms to find the design for general Gaussian models with nonzero means. There are two key observations for zeromean Gaussian models. First, projections are not necessarily along the largest modes of the covariance matrix of the data, and, in some situations, they can even be along the smallest modes. Secondly, under specific regimes, the more »
 Publication Date:
 NSFPAR ID:
 10355565
 Journal Name:
 Entropy
 Volume:
 24
 Issue:
 2
 Page Range or eLocationID:
 188
 ISSN:
 10994300
 Sponsoring Org:
 National Science Foundation
More Like this


Summary Model selection is crucial both to highdimensional learning and to inference for contemporary big data applications in pinpointing the best set of covariates among a sequence of candidate interpretable models. Most existing work implicitly assumes that the models are correctly specified or have fixed dimensionality, yet both model misspecification and high dimensionality are prevalent in practice. In this paper, we exploit the framework of model selection principles under the misspecified generalized linear models presented in Lv & Liu (2014), and investigate the asymptotic expansion of the posterior model probability in the setting of highdimensional misspecified models. With a natural choice of prior probabilities that encourages interpretability and incorporates the Kullback–Leibler divergence, we suggest using the highdimensional generalized Bayesian information criterion with prior probability for largescale model selection with misspecification. Our new information criterion characterizes the impacts of both model misspecification and high dimensionality on model selection. We further establish the consistency of covariance contrast matrix estimation and the model selection consistency of the new information criterion in ultrahigh dimensions under some mild regularity conditions. Our numerical studies demonstrate that the proposed method enjoys improved model selection consistency over its main competitors.

Gaussian processes (GPs) offer a flexible class of priors for nonparametric Bayesian regression, but popular GP posterior inference methods are typically prohibitively slow or lack desirable finitedata guarantees on quality. We develop a scalable approach to approximate GP regression, with finitedata guarantees on the accuracy of our pointwise posterior mean and variance estimates. Our main contribution is a novel objective for approximate inference in the nonparametric setting: the preconditioned Fisher (pF) divergence. We show that unlike the Kullback–Leibler divergence (used in variational inference), the pF divergence bounds bounds the 2Wasserstein distance, which in turn provides tight bounds on the pointwise error of mean and variance estimates. We demonstrate that, for sparse GP likelihood approximations, we can minimize the pF divergence bounds efficiently. Our experiments show that optimizing the pF divergence bounds has the same computational requirements as variational sparse GPs while providing comparable empirical performance—in addition to our novel finitedata quality guarantees.

Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. We establish nonasymptotic absolute error bounds for a neural estimator realized by a shallow NN, focusing on four popular 𝖿divergencesKullbackLeibler, chisquared, squared Hellinger, and total variation. Our analysis relies on nonasymptotic function approximation theorems and tools from empirical process theory to bound the two sources of error involved: function approximation and empirical estimation. The bounds characterize the effective error in terms of NN size and the number of samples, and reveal scaling rates that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growthrate are minimax rateoptimal, achieving the parametric convergence rate.

For the inverse problem in physical models, one measures the solution and infers the model parameters using information from the collected data. Oftentimes, these data are inadequate and render the inverse problem illposed. We study the illposedness in the context of optical imaging, which is a medical imaging technique that uses light to probe (bio)tissue structure. Depending on the intensity of the light, the forward problem can be described by different types of equations. Highenergy light scatters very little, and one uses the radiative transfer equation (RTE) as the model; lowenergy light scatters frequently, so the diffusion equation (DE) suffices to be a good approximation. A multiscale approximation links the hyperbolictype RTE with the parabolictype DE. The inverse problems for the two equations have a multiscale passage as well, so one expects that as the energy of the photons diminishes, the inverse problem changes from well to illposed. We study this stability deterioration using the Bayesian inference. In particular, we use the Kullback–Leibler divergence between the prior distribution and the posterior distribution based on the RTE to prove that the information gain from the measurement vanishes as the energy of the photons decreases, so that the inverse problem is illposedmore »