skip to main content


Title: Covariances, robustness and variational Bayes
Mean-field Variational Bayes (MFVB) is an approximate Bayesian posterior inference technique that is increasingly popular due to its fast runtimes on large-scale data sets. However, even when MFVB provides accurate posterior means for certain parameters, it often mis-estimates variances and covariances. Furthermore, prior robustness measures have remained undeveloped for MFVB. By deriving a simple formula for the effect of infinitesimal model perturbations on MFVB posterior means, we provide both improved covariance estimates and local robustness measures for MFVB, thus greatly expanding the practical usefulness of MFVB posterior approximations. The estimates for MFVB posterior covariances rely on a result from the classical Bayesian robustness literature that relates derivatives of posterior expectations to posterior covariances and includes the Laplace approximation as a special case. Our key condition is that the MFVB approximation provides good estimates of a select subset of posterior means---an assumption that has been shown to hold in many practical settings. In our experiments, we demonstrate that our methods are simple, general, and fast, providing accurate posterior uncertainty estimates and robustness measures with runtimes that can be an order of magnitude faster than MCMC.  more » « less
Award ID(s):
1750286
NSF-PAR ID:
10085570
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Journal of machine learning research
Volume:
19
Issue:
51
ISSN:
1532-4435
Page Range / eLocation ID:
1 - 49
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. ABSTRACT

    Upcoming galaxy surveys will allow us to probe the growth of the cosmic large-scale structure with improved sensitivity compared to current missions, and will also map larger areas of the sky. This means that in addition to the increased precision in observations, future surveys will also access the ultralarge-scale regime, where commonly neglected effects such as lensing, redshift-space distortions, and relativistic corrections become important for calculating correlation functions of galaxy positions. At the same time, several approximations usually made in these calculations such as the Limber approximation break down at those scales. The need to abandon these approximations and simplifying assumptions at large scales creates severe issues for parameter estimation methods. On the one hand, exact calculations of theoretical angular power spectra become computationally expensive, and the need to perform them thousands of times to reconstruct posterior probability distributions for cosmological parameters makes the approach unfeasible. On the other hand, neglecting relativistic effects and relying on approximations may significantly bias the estimates of cosmological parameters. In this work, we quantify this bias and investigate how an incomplete modelling of various effects on ultralarge scales could lead to false detections of new physics beyond the standard ΛCDM model. Furthermore, we propose a simple debiasing method that allows us to recover true cosmologies without running the full parameter estimation pipeline with exact theoretical calculations. This method can therefore provide a fast way of obtaining accurate values of cosmological parameters and estimates of exact posterior probability distributions from ultralarge-scale observations.

     
    more » « less
  2. Abstract

    Bayesian Markov chain Monte Carlo explores tree space slowly, in part because it frequently returns to the same tree topology. An alternative strategy would be to explore tree space systematically, and never return to the same topology. In this article, we present an efficient parallelized method to map out the high likelihood set of phylogenetic tree topologies via systematic search, which we show to be a good approximation of the high posterior set of tree topologies on the data sets analyzed. Here, “likelihood” of a topology refers to the tree likelihood for the corresponding tree with optimized branch lengths. We call this method “phylogenetic topographer” (PT). The PT strategy is very simple: starting in a number of local topology maxima (obtained by hill-climbing from random starting points), explore out using local topology rearrangements, only continuing through topologies that are better than some likelihood threshold below the best observed topology. We show that the normalized topology likelihoods are a useful proxy for the Bayesian posterior probability of those topologies. By using a nonblocking hash table keyed on unique representations of tree topologies, we avoid visiting topologies more than once across all concurrent threads exploring tree space. We demonstrate that PT can be used directly to approximate a Bayesian consensus tree topology. When combined with an accurate means of evaluating per-topology marginal likelihoods, PT gives an alternative procedure for obtaining Bayesian posterior distributions on phylogenetic tree topologies.

     
    more » « less
  3. Abstract

    In electronic health records (EHRs) data analysis, nonparametric regression and classification using International Classification of Disease (ICD) codes as covariates remain understudied. Automated methods have been developed over the years for predicting biomedical responses using EHRs, but relatively less attention has been paid to developing patient similarity measures that use ICD codes and chronic conditions, where a chronic condition is defined as a set of ICD codes. We address this problem by first developing a string kernel function for measuring the similarity between a pair of primary chronic conditions, represented as subsets of ICD codes. Second, we extend this similarity measure to a family of covariance functions on subsets of chronic conditions. This family is used in developing Gaussian process (GP) priors for Bayesian nonparametric regression and classification using diagnoses and other demographic information as covariates. Markov chain Monte Carlo (MCMC) algorithms are used for posterior inference and predictions. The proposed methods are tuning free, so they are ideal for automated prediction of biomedical responses depending on chronic conditions. We evaluate the practical performance of our method on EHR data collected from 1660 patients at the University of Iowa Hospitals and Clinics (UIHC) with six different primary cancer sites. Our method provides better sensitivity and specificity than its competitors in classifying different primary cancer sites and estimates the marginal associations between chronic conditions and primary cancer sites.

     
    more » « less
  4. null (Ed.)
    This paper studies a remote sensing system where multiple wireless sensors generate possibly noisy information updates of various surveillance fields and delivering these updates to a control center over a wireless network. The control center needs a sufficient number of recently generated information updates to have an accurate estimate of the current system status, which is critical for the control center to make appropriate control decisions. The goal of this work is then to design the optimal policy for scheduling the transmissions of information updates. Through Brownian approximation, we demonstrate that the control center’s ability to make accurate real-time estimates depends on the averages and temporal variances of the delivery processes. We then formulate a constrained optimization problem to find the optimal means and variances. We also develop a simple online scheduling policy that employs the optimal means and variances to achieve the optimal system-wide performance. Simulation results show that our scheduling policy enjoys fast convergence speed and better performance when compared to other state-of-the-art policies. 
    more » « less
  5. Accurate protein inference in the presence of shared peptides is still one of the key problems in bottom-up proteomics. Most protein inference tools employing simple heuristic inference strategies are efficient but exhibit reduced accuracy. More advanced probabilistic methods often exhibit better inference quality but tend to be too slow for large data sets. Here, we present a novel protein inference method, EPIFANY, combining a loopy belief propagation algorithm with convolution trees for efficient processing of Bayesian networks. We demonstrate that EPIFANY combines the reliable protein inference of Bayesian methods with significantly shorter runtimes. On the 2016 iPRG protein inference benchmark data, EPIFANY is the only tested method that finds all true-positive proteins at a 5% protein false discovery rate (FDR) without strict prefiltering on the peptide-spectrum match (PSM) level, yielding an increase in identification performance (+10% in the number of true positives and +14% in partial AUC) compared to previous approaches. Even very large data sets with hundreds of thousands of spectra (which are intractable with other Bayesian and some non-Bayesian tools) can be processed with EPIFANY within minutes. The increased inference quality including shared peptides results in better protein inference results and thus increased robustness of the biological hypotheses generated. EPIFANY is available as open-source software for all major platforms at https://OpenMS.de/epifany. 
    more » « less