skip to main content


Title: Stein’s method meets computational statistics: a review of some recent developments
Stein’s method compares probability distributions through the study of a class of linear operators called Stein operators. While mainly studied in probability and used to underpin theoretical statistics, Stein’s method has led to significant advances in computational statistics in recent years. The goal of this survey is to bring together some of these recent developments, and in doing so, to stimulate further research into the successful field of Stein’s method and statistics. The topics we discuss include tools to benchmark and compare sampling methods such as approximate Markov chain Monte Carlo, deterministic alternatives to sampling methods, control variate techniques, parameter estimation and goodness-of-fit testing.  more » « less
Award ID(s):
1846421
NSF-PAR ID:
10440556
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Statistical science
ISSN:
0883-4237
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Transition path theory computes statistics from ensembles of reactive trajectories. A common strategy for sampling reactive trajectories is to control the branching and pruning of trajectories so as to enhance the sampling of low probability segments. However, it can be challenging to apply transition path theory to data from such methods because determining whether configurations and trajectory segments are part of reactive trajectories requires looking backward and forward in time. Here, we show how this issue can be overcome efficiently by introducing simple data structures. We illustrate the approach in the context of nonequilibrium umbrella sampling, but the strategy is general and can be used to obtain transition path theory statistics from other methods that sample segments of unbiased trajectories. 
    more » « less
  2. Summary

    This paper provides a framework for testing multiple null hypotheses simultaneously using experimental data in which simple random sampling is used to assign treatment status to units. Using general results from the multiple testing literature, we develop under weak assumptions a procedure that (i) asymptotically controls the familywise error rate—the probability of one or more false rejections—and (ii) is asymptotically balanced in that the marginal probability of rejecting any true null hypothesis is approximately equal in large samples. Our procedure improves upon classical methods by incorporating information about the joint dependence structure of the test statistics when determining which null hypotheses to reject, leading to gains in power. An important point of departure from prior work is that we exploit observed, baseline covariates to obtain further gains in power. The precise way in which we incorporate these covariates is based on recent results from the statistics literature in order to ensure that inferences are typically more powerful in large samples.

     
    more » « less
  3. Summary

    Motivated by the statistical inference problem in population genetics, we present a new sequential importance sampling with resampling strategy. The idea of resampling is key to the recent surge of popularity of sequential Monte Carlo methods in the statistics and engin-eering communities, but existing resampling techniques do not work well for coalescent-based inference problems in population genetics. We develop a new method called ‘stopping-time resampling’, which allows us to compare partially simulated samples at different stages to terminate unpromising partial samples and to multiply promising samples early on. To illustrate the idea, we first apply the new method to approximate the solution of a Dirichlet problem and the likelihood function of a non-Markovian process. Then we focus on its application in population genetics. All our examples show that the new resampling method can significantly improve the computational efficiency of existing sequential importance sampling methods.

     
    more » « less
  4. na (Ed.)
    Bayesian methods have been widely used in the last two decades to infer statistical proper- ties of spatially variable coefficients in partial differential equations from measurements of the solutions of these equations. Yet, in many cases the number of variables used to param- eterize these coefficients is large, and oobtaining meaningful statistics of their probability distributions is difficult using simple sampling methods such as the basic Metropolis– Hastings algorithm—in particular, if the inverse problem is ill-conditioned or ill-posed. As a consequence, many advanced sampling methods have been described in the literature that converge faster than Metropolis–Hastings, for example, by exploiting hierarchies of statistical models or hierarchies of discretizations of the underlying differential equation. At the same time, it remains difficult for the reader of the literature to quantify the advantages of these algorithms because there is no commonly used benchmark. This paper presents a benchmark Bayesian inverse problem—namely, the determination of a spatially variable coefficient, discretized by 64 values, in a Poisson equation, based on point mea- surements of the solution—that fills the gap between widely used simple test cases (such as superpositions of Gaussians) and real applications that are difficult to replicate for de- velopers of sampling algorithms. We provide a complete description of the test case and provide an open-source implementation that can serve as the basis for further experiments. We have also computed 2 × 10^11 samples, at a cost of some 30 CPU years, of the poste- rior probability distribution from which we have generated detailed and accurate statistics against which other sampling algorithms can be tested. 
    more » « less
  5. Abstract

    Searching for patterns in data is important because it can lead to the discovery of sequence segments that play a functional role. The complexity of pattern statistics that are used in data analysis and the need of the sampling distribution of those statistics for inference renders efficient computation methods as paramount. This article gives an overview of the main methods used to compute distributions of statistics of overlapping pattern occurrences, specifically, generating functions, correlation functions, the Goulden‐Jackson cluster method, recursive equations, and Markov chain embedding. The underlying data sequence will be assumed to be higher‐order Markovian, which includes sparse Markov models and variable length Markov chains as special cases. Also considered will be recent developments for extending the computational capabilities of the Markov chain‐based method through an algorithm for minimizing the size of the chain's state space, as well as improved data modeling capabilities through sparse Markov models. An application to compute a distribution used as a test statistic in sequence alignment will serve to illustrate the usefulness of the methodology.

    This article is categorized under:

    Statistical Learning and Exploratory Methods of the Data Sciences > Pattern Recognition

    Data: Types and Structure > Categorical Data

    Statistical and Graphical Methods of Data Analysis > Modeling Methods and Algorithms

     
    more » « less