skip to main content

Title: On F -modelling-based empirical Bayes estimation of variances

We consider the problem of empirical Bayes estimation of multiple variances when provided with sample variances. Assuming an arbitrary prior on the variances, we derive different versions of the Bayes estimators using different loss functions. For one particular loss function, the resulting Bayes estimator relies on the marginal cumulative distribution function of the sample variances only. When replacing it with the empirical distribution function, we obtain an empirical Bayes version called the $F$-modelling-based empirical Bayes estimator of variances. We provide theoretical properties of this estimator, and further demonstrate its advantages through extensive simulations and real data analysis.

more » « less
Author(s) / Creator(s):
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Medium: X Size: p. 69-81
["p. 69-81"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    When the dimension of data is comparable to or larger than the number of data samples, principal components analysis (PCA) may exhibit problematic high-dimensional noise. In this work, we propose an empirical Bayes PCA method that reduces this noise by estimating a joint prior distribution for the principal components. EB-PCA is based on the classical Kiefer–Wolfowitz non-parametric maximum likelihood estimator for empirical Bayes estimation, distributional results derived from random matrix theory for the sample PCs and iterative refinement using an approximate message passing (AMP) algorithm. In theoretical ‘spiked’ models, EB-PCA achieves Bayes-optimal estimation accuracy in the same settings as an oracle Bayes AMP procedure that knows the true priors. Empirically, EB-PCA significantly improves over PCA when there is strong prior structure, both in simulation and on quantitative benchmarks constructed from the 1000 Genomes Project and the International HapMap Project. An illustration is presented for analysis of gene expression data obtained by single-cell RNA-seq.

    more » « less
  2. null (Ed.)
    Pillai & Meng (Pillai & Meng 2016 Ann. Stat. 44 , 2089–2097; p. 2091) speculated that ‘the dependence among [random variables, rvs] can be overwhelmed by the heaviness of their marginal tails ·· ·’. We give examples of statistical models that support this speculation. While under natural conditions the sample correlation of regularly varying (RV) rvs converges to a generally random limit, this limit is zero when the rvs are the reciprocals of powers greater than one of arbitrarily (but imperfectly) positively or negatively correlated normals. Surprisingly, the sample correlation of these RV rvs multiplied by the sample size has a limiting distribution on the negative half-line. We show that the asymptotic scaling of Taylor’s Law (a power-law variance function) for RV rvs is, up to a constant, the same for independent and identically distributed observations as for reciprocals of powers greater than one of arbitrarily (but imperfectly) positively correlated normals, whether those powers are the same or different. The correlations and heterogeneity do not affect the asymptotic scaling. We analyse the sample kurtosis of heavy-tailed data similarly. We show that the least-squares estimator of the slope in a linear model with heavy-tailed predictor and noise unexpectedly converges much faster than when they have finite variances. 
    more » « less
  3. Abstract

    The genetic effective population size,Ne, can be estimated from the average gametic disequilibrium () between pairs of loci, but such estimates require evaluation of assumptions and currently have few methods to estimate confidence intervals.speed‐neis a suite ofmatlabcomputer code functions to estimatefromwith a graphical user interface and a rich set of outputs that aid in understanding data patterns and comparing multiple estimators.speed‐neincludes functions to either generate or input simulated genotype data to facilitate comparative studies ofestimators under various population genetic scenarios.speed‐newas validated with data simulated under both time‐forward and time‐backward coalescent models of genetic drift. Three classes of estimators were compared with simulated data to examine several general questions: what are the impacts of microsatellite null alleles on,how should missing data be treated, and does disequilibrium contributed by reduced recombination among some loci in a sample impact. Estimators differed greatly in precision in the scenarios examined, and a widely employedestimator exhibited the largest variances among replicate data sets.speed‐neimplements several jackknife approaches to estimate confidence intervals, and simulated data showed that jackknifing over loci and jackknifing over individuals provided ~95% confidence interval coverage for some estimators and should be useful for empirical studies.speed‐neprovides an open‐source extensible tool for estimation offrom empirical genotype data and to conduct simulations of both microsatellite and single nucleotide polymorphism (SNP) data types to develop expectations and to compareestimators.

    more » « less
  4. Abstract

    We propose a new procedure for inference on optimal treatment regimes in the model‐free setting, which does not require to specify an outcome regression model. Existing model‐free estimators for optimal treatment regimes are usually not suitable for the purpose of inference, because they either have nonstandard asymptotic distributions or do not necessarily guarantee consistent estimation of the parameter indexing the Bayes rule due to the use of surrogate loss. We first study a smoothed robust estimator that directly targets the parameter corresponding to the Bayes decision rule for optimal treatment regimes estimation. This estimator is shown to have an asymptotic normal distribution. Furthermore, we verify that a resampling procedure provides asymptotically accurate inference for both the parameter indexing the optimal treatment regime and the optimal value function. A new algorithm is developed to calculate the proposed estimator with substantially improved speed and stability. Numerical results demonstrate the satisfactory performance of the new methods.

    more » « less
  5. Summary

    We construct empirical Bayes intervals for a large number p of means. The existing intervals in the literature assume that variances σi2 are either equal or unequal but known. When the variances are unequal and unknown, the suggestion is typically to replace them by unbiased estimators Si2. However, when p is large, there would be advantage in ‘borrowing strength’ from each other. We derive double-shrinkage intervals for means on the basis of our empirical Bayes estimators that shrink both the means and the variances. Analytical and simulation studies and application to a real data set show that, compared with the t-intervals, our intervals have higher coverage probabilities while yielding shorter lengths on average. The double-shrinkage intervals are on average shorter than the intervals from shrinking the means alone and are always no longer than the intervals from shrinking the variances alone. Also, the intervals are explicitly defined and can be computed immediately.

    more » « less