skip to main content

Title: Empirical Bayes PCA in High Dimensions

When the dimension of data is comparable to or larger than the number of data samples, principal components analysis (PCA) may exhibit problematic high-dimensional noise. In this work, we propose an empirical Bayes PCA method that reduces this noise by estimating a joint prior distribution for the principal components. EB-PCA is based on the classical Kiefer–Wolfowitz non-parametric maximum likelihood estimator for empirical Bayes estimation, distributional results derived from random matrix theory for the sample PCs and iterative refinement using an approximate message passing (AMP) algorithm. In theoretical ‘spiked’ models, EB-PCA achieves Bayes-optimal estimation accuracy in the same settings as an oracle Bayes AMP procedure that knows the true priors. Empirically, EB-PCA significantly improves over PCA when there is strong prior structure, both in simulation and on quantitative benchmarks constructed from the 1000 Genomes Project and the International HapMap Project. An illustration is presented for analysis of gene expression data obtained by single-cell RNA-seq.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Medium: X Size: p. 853-878
["p. 853-878"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Many variables of interest in agricultural or economical surveys have skewed distributions and can equal zero. Our data are measures of sheet and rill erosion called Revised Universal Soil Loss Equation2 (RUSLE2). Small area estimates of mean RUSLE2 erosion are of interest. We use a zero‐inflated lognormal mixed effects model for small area estimation. The model combines a unit‐level lognormal model for the positive RUSLE2 responses with a unit‐level logistic mixed effects model for the binary indicator that the response is nonzero. In the Conservation Effects Assessment Project (CEAP) data, counties with a higher probability of nonzero responses also tend to have a higher mean among the positive RUSLE2 values. We capture this property of the data through an assumption that the pair of random effects for a county are correlated. We develop empirical Bayes (EB) small area predictors and a bootstrap estimator of the mean squared error (MSE). In simulations, the proposed predictor is superior to simpler alternatives. We then apply the method to construct EB predictors of mean RUSLE2 erosion for South Dakota counties. To obtain auxiliary variables for the population of cropland in South Dakota, we integrate a satellite‐derived land cover map with a geographic database of soil properties. We provide an R Shiny application calledviscover(available at to visualize the overlay operations required to construct the covariates. On the basis of bootstrap estimates of the mean square error, we conclude that the EB predictors of mean RUSLE2 erosion are superior to direct estimators.

    more » « less
  2. Abstract

    We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In a simple, homogeneous observation model, we show that an existing observed-proportion weighted (OPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, which exhibits an interesting phase transition. However, deeper investigation reveals that, particularly in more realistic settings where the observation probabilities are heterogeneous, the empirical performance of the OPW estimator can be unsatisfactory; moreover, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method, which we call primePCA, that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the OPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. We prove that the error of primePCA converges to zero at a geometric rate in the noiseless case, and when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios, including settings where the data are not Missing Completely At Random.

    more » « less
  3. Principal Component Analysis (PCA) is a standard dimensionality reduction technique, but it treats all samples uniformly, making it suboptimal for heterogeneous data that are increasingly common in modern settings. This paper proposes a PCA variant for samples with heterogeneous noise levels, i.e., heteroscedastic noise, that naturally arise when some of the data come from higher quality sources than others. The technique handles heteroscedasticity by incorporating it in the statistical model of a probabilistic PCA. The resulting optimization problem is an interesting nonconvex problem related to but not seemingly solved by singular value decomposition, and this paper derives an expectation maximization (EM) algorithm. Numerical experiments illustrate the benefits of using the proposed method to combine samples with heteroscedastic noise in a single analysis, as well as benefits of careful initialization for the EM algorithm. Index Terms— Principal component analysis, heterogeneous data, maximum likelihood estimation, latent factors 
    more » « less

    In astronomy, spectroscopy consists of observing an astrophysical source and extracting its spectrum of electromagnetic radiation. Once extracted, a model is fit to the spectra to measure the observables, leading to an understanding of the underlying physics of the emission mechanism. One crucial, and often overlooked, aspect of this model is the background emission, which contains foreground and background astrophysical sources, intervening atmospheric emission, and artefacts related to the instrument such as noise. This paper proposes an algorithmic approach to constructing a background model for SITELLE observations using statistical tools and supervised machine learning algorithms. SITELLE is an imaging Fourier transform spectrometer located at the Canada-France-Hawaii Telescope, which produces a three-dimensional data cube containing the position of the emission (two dimensions) and the spectrum of the emission. SITELLE has a wide field of view (11 arcmin × 11 arcmin), which makes the background emission particularly challenging to model. We apply a segmentation algorithm implemented in photutils to divide the data cube into background and source spaxels. After applying a principal component analysis (PCA) on the background spaxels, we train an artificial neural network to interpolate from the background to the source spaxels in the PCA coefficient space, which allows us to generate a local background model over the entire data cube. We highlight the performance of this methodology by applying it to SITELLE observations obtained of a Star-formation, Ionized Gas and Nebular Abundances Legacy Survey galaxy, NGC 4449, and the Perseus galaxy cluster of galaxies, NGC 1275. We discuss the physical interpretation of the principal components and noise reduction in the resulting PCA-based reconstructions. Additionally, we compare the fit results using our new background modelling approach with standard methods used in the literature and find that our method better captures the emission from H ii regions in NGC 4449 and the faint emission regions in NGC 1275. These methods also demonstrate that the background does change as a function of the position of the data cube. While the approach is applied explicitly to SITELLE data in this study, we argue that it can be readily adapted to any integral field unit style data, enabling the user to obtain more robust measurements on the flux of the emission lines.

    more » « less
  5. Summary

    We consider the problem of empirical Bayes estimation of multiple variances when provided with sample variances. Assuming an arbitrary prior on the variances, we derive different versions of the Bayes estimators using different loss functions. For one particular loss function, the resulting Bayes estimator relies on the marginal cumulative distribution function of the sample variances only. When replacing it with the empirical distribution function, we obtain an empirical Bayes version called the $F$-modelling-based empirical Bayes estimator of variances. We provide theoretical properties of this estimator, and further demonstrate its advantages through extensive simulations and real data analysis.

    more » « less