skip to main content


Title: Empirical Bayes PCA in High Dimensions
Abstract

When the dimension of data is comparable to or larger than the number of data samples, principal components analysis (PCA) may exhibit problematic high-dimensional noise. In this work, we propose an empirical Bayes PCA method that reduces this noise by estimating a joint prior distribution for the principal components. EB-PCA is based on the classical Kiefer–Wolfowitz non-parametric maximum likelihood estimator for empirical Bayes estimation, distributional results derived from random matrix theory for the sample PCs and iterative refinement using an approximate message passing (AMP) algorithm. In theoretical ‘spiked’ models, EB-PCA achieves Bayes-optimal estimation accuracy in the same settings as an oracle Bayes AMP procedure that knows the true priors. Empirically, EB-PCA significantly improves over PCA when there is strong prior structure, both in simulation and on quantitative benchmarks constructed from the 1000 Genomes Project and the International HapMap Project. An illustration is presented for analysis of gene expression data obtained by single-cell RNA-seq.

 
more » « less
Award ID(s):
1916198
NSF-PAR ID:
10398633
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
84
Issue:
3
ISSN:
1369-7412
Format(s):
Medium: X Size: p. 853-878
Size(s):
["p. 853-878"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Many variables of interest in agricultural or economical surveys have skewed distributions and can equal zero. Our data are measures of sheet and rill erosion called Revised Universal Soil Loss Equation2 (RUSLE2). Small area estimates of mean RUSLE2 erosion are of interest. We use a zero‐inflated lognormal mixed effects model for small area estimation. The model combines a unit‐level lognormal model for the positive RUSLE2 responses with a unit‐level logistic mixed effects model for the binary indicator that the response is nonzero. In the Conservation Effects Assessment Project (CEAP) data, counties with a higher probability of nonzero responses also tend to have a higher mean among the positive RUSLE2 values. We capture this property of the data through an assumption that the pair of random effects for a county are correlated. We develop empirical Bayes (EB) small area predictors and a bootstrap estimator of the mean squared error (MSE). In simulations, the proposed predictor is superior to simpler alternatives. We then apply the method to construct EB predictors of mean RUSLE2 erosion for South Dakota counties. To obtain auxiliary variables for the population of cropland in South Dakota, we integrate a satellite‐derived land cover map with a geographic database of soil properties. We provide an R Shiny application calledviscover(available athttps://lyux.shinyapps.io/viscover/) to visualize the overlay operations required to construct the covariates. On the basis of bootstrap estimates of the mean square error, we conclude that the EB predictors of mean RUSLE2 erosion are superior to direct estimators.

     
    more » « less
  2. Abstract

    We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In a simple, homogeneous observation model, we show that an existing observed-proportion weighted (OPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, which exhibits an interesting phase transition. However, deeper investigation reveals that, particularly in more realistic settings where the observation probabilities are heterogeneous, the empirical performance of the OPW estimator can be unsatisfactory; moreover, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method, which we call primePCA, that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the OPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. We prove that the error of primePCA converges to zero at a geometric rate in the noiseless case, and when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios, including settings where the data are not Missing Completely At Random.

     
    more » « less
  3. Principal Component Analysis (PCA) is a standard dimensionality reduction technique, but it treats all samples uniformly, making it suboptimal for heterogeneous data that are increasingly common in modern settings. This paper proposes a PCA variant for samples with heterogeneous noise levels, i.e., heteroscedastic noise, that naturally arise when some of the data come from higher quality sources than others. The technique handles heteroscedasticity by incorporating it in the statistical model of a probabilistic PCA. The resulting optimization problem is an interesting nonconvex problem related to but not seemingly solved by singular value decomposition, and this paper derives an expectation maximization (EM) algorithm. Numerical experiments illustrate the benefits of using the proposed method to combine samples with heteroscedastic noise in a single analysis, as well as benefits of careful initialization for the EM algorithm. Index Terms— Principal component analysis, heterogeneous data, maximum likelihood estimation, latent factors 
    more » « less
  4. Summary

    We consider the problem of empirical Bayes estimation of multiple variances when provided with sample variances. Assuming an arbitrary prior on the variances, we derive different versions of the Bayes estimators using different loss functions. For one particular loss function, the resulting Bayes estimator relies on the marginal cumulative distribution function of the sample variances only. When replacing it with the empirical distribution function, we obtain an empirical Bayes version called the $F$-modelling-based empirical Bayes estimator of variances. We provide theoretical properties of this estimator, and further demonstrate its advantages through extensive simulations and real data analysis.

     
    more » « less
  5. It is well-known that morphological features in the brain undergo changes due to traumatic events and associated disorders such as post-traumatic stress disorder (PTSD). However, existing approaches typically offer group-level comparisons, and there are limited predictive approaches for modeling behavioral outcomes based on brain shape features that can account for heterogeneity in PTSD, which is of paramount interest. We propose a comprehensive shape analysis framework representing brain sub-structures, such as the hippocampus, amygdala, and putamen, as parameterized surfaces and quantifying their shape differences using an elastic shape metric. Under this metric, we compute shape summaries (mean, covariance, PCA) of brain sub-structures and represent individual brain shapes by their principal scores under a shape-PCA basis. These representations are rich enough to allow visualizations of full 3D structures and help understand localized changes. In order to validate the elastic shape analysis, we use the principal components (PCs) to reconstruct the brain structures and perform further evaluation by performing a regression analysis to model PTSD and trauma severity using the brain shapes representedviaPCs and in conjunction with auxiliary exposure variables. We apply our method to data from the Grady Trauma Project (GTP), where the goal is to predict clinical measures of PTSD. The framework seamlessly integrates accurate morphological features and other clinical covariates to yield superior predictive performance when modeling PTSD outcomes. Compared to vertex-wise analysis and other widely applied shape analysis methods, the elastic shape analysis approach results in considerably higher reconstruction accuracy for the brain shape and reveals significantly greater predictive power. It also helps identify local deformations in brain shapes associated with PTSD severity.

     
    more » « less