skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Friday, July 12 until 9:00 AM ET on Saturday, July 13 due to maintenance. We apologize for the inconvenience.

Title: ALPCAH: Sample-wise Heteroscedastic PCA with Tail Singular Value Regularization
Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction that is useful for various data science problems. However, many applications involve heterogeneous data that varies in quality due to noise characteristics associated with different sources of the data. Methods that deal with this mixed dataset are known as heteroscedastic methods. Current methods like HePPCAT make Gaussian assumptions of the basis coefficients that may not hold in practice. Other methods such as Weighted PCA (WPCA) assume the noise variances are known, which may be difficult to know in practice. This paper develops a PCA method that can estimate the sample-wise noise variances and use this information in the model to improve the estimate of the subspace basis associated with the low-rank structure of the data. This is done without distributional assumptions of the low-rank component and without assuming the noise variances are known. Simulations show the effectiveness of accounting for such heteroscedasticity in the data, the benefits of using such a method with all of the data versus retaining only good data, and comparisons are made against other PCA methods established in the literature like PCA, Robust PCA (RPCA), and HePPCAT. Code available at  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Publisher / Repository:
Date Published:
Journal Name:
International Conference on Sampling Theory and Applications
Page Range / eLocation ID:
1 to 6
Medium: X
New Haven, CT, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. The e ectiveness of supervised learning techniques has made them ubiquitous in research and practice. In high-dimensional settings, supervised learning commonly relies on dimensionality reduction to improve performance and identify the most important factors in predicting outcomes. However, the economic importance of learn- ing has made it a natural target for adversarial manipulation of training data, which we term poisoning attacks. Prior approaches to dealing with robust supervised learning rely on strong assumptions about the nature of the feature matrix, such as feature independence and sub-Gaussian noise with low variance. We propose an inte- grated method for robust regression that relaxes these assumptions, assuming only that the feature matrix can be well approximated by a low-rank matrix. Our techniques integrate improved robust low-rank matrix approximation and robust principle component regression, and yield strong performance guarantees. Moreover, we experimentally show that our methods signi cantly outper- form state-of-the-art robust regression both in running time and prediction error. 
    more » « less
  2. Low-rank matrix recovery problems involving high-dimensional and heterogeneous data appear in applications throughout statistics and machine learning. The contribution of this paper is to establish the fundamental limits of recovery for a broad class of these problems. In particular, we study the problem of estimating a rank-one matrix from Gaussian observations where different blocks of the matrix are observed under different noise levels. In the setting where the number of blocks is fixed while the number of variables tends to infinity, we prove asymptotically exact formulas for the minimum mean-squared error in estimating both the matrix and underlying factors. These results are based on a novel reduction from the low-rank matrix tensor product model (with homogeneous noise) to a rank-one model with heteroskedastic noise. As an application of our main result, we show that show recently proposed methods based on applying principal component analysis (PCA) to weighted combinations of the data are optimal in some settings but sub-optimal in others. We also provide numerical results comparing our asymptotic formulas with the performance of methods based weighted PCA, gradient descent, and approximate message passing. 
    more » « less
  3. null (Ed.)
    We provide a non-asymptotic analysis of the spiked Wishart and Wigner matrix models with a generative neural network prior. Spiked random matrices have the form of a rank-one signal plus noise and have been used as models for high dimensional Principal Component Analysis (PCA), community detection and synchronization over groups. Depending on the prior imposed on the spike, these models can display a statistical-computational gap between the information theoretically optimal reconstruction error that can be achieved with unbounded computational resources and the sub-optimal performances of currently known polynomial time algorithms. These gaps are believed to be fundamental, as in the emblematic case of Sparse PCA. In stark contrast to such cases, we show that there is no statistical-computational gap under a generative network prior, in which the spike lies on the range of a generative neural network. Specifically, we analyze a gradient descent method for minimizing a nonlinear least squares objective over the range of an expansive-Gaussian neural network and show that it can recover in polynomial time an estimate of the underlying spike with a rate-optimal sample complexity and dependence on the noise level. 
    more » « less
  4. Dette, Holger ; Lee, Stephen ; Pensky, Marianna (Ed.)
    Quantum state tomography, which aims to estimate quantum states that are described by density matrices, plays an important role in quantum science and quantum technology. This paper examines the eigenspace estimation and the reconstruction of large low-rank density matrix based on Pauli measurements. Both ordinary principal component analysis (PCA) and iterative thresholding sparse PCA (ITSPCA) estimators of the eigenspace are studied, and their respective convergence rates are established. In particular, we show that the ITSPCA estimator is rate-optimal. We present the reconstruction of the large low-rank density matrix and obtain its optimal convergence rate by using the ITSPCA estimator. A numerical study is carried out to investigate the finite sample performance of the proposed estimators. 
    more » « less
  5. An alternating optimization algorithm is presented and analyzed for identifying low-rank signal components, known in factor analysis terminology as common factors, that are correlated across two multiple-input multiple-output (MIMO) channels. The additive noise model at each of the MIMO channels consists of white uncorrelated noises of unequal variances plus a low-rank structured interference that is not correlated across the two channels. The low-rank components at each channel represent uncommon or channel-specific factors. 
    more » « less