skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, December 13 until 2:00 AM ET on Saturday, December 14 due to maintenance. We apologize for the inconvenience.


Title: High-Dimensional Principal Component Analysis with Heterogeneous Missingness
Abstract

We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In a simple, homogeneous observation model, we show that an existing observed-proportion weighted (OPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, which exhibits an interesting phase transition. However, deeper investigation reveals that, particularly in more realistic settings where the observation probabilities are heterogeneous, the empirical performance of the OPW estimator can be unsatisfactory; moreover, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method, which we call primePCA, that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the OPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. We prove that the error of primePCA converges to zero at a geometric rate in the noiseless case, and when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios, including settings where the data are not Missing Completely At Random.

 
more » « less
Award ID(s):
2015366
PAR ID:
10382831
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
84
Issue:
5
ISSN:
1369-7412
Format(s):
Medium: X Size: p. 2000-2031
Size(s):
p. 2000-2031
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we propose a novel method for matrix completion under general non- uniform missing structures. By controlling an upper bound of a novel balancing error, we construct weights that can actively adjust for the non-uniformity in the empirical risk without explicitly modeling the observation probabilities, and can be computed efficiently via convex optimization. The recovered matrix based on the proposed weighted empirical risk enjoys appealing theoretical guarantees. In particular, the proposed method achieves stronger guarantee than existing work in terms of the scaling with respect to the observation probabilities, under asymptotically heterogeneous missing settings (where entry-wise observation probabilities can be of different orders). These settings can be regarded as a better theoretical model of missing patterns with highly varying probabilities. We also provide a new minimax lower bound under a class of heterogeneous settings. Numerical experiments are also provided to demonstrate the effectiveness of the proposed method. 
    more » « less
  2. Matrix completion is a well-known approach for recommender systems. It predicts the values of the missing entries in a sparse user-item interaction matrix, based on the low-rank structure of the rating matrix. However, existing matrix completion methods do not take node polysemy and side information of social relationships into consideration, which can otherwise further improve the performance. In this paper, we propose a novel matrix completion method that employs both users’ friendships and rating entries to predict the missing values in a user-item matrix. Our approach adopts a graph-based modeling where nodes are users and items, and two types of edges are considered: user friendships and user-item interactions. Polysemy-aware node features are extracted from this heterogeneous graph through a graph convolution network by considering the multifaceted factors for edge formation, which are then connected to a hybrid loss function with two heads: (1) a social-homophily head to address node polysemy, and (2) an error head for user-item rating regression. The latter is formulated on all matrix entries to combat the sensitivity of negative sampling of the vast majority of missing entries during training, with a smart technique to reduce the time complexity. Extensive experiments over real datasets verify that our model outperforms the state-of-the-art matrix completion methods by a significant margin. 
    more » « less
  3. null (Ed.)
    Matrix completion, the problem of completing missing entries in a data matrix with low-dimensional structure (such as rank), has seen many fruitful approaches and analyses. Tensor completion is the tensor analog that attempts to impute missing tensor entries from similar low-rank type assumptions. In this paper, we study the tensor completion problem when the sampling pattern is deterministic and possibly non-uniform. We first propose an efficient weighted Higher Order Singular Value Decomposition (HOSVD) algorithm for the recovery of the underlying low-rank tensor from noisy observations and then derive the error bounds under a properly weighted metric. Additionally, the efficiency and accuracy of our algorithm are both tested using synthetic and real datasets in numerical simulations. 
    more » « less
  4. Abstract In this paper, we study the nonlinear inverse problem of estimating the spectrum of a system matrix, that drives a finite-dimensional affine dynamical system, from partial observations of a single trajectory data. In the noiseless case, we prove an annihilating polynomial of the system matrix, whose roots are a subset of the spectrum, can be uniquely determined from data. We then study which eigenvalues of the system matrix can be recovered and derive various sufficient and necessary conditions to characterize the relationship between the recoverability of each eigenvalue and the observation locations. We propose various reconstruction algorithms with theoretical guarantees, generalizing the classical Prony method, ESPRIT, and matrix pencil method. We test the algorithms over a variety of examples with applications to graph signal processing, disease modeling and a real-human motion dataset. The numerical results validate our theoretical results and demonstrate the effectiveness of the proposed algorithms. 
    more » « less
  5. We consider the problem of estimating how well a model class is capable of fitting a distribution of labeled data. We show that it is possible to accurately estimate this “learnability” even when given an amount of data that is too small to reliably learn any accurate model. Our first result applies to the setting where the data is drawn from a d-dimensional distribution with isotropic covariance, and the label of each datapoint is an arbitrary noisy function of the datapoint. In this setting, we show that with O(sqrt(d)) samples, one can accurately estimate the fraction of the variance of the label that can be explained via the best linear function of the data. For comparison, even if the labels are noiseless linear functions of the data, a sample size linear in the dimension, d, is required to learn any function correlated with the underlying model. Our estimation approach also applies to the setting where the data distribution has an (unknown) arbitrary covariance matrix, allowing these techniques to be applied to settings where the model class consists of a linear function applied to a nonlinear embedding of the data. In this setting we give a consistent estimator of the fraction of explainable variance that uses o(d) samples. Finally, our techniques also extend to the setting of binary classification, where we obtain analogous results under the logistic model, for estimating the classification accuracy of the best linear classifier. We demonstrate the practical viability of our approaches on synthetic and real data. This ability to estimate the explanatory value of a set of features (or dataset), even in the regime in which there is too little data to realize that explanatory value, may be relevant to the scientific and industrial settings for which data collection is expensive and there are many potentially relevant feature sets that could be collected. 
    more » « less