In this paper, we propose a novel method for matrix completion under general non- uniform missing structures. By controlling an upper bound of a novel balancing error, we construct weights that can actively adjust for the non-uniformity in the empirical risk without explicitly modeling the observation probabilities, and can be computed efficiently via convex optimization. The recovered matrix based on the proposed weighted empirical risk enjoys appealing theoretical guarantees. In particular, the proposed method achieves stronger guarantee than existing work in terms of the scaling with respect to the observation probabilities, under asymptotically heterogeneous missing settings (where entry-wise observation probabilities can be of different orders). These settings can be regarded as a better theoretical model of missing patterns with highly varying probabilities. We also provide a new minimax lower bound under a class of heterogeneous settings. Numerical experiments are also provided to demonstrate the effectiveness of the proposed method.
more »
« less
High-Dimensional Principal Component Analysis with Heterogeneous Missingness
Abstract We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In a simple, homogeneous observation model, we show that an existing observed-proportion weighted (OPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, which exhibits an interesting phase transition. However, deeper investigation reveals that, particularly in more realistic settings where the observation probabilities are heterogeneous, the empirical performance of the OPW estimator can be unsatisfactory; moreover, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method, which we call primePCA, that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the OPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. We prove that the error of primePCA converges to zero at a geometric rate in the noiseless case, and when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios, including settings where the data are not Missing Completely At Random.
more »
« less
- Award ID(s):
- 2015366
- PAR ID:
- 10382831
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Journal of the Royal Statistical Society Series B: Statistical Methodology
- Volume:
- 84
- Issue:
- 5
- ISSN:
- 1369-7412
- Format(s):
- Medium: X Size: p. 2000-2031
- Size(s):
- p. 2000-2031
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Matrix completion is a well-known approach for recommender systems. It predicts the values of the missing entries in a sparse user-item interaction matrix, based on the low-rank structure of the rating matrix. However, existing matrix completion methods do not take node polysemy and side information of social relationships into consideration, which can otherwise further improve the performance. In this paper, we propose a novel matrix completion method that employs both users’ friendships and rating entries to predict the missing values in a user-item matrix. Our approach adopts a graph-based modeling where nodes are users and items, and two types of edges are considered: user friendships and user-item interactions. Polysemy-aware node features are extracted from this heterogeneous graph through a graph convolution network by considering the multifaceted factors for edge formation, which are then connected to a hybrid loss function with two heads: (1) a social-homophily head to address node polysemy, and (2) an error head for user-item rating regression. The latter is formulated on all matrix entries to combat the sensitivity of negative sampling of the vast majority of missing entries during training, with a smart technique to reduce the time complexity. Extensive experiments over real datasets verify that our model outperforms the state-of-the-art matrix completion methods by a significant margin.more » « less
-
Meka, Raghu (Ed.)Matrix completion tackles the task of predicting missing values in a low-rank matrix based on a sparse set of observed entries. It is often assumed that the observation pattern is generated uniformly at random or has a very specific structure tuned to a given algorithm. There is still a gap in our understanding when it comes to arbitrary sampling patterns. Given an arbitrary sampling pattern, we introduce a matrix completion algorithm based on network flows in the bipartite graph induced by the observation pattern. For additive matrices, we show that the electrical flow is optimal, and we establish error upper bounds customized to each entry as a function of the observation set, along with matching minimax lower bounds. Our results show that the minimax squared error for recovery of a particular entry in the matrix is proportional to the effective resistance of the corresponding edge in the graph. Furthermore, we show that the electrical flow estimator is equivalent to the least squares estimator. We apply our estimator to the two-way fixed effects model and show that it enables us to accurately infer individual causal effects and the unit-specific and time-specific confounders. For rank-1 matrices, we use edge-disjoint paths to form an estimator that achieves minimax optimal estimation when the sampling is sufficiently dense. Our discovery introduces a new family of estimators parametrized by network flows, which provide a fine-grained and intuitive understanding of the impact of the given sampling pattern on the difficulty of estimation at an entry-specific level. This graph-based approach allows us to quantify the inherent complexity of matrix completion for individual entries, rather than relying solely on global measures of performance.more » « less
-
null (Ed.)Matrix completion, the problem of completing missing entries in a data matrix with low-dimensional structure (such as rank), has seen many fruitful approaches and analyses. Tensor completion is the tensor analog that attempts to impute missing tensor entries from similar low-rank type assumptions. In this paper, we study the tensor completion problem when the sampling pattern is deterministic and possibly non-uniform. We first propose an efficient weighted Higher Order Singular Value Decomposition (HOSVD) algorithm for the recovery of the underlying low-rank tensor from noisy observations and then derive the error bounds under a properly weighted metric. Additionally, the efficiency and accuracy of our algorithm are both tested using synthetic and real datasets in numerical simulations.more » « less
-
Abstract In this paper, we study the nonlinear inverse problem of estimating the spectrum of a system matrix, that drives a finite-dimensional affine dynamical system, from partial observations of a single trajectory data. In the noiseless case, we prove an annihilating polynomial of the system matrix, whose roots are a subset of the spectrum, can be uniquely determined from data. We then study which eigenvalues of the system matrix can be recovered and derive various sufficient and necessary conditions to characterize the relationship between the recoverability of each eigenvalue and the observation locations. We propose various reconstruction algorithms with theoretical guarantees, generalizing the classical Prony method, ESPRIT, and matrix pencil method. We test the algorithms over a variety of examples with applications to graph signal processing, disease modeling and a real-human motion dataset. The numerical results validate our theoretical results and demonstrate the effectiveness of the proposed algorithms.more » « less
An official website of the United States government
