Abstract We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In a simple, homogeneous observation model, we show that an existing observed-proportion weighted (OPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, which exhibits an interesting phase transition. However, deeper investigation reveals that, particularly in more realistic settings where the observation probabilities are heterogeneous, the empirical performance of the OPW estimator can be unsatisfactory; moreover, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method, which we call primePCA, that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the OPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. We prove that the error of primePCA converges to zero at a geometric rate in the noiseless case, and when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios, including settings where the data are not Missing Completely At Random.
more »
« less
Matrix Completion with Model-free Weighting
In this paper, we propose a novel method for matrix completion under general non- uniform missing structures. By controlling an upper bound of a novel balancing error, we construct weights that can actively adjust for the non-uniformity in the empirical risk without explicitly modeling the observation probabilities, and can be computed efficiently via convex optimization. The recovered matrix based on the proposed weighted empirical risk enjoys appealing theoretical guarantees. In particular, the proposed method achieves stronger guarantee than existing work in terms of the scaling with respect to the observation probabilities, under asymptotically heterogeneous missing settings (where entry-wise observation probabilities can be of different orders). These settings can be regarded as a better theoretical model of missing patterns with highly varying probabilities. We also provide a new minimax lower bound under a class of heterogeneous settings. Numerical experiments are also provided to demonstrate the effectiveness of the proposed method.
more »
« less
- Award ID(s):
- 1711952
- PAR ID:
- 10303652
- Date Published:
- Journal Name:
- Proceedings of Machine Learning Research
- ISSN:
- 2640-3498
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss ℓ can control the test error under all Moreau envelopes of the loss ℓ . We use our finite-sample bound to directly recover the “optimistic rate” of Zhou et al. (2021) for linear regression with the square loss, which is known to be tight for minimal ℓ2-norm interpolation, but we also handle more general settings where the label is generated by a potentially misspecified multi-index model. The same argument can analyze noisy interpolation of max-margin classifiers through the squared hinge loss, and establishes consistency results in spiked-covariance settings. More generally, when the loss is only assumed to be Lipschitz, our bound effectively improves Talagrand’s well-known contraction lemma by a factor of two, and we prove uniform convergence of interpolators (Koehler et al. 2021) for all smooth, non-negative losses. Finally, we show that application of our generalization bound using localized Gaussian width will generally be sharp for empirical risk minimizers, establishing a non-asymptotic Moreau envelope theorymore » « less
-
Multi-modal data are prevalent in many scientific fields. In this study, we consider the parameter estimation and variable selection for a multi-response regression using block-missing multi-modal data. Our method allows the dimensions of both the responses and the predictors to be large, and the responses to be incomplete and correlated, a common practical problem in high-dimensional settings. Our proposed method uses two steps to make a prediction from a multi-response linear regression model with block-missing multi-modal predictors. In the first step, without imputing missing data, we use all available data to estimate the covariance matrix of the predictors and the cross-covariance matrix between the predictors and the responses. In the second step, we use these matrices and a penalized method to simultaneously estimate the precision matrix of the response vector, given the predictors, and the sparse regression parameter matrix. Lastly, we demonstrate the effectiveness of the proposed method using theoretical studies, simulated examples, and an analysis of a multi-modal imaging data set from the Alzheimer’s Disease Neuroimaging Initiative.more » « less
-
Domain adaptation addresses the challenge where the distribution of target inference data differs from that of the source training data. Recently, data privacy has become a significant constraint, limiting access to the source domain. To mitigate this issue, Source-Free Domain Adaptation (SFDA) methods bypass source domain data by generating source-like data or pseudo-labeling the unlabeled target domain. However, these approaches often lack theoretical grounding. In this work, we provide a theoretical analysis of the SFDA problem, focusing on the general empirical risk of the unlabeled target domain. Our analysis offers a comprehensive understanding of how representativeness, generalization, and variety contribute to controlling the upper bound of target domain empirical risk in SFDA settings. We further explore how to balance this trade-off from three perspectives: sample selection, semantic domain alignment, and a progressive learning framework. These insights inform the design of novel algorithms. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on three benchmark datasets--Office-Home, DomainNet, and VisDA-C--yielding relative improvements of 3.2%, 9.1%, and 7.5%, respectively, over the representative SFDA method, SHOT.more » « less
-
One of the first steps in applications of statistical network analysis is frequently to produce summary charts of important features of the network. Many of these features take the form of sequences of graph statistics counting the number of realized events in the network, examples of which are degree distributions, edgewise shared partner distributions, and more. We provide conditions under which the empirical distributions of sequences of graph statistics are consistent in the L-infinity-norm in settings where edges in the network are dependent. We accomplish this task by deriving concentration inequalities that bound probabilities of deviations of graph statistics from the expected value under weak dependence conditions. We apply our concentration inequalities to empirical distributions of sequences of graph statistics and derive non-asymptotic bounds on the L-infinity-error which hold with high probability. Our non-asymptotic results are then extended to demonstrate uniform convergence almost surely in selected examples. We illustrate theoretical results through examples, simulation studies, and an application.more » « less
An official website of the United States government

