skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Principal Component Regression with Semirandom Observations via Matrix Completion
Principal Component Regression (PCR) is a popular method for prediction from data, and is one way to address the so-called multi-collinearity problem in regression. It was shown recently that algorithms for PCR such as hard singular value thresholding (HSVT) are also quite robust, in that they can handle data that has missing or noisy covariates. However, such spectral approaches require strong distributional assumptions on which entries are observed. Specifically, every covariate is assumed to be observed with probability (exactly) p, for some value of p. Our goal in this work is to weaken this requirement, and as a step towards this, we study a "semi-random" model. In this model, every covariate is revealed with probability p, and then an adversary comes in and reveals additional covariates. While the model seems intuitively easier, it is well known that algorithms such as HSVT perform poorly. Our approach is based on studying the closely related problem of Noisy Matrix Completion in a semi-random setting. By considering a new semidefinite programming relaxation, we develop new guarantees for matrix completion, which is our core technical contribution.  more » « less
Award ID(s):
2008688
PAR ID:
10255472
Author(s) / Creator(s):
; ;
Editor(s):
Banerjee, Arindam; Fukumizu, Kenji
Date Published:
Journal Name:
International Conference on Artificial Intelligence and Statistics (AISTATS)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary Covariate adjustment can improve precision in analysing randomized experiments. With fully observed data, regression adjustment and propensity score weighting are asymptotically equivalent in improving efficiency over unadjusted analysis. When some outcomes are missing, we consider combining these two adjustment methods with the inverse probability of observation weighting for handling missing outcomes, and show that the equivalence between the two methods breaks down. Regression adjustment no longer ensures efficiency gain over unadjusted analysis unless the true outcome model is linear in covariates or the outcomes are missing completely at random. Propensity score weighting, in contrast, still guarantees efficiency over unadjusted analysis, and including more covariates in adjustment never harms asymptotic efficiency. Moreover, we establish the value of using partially observed covariates to secure additional efficiency by the missingness indicator method, which imputes all missing covariates by zero and uses the union of the completed covariates and corresponding missingness indicators as the new, fully observed covariates. Based on these findings, we recommend using regression adjustment in combination with the missingness indicator method if the linear outcome model or missing-completely-at-random assumption is plausible and using propensity score weighting with the missingness indicator method otherwise. 
    more » « less
  2. We consider the well-studied problem of completing a rank- , -incoherent matrix from incomplete observations. We focus on this problem in the semi-random setting where each entry is independently revealed with probability at least . Whereas multiple nearly-linear time algorithms have been established in the more specialized fully-random setting where each entry is revealed with probablity exactly , the only known nearly-linear time algorithm in the semi-random setting is due to [CG18], whose sample complexity has a polynomial dependence on the inverse accuracy and condition number and thus cannot achieve high-accuracy recovery. Our main result is the first high-accuracy nearly-linear time algorithm for solving semi-random matrix completion, and an extension to the noisy observation setting. Our result builds upon the recent short-flat decomposition framework of [KLLST23a, KLLST23b] and leverages fast algorithms for flow problems on graphs to solve adaptive reweighting subproblems efficiently 
    more » « less
  3. Low-rank matrix recovery is a fundamental problem in machine learning with numerous applications. In practice, the problem can be solved by convex optimization namely nuclear norm minimization, or by non-convex optimization as it is well-known that for low-rank matrix problems like matrix sensing and matrix completion, all local optima of the natural non-convex objectives are also globally optimal under certain ideal assumptions. In this paper, we study new approaches for matrix sensing in a semi-random model where an adversary can add any number of arbitrary sensing matrices. More precisely, the problem is to recover a low-rank matrix $$X^\star$$ from linear measurements $$b_i = \langle A_i, X^\star \rangle$$, where an unknown subset of the sensing matrices satisfies the Restricted Isometry Property (RIP) and the rest of the $$A_i$$'s are chosen adversarially. It is known that in the semi-random model, existing non-convex objectives can have bad local optima. To fix this, we present a descent-style algorithm that provably recovers the ground-truth matrix $$X^\star$$. For the closely-related problem of semi-random matrix completion, prior work [CG18] showed that all bad local optima can be eliminated by reweighting the input data. However, the analogous approach for matrix sensing requires reweighting a set of matrices to satisfy RIP, which is a condition that is NP-hard to check. Instead, we build on the framework proposed in [KLL$^+$23] for semi-random sparse linear regression, where the algorithm in each iteration reweights the input based on the current solution, and then takes a weighted gradient step that is guaranteed to work well locally. Our analysis crucially exploits the connection between sparsity in vector problems and low-rankness in matrix problems, which may have other applications in obtaining robust algorithms for sparse and low-rank problems. 
    more » « less
  4. In this work, we propose a new Gaussian process (GP) regression framework that enforces the physical constraints in a probabilistic manner. Specifically, we focus on inequality and monotonicity constraints. This GP model is trained by the quantum-inspired Hamiltonian Monte Carlo (QHMC) algorithm, which is an efficient way to sample from a broad class of distributions by allowing a particle to have a random mass matrix with a probability distribution. Integrating the QHMC into the inequality and monotonicity constrained GP regression in the probabilistic sense, our approach enhances the accuracy and reduces the variance in the resulting GP model. Additionally, the probabilistic aspect of the method leads to reduced computational expenses and execution time. Further, we present an adaptive learning algorithm that guides the selection of constraint locations. The accuracy and efficiency of the method are demonstrated in estimating the hyperparameter of high-dimensional GP models under noisy conditions, reconstructing the sparsely observed state of a steady state heat transport problem, and learning a conservative tracer distribution from sparse tracer concentration measurements. 
    more » « less
  5. null (Ed.)
    We propose methods for estimating correspondence between two point sets under the presence of outliers in both the source and target sets. The proposed algorithms expand upon the theory of the regression without correspondence problem to estimate transformation coefficients using unordered multisets of covariates and responses. Previous theoretical analysis of the problem has been done in a setting where the responses are a complete permutation of the regressed covariates. This paper expands the problem setting by analyzing the cases where only a subset of the responses is a permutation of the regressed covariates in addition to some covariates possibly being adversarial outliers. We term this problem robust regression without correspondence and provide several algorithms based on random sample consensus for exact and approximate recovery in a noiseless and noisy one-dimensional setting as well as an approximation algorithm for multiple dimensions. The theoretical guarantees of the algorithms are verified in simulated data. We demonstrate an important computational neuroscience application of the proposed framework by demonstrating its effectiveness in a Caenorhabditis elegans neuron matching problem where the presence of outliers in both the source and target nematodes is a natural tendency 
    more » « less