skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 14, 2026

Title: Beyond Sin-Squared Error: Linear-Time Entrywise Uncertainty Quantification for Streaming PCA
We propose a novel statistical inference framework for streaming principal component analysis (PCA) using Oja's algorithm, enabling the construction of confidence intervals for individual entries of the estimated eigenvector. Most existing works on streaming PCA focus on providing sharp sin-squared error guarantees. Recently, there has been some interest in uncertainty quantification for the sin-squared error. However, uncertainty quantification or sharp error guarantees for entries of the estimated eigenvector in the streaming setting remains largely unexplored. We derive a sharp Bernstein-type concentration bound for elements of the estimated vector matching the optimal error rate up to logarithmic factors. We also establish a Central Limit Theorem for a suitably centered and scaled subset of the entries. To efficiently estimate the coordinate-wise variance, we introduce a provably consistent subsampling algorithm that leverages the median-of-means approach, empirically achieving similar accuracy to multiplier bootstrap methods while being significantly more computationally efficient. Numerical experiments demonstrate its effectiveness in providing reliable uncertainty estimates with a fraction of the computational cost of existing methods.  more » « less
Award ID(s):
2505865
PAR ID:
10631436
Author(s) / Creator(s):
; ;
Publisher / Repository:
https://doi.org/10.48550/arXiv.2506.12655
Date Published:
ISSN:
2506.12655
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Oja's algorithm for Streaming Principal Component Analysis (PCA) for n data-points in a d dimensional space achieves the same sin-squared error O(r𝖾𝖿𝖿/n) as the offline algorithm in O(d) space and O(nd) time and a single pass through the datapoints. Here r𝖾𝖿𝖿 is the effective rank (ratio of the trace and the principal eigenvalue of the population covariance matrix Σ). Under this computational budget, we consider the problem of sparse PCA, where the principal eigenvector of Σ is s-sparse, and r𝖾𝖿𝖿 can be large. In this setting, to our knowledge, \textit{there are no known single-pass algorithms} that achieve the minimax error bound in O(d) space and O(nd) time without either requiring strong initialization conditions or assuming further structure (e.g., spiked) of the covariance matrix. We show that a simple single-pass procedure that thresholds the output of Oja's algorithm (the Oja vector) can achieve the minimax error bound under some regularity conditions in O(d) space and O(nd) time. We present a nontrivial and novel analysis of the entries of the unnormalized Oja vector, which involves the projection of a product of independent random matrices on a random initial vector. This is completely different from previous analyses of Oja's algorithm and matrix products, which have been done when the r𝖾𝖿𝖿 is bounded. 
    more » « less
  2. We study fast algorithms for computing fundamental properties of a positive semidefinite kernel matrix K∈ R^{n*n} corresponding to n points x1,…,xn∈R^d. In particular, we consider estimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector. We show that the sum of matrix entries can be estimated to 1+ϵ relative error in time sublinear in n and linear in d for many popular kernels, including the Gaussian, exponential, and rational quadratic kernels. For these kernels, we also show that the top eigenvalue (and an approximate eigenvector) can be approximated to 1+ϵ relative error in time subquadratic in n and linear in d. Our algorithms represent significant advances in the best known runtimes for these problems. They leverage the positive definiteness of the kernel matrix, along with a recent line of work on efficient kernel density estimation. 
    more » « less
  3. Oja’s rule [Oja, Journal of mathematical biology 1982] is a well-known biologically-plausible algorithm using a Hebbian-type synaptic update rule to solve streaming principal component analysis (PCA). Computational neuroscientists have known that this biological version of Oja’s rule converges to the top eigenvector of the covariance matrix of the input in the limit. However, prior to this work, it was open to prove any convergence rate guarantee. In this work, we give the first convergence rate analysis for the biological version of Oja’s rule in solving streaming PCA. Moreover, our convergence rate matches the information theoretical lower bound up to logarithmic factors and outperforms the state-of-the-art upper bound for streaming PCA. Furthermore, we develop a novel framework inspired by ordinary differential equations (ODE) to analyze general stochastic dynamics. The framework abandons the traditional \textit{step-by-step} analysis and instead analyzes a stochastic dynamic in \textit{one-shot} by giving a closed-form solution to the entire dynamic. The one-shot framework allows us to apply stopping time and martingale techniques to have a flexible and precise control on the dynamic. We believe that this general framework is powerful and should lead to effective yet simple analysis for a large class of problems with stochastic dynamics. 
    more » « less
  4. Abernethy, Jacob; Agarwal, Agarwal (Ed.)
    Motivated by problems in controlled experiments, we study the discrepancy of random matrices with continuous entries where the number of columns $$n$$ is much larger than the number of rows $$m$$. Our first result shows that if $$\omega(1) = m = o(n)$$, a matrix with i.i.d. standard Gaussian entries has discrepancy $$\Theta(\sqrt{n} \, 2^{-n/m})$$ with high probability. This provides sharp guarantees for Gaussian discrepancy in a regime that had not been considered before in the existing literature. Our results also apply to a more general family of random matrices with continuous i.i.d. entries, assuming that $$m = O(n/\log{n})$$. The proof is non-constructive and is an application of the second moment method. Our second result is algorithmic and applies to random matrices whose entries are i.i.d. and have a Lipschitz density. We present a randomized polynomial-time algorithm that achieves discrepancy $$e^{-\Omega(\log^2(n)/m)}$$ with high probability, provided that $$m = O(\sqrt{\log{n}})$$. In the one-dimensional case, this matches the best known algorithmic guarantees due to Karmarkar–Karp. For higher dimensions $$2 \leq m = O(\sqrt{\log{n}})$$, this establishes the first efficient algorithm achieving discrepancy smaller than $$O( \sqrt{m} )$$. 
    more » « less
  5. The accuracy of many downstream machine learning algorithms is tied to the training data having uncorrelated features. With the modern-day data often being streaming in nature, geographically distributed, and having large dimensions, it is paramount to apply both uncorrelated feature learning and dimensionality reduction techniques in this scenario. Principal Component Analysis (PCA) is a state-of-the-art tool that simultaneously yields uncorrelated features and reduces data dimensions by projecting data onto the eigenvectors of the population covariance matrix. This paper introduces a novel algorithm called Consensus-DIstributEd Generalized Oja (C-DIEGO), which is based on Oja's method, to estimate the dominant eigenvector of a population covariance matrix in a distributed, streaming setting. The algorithm considers a distributed network of arbitrarily connected nodes without a central coordinator and assumes data samples continuously arrive at the individual nodes in a streaming manner. It is established in the paper that C-DIEGO can achieve an order-optimal convergence rate if nodes in the network are allowed to have enough consensus rounds per algorithmic iteration. Numerical results are also reported in the paper that showcase the efficacy of the proposed algorithm. 
    more » « less