Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction that is useful for various data science problems. However, many applications involve heterogeneous data that varies in quality due to noise characteristics associated with different sources of the data. Methods that deal with this mixed dataset are known as heteroscedastic methods. Current methods like HePPCAT make Gaussian assumptions of the basis coefficients that may not hold in practice. Other methods such as Weighted PCA (WPCA) assume the noise variances are known, which may be difficult to know in practice. This paper develops a PCA method that can estimate the sample-wise noise variances and use this information in the model to improve the estimate of the subspace basis associated with the low-rank structure of the data. This is done without distributional assumptions of the low-rank component and without assuming the noise variances are known. Simulations show the effectiveness of accounting for such heteroscedasticity in the data, the benefits of using such a method with all of the data versus retaining only good data, and comparisons are made against other PCA methods established in the literature like PCA, Robust PCA (RPCA), and HePPCAT. Code available at https://github.com/javiersc1/ALPCAH.
more »
« less
Common population codes produce extremely nonlinear neural manifolds
Populations of neurons represent sensory, motor, and cognitive variables via patterns of activity distributed across the population. The size of the population used to encode a variable is typically much greater than the dimension of the variable itself, and thus, the corresponding neural population activity occupies lower-dimensional subsets of the full set of possible activity states. Given population activity data with such lower-dimensional structure, a fundamental question asks how close the low-dimensional data lie to a linear subspace. The linearity or nonlinearity of the low-dimensional structure reflects important computational features of the encoding, such as robustness and generalizability. Moreover, identifying such linear structure underlies common data analysis methods such as Principal Component Analysis (PCA). Here, we show that for data drawn from many common population codes the resulting point clouds and manifolds are exceedingly nonlinear, with the dimension of the best-fitting linear subspace growing at least exponentially with the true dimension of the data. Consequently, linear methods like PCA fail dramatically at identifying the true underlying structure, even in the limit of arbitrarily many data points and no noise.
more »
« less
- Award ID(s):
- 1934568
- PAR ID:
- 10555433
- Publisher / Repository:
- National Academy of Sciences
- Date Published:
- Journal Name:
- Proceedings of the National Academy of Sciences
- Volume:
- 120
- Issue:
- 39
- ISSN:
- 0027-8424
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract When the dimension of data is comparable to or larger than the number of data samples, principal components analysis (PCA) may exhibit problematic high-dimensional noise. In this work, we propose an empirical Bayes PCA method that reduces this noise by estimating a joint prior distribution for the principal components. EB-PCA is based on the classical Kiefer–Wolfowitz non-parametric maximum likelihood estimator for empirical Bayes estimation, distributional results derived from random matrix theory for the sample PCs and iterative refinement using an approximate message passing (AMP) algorithm. In theoretical ‘spiked’ models, EB-PCA achieves Bayes-optimal estimation accuracy in the same settings as an oracle Bayes AMP procedure that knows the true priors. Empirically, EB-PCA significantly improves over PCA when there is strong prior structure, both in simulation and on quantitative benchmarks constructed from the 1000 Genomes Project and the International HapMap Project. An illustration is presented for analysis of gene expression data obtained by single-cell RNA-seq.more » « less
-
Recently, a wide range of memory-efficient LLM training algorithms have gained substantial popularity. These methods leverage the low-rank structure of gradients to project optimizer states into a subspace using a projection matrix found by singular value decomposition (SVD). However, convergence of these algorithms is highly dependent on the update rules of their projection matrix. This work provides the first convergence guarantee for arbitrary update rules of projection matrices, generally applicable to optimizers that can be analyzed with Hamiltonian Descent, including common ones such as LION and Adam. Inspired by this theoretical understanding, the authors propose Online Subspace Descent, a new family of subspace descent optimizers that do not rely on SVD. Instead of updating the projection matrix with eigenvectors, Online Subspace Descent updates it with online PCA. This approach is flexible and introduces minimal overhead to training. Experiments show that for pretraining LLaMA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream task performance than state-of-the-art low-rank training methods across settings, narrowing the gap with full-rank baselines.more » « less
-
It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parametrizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest —non-linear but regular networks— no tight characterization has yet been achieved, despite significant developments. We take a step in this direction by considering depth-2 neural networks trained by SGD in the mean-field regime. We consider functions on binary inputs that depend on a latent low-dimensional subspace (i.e., small number of coordinates). This regime is of interest since it is poorly under- stood how neural networks routinely tackle high-dimensional datasets and adapt to latent low- dimensional structure without suffering from the curse of dimensionality. Accordingly, we study SGD-learnability with O(d) sample complexity in a large ambient dimension d. Our main results characterize a hierarchical property —the merged-staircase property— that is both necessary and nearly sufficient for learning in this setting. We further show that non-linear training is necessary: for this class of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently. The key tools are a new “dimension-free” dynamics approximation result that applies to functions defined on a latent space of low-dimension, a proof of global convergence based on polynomial identity testing, and an improvement of lower bounds against linear methods for non-almost orthogonal functions.more » « less
-
Krylov Cubic Regularized Newton: A Subspace Second-Order Method with Dimension-Free Convergence RateSecond-order optimization methods, such as cubic regularized Newton methods, are known for their rapid convergence rates; nevertheless, they become impractical in high-dimensional problems due to their substantial memory requirements and computational costs. One promising approach is to execute second order updates within a lower-dimensional subspace, giving rise to \textit{subspace second-order} methods. However, the majority of existing subspace second-order methods randomly select subspaces, consequently resulting in slower convergence rates depending on the problem's dimension $$d$$. In this paper, we introduce a novel subspace cubic regularized Newton method that achieves a dimension-independent global convergence rate of $$\bigO\left(\frac{1}{mk}+\frac{1}{k^2}\right)$$ for solving convex optimization problems. Here, $$m$$ represents the subspace dimension, which can be significantly smaller than $$d$$. Instead of adopting a random subspace, our primary innovation involves performing the cubic regularized Newton update within the \emph{Krylov subspace} associated with the Hessian and the gradient of the objective function. This result marks the first instance of a dimension-independent convergence rate for a subspace second-order method. Furthermore, when specific spectral conditions of the Hessian are met, our method recovers the convergence rate of a full-dimensional cubic regularized Newton method. Numerical experiments show our method converges faster than existing random subspace methods, especially for high-dimensional problems.more » « less
An official website of the United States government

