When the dimension of data is comparable to or larger than the number of data samples, principal components analysis (PCA) may exhibit problematic high-dimensional noise. In this work, we propose an empirical Bayes PCA method that reduces this noise by estimating a joint prior distribution for the principal components. EB-PCA is based on the classical Kiefer–Wolfowitz non-parametric maximum likelihood estimator for empirical Bayes estimation, distributional results derived from random matrix theory for the sample PCs and iterative refinement using an approximate message passing (AMP) algorithm. In theoretical ‘spiked’ models, EB-PCA achieves Bayes-optimal estimation accuracy in the same settings as an oracle Bayes AMP procedure that knows the true priors. Empirically, EB-PCA significantly improves over PCA when there is strong prior structure, both in simulation and on quantitative benchmarks constructed from the 1000 Genomes Project and the International HapMap Project. An illustration is presented for analysis of gene expression data obtained by single-cell RNA-seq.
In this paper, we address the problem of estimating transport surplus (a.k.a. matching affinity) in high-dimensional optimal transport problems. Classical optimal transport theory specifies the matching affinity and determines the optimal joint distribution. In contrast, we study the inverse problem of estimating matching affinity based on the observation of the joint distribution, using an entropic regularization of the problem. To accommodate high dimensionality of the data, we propose a novel method that incorporates a nuclear norm regularization that effectively enforces a rank constraint on the affinity matrix. The low-rank matrix estimated in this way reveals the main factors that are relevant for matching.
- Award ID(s):
- 1716489
- Publication Date:
- NSF-PAR ID:
- 10116279
- Journal Name:
- Information and Inference: A Journal of the IMA
- Volume:
- 8
- Issue:
- 4
- Page Range or eLocation-ID:
- p. 677-689
- ISSN:
- 2049-8772
- Publisher:
- Oxford University Press
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract -
Abstract Selecting the optimal Markowitz portfolio depends on estimating the covariance matrix of the returns of N assets from T periods of historical data. Problematically, N is typically of the same order as T, which makes the sample covariance matrix estimator perform poorly, both empirically and theoretically. While various other general-purpose covariance matrix estimators have been introduced in the financial economics and statistics literature for dealing with the high dimensionality of this problem, we here propose an estimator that exploits the fact that assets are typically positively dependent. This is achieved by imposing that the joint distribution of returns be multivariate totally positive of order 2 (MTP2). This constraint on the covariance matrix not only enforces positive dependence among the assets but also regularizes the covariance matrix, leading to desirable statistical properties such as sparsity. Based on stock market data spanning 30 years, we show that estimating the covariance matrix under MTP2 outperforms previous state-of-the-art methods including shrinkage estimators and factor models.
-
Abstract Photoacoustic computed tomography (PACT) is an emerging computed imaging modality that exploits optical contrast and ultrasonic detection principles to form images of the photoacoustically induced initial pressure distribution within tissue. The PACT reconstruction problem corresponds to a time-domain inverse source problem, where the initial pressure distribution is recovered from the measurements recorded on an aperture outside the support of the source. A major challenge in transcranial PACT brain imaging is to compensate for aberrations in the measured acoustic data that are induced by propagation of the photoacoustic wavefields through the skull. To properly account for these effects, previously proposed image reconstruction methods for transcranial PACT require knowledge of the spatial distribution of the elastic parameters of the skull. However, estimating the spatial distribution of these parameters prior to the PACT experiment remains challenging. To circumvent this issue, in this work a method to jointly reconstruct the initial pressure distribution and a low-dimensional representation of the elastic parameters of the skull is developed and investigated. The joint reconstruction (JR) problem is solved by use of a proximal optimization method that allows constraints and non-smooth regularization terms. The proposed method is evaluated by use of large-scale three-dimensional (3D) computer-simulation studies thatmore »
-
Generative Moment-Matching Network (GMMN) is a deep generative model, which employs maximum mean discrepancy as the objective to learn model parameters. However, this model can only generate samples, failing to infer the latent code from samples for downstream tasks. In this paper, we propose a novel Joint Generative Moment-Matching Network (JGMMN), which learns the structural latent code for unsupervised inference. Specifically, JGMMN has a generation network for the generation task and an inference network for the inference task. We first reformulate this model as the two joint distributions matching problem. To solve this problem, we propose to use the Joint Maximum Mean Discrepancy (JMMD) as the objective to learn these two networks simultaneously. Furthermore, to enforce the consistency between the sample distribution and the inferred latent code distribution, we propose a novel multi-modal regularization to enforce this consistency. At last, extensive experiments on both synthetic and real-world datasets have verified the effectiveness and correctness of our proposed JGMMN.
-
Low-rank matrix recovery problems involving high-dimensional and heterogeneous data appear in applications throughout statistics and machine learning. The contribution of this paper is to establish the fundamental limits of recovery for a broad class of these problems. In particular, we study the problem of estimating a rank-one matrix from Gaussian observations where different blocks of the matrix are observed under different noise levels. In the setting where the number of blocks is fixed while the number of variables tends to infinity, we prove asymptotically exact formulas for the minimum mean-squared error in estimating both the matrix and underlying factors. These results are based on a novel reduction from the low-rank matrix tensor product model (with homogeneous noise) to a rank-one model with heteroskedastic noise. As an application of our main result, we show that show recently proposed methods based on applying principal component analysis (PCA) to weighted combinations of the data are optimal in some settings but sub-optimal in others. We also provide numerical results comparing our asymptotic formulas with the performance of methods based weighted PCA, gradient descent, and approximate message passing.