It is increasingly interesting to model the relationship between two sets of high-dimensional measurements with potentially high correlations. Canonical correlation analysis (CCA) is a classical tool that explores the dependency of two multivariate random variables and extracts canonical pairs of highly correlated linear combinations. Driven by applications in genomics, text mining, and imaging research, among others, many recent studies generalize CCA to high-dimensional settings. However, most of them either rely on strong assumptions on covariance matrices, or do not produce nested solutions. We propose a new sparse CCA (SCCA) method that recasts high-dimensional CCA as an iterative penalized least squares problem. Thanks to the new iterative penalized least squares formulation, our method directly estimates the sparse CCA directions with efficient algorithms. Therefore, in contrast to some existing methods, the new SCCA does not impose any sparsity assumptions on the covariance matrices. The proposed SCCA is also very flexible in the sense that it can be easily combined with properly chosen penalty functions to perform structured variable selection and incorporate prior information. Moreover, our proposal of SCCA produces nested solutions and thus provides great convenient in practice. Theoretical results show that SCCA can consistently estimate the true canonical pairs with an overwhelming probability in ultra-high dimensions. Numerical results also demonstrate the competitive performance of SCCA.
Canonical correlation analysis in high dimensions with structured regularization
Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an [Formula: see text] penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regularization is that it ignores any data structure, treating all the features equally, which can be ill-suited for some applications. In this article we introduce several approaches to regularizing CCA that take the underlying data structure into account. In particular, the proposed group regularized canonical correlation analysis (GRCCA) is useful when the variables are correlated in groups. We illustrate some computational strategies to avoid excessive computations with regularized CCA in high dimensions. We demonstrate the application of these methods in our motivating application from neuroscience, as well as in a small simulation example.
more »
« less
- Award ID(s):
- 2013736
- PAR ID:
- 10324406
- Date Published:
- Journal Name:
- Statistical Modelling
- ISSN:
- 1471-082X
- Page Range / eLocation ID:
- 1471082X2110410
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract -
Quantifying dependence between high-dimensional random variables is central to statistical learning and inference. Two classical methods are canonical correlation analysis (CCA), which identifies maximally correlated projected versions of the original variables, and Shannon's mutual information, which is a universal dependence measure that also captures high-order dependencies. However, CCA only accounts for linear dependence, which may be insufficient for certain applications, while mutual information is often infeasible to compute/estimate in high dimensions. This work proposes a middle ground in the form of a scalable information-theoretic generalization of CCA, termed max-sliced mutual information (mSMI). mSMI equals the maximal mutual information between low-dimensional projections of the high-dimensional variables, which reduces back to CCA in the Gaussian case. It enjoys the best of both worlds: capturing intricate dependencies in the data while being amenable to fast computation and scalable estimation from samples. We show that mSMI retains favorable structural properties of Shannon's mutual information, like variational forms and identification of independence. We then study statistical estimation of mSMI, propose an efficiently computable neural estimator, and couple it with formal non-asymptotic error bounds. We present experiments that demonstrate the utility of mSMI for several tasks, encompassing independence testing, multi-view representation learning, algorithmic fairness, and generative modeling. We observe that mSMI consistently outperforms competing methods with little-to-no computational overhead.more » « less
-
NA (Ed.)Recent work has shown that repetition coding followed by interleaving induces signal structure that can be exploited to separate multiple co-channel user transmissions, without need for pilots or coordination/synchronization between the users. This is accomplished via a statistical learning technique known as canonical correlation analysis (CCA), which works even when the channels are time-varying. Previous analysis has established that it is possible to identify the user signals up to complex scaling in the noiseless case. This letter goes one important step further to show that CCA in fact yields the linear MMSE estimate of the user signals up to complex scaling, without using any explicit training. Instead, CCA relies only on the repetition and interleaving structure. This is particularly appealing in asynchronous ad-hoc and unlicensed setups, where tight user coordination is not practical.more » « less
-
Multiview analysis aims to extract common information from data entities across different domains (e.g., acoustic, visual, text). Canonical correlation analysis (CCA) is one of the classic tools for this problem, which estimates the shared latent information via linear transforming the different views of data. CCA has also been generalized to the nonlinear regime, where kernel methods and neural networks are introduced to replace the linear transforms. While the theoretical aspects of linear CCA are relatively well understood, nonlinear multiview analysis is still largely intuition-driven. In this work, our interest lies in the identifiability of shared latent information under a nonlinear multiview analysis framework. We propose a model identification criterion for learning latent information from multiview data, under a reasonable data generating model. We show that minimizing this criterion leads to identification of the latent shared information up to certain indeterminacy. We also propose a neural network based implementation and an efficient algorithm to realize the criterion. Our analysis is backed by experiments on both synthetic and real data.more » « less
-
Canonical correlation analysis (CCA) has been essential in unsupervised multimodal/multiview latent representation learning and data fusion. Classic CCA extracts shared information from multiple modalities of data using linear transformations. In recent years, deep neural networks-based nonlinear feature extractors were combined with CCA to come up with new variants, namely the ``DeepCCA'' line of work. These approaches were shown to have enhanced performance in many applications. However, theoretical supports of DeepCCA are often lacking. To address this challenge, the recent work of Lyu and Fu (2020) showed that, under a reasonable postnonlinear generative model, a carefully designed DeepCCA criterion provably removes unknown distortions in data generation and identifies the shared information across modalities. Nonetheless, a critical assumption used by Lyu and Fu (2020) for identifiability analysis was that unlimited data is available, which is unrealistic. This brief paper puts forth a finite-sample analysis of the DeepCCA method by Lyu and Fu (2020). The main result is that the finite-sample version of the method can still estimate the shared information with a guaranteed accuracy when the number of samples is sufficiently large. Our analytical approach is a nontrivial integration of statistical learning, numerical differentiation, and robust system identification, which may be of interest beyond the scope of DeepCCA and benefit other unsupervised learning paradigms.more » « less