skip to main content


Title: Sparse semiparametric canonical correlation analysis for data of mixed types
Summary Canonical correlation analysis investigates linear relationships between two sets of variables, but it often works poorly on modern datasets because of high dimensionality and mixed data types such as continuous, binary and zero-inflated. To overcome these challenges, we propose a semiparametric approach to sparse canonical correlation analysis based on the Gaussian copula. The main result of this paper is a truncated latent Gaussian copula model for data with excess zeros, which allows us to derive a rank-based estimator of the latent correlation matrix for mixed variable types without estimation of marginal transformation functions. The resulting canonical correlation analysis method works well in high-dimensional settings, as demonstrated via numerical studies, and when applied to the analysis of association between gene expression and microRNA data from breast cancer patients.  more » « less
Award ID(s):
1712943 1934904
NSF-PAR ID:
10160912
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Biometrika
ISSN:
0006-3444
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Gaussian process (GP) models have been extended to emulate expensive computer simulations with both qualitative/categorical and quantitative/continuous variables. Latent variable (LV) GP models, which have been recently developed to map each qualitative variable to some underlying numerical LVs, have strong physics‐based justification and have achieved promising performance. Two versions use LVs in Cartesian (LV‐Car) space and hyperspherical (LV‐sph) space, respectively. Despite their success, the effects of these different LV structures are still poorly understood. This article illuminates this issue with two contributions. First, we develop a theorem on the effect of the ranks of the qualitative factor correlation matrices of mixed‐variable GP models, from which we conclude that the LV‐sph model restricts the interactions between the input variables and thus restricts the types of response surface data with which the model can be consistent. Second, following a rank‐based perspective like in the theorem, we propose a new alternative model named LV‐mix that combines the LV‐based correlation structures from both LV‐Car and LV‐sph models to achieve better model flexibility than them. Through extensive case studies, we show that LV‐mix achieves higher average accuracy compared with the existing two.

     
    more » « less
  2. The goal of the crime forecasting problem is to predict different types of crimes for each geographical region (like a neighborhood or censor tract) in the near future. Since nearby regions usually have similar socioeconomic characteristics which indicate similar crime patterns, recent state-of-the-art solutions constructed a distance-based region graph and utilized Graph Neural Network (GNN) techniques for crime forecasting, because the GNN techniques could effectively exploit the latent relationships between neighboring region nodes in the graph if the edges reveal high dependency or correlation. However, this distance-based pre-defined graph can not fully capture crime correlation between regions that are far from each other but share similar crime patterns. Hence, to make a more accurate crime prediction, the main challenge is to learn a better graph that reveals the dependencies between regions in crime occurrences and meanwhile captures the temporal patterns from historical crime records. To address these challenges, we propose an end-to-end graph convolutional recurrent network called HAGEN with several novel designs for crime prediction. Specifically, our framework could jointly capture the crime correlation between regions and the temporal crime dynamics by combining an adaptive region graph learning module with the Diffusion Convolution Gated Recurrent Unit (DCGRU). Based on the homophily assumption of GNN (i.e., graph convolution works better where neighboring nodes share the same label), we propose a homophily-aware constraint to regularize the optimization of the region graph so that neighboring region nodes on the learned graph share similar crime patterns, thus fitting the mechanism of diffusion convolution. Empirical experiments and comprehensive analysis on two real-world datasets showcase the effectiveness of HAGEN. 
    more » « less
  3. null (Ed.)
    Abstract

    The data-driven approach is emerging as a promising method for the topological design of the multiscale structure with greater efficiency. However, existing data-driven methods mostly focus on a single class of unit cells without considering multiple classes to accommodate spatially varying desired properties. The key challenge is the lack of inherent ordering or “distance” measure between different classes of unit cells in meeting a range of properties. To overcome this hurdle, we extend the newly developed latent-variable Gaussian process (LVGP) to creating multi-response LVGP (MRLVGP) for the unit cell libraries of metamaterials, taking both qualitative unit cell concepts and quantitative unit cell design variables as mixed-variable inputs. The MRLVGP embeds the mixed variables into a continuous design space based on their collective effect on the responses, providing substantial insights into the interplay between different geometrical classes and unit cell materials. With this model, we can easily obtain a continuous and differentiable transition between different unit cell concepts that can render gradient information for multiscale topology optimization. While the proposed approach has a broader impact on the concurrent topological and material design of engineered systems, we demonstrate its benefits through multiscale topology optimization with aperiodic unit cells. Design examples reveal that considering multiple unit cell types can lead to improved performance due to the consistent load-transferred paths for micro- and macrostructures.

     
    more » « less
  4. Abstract

    Copula is a popular method for modeling the dependence among marginal distributions in multivariate censored data. As many copula models are available, it is essential to check if the chosen copula model fits the data well for analysis. Existing approaches to testing the fitness of copula models are mainly for complete or right-censored data. No formal goodness-of-fit (GOF) test exists for interval-censored or recurrent events data. We develop a general GOF test for copula-based survival models using the information ratio (IR) to address this research gap. It can be applied to any copula family with a parametric form, such as the frequently used Archimedean, Gaussian, and D-vine families. The test statistic is easy to calculate, and the test procedure is straightforward to implement. We establish the asymptotic properties of the test statistic. The simulation results show that the proposed test controls the type-I error well and achieves adequate power when the dependence strength is moderate to high. Finally, we apply our method to test various copula models in analyzing multiple real datasets. Our method consistently separates different copula models for all these datasets in terms of model fitness.

     
    more » « less
  5. Abstract

    The joint analysis of spatial and temporal processes poses computational challenges due to the data's high dimensionality. Furthermore, such data are commonly non-Gaussian. In this paper, we introduce a copula-based spatiotemporal model for analyzing spatiotemporal data and propose a semiparametric estimator. The proposed algorithm is computationally simple, since it models the marginal distribution and the spatiotemporal dependence separately. Instead of assuming a parametric distribution, the proposed method models the marginal distributions nonparametrically and thus offers more flexibility. The method also provides a convenient way to construct both point and interval predictions at new times and locations, based on the estimated conditional quantiles. Through a simulation study and an analysis of wind speeds observed along the border between Oregon and Washington, we show that our method produces more accurate point and interval predictions for skewed data than those based on normality assumptions.

     
    more » « less