Abstract One key challenge encountered in single-cell data clustering is to combine clustering results of data sets acquired from multiple sources. We propose to represent the clustering result of each data set by a Gaussian mixture model (GMM) and produce an integrated result based on the notion of Wasserstein barycenter. However, the precise barycenter of GMMs, a distribution on the same sample space, is computationally infeasible to solve. Importantly, the barycenter of GMMs may not be a GMM containing a reasonable number of components. We thus propose to use the minimized aggregated Wasserstein (MAW) distance to approximate the Wasserstein metric and develop a new algorithm for computing the barycenter of GMMs under MAW. Recent theoretical advances further justify using the MAW distance as an approximation for the Wasserstein metric between GMMs. We also prove that the MAW barycenter of GMMs has the same expectation as the Wasserstein barycenter. Our proposed algorithm for clustering integration scales well with the data dimension and the number of mixture components, with complexity independent of data size. We demonstrate that the new method achieves better clustering results on several single-cell RNA-seq data sets than some other popular methods.
more »
« less
Tensor envelope mixture model for simultaneous clustering and multiway dimension reduction
Abstract In the form of multidimensional arrays, tensor data have become increasingly prevalent in modern scientific studies and biomedical applications such as computational biology, brain imaging analysis, and process monitoring system. These data are intrinsically heterogeneous with complex dependencies and structure. Therefore, ad‐hoc dimension reduction methods on tensor data may lack statistical efficiency and can obscure essential findings. Model‐based clustering is a cornerstone of multivariate statistics and unsupervised learning; however, existing methods and algorithms are not designed for tensor‐variate samples. In this article, we propose a tensor envelope mixture model (TEMM) for simultaneous clustering and multiway dimension reduction of tensor data. TEMM incorporates tensor‐structure‐preserving dimension reduction into mixture modeling and drastically reduces the number of free parameters and estimative variability. An expectation‐maximization‐type algorithm is developed to obtain likelihood‐based estimators of the cluster means and covariances, which are jointly parameterized and constrained onto a series of lower dimensional subspaces known as the tensor envelopes. We demonstrate the encouraging empirical performance of the proposed method in extensive simulation studies and a real data application in comparison with existing vector and tensor clustering methods.
more »
« less
- PAR ID:
- 10397033
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Biometrics
- Volume:
- 78
- Issue:
- 3
- ISSN:
- 0006-341X
- Format(s):
- Medium: X Size: p. 1067-1079
- Size(s):
- p. 1067-1079
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper studies the prediction task of tensor-on-tensor regression in which both covariates and responses are multi-dimensional arrays (a.k.a., tensors) across time with arbitrary tensor order and data dimension. Existing methods either focused on linear models without accounting for possibly nonlinear relationships between covariates and responses, or directly employed black-box deep learning algorithms that failed to utilize the inherent tensor structure. In this work, we propose a Factor Augmented Tensor-on-Tensor Neural Network (FATTNN) that integrates tensor factor models into deep neural networks. We begin with summarizing and extracting useful predictive information (represented by the ``factor tensor'') from the complex structured tensor covariates, and then proceed with the prediction task using the estimated factor tensor as input of a temporal convolutional neural network. The proposed methods effectively handle nonlinearity between complex data structures, and improve over traditional statistical models and conventional deep learning approaches in both prediction accuracy and computational cost. By leveraging tensor factor models, our proposed methods exploit the underlying latent factor structure to enhance the prediction, and in the meantime, drastically reduce the data dimensionality that speeds up the computation. The empirical performances of our proposed methods are demonstrated via simulation studies and real-world applications to three public datasets. Numerical results show that our proposed algorithms achieve substantial increases in prediction accuracy and significant reductions in computational time compared to benchmark methods.more » « less
-
Imaging data-based prognostic models focus on using an asset’s degradation images to predict its time to failure (TTF). Most image-based prognostic models have two common limitations. First, they require degradation images to be complete (i.e., images are observed continuously and regularly over time). Second, they usually employ an unsupervised dimension reduction method to extract low-dimensional features and then use the features for TTF prediction. Because unsupervised dimension reduction is conducted on the degradation images without the involvement of TTFs, there is no guarantee that the extracted features are effective for failure time prediction. To address these challenges, this article develops a supervised tensor dimension reduction-based prognostic model. The model first proposes a supervised dimension reduction method for tensor data. It uses historical TTFs to guide the detection of a tensor subspace to extract low-dimensional features from high-dimensional incomplete degradation imaging data. Next, the extracted features are used to construct a prognostic model based on (log)-location-scale regression. An optimization algorithm for parameter estimation is proposed, and analytical solutions are discussed. Simulated data and a real-world data set are used to validate the performance of the proposed model. History: Bianca Maria Colosimo served as the senior editor for this article Funding: This work was supported by National Science Foundation [2229245]. Data Ethics & Reproducibility Note: The code capsule is available on Code Ocean at https://github.com/czhou9/Code-and-Data-for-IJDS and in the e-Companion to this article (available at https://doi.org/10.1287/ijds.2022.x022 ).more » « less
-
Summary With rapid development of techniques to measure brain activity and structure, statistical methods for analyzing modern brain-imaging data play an important role in the advancement of science. Imaging data that measure brain function are usually multivariate high-density longitudinal data and are heterogeneous across both imaging sources and subjects, which lead to various statistical and computational challenges. In this article, we propose a group-based method to cluster a collection of multivariate high-density longitudinal data via a Bayesian mixture of smoothing splines. Our method assumes each multivariate high-density longitudinal trajectory is a mixture of multiple components with different mixing weights. Time-independent covariates are assumed to be associated with the mixture components and are incorporated via logistic weights of a mixture-of-experts model. We formulate this approach under a fully Bayesian framework using Gibbs sampling where the number of components is selected based on a deviance information criterion. The proposed method is compared to existing methods via simulation studies and is applied to a study on functional near-infrared spectroscopy, which aims to understand infant emotional reactivity and recovery from stress. The results reveal distinct patterns of brain activity, as well as associations between these patterns and selected covariates.more » « less
-
Tackling High-Dimensional Tensor Clustering In the paper “Jointly Modeling and Clustering Tensors in High Dimensions,” Cai, Zhang, and Sun address the challenge of jointly modeling and clustering tensors by introducing a high-dimensional tensor mixture model with heterogeneous covariances. The proposed mixture model exploits the intrinsic structures of tensor data. The authors develop a computationally efficient high-dimensional expectation conditional maximization (HECM) algorithm and show that the HECM iterates, with an appropriate initialization, converge geometrically to a neighborhood that is within statistical precision of the true parameter. The theoretical analysis is nontrivial because of the dual nonconvexity arising from both the expectation maximization-type estimation and the nonconvex objective function in the M step. They also study the convergence rate of the algorithm when the number of clusters is overspecified and when the signal-to-noise ratio diminishes with sample size. The efficacy of the proposed method is demonstrated through numerical experiments and a real-world medical data application.more » « less
An official website of the United States government
