skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: An analysis of classical multidimensional scaling with applications to clustering
Abstract Classical multidimensional scaling is a widely used dimension reduction technique. Yet few theoretical results characterizing its statistical performance exist. This paper provides a theoretical framework for analyzing the quality of embedded samples produced by classical multidimensional scaling. This lays a foundation for various downstream statistical analyses, and we focus on clustering noisy data. Our results provide scaling conditions on the signal-to-noise ratio under which classical multidimensional scaling followed by a distance-based clustering algorithm can recover the cluster labels of all samples. Simulation studies confirm these scaling conditions are sharp. Applications to the cancer gene-expression data, the single-cell RNA sequencing data and the natural language data lend strong support to the methodology and theory.  more » « less
Award ID(s):
2131292 2136198
PAR ID:
10374348
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Information and Inference: A Journal of the IMA
Volume:
12
Issue:
1
ISSN:
2049-8772
Page Range / eLocation ID:
p. 72-112
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Clustering is a fundamental tool for exploratory data analysis. One central problem in clustering is deciding if the clusters discovered by clustering methods are reliable as opposed to being artifacts of natural sampling variation. Statistical significance of clustering (SigClust) is a recently developed cluster evaluation tool for high-dimension, low-sample size data. Despite its successful application to many scientific problems, there are cases where the original SigClust may not work well. Furthermore, for specific applications, researchers may not have access to the original data and only have the dissimilarity matrix. In this case, clustering is still a valuable exploratory tool, but the original SigClust is not applicable. To address these issues, we propose a new SigClust method using multidimensional scaling (MDS). The underlying idea behind MDS-based SigClust is that one can achieve low-dimensional representations of the original data via MDS using only the dissimilarity matrix and then apply SigClust on the low-dimensional MDS space. The proposed MDS-based SigClust can circumvent the challenge of parameter estimation of the original method in high-dimensional spaces while keeping the essential clustering structure in the MDS space. Both simulations and real data applications demonstrate that the proposed method works remarkably well for assessing the statistical significance of clustering. Supplementary materials for this article are available online. 
    more » « less
  2. Abstract In the form of multidimensional arrays, tensor data have become increasingly prevalent in modern scientific studies and biomedical applications such as computational biology, brain imaging analysis, and process monitoring system. These data are intrinsically heterogeneous with complex dependencies and structure. Therefore, ad‐hoc dimension reduction methods on tensor data may lack statistical efficiency and can obscure essential findings. Model‐based clustering is a cornerstone of multivariate statistics and unsupervised learning; however, existing methods and algorithms are not designed for tensor‐variate samples. In this article, we propose a tensor envelope mixture model (TEMM) for simultaneous clustering and multiway dimension reduction of tensor data. TEMM incorporates tensor‐structure‐preserving dimension reduction into mixture modeling and drastically reduces the number of free parameters and estimative variability. An expectation‐maximization‐type algorithm is developed to obtain likelihood‐based estimators of the cluster means and covariances, which are jointly parameterized and constrained onto a series of lower dimensional subspaces known as the tensor envelopes. We demonstrate the encouraging empirical performance of the proposed method in extensive simulation studies and a real data application in comparison with existing vector and tensor clustering methods. 
    more » « less
  3. Abstract After decades, the theoretical study of core-collapse supernova explosions is moving from parameterized, spherically symmetric models to increasingly realistic multidimensional simulations. However, obtaining nucleosynthesis yields based on such multidimensional core-collapse supernova simulations is not straightforward. Frequently, tracer particles are employed. Tracer particles may be tracked in situ during the simulation, but often they are reconstructed in a post-processing step based on the information saved during the hydrodynamic simulation. Reconstruction can be done in a number of ways, and here we compare the approaches of backward and forward integration of the equations of motion to the results based on inline particle trajectories. We find that both methods agree reasonably well with the inline results for isotopes for which a large number of particles contribute. However, for rarer isotopes that are produced only by a small number of particle trajectories, deviations can be large. For our setup, we find that backward integration leads to better agreement with the inline particles by more accurately reproducing the conditions following freeze-out from nuclear statistical equilibrium, because the establishment of nuclear statistical equilibrium erases the need for detailed trajectories at earlier times. Based on our results, if inline tracers are unavailable, we recommend backward reconstruction to the point when nuclear statistical equilibrium was last applied, with an interval between simulation snapshots of at most 1 ms for nucleosynthesis post-processing. 
    more » « less
  4. Zhang, Shihua (Ed.)
    Recent advances in single-cell technologies have enabled high-resolution characterization of tissue and cancer compositions. Although numerous tools for dimension reduction and clustering are available for single-cell data analyses, these methods often fail to simultaneously preserve local cluster structure and global data geometry. To address these challenges, we developed a novel analyses framework,Single-CellPathMetricsProfiling (scPMP), using power-weighted path metrics, which measure distances between cells in a data-driven way. Unlike Euclidean distance and other commonly used distance metrics, path metrics are density sensitive and respect the underlying data geometry. By combining path metrics with multidimensional scaling, a low dimensional embedding of the data is obtained which preserves both the global data geometry and cluster structure. We evaluate the method both for clustering quality and geometric fidelity, and it outperforms current scRNAseq clustering algorithms on a wide range of benchmarking data sets. 
    more » « less
  5. Morphometric analyses of male genitalia are routinely used to distinguish genera and species in beetles, butterflies, and flies, but are rarely used in ants, where most morphometric analyses focus on the external morphology of the worker caste. In this work, we performed linear morphometric analysis of the male genitalia to distinguish Monomorium and Syllophopsis in Madagascar. For 80 specimens, we measured 10 morphometric characters, especially on the paramere, volsella, and penisvalvae. Three datasets were made from linear measurements: mean (raw data), the ratios of characters (ratio data), and the Removal of Allometric Variance (RAV data). The following quantitative methods were applied to these datasets: hierarchical clustering (Ward’s method), unconstrained ordination methods including Principal Component Analysis (PCA), Non-Metric Multidimensional Scaling analyses (NMDS), Linear Discriminant Analysis (LDA), and Conditional Inference Trees (CITs). The results from statistical analysis show that the ratios proved to be the most effective approach for genus-level differentiation. However, the RAV method exhibited overlap between the genera. Meanwhile, the raw data facilitated more nuanced distinctions at the species level compared with the ratios and RAV approaches. The CITs revealed that the ratios of denticle length of the valviceps (SeL) to the paramere height (PaH) effectively distinguished between genera and identified key variables for species-level differentiation. Overall, this study shows that linear morphometric analysis of male genitalia is a useful data source for taxonomic delimitation. 
    more » « less