skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Monitoring the shape of weather, soundscapes, and dynamical systems: a new statistic for dimension-driven data analysis on large datasets
Dimensionality-reduction methods are a fundamental tool in the analysis of large datasets. These algorithms work on the assumption that the "intrinsic dimension" of the data is generally much smaller than the ambient dimension in which it is collected. Alongside their usual purpose of mapping data into a smaller-dimensional space with minimal information loss, dimensionality-reduction techniques implicitly or explicitly provide information about the dimension of the dataset.In this paper, we propose a new statistic that we call the kappa-profile for analysis of large datasets. The kappa-profile arises from a dimensionality-reduction optimization problem: namely that of finding a projection that optimally preserves the secants between points in the dataset. From this optimal projection we extract kappa, the norm of the shortest projected secant from among the set of all normalized secants. This kappa can be computed for any dimension k; thus the tuple of kappa values (indexed by dimension) becomes a kappa-profile. Algorithms such as the Secant-Avoidance Projection algorithm and the Hierarchical Secant-Avoidance Projection algorithm provide a computationally feasible means of estimating the kappa-profile for large datasets, and thus a method of understanding and monitoring their behavior. As we demonstrate in this paper, the kappa-profile serves as a useful statistic in several representative settings: weather data, soundscape data, and dynamical systems data.  more » « less
Award ID(s):
1633830
PAR ID:
10099074
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2018 IEEE International Conference on Big Data (Big Data)
Page Range / eLocation ID:
1045 to 1051
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A fundamental question in many data analysis settings is the problem of discerning the “natural” dimension of a data set. That is, when a data set is drawn from a manifold (possibly with noise), a meaningful aspect of the data is the dimension of that manifold. Various approaches exist for estimating this dimension, such as the method of Secant-Avoidance Projection (SAP). Intuitively, the SAP algorithm seeks to determine a projection which best preserves the lengths of all secants between points in a data set; by applying the algorithm to find the best projections to vector spaces of various dimensions, one may infer the dimension of the manifold of origination. That is, one may learn the dimension at which it is possible to construct a diffeomorphic copy of the data in a lower-dimensional Euclidean space. Using Whitney's embedding theorem, we can relate this information to the natural dimension of the data. A drawback of the SAP algorithm is that a data set with T points has O(T 2 ) secants, making the computation and storage of all secants infeasible for very large data sets. In this paper, we propose a novel algorithm that generalizes the SAP algorithm with an emphasis on addressing this issue. That is, we propose a hierarchical secant-based dimensionality-reduction method, which can be employed for data sets where explicitly calculating all secants is not feasible. 
    more » « less
  2. A fundamental question in many data analysis settings is the problem of discerning the ``natural'' dimension of a data set. That is, when a data set is drawn from a manifold (possibly with noise), a meaningful aspect of the data is the dimension of that manifold. Various approaches exist for estimating this dimension, such as the method of Secant-Avoidance Projection (SAP). Intuitively, the SAP algorithm seeks to determine a projection which best preserves the lengths of all secants between points in a data set; by applying the algorithm to find the best projections to vector spaces of various dimensions, one may infer the dimension of the manifold of origination. That is, one may learn the dimension at which it is possible to construct a diffeomorphic copy of the data in a lower-dimensional Euclidean space. Using Whitney's embedding theorem, we can relate this information to the natural dimension of the data. A drawback of the SAP algorithm is that a data set with $$n$$ points has $n(n-1)/2$ secants, making the computation and storage of all secants infeasible for very large data sets. In this paper, we propose a novel algorithm that generalizes the SAP algorithm with an emphasis on addressing this issue. That is, we propose a hierarchical secant-based dimensionality-reduction method, which can be employed for data sets where explicitly calculating all secants is not feasible. 
    more » « less
  3. Dimensionality-reduction techniques are a fundamental tool for extracting useful information from high-dimensional data sets. Because secant sets encode manifold geometry, they are a useful tool for designing meaningful data-reduction algorithms. In one such approach, the goal is to construct a projection that maximally avoids secant directions and hence ensures that distinct data points are not mapped too close together in the reduced space. This type of algorithm is based on a mathematical framework inspired by the constructive proof of Whitney's embedding theorem from differential topology. Computing all (unit) secants for a set of points is by nature computationally expensive, thus opening the door for exploitation of GPU architecture for achieving fast versions of these algorithms. We present a polynomial-time data-reduction algorithm that produces a meaningful low-dimensional representation of a data set by iteratively constructing improved projections within the framework described above. Key to our algorithm design and implementation is the use of GPUs which, among other things, minimizes the computational time required for the calculation of all secant lines. One goal of this report is to share ideas with GPU experts and to discuss a class of mathematical algorithms that may be of interest to the broader GPU community. 
    more » « less
  4. Abstract Principal Component Analysis (PCA) has long been a cornerstone in dimensionality reduction for high-dimensional data, including single-cell RNA sequencing (scRNA-seq). However, PCA’s performance typically degrades with increasing data size, can be sensitive to outliers, and assumes linearity. Recently, Random Projection (RP) methods have emerged as promising alternatives, addressing some of these limitations. This study systematically and comprehensively evaluates PCA and RP approaches, including Singular Value Decomposition (SVD) and randomized SVD, alongside Sparse and Gaussian Random Projection algorithms, with a focus on computational efficiency and downstream analysis effectiveness. We benchmark performance using multiple scRNA-seq datasets including labeled and unlabeled publicly available datasets. We apply Hierarchical Clustering and Spherical K-Means clustering algorithms to assess downstream clustering quality. For labeled datasets, clustering accuracy is measured using the Hungarian algorithm and Mutual Information. For unlabeled datasets, the Dunn Index and Gap Statistic capture cluster separation. Across both dataset types, the Within-Cluster Sum of Squares (WCSS) metric is used to assess variability. Additionally, locality preservation is examined, with RP outperforming PCA in several of the evaluated metrics. Our results demonstrate that RP not only surpasses PCA in computational speed but also rivals and, in some cases, exceeds PCA in preserving data variability and clustering quality. By providing a thorough benchmarking of PCA and RP methods, this work offers valuable insights into selecting optimal dimensionality reduction techniques, balancing computational performance, scalability, and the quality of downstream analyses. 
    more » « less
  5. Principal component analysis (PCA) is one of the most frequently used dimensionality reduction methods for high-dimensional datasets, especially single-cell RNA sequencing (scRNA-seq). Despite its popularity, PCA faces challenges, particularly related to its performance degrading as the dataset size increases. Additionally, PCA is sensitive to outliers and assumes linearity. Random projection (RP) methods have emerged as a promising alternative to address several of PCA’s limitations. In this study, we conduct a systematic and comprehensive evaluation of PCA and RP methods, including singular value decomposition (SVD) and randomized SVD approaches, against multiple RP methods including sparse random projection, Gaussian random projection, and we introduce a Matching Sparsity Random Projection algorithm that adaptively calibrates projection matrix density according to input data sparsity patterns, emphasizing both computational scalability and effectiveness in downstream analytical tasks. We evaluated these methods on multiple publicly available scRNA-seq datasets that include both labeled and unlabeled scenarios. Clustering performance is assessed using Hierarchical Clustering and Spherical K-Means algorithms, with labeled datasets evaluated through Hungarian algorithm accuracy and Mutual Information metrics. For unlabeled datasets, we used the Dunn Index and Gap Statistic to quantify cluster separation quality. Across both dataset types, the Within-Cluster Sum of Squares metric is used to assess variability. Moreover, locality preservation is examined, with RP methods, including our adaptive sparsity approach, outperforming PCA in several of the evaluated metrics. Our experimental results show that RP methods not only deliver substantial computational speed improvements over PCA but also rival, and in some cases, exceed PCA in preserving data variability and clustering quality. Through this comprehensive methodological comparison, our work provides critical guidance for selecting appropriate dimensionality reduction strategies that effectively balance computational demands, scalability requirements, and analytical quality in downstream analyses. 
    more » « less