A fundamental question in many data analysis settings is the problem of discerning the “natural” dimension of a data set. That is, when a data set is drawn from a manifold (possibly with noise), a meaningful aspect of the data is the dimension of that manifold. Various approaches exist for estimating this dimension, such as the method of SecantAvoidance Projection (SAP). Intuitively, the SAP algorithm seeks to determine a projection which best preserves the lengths of all secants between points in a data set; by applying the algorithm to find the best projections to vector spaces of various dimensions, one may infer the dimension of the manifold of origination. That is, one may learn the dimension at which it is possible to construct a diffeomorphic copy of the data in a lowerdimensional Euclidean space. Using Whitney's embedding theorem, we can relate this information to the natural dimension of the data. A drawback of the SAP algorithm is that a data set with T points has O(T 2 ) secants, making the computation and storage of all secants infeasible for very large data sets. In this paper, we propose a novel algorithm that generalizes the SAP algorithm with an emphasis on addressing this issue. That is, we propose a hierarchical secantbased dimensionalityreduction method, which can be employed for data sets where explicitly calculating all secants is not feasible.
more »
« less
Too many secants: a hierarchical approach to secantbased dimensionality reduction on large data sets
A fundamental question in many data analysis settings is the problem of discerning the ``natural'' dimension of a data set. That is, when a data set is drawn from a manifold (possibly with noise), a meaningful aspect of the data is the dimension of that manifold. Various approaches exist for estimating this dimension, such as the method of SecantAvoidance Projection (SAP). Intuitively, the SAP algorithm seeks to determine a projection which best preserves the lengths of all secants between points in a data set; by applying the algorithm to find the best projections to vector spaces of various dimensions, one may infer the dimension of the manifold of origination. That is, one may learn the dimension at which it is possible to construct a diffeomorphic copy of the data in a lowerdimensional Euclidean space. Using Whitney's embedding theorem, we can relate this information to the natural dimension of the data. A drawback of the SAP algorithm is that a data set with $n$ points has $n(n1)/2$ secants, making the computation and storage of all secants infeasible for very large data sets. In this paper, we propose a novel algorithm that generalizes the SAP algorithm with an emphasis on addressing this issue. That is, we propose a hierarchical secantbased dimensionalityreduction method, which can be employed for data sets where explicitly calculating all secants is not feasible.
more »
« less
 Award ID(s):
 1633830
 NSFPAR ID:
 10064960
 Date Published:
 Journal Name:
 2018 IEEE High Performance Extreme Computing Conference (HPEC ‘18)
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Dimensionalityreduction techniques are a fundamental tool for extracting useful information from highdimensional data sets. Because secant sets encode manifold geometry, they are a useful tool for designing meaningful datareduction algorithms. In one such approach, the goal is to construct a projection that maximally avoids secant directions and hence ensures that distinct data points are not mapped too close together in the reduced space. This type of algorithm is based on a mathematical framework inspired by the constructive proof of Whitney's embedding theorem from differential topology. Computing all (unit) secants for a set of points is by nature computationally expensive, thus opening the door for exploitation of GPU architecture for achieving fast versions of these algorithms. We present a polynomialtime datareduction algorithm that produces a meaningful lowdimensional representation of a data set by iteratively constructing improved projections within the framework described above. Key to our algorithm design and implementation is the use of GPUs which, among other things, minimizes the computational time required for the calculation of all secant lines. One goal of this report is to share ideas with GPU experts and to discuss a class of mathematical algorithms that may be of interest to the broader GPU community.more » « less

Dimensionalityreduction methods are a fundamental tool in the analysis of large datasets. These algorithms work on the assumption that the "intrinsic dimension" of the data is generally much smaller than the ambient dimension in which it is collected. Alongside their usual purpose of mapping data into a smallerdimensional space with minimal information loss, dimensionalityreduction techniques implicitly or explicitly provide information about the dimension of the dataset.In this paper, we propose a new statistic that we call the kappaprofile for analysis of large datasets. The kappaprofile arises from a dimensionalityreduction optimization problem: namely that of finding a projection that optimally preserves the secants between points in the dataset. From this optimal projection we extract kappa, the norm of the shortest projected secant from among the set of all normalized secants. This kappa can be computed for any dimension k; thus the tuple of kappa values (indexed by dimension) becomes a kappaprofile. Algorithms such as the SecantAvoidance Projection algorithm and the Hierarchical SecantAvoidance Projection algorithm provide a computationally feasible means of estimating the kappaprofile for large datasets, and thus a method of understanding and monitoring their behavior. As we demonstrate in this paper, the kappaprofile serves as a useful statistic in several representative settings: weather data, soundscape data, and dynamical systems data.more » « less

null (Ed.)Graphs model realworld circumstances in many applications where they may constantly change to capture the dynamic behavior of the phenomena. Topological persistence which provides a set of birth and death pairs for the topological features is one instrument for analyzing such changing graph data. However, standard persistent homology defined over a growing space cannot always capture such a dynamic process unless shrinking with deletions is also allowed. Hence, zigzag persistence which incorporates both insertions and deletions of simplices is more appropriate in such a setting. Unlike standard persistence which admits nearly lineartime algorithms for graphs, such results for the zigzag version improving the general O(m^ω) time complexity are not known, where ω < 2.37286 is the matrix multiplication exponent. In this paper, we propose algorithms for zigzag persistence on graphs which run in nearlinear time. Specifically, given a filtration with m additions and deletions on a graph with n vertices and edges, the algorithm for 0dimension runs in O(mlog² n+mlog m) time and the algorithm for 1dimension runs in O(mlog⁴ n) time. The algorithm for 0dimension draws upon another algorithm designed originally for pairing critical points of Morse functions on 2manifolds. The algorithm for 1dimension pairs a negative edge with the earliest positive edge so that a 1cycle containing both edges resides in all intermediate graphs. Both algorithms achieve the claimed time complexity via dynamic graph data structures proposed by Holm et al. In the end, using Alexander duality, we extend the algorithm for 0dimension to compute the (p1)dimensional zigzag persistence for ℝ^pembedded complexes in O(mlog² n+mlog m+nlog n) time.more » « less

In this paper, we propose new techniques for solving geometric optimization problems involving interpoint distances of a point set in the plane. Given a set P of n points in the plane and an integer 1 ≤ k ≤ binom(n,2), the distance selection problem is to find the kth smallest interpoint distance among all pairs of points of P. The previously best deterministic algorithm solves the problem in O(n^{4/3} log² n) time [Katz and Sharir, 1997]. In this paper, we improve their algorithm to O(n^{4/3} log n) time. Using similar techniques, we also give improved algorithms on both the twosided and the onesided discrete Fréchet distance with shortcuts problem for two point sets in the plane. For the twosided problem (resp., onesided problem), we improve the previous work [Avraham, Filtser, Kaplan, Katz, and Sharir, 2015] by a factor of roughly log²(m+n) (resp., (m+n)^ε), where m and n are the sizes of the two input point sets, respectively. Other problems whose solutions can be improved by our techniques include the reverse shortest path problems for unitdisk graphs. Our techniques are quite general and we believe they will find many other applications in future.more » « less