Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Abstract Persistent homology is a computationally intensive and yet extremely powerful tool for Topological Data Analysis. Applying the tool on potentially infinite sequence of data objects is a challenging task. For this reason, persistent homology and data stream mining have long been two important but disjoint areas of data science. The first computational model, that was recently introduced to bridge the gap between the two areas, is useful for detecting steady or gradual changes in data streams, such as certain genomic modifications during the evolution of species. However, that model is not suitable for applications that encounter abrupt changes of extremely short duration. This paper presents another model for computing persistent homology on streaming data that addresses the shortcoming of the previous work. The model is validated on the important real‐world application of network anomaly detection. It is shown that in addition to detecting the occurrence of anomalies or attacks in computer networks, the proposed model is able to visually identify several types of traffic. Moreover, the model can accurately detect abrupt changes of extremely short as well as longer duration in the network traffic. These capabilities are not achievable by the previous model or by traditional data mining techniques.more » « lessFree, publiclyaccessible full text available July 1, 2024

Topological Data Analysis (TDA) utilizes concepts from topology to analyze data. In general, TDA considers objects similar based on a topological invariant. Topological invariants are properties of the topological space that are homeomorphic; resilient to deformation in the space. The EulerPoincaré Characteristic is a classic topological invariant that represents the alternating sum of the vertices, edges, faces, and higherorder cells of a closed surface. Tracking the Euler characteristic over a topological filtration produces an Euler Characteristic Curve (ECC). This study introduces a computational technique to determine the ECC of R2 or R3 data; the technique generalizes to higher dimensions. This technique separates landscapes of lowerorder homologies utilizing triangulations of the space.more » « less

Persistent Homology (PH) is computationally expensive and is thus generally employed with strict limits on the (i) maximum connectivity distance and (ii) dimensions of homology groups to compute (unless working with trivially small data sets). As a result, most studies with PH only work with H0 and H1 homology groups. This paper examines the identification and isolation of regions of data sets where high dimensional topological features are suspected to be located. These regions are analyzed with PH to characterize the high dimensional homology groups contained in that region. Since only the region around a suspected topological feature is analyzed, it is possible to identify high dimension homologies piecewise and then assemble the results into a scalable characterization of the original data set.more » « less

Topological Data Analysis is a machine learning method that summarizes the topological features of a space. Persistent Homology (PH) can identify these topological features as they persist within a point cloud; persisting in respect to the connectedness of the point cloud at increasing distances. The utility of PH is apparent in several fields including bioinformatics, network security, and object classification. However, the memory complexity of PH limits the application to relatively small point clouds for lowdimensional topological feature identification. For this reason, numerous approaches to optimize and approximate the PH have been introduced for providing results over large point clouds. One solution, Partitioned Persistent Homology (PPH), has shown favorable approximation on a single node with significant performance improvement. However, the singlenode approach is limited by the available system memory, leading to the need for a distributed approach for additional (especially memory) resources. This paper studies a distributed version of PPH for use with large point clouds over a highperformance compute cluster. Experimental results of the distributed algorithm against previous studies is presented along with scalability of the distributed library.more » « less

An emerging method for data analysis is called Topological Data Analysis (TDA). TDA is based in the mathematical field of topology and examines the properties of spaces under continuous deformation. One of the key tools used for TDA is called persistent homology which considers the connectivity of points in a ddimensional point cloud at different spatial resolutions to identify topological properties (holes, loops, and voids) in the space. Persistent homology then classifies the topological features by their persistence through the range of spatial connectivity. Unfortunately the memory and runtime complexity of computing persistent homology is exponential and current tools can only process a few thousand points in R3. Fortunately, the use of data reduction techniques enables persistent homology to be applied to much larger point clouds. Techniques to reduce the data range from random sampling of points to clustering the data and using the cluster centroids as the reduced data. While several data reduction approaches appear to preserve the large topological features present in the original point cloud, no systematic study comparing the efficacy of different data clustering techniques in preserving the persistent homology results has been performed. This paper explores the question of topology preserving data reductions and describes formally when and how topological features can be mischaracterized or lost by data reduction techniques. The paper also performs an experimental assessment of data reduction techniques and resilient effects on the persistent homology. In particular, data reduction by random selection is compared to cluster centroids extracted from different data clustering algorithms.more » « less

This paper introduces a framework to compute persistent homology, a principal tool in Topological Data Analysis, on potentially unbounded and evolving data streams. The framework is organized into online and offline components. The online element maintains a summary of the data that preserves the topological structure of the stream. The offline component computes the persistence intervals from the data captured by the summary. The framework is applied to the detection of horizontal or reticulate genomic exchanges during the evolution of species that cannot be identified by phylogenetic inference or traditional data mining. The method effectively detects reticulate evolution that occurs through reassortment and recombination in large streams of genomic sequences of Influenza and HIV viruses.more » « less

Persistent homology is a method of data analysis that is based in the mathematical field of topology. Unfortunately, the runtime and memory complexities associated with computing persistent homology inhibit general use for the analysis of big data. For example, the best tools currently available to compute persistent homology can process only a few thousand data points in R^3. Several studies have proposed using sampling or data reduction methods to attack this limit. While these approaches enable the computation of persistent homology on much larger data sets, the methods are approximate. Furthermore, while they largely preserve the results of large topological features, they generally miss reporting information about the small topological features that are present in the data set. While this abstraction is useful in many cases, there are data analysis needs where the smaller features are also significant (e.g., brain artery analysis). This paper explores a combination of data reduction and data partitioning to compute persistent homology on big data that enables the identification of both large and small topological features from the input data set. To reduce the approximation errors that typically accompany data reduction for persistent homology, the described method also includes a mechanism of ``upscaling'' the data circumscribing the large topological features that are computed from the sampled data. The designed experimental method provides significant results for improving the scale at which persistent homology can be performedmore » « less

Persistent homology is used for computing topological features of a space at different spatial resolutions. It is one of the main tools from computational topology that is applied to the problems of data analysis. Despite several attempts to reduce its complexity, persistent homology remains expensive in both time and space. These limits are such that the largest data sets to which the method can be applied have the number of points of the order of thousands in R^3. This paper explores a technique intended to reduce the number of data points while preserving the salient topological features of the data. The proposed technique enables the computation of persistent homology on a reduced version of the original input data without affecting significant components of the output. Since the run time of persistent homology is exponential in the number of data points, the proposed data reduction method facilitates the computation in a fraction of the time required for the original data. Moreover, the data reduction method can be combined with any existing technique that simplifies the computation of persistent homology. The data reduction is performed by creating small groups of \emph{similar} data points, called nanoclusters, and then replacing the points within each nanocluster with its cluster center. The persistence homology of the reduced data differs from that of the original data by an amount bounded by the radius of the nanoclusters. The theoretical analysis is backed by experimental results showing that persistent homology is preserved by the proposed data reduction technique.more » « less