<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Data Reduction and Feature Isolation for Computing Persistent Homology on High Dimensional Data</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>12/15/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10350942</idno>
					<idno type="doi">10.1109/BigData52589.2021.9671839</idno>
					<title level='j'>Workshop on Applications of Topological Data Analysis to "Big Data"</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Rishi R. Verma</author><author>Nicholas O. Malott</author><author>Philip A. Wilsey</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Persistent Homology (PH) is computationally expensive and is thus generally employed with strict limits on the (i) maximum connectivity distance and (ii) dimensions of homology groups to compute (unless working with trivially small data sets). As a result, most studies with PH only work with H0 and H1 homology groups. This paper examines the identification and isolation of regions of data sets where high dimensional topological features are suspected to be located. These regions are analyzed with PH to characterize the high dimensional homology groups contained in that region. Since only the region around a suspected topological feature is analyzed, it is possible to identify high dimension homologies piecewise and then assemble the results into a scalable characterization of the original data set.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Persistent Homology (PH) is one of the principal components of Topological Data Analysis (TDA) <ref type="bibr">[1]</ref>. PH provides a scalar representation of topological structures encoded in a point cloud -connected components, loops, voids, and so on -as they persist over multiple spatial resolutions <ref type="bibr">[2]</ref>- <ref type="bibr">[6]</ref>. The output of PH is captured through persistence intervals that characterize the persistence, or spatial lifetime, of these topological structures. The resulting persistence intervals can be used for various machine learning applications <ref type="bibr">[7]</ref>- <ref type="bibr">[18]</ref>.</p><p>The computation of PH suffers from exponential memory and time complexities that prevent its general application to high dimensional data. For example, computing the H 2 homology groups of a point cloud in R 3 is only feasible with a few thousand points. Consequently, work with PH is often limited to the H 0 and H 1 homology groups (although in some cases H 2 homology groups are studied). However, topological features in dimensions above H 2 are generally omitted due to the associated time and space complexity. This work explores techniques to enable the use of computational PH to characterize higher dimensional homology groups.</p><p>The method presented in this paper computes PH on dimensional reductions of the data to locate candidate regions where high dimensional homologies may be located. The candidate regions of the original space are then isolated for Support for this work was provided in part by the National Science Foundation under grant <ref type="bibr">IIS-1909096.</ref> high dimension PH analysis. The technique is motivated by previous studies of dimensionality reduction on PH <ref type="bibr">[19]</ref>- <ref type="bibr">[22]</ref>. The principal idea of this paper is that, under projection, higher dimensional topological features will be present in the lower dimensional data. Thus, computing PH on the dimensionality reduced data will help identify candidate regions of the original data to examine for high dimensional homologies. In this work, dimensionality reduction and PH are used to locate the vertices of representative cycles of topological features in the reduced dimension. The dimensionality reduced vertices are then re-associated to the corresponding vertices in the original data set. The vertex set is then used to extract a region of the original data around the cycle's center for further PH analysis in the high dimensional space.</p><p>The remainder of this paper is organized as follows. Section II presents some background and related work. Section III reviews the technical approach used to locate and analyze topological features in high dimensional data. Section IV presents some early experimental results. Finally, Section V contains some concluding remarks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND AND RELATED WORK</head><p>This section presents background information and studies related to this work. Additional background information on PH can be found in <ref type="bibr">[2]</ref>, <ref type="bibr">[4]</ref>, <ref type="bibr">[23]</ref>, <ref type="bibr">[24]</ref>.</p><p>Techniques to combat the exponential memory and time complexity of PH have led to advancements in complex construction <ref type="bibr">[25]</ref>- <ref type="bibr">[27]</ref>, complex representation <ref type="bibr">[28]</ref>- <ref type="bibr">[30]</ref>, boundary matrix reduction <ref type="bibr">[30]</ref>- <ref type="bibr">[33]</ref>, and approximate algorithms <ref type="bibr">[20]</ref>, <ref type="bibr">[21]</ref>, <ref type="bibr">[34]</ref>- <ref type="bibr">[36]</ref>. A majority of these are evaluated with respect to low dimensional topological feature identification. Few studies are directly aimed at analyzing high dimensional homology groups; both the application and understanding of higher dimensional homology groups are limited.</p><p>Several studies have examined the application of dimensional reduction techniques to the computation of PH <ref type="bibr">[19]</ref>- <ref type="bibr">[21]</ref>. The studies by <ref type="bibr">[19]</ref>, <ref type="bibr">[21]</ref> have largely focused on the theoretical issues and limits of dimensionality reduction for the computation of PH. Ramamurthy et al <ref type="bibr">[20]</ref> provide the first empirical study of random projection on persistence intervals and Betti numbers. Their experiments consider high dimensional source data in 3 studies, namely: 10, 000 points in 50-dimensions (synthetic), 15, 000 points in 25-dimensions (image patches), and 4, 000 points in 100-dimensions (audio clips of wheezing patients). The dimensional reduction step  is performed using random projection and the data is reduced to dimension 30. They examined the impact of dimensional reduction on the computed Betti numbers (the number of kdimensional features, using PH). Unfortunately, the data in these studies contained Betti numbers only at &#946; 0 , &#946; 1 , and &#946; 2 (&#946; 2 results were reported only from the synthetic data).</p><p>This study provides evidence that homology groups H 0 , H 1 , and H 2 are preserved through random projection; however the preservation of homology groups above H 2 was not explored.</p><p>Cycles and Co-cycles: In this paper, the identification of high dimensional homology groups is achieved by extracting representative cycles from a PH computation. For each topological feature, the PH computation can track the boundary, or the vertices, that circumscribe that feature <ref type="bibr">[37]</ref>. There may exist multiple possible cycles that can generate the same feature, so a single representative cycle is chosen. The representative cycles provide candidate regions for analysis in the original space. Features identified in the projected space are utilized to gather their corresponding high dimensional neighborhood (region) that is then used to compute high dimensional homology groups (generally from a memory-bound approximated sampling of the region).</p><p>Unfortunately, most existing tools to compute PH utilize cohomology instead of homology <ref type="bibr">[38]</ref>. Cohomology reports representative co-cycles instead of cycles. A new technique called Involuted Homology <ref type="bibr">[39]</ref> can reconstruct the homology cycles after computing the cohomology. While not widely used, involuted homology is available in a tool called ripserer <ref type="bibr">[40]</ref>. This tool is used in the studies of this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data Size Limitations when Computing PH:</head><p>The approximation of high dimensional homology groups developed in this study is motivated primarily by the limits of memory. In general, computing the H 2 homology groups of data with PH is constrained to several thousand points on modern computing hardware. Computation of higher dimensional homology groups is even more restricted. Table <ref type="table">I</ref> presents the limitations of the computation of H d homology groups using synthetic d-Spheres in R d+1 with ripserer on an AMD Ryzen 7 with 128GB of RAM. These limits are similar for other tools (e.g., ripser <ref type="bibr">[30]</ref>). While the exact limits depend on each specific data set, these are sufficient for the early assessments computed for this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. TECHNICAL APPROACH</head><p>This section describes the approach to characterizing high dimensional homology groups of an input point cloud. The outPIs &#8592; ripserer(reducedRegion) 18: end for approach utilizes the output of PH on a dimensionally-reduced approximation to identify candidate regions of the space where high dimensional features may lie. These areas are then reassociated to their points in the original point cloud to form a region that is evaluated for higher dimensional homologies. Algorithm 1 contains the pseudo-code for the approach. The algorithm can be summarized in 3 main processing steps.</p><p>Step 1: Project Data to R 3 and Compute PH This step locates the geometric center of regions in the data where topological features are suspected to exist (Lines 2-11). Since the projection methods being used in this study are stochastic, the algorithm takes multiple trials to collect the persistence intervals in the low dimension space (this study used Gaussian projection to a fixed dimension of 3 with 8 trials). The algorithm only collects significant H 1 and H 2 persistence intervals from the projected data (Lines 5-10). In this study, the kneed algorithm <ref type="bibr">[41]</ref> with the default sensitivity of 1.0 is used on the length (&#491; death&#491; birth ) of the persistence intervals to determine significance. Furthermore, the kneed algorithm and the definition of significance was evaluated separately on the sets of H 1 and H 2 persistence intervals.</p><p>For each significant persistent interval found, the indices of the representative cycle vertices are used to collect the corresponding vertices in the high dimensional data (Line 7). The index is sufficient to re-associate vertices back to the high dimensional space because Gaussian projection performs a 1to-1 mapping of high to low vectors. The final component of this step is to compute the geometric center of the high dimension vertices associated to the persistence interval (Line 8). Note that all geometric centers are collected regardless of the dimension of the persistence interval that generated them.</p><p>A quick evaluation to analyze each dimension separately was performed early in this study, but no significant advantage or discriminating conclusion could be drawn from their separation.</p><p>Step 2: Cluster the Centers and Extract Centroids In this step, the geometric centers, gathered from the representative cycles in Step 1, are now examined and clustered to isolate the key regions where topological features are suspected to exist. In particular, the geometric centers are clustered and the centroids of the clusters captured (Line: 12). This approach requires a clustering algorithm that operates without expecting a known target number of clusters. In this study, the MeanShift algorithm was used. While not scalable, the number of representative cycles to cluster is relatively small and the MeanShift algorithm performs extremely well experimentally at locating the correct number of centers for the test data examined in Section IV. The clustering step removes redundant centers collected from the persistence intervals of the multiple trials.</p><p>Step 3: Define High Dimension Region &amp; Compute PH Each centroid from Step 2 is used to define the center of a candidate region for more detailed analysis in the higher dimensional space. For each centroid, points from the original point cloud that are within a specified distance of the centroid are collected into a region of points of interest (Line: 15). In this study, this distance is defined independently for each cluster. In particular, the mean (&#181;) plus standard deviation (&#963;) of the distances from the vertices of the representative cycles in the cluster to the cluster centroid is used (Line: 14). This region of points is a subset of the original point cloud that can be studied with PH. However, the size of this region can easily exceed the capacity of existing PH tools, so prior to analysis, the region is sampled (Line: 16). The k-means++ algorithm is used to sample the data as it has been shown to provide an accurate approximation for computing PH for large data sets <ref type="bibr">[35]</ref>, <ref type="bibr">[42]</ref>. The sampled region is then analyzed using standard PH tools (Line: 17).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. PRELIMINARY RESULTS</head><p>Synthetic data sets are used to evaluate the technique described in this paper in order to generate known high dimensional features. The synthetic data is composed of multiple unit d-Spheres of different dimensions embedded in R 5 through R 7 . Each unit d-Sphere is composed of 5, 000 points and was generated using Muller's method <ref type="bibr">[43]</ref> to produce a uniform random set of points on the surface of the sphere. These spheres were all constructed about the origin of the coordinate axis in R d+1 . One test was performed on data embedded in R 8 . However the computation of PH was unable to clearly distinguish features in this dimension (with this algorithm or using a standard PH computation on one unit 8-Sphere). As a result, testing was halted at R 7 .</p><p>Multiple d-Spheres were embedded in a common space; when the space was of a dimension higher than the original  sphere, the additional coordinates were set to 0.0. The d-Spheres were placed adjacent to each other on the first coordinate axis at multiples of 2 so that they were pairwise non-intersecting. Thus, the t spheres were positioned so that their respective centers were at 2i, 0, &#8226; &#8226; &#8226; , 0 for 0 &#8804; i &lt; t.</p><p>In this test suite, there were 9 sets of synthetic test data, three with 2 d-Spheres, two with 3 d-Spheres and four with 4 d-Spheres. Due to space considerations, not all results are shown; however, the results are consistent across all tests.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Identifying Centers of Candidate regions</head><p>The first step of the proposed approach (Algorithm 1, Lines: 2-11) analyzes the projected data to locate centers of regions of the data that the approach determines are potential candidates where a significant topological feature may exist. This step operates without any knowledge of the number or dimension of any homological features present in the data. Given the synthetic test data described above, it is known (to us, but not to the algorithm) that there are significant topological features along the first coordinate axis at 2i, 0, &#8226; &#8226; &#8226; , 0 for 0 &#8804; i &lt; t. Thus, the ability of the algorithm to correctly identify these feature centers provides a preliminary indicator of the suitability of the approach.</p><p>Table <ref type="table">II</ref> shows the computed region centers for 4 of the test data sets. The bold strings are the names of each data set, which describes the dimensions of the d-Spheres and their embeddings. For example, the first row of Table II examines 4d5d5d4din5d. This name indicates an embedding in R 5 of: (i) a unit 4-Sphere at 0, 0, 0, 0, 0 , (ii) a unit 5-Sphere at 2, 0, 0, 0, 0 , (iii) a unit 5-Sphere at 4, 0, 0, 0, 0 , and (iv) a unit 4-Sphere at 6, 0, 0, 0, 0 . Each row following the test data set name provides the Cartesian coordinates of the geometric centroid of all cluster centroids reported by Algorithm 1 (Line: 12). For each d-Sphere, the method reports the correct number of clusters and a good approximation of their geometric center.</p><p>While not the exact centers of the test data, the computed centroids are extremely close to the correct values.</p></div></body>
		</text>
</TEI>
