skip to main content

Title: A Comparative Study of Machine Learning Methods for Persistence Diagrams
Many and varied methods currently exist for featurization, which is the process of mapping persistence diagrams to Euclidean space, with the goal of maximally preserving structure. However, and to our knowledge, there are presently no methodical comparisons of existing approaches, nor a standardized collection of test data sets. This paper provides a comparative study of several such methods. In particular, we review, evaluate, and compare the stable multi-scale kernel, persistence landscapes, persistence images, the ring of algebraic functions, template functions, and adaptive template systems. Using these approaches for feature extraction, we apply and compare popular machine learning methods on five data sets: MNIST, Shape retrieval of non-rigid 3D Human Models (SHREC14), extracts from the Protein Classification Benchmark Collection (Protein), MPEG7 shape matching, and HAM10000 skin lesion data set. These data sets are commonly used in the above methods for featurization, and we use them to evaluate predictive utility in real-world applications.  more » « less
Award ID(s):
1943758 2006661 2415445
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Frontiers in Artificial Intelligence
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. As the field of Topological Data Analysis continues to show success in theory and in applications, there has been increasing interest in using tools from this field with methods for machine learning. Using persistent homology, specifically persistence diagrams, as inputs to machine learning techniques requires some mathematical creativity. The space of persistence diagrams does not have the desirable properties for machine learning, thus methods such as kernel methods and vectorization methods have been developed. One such featurization of persistence diagrams by Perea, Munch and Khasawneh uses continuous, compactly supported functions, referred to as "template functions," which results in a stable vector representation of the persistence diagram. In this paper, we provide a method of adaptively partitioning persistence diagrams to improve these featurizations based on localized information in the diagrams. Additionally, we provide a framework to adaptively select parameters required for the template functions in order to best utilize the partitioning method. We present results for application to example data sets comparing classification results between template function featurizations with and without partitioning, in addition to other methods from the literature. 
    more » « less
  2. The emergence of data-intensive scientific discovery and machine learning has dramatically changed the way in which scientists and engineers approach materials design. Nevertheless, for designing macromolecules or polymers, one limitation is the lack of appropriate methods or standards for converting systems into chemically informed, machine-readable representations. This featurization process is critical to building predictive models that can guide polymer discovery. Although standard molecular featurization techniques have been deployed on homopolymers, such approaches capture neither the multiscale nature nor topological complexity of copolymers, and they have limited application to systems that cannot be characterized by a single repeat unit. Herein, we present, evaluate, and analyze a series of featurization strategies suitable for copolymer systems. These strategies are systematically examined in diverse prediction tasks sourced from four distinct datasets that enable understanding of how featurization can impact copolymer property prediction. Based on this comparative analysis, we suggest directly encoding polymer size in polymer representations when possible, adopting topological descriptors or convolutional neural networks when the precise polymer sequence is known, and using chemically informed unit representations when developing extrapolative models. These results provide guidance and future directions regarding polymer featurization for copolymer design by machine learning. 
    more » « less
  3. Abstract Objectives

    Increased use of three‐dimensional (3D) imaging data has led to a need for methods capable of capturing rich shape descriptions. Semi‐landmarks have been demonstrated to increase shape information but placement in 3D can be time consuming, computationally expensive, or may introduce artifacts. This study implements and compares three strategies to more densely sample a 3D image surface.

    Materials and methods

    Three dense sampling strategies: patch, patch‐thin‐plate spline (TPS), and pseudo‐landmark sampling, are implemented to analyze skulls from three species of great apes. To evaluate the shape information added by each strategy, the semi or pseudo‐landmarks are used to estimate a transform between an individual and the population average template. The average mean root squared error between the transformed mesh and the template is used to quantify the success of the transform.


    The landmark sets generated by each method result in estimates of the template that on average were comparable or exceeded the accuracy of using manual landmarks alone. The patch method demonstrates the most sensitivity to noise and missing data, resulting in outliers with large deviations in the mean shape estimates. Patch‐TPS and pseudo‐landmarking provide more robust performance in the presence of noise and variability in the dataset.


    Each landmarking strategy was capable of producing shape estimations of the population average templates that were generally comparable to manual landmarks alone while greatly increasing the density of the shape information. This study highlights the potential trade‐offs between correspondence of the semi‐landmark points, consistent point spacing, sample coverage, repeatability, and computational time.

    more » « less
  4. Abstract

    As neuroimaging data increase in complexity and related analytical problems follow suite, more researchers are drawn to collaborative frameworks that leverage data sets from multiple data‐collection sites to balance out the complexity with an increased sample size. Although centralized data‐collection approaches have dominated the collaborative scene, a number of decentralized approaches—those that avoid gathering data at a shared central store—have grown in popularity. We expect the prevalence of decentralized approaches to continue as privacy risks and communication overhead become increasingly important for researchers. In this article, we develop, implement and evaluate a decentralized version of one such widely used tool: dynamic functional network connectivity. Our resulting algorithm, decentralized dynamic functional network connectivity (ddFNC), synthesizes a new, decentralized group independent component analysis algorithm (dgICA) with algorithms for decentralizedk‐means clustering. We compare both individual decentralized components and the full resulting decentralized analysis pipeline against centralized counterparts on the same data, and show that both provide comparable performance. Additionally, we perform several experiments which evaluate the communication overhead and convergence behavior of various decentralization strategies and decentralized clustering algorithms. Our analysis indicates that ddFNC is a fine candidate for facilitating decentralized collaboration between neuroimaging researchers, and stands ready for the inclusion of privacy‐enabling modifications, such as differential privacy.

    more » « less
  5. A critical step in data analysis for many different types of experiments is the identification of features with theoretically defined shapes in N -dimensional datasets; examples of this process include finding peaks in multi-dimensional molecular spectra or emitters in fluorescence microscopy images. Identifying such features involves determining if the overall shape of the data is consistent with an expected shape; however, it is generally unclear how to quantitatively make this determination. In practice, many analysis methods employ subjective, heuristic approaches, which complicates the validation of any ensuing results—especially as the amount and dimensionality of the data increase. Here, we present a probabilistic solution to this problem by using Bayes’ rule to calculate the probability that the data have any one of several potential shapes. This probabilistic approach may be used to objectively compare how well different theories describe a dataset, identify changes between datasets and detect features within data using a corollary method called Bayesian Inference-based Template Search; several proof-of-principle examples are provided. Altogether, this mathematical framework serves as an automated ‘engine’ capable of computationally executing analysis decisions currently made by visual inspection across the sciences. 
    more » « less