skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Topology Applied to Machine Learning: From Global to Local
Through the use of examples, we explain one way in which applied topology has evolved since the birth of persistent homology in the early 2000s. The first applications of topology to data emphasized the global shape of a dataset, such as the three-circle model for 3 × 3 pixel patches from natural images, or the configuration space of the cyclo-octane molecule, which is a sphere with a Klein bottle attached via two circles of singularity. In these studies of global shape, short persistent homology bars are disregarded as sampling noise. More recently, however, persistent homology has been used to address questions about the local geometry of data. For instance, how can local geometry be vectorized for use in machine learning problems? Persistent homology and its vectorization methods, including persistence landscapes and persistence images, provide popular techniques for incorporating both local geometry and global topology into machine learning. Our meta-hypothesis is that the short bars are as important as the long bars for many machine learning tasks. In defense of this claim, we survey applications of persistent homology to shape recognition, agent-based modeling, materials science, archaeology, and biology. Additionally, we survey work connecting persistent homology to geometric features of spaces, including curvature and fractal dimension, and various methods that have been used to incorporate persistent homology into machine learning.  more » « less
Award ID(s):
1934725
PAR ID:
10228407
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Frontiers in Artificial Intelligence
Volume:
4
ISSN:
2624-8212
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Protein‐ligand binding is a fundamental biological process that is paramount to many other biological processes, such as signal transduction, metabolic pathways, enzyme construction, cell secretion, and gene expression. Accurate prediction of protein‐ligand binding affinities is vital to rational drug design and the understanding of protein‐ligand binding and binding induced function. Existing binding affinity prediction methods are inundated with geometric detail and involve excessively high dimensions, which undermines their predictive power for massive binding data. Topology provides the ultimate level of abstraction and thus incurs too much reduction in geometric information. Persistent homology embeds geometric information into topological invariants and bridges the gap between complex geometry and abstract topology. However, it oversimplifies biological information. This work introduces element specific persistent homology (ESPH) or multicomponent persistent homology to retain crucial biological information during topological simplification. The combination of ESPH and machine learning gives rise to a powerful paradigm for macromolecular analysis. Tests on 2 large data sets indicate that the proposed topology‐based machine‐learning paradigm outperforms other existing methods in protein‐ligand binding affinity predictions. ESPH reveals protein‐ligand binding mechanism that can not be attained from other conventional techniques. The present approach reveals that protein‐ligand hydrophobic interactions are extended to 40Å  away from the binding site, which has a significant ramification to drug and protein design. 
    more » « less
  2. As the field of Topological Data Analysis continues to show success in theory and in applications, there has been increasing interest in using tools from this field with methods for machine learning. Using persistent homology, specifically persistence diagrams, as inputs to machine learning techniques requires some mathematical creativity. The space of persistence diagrams does not have the desirable properties for machine learning, thus methods such as kernel methods and vectorization methods have been developed. One such featurization of persistence diagrams by Perea, Munch and Khasawneh uses continuous, compactly supported functions, referred to as "template functions," which results in a stable vector representation of the persistence diagram. In this paper, we provide a method of adaptively partitioning persistence diagrams to improve these featurizations based on localized information in the diagrams. Additionally, we provide a framework to adaptively select parameters required for the template functions in order to best utilize the partitioning method. We present results for application to example data sets comparing classification results between template function featurizations with and without partitioning, in addition to other methods from the literature. 
    more » « less
  3. Abstract Topological data analysis (TDA) is a tool from data science and mathematics that is beginning to make waves in environmental science. In this work, we seek to provide an intuitive and understandable introduction to a tool from TDA that is particularly useful for the analysis of imagery, namely, persistent homology. We briefly discuss the theoretical background but focus primarily on understanding the output of this tool and discussing what information it can glean. To this end, we frame our discussion around a guiding example of classifying satellite images from the sugar, fish, flower, and gravel dataset produced for the study of mesoscale organization of clouds by Rasp et al. We demonstrate how persistent homology and its vectorization, persistence landscapes, can be used in a workflow with a simple machine learning algorithm to obtain good results, and we explore in detail how we can explain this behavior in terms of image-level features. One of the core strengths of persistent homology is how interpretable it can be, so throughout this paper we discuss not just the patterns we find but why those results are to be expected given what we know about the theory of persistent homology. Our goal is that readers of this paper will leave with a better understanding of TDA and persistent homology, will be able to identify problems and datasets of their own for which persistent homology could be helpful, and will gain an understanding of the results they obtain from applying the included GitHub example code. Significance StatementInformation such as the geometric structure and texture of image data can greatly support the inference of the physical state of an observed Earth system, for example, in remote sensing to determine whether wildfires are active or to identify local climate zones. Persistent homology is a branch of topological data analysis that allows one to extract such information in an interpretable way—unlike black-box methods like deep neural networks. The purpose of this paper is to explain in an intuitive manner what persistent homology is and how researchers in environmental science can use it to create interpretable models. We demonstrate the approach to identify certain cloud patterns from satellite imagery and find that the resulting model is indeed interpretable. 
    more » « less
  4. Abstract Recently, persistent homology has had tremendous success in biomolecular data analysis. It works by examining the topological relationship or connectivity of a group of atoms in a molecule at a variety of scales, then rendering a family of topological representations of the molecule. However, persistent homology is rarely employed for the analysis of atomic properties, such as biomolecular flexibility analysis or B-factor prediction. This work introduces atom-specific persistent homology to provide a local atomic level representation of a molecule via a global topological tool. This is achieved through the construction of a pair of conjugated sets of atoms and corresponding conjugated simplicial complexes, as well as conjugated topological spaces. The difference between the topological invariants of the pair of conjugated sets is measured by Bottleneck and Wasserstein metrics and leads to an atom-specific topological representation of individual atomic properties in a molecule. Atom-specific topological features are integrated with various machine learning algorithms, including gradient boosting trees and convolutional neural network for protein thermal fluctuation analysis and B-factor prediction. Extensive numerical results indicate the proposed method provides a powerful topological tool for analyzing and predicting localized information in complex macromolecules. 
    more » « less
  5. Abstract A central challenge in topological data analysis is the interpretation of barcodes. The classical algebraic-topological approach to interpreting homology classes is to build maps to spaces whose homology carries semantics we understand and then to appeal to functoriality. However, we often lack such maps in real data; instead, we must rely on a cross-dissimilarity measure between our observations of a system and a reference. In this paper, we develop a pair of computational homological algebra approaches for relating persistent homology classes and barcodes:persistent extension, which enumerates potential relations between homology classes from two complexes built on the same vertex set, and the method ofanalogous bars, which utilizes persistent extension and the witness complex built from a cross-dissimilarity measure to provide relations across systems. We provide an implementation of these methods and demonstrate their use in comparing homology classes between two samples from the same metric space and determining whether topology is maintained or destroyed under clustering and dimensionality reduction. 
    more » « less