skip to main content

Title: The shape of things to come: Topological data analysis and biology, from molecules to organisms

Shape is data and data is shape. Biologists are accustomed to thinking about how the shape of biomolecules, cells, tissues, and organisms arise from the effects of genetics, development, and the environment. Less often do we consider that data itself has shape and structure, or that it is possible to measure the shape of data and analyze it. Here, we review applications of topological data analysis (TDA) to biology in a way accessible to biologists and applied mathematicians alike. TDA uses principles from algebraic topology to comprehensively measure shape in data sets. Using a function that relates the similarity of data points to each other, we can monitor the evolution of topological features—connected components, loops, and voids. This evolution, a topological signature, concisely summarizes large, complex data sets. We first provide a TDA primer for biologists before exploring the use of TDA across biological sub‐disciplines, spanning structural biology, molecular biology, evolution, and development. We end by comparing and contrasting different TDA approaches and the potential for their use in biology. The vision of TDA, that data are shape and shape is data, will be relevant as biology transitions into a data‐driven era where the meaningful interpretation of large data more » sets is a limiting factor.

« less
 ;  ;  ;  ;  
Award ID(s):
Publication Date:
Journal Name:
Developmental Dynamics
Page Range or eLocation-ID:
p. 816-833
Wiley Blackwell (John Wiley & Sons)
Sponsoring Org:
National Science Foundation
More Like this
  1. Chen, Tsu-Wei ; Long, Stephen P (Ed.)
    Abstract Shape plays a fundamental role in biology. Traditional phenotypic analysis methods measure some features but fail to measure the information embedded in shape comprehensively. To extract, compare and analyse this information embedded in a robust and concise way, we turn to topological data analysis (TDA), specifically the Euler characteristic transform. TDA measures shape comprehensively using mathematical representations based on algebraic topology features. To study its use, we compute both traditional and topological shape descriptors to quantify the morphology of 3121 barley seeds scanned with X-ray computed tomography (CT) technology at 127 μm resolution. The Euler characteristic transform measures shape by analysing topological features of an object at thresholds across a number of directional axes. A Kruskal–Wallis analysis of the information encoded by the topological signature reveals that the Euler characteristic transform picks up successfully the shape of the crease and bottom of the seeds. Moreover, while traditional shape descriptors can cluster the seeds based on their accession, topological shape descriptors can cluster them further based on their panicle. We then successfully train a support vector machine to classify 28 different accessions of barley based exclusively on the shape of their grains. We observe that combining both traditional and topologicalmore »descriptors classifies barley seeds better than using just traditional descriptors alone. This improvement suggests that TDA is thus a powerful complement to traditional morphometrics to comprehensively describe a multitude of ‘hidden’ shape nuances which are otherwise not detected.« less
  2. Abstract

    Large‐scale digitization projects such as#ScanAllFishesandoVertare generating high‐resolution microCT scans of vertebrates by the thousands. Data from these projects are shared with the community using aggregate 3D specimen repositories like MorphoSource through various open licenses. We anticipate an explosion of quantitative research in organismal biology with the convergence of available data and the methodologies to analyse them.

    Though the data are available, the road from a series of images to analysis is fraught with challenges for most biologists. It involves tedious tasks of data format conversions, preserving spatial scale of the data accurately, 3D visualization and segmentations, and acquiring measurements and annotations. When scientists use commercial software with proprietary formats, a roadblock for data exchange, collaboration and reproducibility is erected that hurts the efforts of the scientific community to broaden participation in research.

    We developed SlicerMorph as an extension of 3D Slicer, a biomedical visualization and analysis ecosystem with extensive visualization and segmentation capabilities built on proven python‐scriptable open‐source libraries such as Visualization Toolkit and Insight Toolkit. In addition to the core functionalities of Slicer, SlicerMorph provides users with modules to conveniently retrieve open‐access 3D models or import users own 3D volumes, to annotate 3D curve and patch‐based landmarks, generate landmark templates,more »conduct geometric morphometric analyses of 3D organismal form using both landmark‐driven and landmark‐free approaches, and create 3D animations from their results. We highlight how these individual modules can be tied together to establish complete workflow(s) from image sequence to morphospace. Our software development efforts were supplemented with short courses and workshops that cover the fundamentals of 3D imaging and morphometric analyses as it applies to study of organismal form and shape in evolutionary biology.

    Our goal is to establish a community of organismal biologists centred around Slicer and SlicerMorph to facilitate easy exchange of data and results and collaborations using 3D specimens. Our proposition to our colleagues is that using a common open platform supported by a large user and developer community ensures the longevity and sustainability of the tools beyond the initial development effort.

    « less
  3. Abstract Background

    Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dealing with high dimensional constructs. In this work, we utilize some recent developments in TDA to curate gene expression data. Our work differs from the predecessors in two aspects: (1) Traditional TDA pipelines use topological signatures called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels. (2) Most of the earlier works employ barcodes obtained using topological summaries as fingerprints for the data. Even though they are stable signatures, there exists no direct mapping between the data and said barcodes.


    The topology relevant curated data that we obtain provides an improvement in shallow learning as well as deep learning based supervised classifications. We further show that the representative cycles we compute have an unsupervised inclination towards phenotype labels. This work thus shows that topological signatures are able tomore »comprehend gene expression levels and classify cohorts accordingly.


    In this work, we engender representative persistent cycles to discern the gene expression data. These cycles allow us to directly procure genes entailed in similar processes.

    « less
  4. Abstract In developmental biology as well as in other biological systems, emerging structure and organization can be captured using time-series data of protein locations. In analyzing this time-dependent data, it is a common challenge not only to determine whether topological features emerge, but also to identify the timing of their formation. For instance, in most cells, actin filaments interact with myosin motor proteins and organize into polymer networks and higher-order structures. Ring channels are examples of such structures that maintain constant diameters over time and play key roles in processes such as cell division, development, and wound healing. Given the limitations in studying interactions of actin with myosin in vivo, we generate time-series data of protein polymer interactions in cells using complex agent-based models. Since the data has a filamentous structure, we propose sampling along the actin filaments and analyzing the topological structure of the resulting point cloud at each time. Building on existing tools from persistent homology, we develop a topological data analysis (TDA) method that assesses effective ring generation in this dynamic data. This method connects topological features through time in a path that corresponds to emergence of organization in the data. In this work, we also proposemore »methods for assessing whether the topological features of interest are significant and thus whether they contribute to the formation of an emerging hole (ring channel) in the simulated protein interactions. In particular, we use the MEDYAN simulation platform to show that this technique can distinguish between the actin cytoskeleton organization resulting from distinct motor protein binding parameters.« less
  5. Topological data analysis (TDA) combines concepts from algebraic topology, machine learning, statistics, and data science which allow us to study data in terms of their latent shape properties. Despite the use of TDA in a broad range of applications, from neuroscience to power systems to finance, the utility of TDA in Earth science applications is yet untapped. The current study aims to offer a new approach for analyzing multi-resolution Earth science datasets using the concept of data shape and associated intrinsic topological data characteristics. In particular, we develop a new topological approach to quantitatively compare two maps of geophysical variables at different spatial resolutions. We illustrate the proposed methodology by applying TDA to aerosol optical depth (AOD) datasets from the Goddard Earth Observing System, Version 5 (GEOS-5) model over the Middle East. Our results show that, contrary to the existing approaches, TDA allows for systematic and reliable comparison of spatial patterns from different observational and model datasets without regridding the datasets into common grids.