skip to main content


Title: Measuring hidden phenotype: quantifying the shape of barley seeds using the Euler characteristic transform
Abstract Shape plays a fundamental role in biology. Traditional phenotypic analysis methods measure some features but fail to measure the information embedded in shape comprehensively. To extract, compare and analyse this information embedded in a robust and concise way, we turn to topological data analysis (TDA), specifically the Euler characteristic transform. TDA measures shape comprehensively using mathematical representations based on algebraic topology features. To study its use, we compute both traditional and topological shape descriptors to quantify the morphology of 3121 barley seeds scanned with X-ray computed tomography (CT) technology at 127 μm resolution. The Euler characteristic transform measures shape by analysing topological features of an object at thresholds across a number of directional axes. A Kruskal–Wallis analysis of the information encoded by the topological signature reveals that the Euler characteristic transform picks up successfully the shape of the crease and bottom of the seeds. Moreover, while traditional shape descriptors can cluster the seeds based on their accession, topological shape descriptors can cluster them further based on their panicle. We then successfully train a support vector machine to classify 28 different accessions of barley based exclusively on the shape of their grains. We observe that combining both traditional and topological descriptors classifies barley seeds better than using just traditional descriptors alone. This improvement suggests that TDA is thus a powerful complement to traditional morphometrics to comprehensively describe a multitude of ‘hidden’ shape nuances which are otherwise not detected.  more » « less
Award ID(s):
2018432 2046256
NSF-PAR ID:
10358277
Author(s) / Creator(s):
; ; ; ; ; ;
Editor(s):
Chen, Tsu-Wei; Long, Stephen P
Date Published:
Journal Name:
in silico Plants
Volume:
4
Issue:
1
ISSN:
2517-5025
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Shape is data and data is shape. Biologists are accustomed to thinking about how the shape of biomolecules, cells, tissues, and organisms arise from the effects of genetics, development, and the environment. Less often do we consider that data itself has shape and structure, or that it is possible to measure the shape of data and analyze it. Here, we review applications of topological data analysis (TDA) to biology in a way accessible to biologists and applied mathematicians alike. TDA uses principles from algebraic topology to comprehensively measure shape in data sets. Using a function that relates the similarity of data points to each other, we can monitor the evolution of topological features—connected components, loops, and voids. This evolution, a topological signature, concisely summarizes large, complex data sets. We first provide a TDA primer for biologists before exploring the use of TDA across biological sub‐disciplines, spanning structural biology, molecular biology, evolution, and development. We end by comparing and contrasting different TDA approaches and the potential for their use in biology. The vision of TDA, that data are shape and shape is data, will be relevant as biology transitions into a data‐driven era where the meaningful interpretation of large data sets is a limiting factor.

     
    more » « less
  2. null (Ed.)
    Abstract This paper introduces the use of topological data analysis (TDA) as an unsupervised machine learning tool to uncover classification criteria in complex inorganic crystal chemistries. Using the apatite chemistry as a template, we track through the use of persistent homology the topological connectivity of input crystal chemistry descriptors on defining similarity between different stoichiometries of apatites. It is shown that TDA automatically identifies a hierarchical classification scheme within apatites based on the commonality of the number of discrete coordination polyhedra that constitute the structural building units common among the compounds. This information is presented in the form of a visualization scheme of a barcode of homology classifications, where the persistence of similarity between compounds is tracked. Unlike traditional perspectives of structure maps, this new “Materials Barcode” schema serves as an automated exploratory machine learning tool that can uncover structural associations from crystal chemistry databases, as well as to achieve a more nuanced insight into what defines similarity among homologous compounds. 
    more » « less
  3. Abstract Background

    Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dealing with high dimensional constructs. In this work, we utilize some recent developments in TDA to curate gene expression data. Our work differs from the predecessors in two aspects: (1) Traditional TDA pipelines use topological signatures called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels. (2) Most of the earlier works employ barcodes obtained using topological summaries as fingerprints for the data. Even though they are stable signatures, there exists no direct mapping between the data and said barcodes.

    Results

    The topology relevant curated data that we obtain provides an improvement in shallow learning as well as deep learning based supervised classifications. We further show that the representative cycles we compute have an unsupervised inclination towards phenotype labels. This work thus shows that topological signatures are able to comprehend gene expression levels and classify cohorts accordingly.

    Conclusions

    In this work, we engender representative persistent cycles to discern the gene expression data. These cycles allow us to directly procure genes entailed in similar processes.

     
    more » « less
  4. Context. Machine-learning methods for predicting solar flares typically employ physics-based features that have been carefully cho- sen by experts in order to capture the salient features of the photospheric magnetic fields of the Sun. Aims. Though the sophistication and complexity of these models have grown over time, there has been little evolution in the choice of feature sets, or any systematic study of whether the additional model complexity leads to higher predictive skill. Methods. This study compares the relative prediction performance of four different machine-learning based flare prediction models with increasing degrees of complexity. It evaluates three different feature sets as input to each model: a “traditional” physics-based feature set, a novel “shape-based” feature set derived from topological data analysis (TDA) of the solar magnetic field, and a com- bination of these two sets. A systematic hyperparameter tuning framework is employed in order to assure fair comparisons of the models across different feature sets. Finally, principal component analysis is used to study the effects of dimensionality reduction on these feature sets. Results. It is shown that simpler models with fewer free parameters perform better than the more complicated models on the canonical 24-h flare forecasting problem. In other words, more complex machine-learning architectures do not necessarily guarantee better prediction performance. In addition, it is found that shape-based feature sets contain just as much useful information as physics-based feature sets for the purpose of flare prediction, and that the dimension of these feature sets – particularly the shape-based one – can be greatly reduced without impacting predictive accuracy. 
    more » « less
  5. Topological Data Analysis (TDA) utilizes concepts from topology to analyze data. In general, TDA considers objects similar based on a topological invariant. Topological invariants are properties of the topological space that are homeomorphic; resilient to deformation in the space. The Euler-Poincaré Characteristic is a classic topological invariant that represents the alternating sum of the vertices, edges, faces, and higherorder cells of a closed surface. Tracking the Euler characteristic over a topological filtration produces an Euler Characteristic Curve (ECC). This study introduces a computational technique to determine the ECC of R2 or R3 data; the technique generalizes to higher dimensions. This technique separates landscapes of lowerorder homologies utilizing triangulations of the space. 
    more » « less