skip to main content

Title: Persistent Homology on Streaming Data
This paper introduces a framework to compute persistent homology, a principal tool in Topological Data Analysis, on potentially unbounded and evolving data streams. The framework is organized into online and offline components. The online element maintains a summary of the data that preserves the topological structure of the stream. The offline component computes the persistence intervals from the data captured by the summary. The framework is applied to the detection of horizontal or reticulate genomic exchanges during the evolution of species that cannot be identified by phylogenetic inference or traditional data mining. The method effectively detects reticulate evolution that occurs through reassortment and recombination in large streams of genomic sequences of Influenza and HIV viruses.  more » « less
Award ID(s):
1440420 1909096
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
8th Workshop on Data Mining in Biomedical Informatics and Healthcare
Page Range / eLocation ID:
636 to 643
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Stream clustering is an important data mining technique to capture the evolving patterns in real-time data streams. Today’s data streams, e.g., IoT events and Web clicks, are usually high-speed and contain dynamically-changing patterns. Existing stream clustering algorithms usually follow an online-offline paradigm with a one-record-at-a-time update model, which was designed for running in a single machine. These stream clustering algorithms, with this sequential update model, cannot be efficiently parallelized and fail to deliver the required high throughput for stream clustering. In this paper, we present DistStream, a distributed framework that can effectively scale out online-offline stream clustering algorithms. To parallelize these algorithms for high throughput, we develop a mini-batch update model with efficient parallelization approaches. To maintain high clustering quality, DistStream’s mini-batch update model preserves the update order in all the computation steps during parallel execution, which can reflect the recent changes for dynamically-changing streaming data. We implement DistStream atop Spark Streaming, as well as four representative stream clustering algorithms based on DistStream. Our evaluation on three real-world datasets shows that DistStream-based stream clustering algorithms can achieve sublinear throughput gain and comparable (99%) clustering quality with their single-machine counterparts. 
    more » « less

    Hybridization has long been recognized as a fundamental evolutionary process in plants but, until recently, our understanding of its phylogenetic distribution and biological significance across deep evolutionary scales has been largely obscure. Over the past decade, genomic and phylogenomic datasets have revealed, perhaps not surprisingly, that hybridization, often associated with polyploidy, has been common throughout the evolutionary history of plants, particularly in various lineages of flowering plants. However, phylogenomic studies have also highlighted the challenges of disentangling signals of ancient hybridization from other sources of genomic conflict (in particular, incomplete lineage sorting). Here, we provide a critical review of ancient hybridization in vascular plants, outlining well‐documented cases of ancient hybridization across plant phylogeny, as well as the challenges unique to documenting ancient versus recent hybridization. We provide a definition for ancient hybridization, which, to our knowledge, has not been explicitly attempted before. Further documenting the extent of deep reticulation in plants should remain an important research focus, especially because published examples likely represent the tip of the iceberg in terms of the total extent of ancient hybridization. However, future research should increasingly explore the macroevolutionary significance of this process, in terms of its impact on evolutionary trajectories (e.g. how does hybridization influence trait evolution or the generation of biodiversity over long time scales?), as well as how life history and ecological factors shape, or have shaped, the frequency of hybridization across geologic time and plant phylogeny. Finally, we consider the implications of ubiquitous ancient hybridization for how we conceptualize, analyze, and classify plant phylogeny. Networks, as opposed to bifurcating trees, represent more accurate representations of evolutionary history in many cases, although our ability to infer, visualize, and use networks for comparative analyses is highly limited. Developing improved methods for the generation, visualization, and use of networks represents a critical future direction for plant biology. Current classification systems also do not generally allow for the recognition of reticulate lineages, and our classifications themselves are largely based on evidence from the chloroplast genome. Updating plant classification to better reflect nuclear phylogenies, as well as considering whether and how to recognize hybridization in classification systems, will represent an important challenge for the plant systematics community.

    more » « less
  3. Summary

    Many plant leaves have two layers of photosynthetic tissue: the palisade and spongy mesophyll. Whereas palisade mesophyll consists of tightly packed columnar cells, the structure of spongy mesophyll is not well characterized and often treated as a random assemblage of irregularly shaped cells.

    Using micro‐computed tomography imaging, topological analysis, and a comparative physiological framework, we examined the structure of the spongy mesophyll in 40 species from 30 genera with laminar leaves and reticulate venation.

    A spectrum of spongy mesophyll diversity encompassed two dominant phenotypes: first, an ordered, honeycomblike tissue structure that emerged from the spatial coordination of multilobed cells, conforming to the physical principles of Euler’s law; and second, a less‐ordered, isotropic network of cells. Phenotypic variation was associated with transitions in cell size, cell packing density, mesophyll surface‐area‐to‐volume ratio, vein density, and maximum photosynthetic rate.

    These results show that simple principles may govern the organization and scaling of the spongy mesophyll in many plants and demonstrate the presence of structural patterns associated with leaf function. This improved understanding of mesophyll anatomy provides new opportunities for spatially explicit analyses of leaf development, physiology, and biomechanics.

    more » « less
  4. Abstract Persistent homology is a computationally intensive and yet extremely powerful tool for Topological Data Analysis. Applying the tool on potentially infinite sequence of data objects is a challenging task. For this reason, persistent homology and data stream mining have long been two important but disjoint areas of data science. The first computational model, that was recently introduced to bridge the gap between the two areas, is useful for detecting steady or gradual changes in data streams, such as certain genomic modifications during the evolution of species. However, that model is not suitable for applications that encounter abrupt changes of extremely short duration. This paper presents another model for computing persistent homology on streaming data that addresses the shortcoming of the previous work. The model is validated on the important real‐world application of network anomaly detection. It is shown that in addition to detecting the occurrence of anomalies or attacks in computer networks, the proposed model is able to visually identify several types of traffic. Moreover, the model can accurately detect abrupt changes of extremely short as well as longer duration in the network traffic. These capabilities are not achievable by the previous model or by traditional data mining techniques. 
    more » « less
  5. Abstract

    The role of hybridization and subsequent introgression has been demonstrated in an increasing number of species. Recently, Fontaineet al. (Science, 347, 2015, 1258524) conducted a phylogenomic analysis of six members of theAnopheles gambiaespecies complex. Their analysis revealed a reticulate evolutionary history and pointed to extensive introgression on all four autosomal arms. The study further highlighted the complex evolutionary signals that the co‐occurrence of incomplete lineage sorting (ILS) and introgression can give rise to in phylogenomic analyses. While tree‐based methodologies were used in the study, phylogenetic networks provide a more natural model to capture reticulate evolutionary histories. In this work, we reanalyse theAnophelesdata using a recently devised framework that combines the multispecies coalescent with phylogenetic networks. This framework allows us to captureILSand introgression simultaneously, and forms the basis for statistical methods for inferring reticulate evolutionary histories. The new analysis reveals a phylogenetic network with multiple hybridization events, some of which differ from those reported in the original study. To elucidate the extent and patterns of introgression across the genome, we devise a new method that quantifies the use of reticulation branches in the phylogenetic network by each genomic region. Applying the method to the mosquito data set reveals the evolutionary history of all the chromosomes. This study highlights the utility of ‘network thinking’ and the new insights it can uncover, in particular in phylogenomic analyses of large data sets with extensive gene tree incongruence.

    more » « less