Analyzing sequential data is crucial in many domains, particularly due to the abundance of data collected from the Internet of Things paradigm. Time series classification, the task of categorizing sequential data, has gained prominence, with machine learning approaches demonstrating remarkable performance on public benchmark datasets. However, progress has primarily been in designing architectures for learning representations from raw data at fixed (or ideal) time scales, which can fail to generalize to longer sequences. This work introduces a \textit{compositional representation learning} approach trained on statistically coherent components extracted from sequential data. Based on a multi-scale change space, an unsupervised approach is proposed to segment the sequential data into chunks with similar statistical properties. A sequence-based encoder model is trained in a multi-task setting to learn compositional representations from these temporal components for time series classification. We demonstrate its effectiveness through extensive experiments on publicly available time series classification benchmarks. Evaluating the coherence of segmented components shows its competitive performance on the unsupervised segmentation task. 
                        more » 
                        « less   
                    
                            
                            Allotaxonometry and rank-turbulence divergence: a universal instrument for comparing complex systems
                        
                    
    
            Abstract Complex systems often comprise many kinds of components which vary over many orders of magnitude in size: Populations of cities in countries, individual and corporate wealth in economies, species abundance in ecologies, word frequency in natural language, and node degree in complex networks. Here, we introduce ‘allotaxonometry’ along with ‘rank-turbulence divergence’ (RTD), a tunable instrument for comparing any two ranked lists of components. We analytically develop our rank-based divergence in a series of steps, and then establish a rank-based allotaxonograph which pairs a map-like histogram for rank-rank pairs with an ordered list of components according to divergence contribution. We explore the performance of rank-turbulence divergence, which we view as an instrument of ‘type calculus’, for a series of distinct settings including: Language use on Twitter and in books, species abundance, baby name popularity, market capitalization, performance in sports, mortality causes, and job titles. We provide a series of supplementary flipbooks which demonstrate the tunability and storytelling power of rank-based allotaxonometry. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2242829
- PAR ID:
- 10514102
- Publisher / Repository:
- Springer Nature
- Date Published:
- Journal Name:
- EPJ Data Science
- Volume:
- 12
- Issue:
- 1
- ISSN:
- 2193-1127
- Page Range / eLocation ID:
- 37
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract Postmating reproductive isolation can help maintain species boundaries when premating barriers to reproduction are incomplete. The strength and identity of postmating reproductive barriers are highly variable among diverging species, leading to questions about their genetic basis and evolutionary drivers. These questions have been tackled in model systems but are less often addressed with broader phylogenetic resolution. In this study we analyse patterns of genetic divergence alongside direct measures of postmating reproductive barriers in an overlooked group of sympatric species within the model monkeyflower genus, Mimulus. Within this Mimulus brevipes species group, we find substantial divergence among species, including a cryptic genetic lineage. However, rampant gene discordance and ancient signals of introgression suggest a complex history of divergence. In addition, we find multiple strong postmating barriers, including postmating prezygotic isolation, hybrid seed inviability and hybrid male sterility. M. brevipes and M. fremontii have substantial but incomplete postmating isolation. For all other tested species pairs, we find essentially complete postmating isolation. Hybrid seed inviability appears linked to differences in seed size, providing a window into possible developmental mechanisms underlying this reproductive barrier. While geographic proximity and incomplete mating isolation may have allowed gene flow within this group in the distant past, strong postmating reproductive barriers today have likely played a key role in preventing ongoing introgression. By producing foundational information about reproductive isolation and genomic divergence in this understudied group, we add new diversity and phylogenetic resolution to our understanding of the mechanisms of plant speciation. Abstract Hybrid seed inviability and other postmating reproductive barriers isolate species in Mimulus section Eunanus. Variation in seed size may help explain hybrid seed failure. Whole-genome sequencing indicates a complex history of divergence, including signals of ancient introgression and cryptic diversity.more » « less
- 
            Summary Metagenomics sequencing is routinely applied to quantify bacterial abundances in microbiome studies, where bacterial composition is estimated based on the sequencing read counts. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which tend to result in inaccurate estimates of bacterial abundance and diversity. This paper takes a multisample approach to estimation of bacterial abundances in order to borrow information across samples and across species. Empirical results from real datasets suggest that the composition matrix over multiple samples is approximately low rank, which motivates a regularized maximum likelihood estimation with a nuclear norm penalty. An efficient optimization algorithm using the generalized accelerated proximal gradient and Euclidean projection onto simplex space is developed. Theoretical upper bounds and the minimax lower bounds of the estimation errors, measured by the Kullback–Leibler divergence and the Frobenius norm, are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The method is applied to an analysis of a human gut microbiome dataset.more » « less
- 
            Abstract Geographic isolation is the primary driver of speciation in many vertebrate lineages. This trend is exemplified by North American darters, a clade of freshwater fishes where nearly all sister species pairs are allopatric and separated by millions of years of divergence. One of the only exceptions is the Lake Waccamaw endemic Etheostoma perlongum and its riverine sister species Etheostoma maculaticeps, which have no physical barriers to gene flow. Here we show that lacustrine speciation of E. perlongum is characterized by morphological and ecological divergence likely facilitated by a large chromosomal inversion. While E. perlongum is phylogenetically nested within the geographically widespread E. maculaticeps, there is a sharp genetic and morphological break coinciding with the lake–river boundary in the Waccamaw River system. Despite recent divergence, an active hybrid zone, and ongoing gene flow, analyses using a de novo reference genome reveal a 9 Mb chromosomal inversion with elevated divergence between E. perlongum and E. maculaticeps. This region exhibits striking synteny with known inversion supergenes in two distantly related fish lineages, suggesting deep evolutionary convergence of genomic architecture. Our results illustrate that rapid, ecological speciation with gene flow is possible even in lineages where geographic isolation is the dominant mechanism of speciation.more » « less
- 
            Abstract Dated, geo‐referenced museum specimens are a rich data source for reconstructing species' distribution and abundance patterns. However, museum records are potentially biased towards over‐representation of rare species, and it is unclear whether museum records can be used to estimate relative abundance in the field.We assembled 17 coupled field and museum datasets to quantitatively compare relative abundance estimates with the Dirichlet distribution. Collectively, these datasets comprise 73,039 museum records and 1,405,316 field observations of 2,240 species.Although museum records of rare species overestimated relative abundance by 1‐fold to over 100‐fold (median study = 9.0), the relative abundance of species estimated from museum occurrence records was strongly correlated with relative abundance estimated from standardized field surveys (r2range of 0.10–0.91, median study = 0.43).These analyses provide a justification for estimating species relative abundance with carefully curated museum occurrence records, which may allow for the detection of temporal or spatial shifts in the rank ordering of common and rare species.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    