Abstract Population ecology and biogeography applications often necessitate the transfer of models across spatial and/or temporal dimensions to make predictions outside the bounds of the data used for model fitting. However, ecological data are often spatiotemporally unbalanced such that the spatial or the temporal dimension tends to contain more data than the other. This unbalance frequently leads model transfers to become substitutions, which are predictions to a different dimension than the predictive model was built on. Despite the prevalence of substitutions in ecology, studies validating their performance and their underlying assumptions are scarce.Here, we present a case study demonstrating both space‐for‐time and time‐for‐space substitutions (TFSS) using emperor penguins (Aptenodytes forsteri) as the focal species. Using an abundance‐based species distribution model (aSDM) of adult emperor penguins in attendance during spring across 50 colonies, we predict long‐term annual fluctuations in fledgling abundance and breeding success at a single colony, Pointe Géologie. Subsequently, we construct statistical models from time series of extended counts on Pointe Géologie to predict average colony abundance distribution across 50 colonies.Our analysis reveals that the distance to nearest open water (NOW) exhibits the strongest association with both temporal and spatial data. Space‐for‐time substitution performance of the aSDM, as measured by the Pearson correlation coefficient, was 0.63 and 0.56 when predicting breeding success and fledgling abundance time series, respectively. Linear regression of fledgling abundance on NOW yields similar TFSS performance when predicting the abundance distribution of emperor penguin colonies with a correlation coefficient of 0.58.We posit that such space–time equivalence arises because: (1) emperor penguin colonies conform to their existing fundamental niche; (2) there is not yet any environmental novelty when comparing the spatial versus temporal variation of distance to the nearest open water; and (3) models of more specific components of life histories, such as fledgling abundance, rather than total population abundance, are more transferable. Identifying these conditions empirically can enhance the qualitative validation of substitutions in cases where direct validation data are lacking.
more »
« less
Allotaxonometry and rank-turbulence divergence: a universal instrument for comparing complex systems
Abstract Complex systems often comprise many kinds of components which vary over many orders of magnitude in size: Populations of cities in countries, individual and corporate wealth in economies, species abundance in ecologies, word frequency in natural language, and node degree in complex networks. Here, we introduce ‘allotaxonometry’ along with ‘rank-turbulence divergence’ (RTD), a tunable instrument for comparing any two ranked lists of components. We analytically develop our rank-based divergence in a series of steps, and then establish a rank-based allotaxonograph which pairs a map-like histogram for rank-rank pairs with an ordered list of components according to divergence contribution. We explore the performance of rank-turbulence divergence, which we view as an instrument of ‘type calculus’, for a series of distinct settings including: Language use on Twitter and in books, species abundance, baby name popularity, market capitalization, performance in sports, mortality causes, and job titles. We provide a series of supplementary flipbooks which demonstrate the tunability and storytelling power of rank-based allotaxonometry.
more »
« less
- Award ID(s):
- 2242829
- PAR ID:
- 10514102
- Publisher / Repository:
- Springer Nature
- Date Published:
- Journal Name:
- EPJ Data Science
- Volume:
- 12
- Issue:
- 1
- ISSN:
- 2193-1127
- Page Range / eLocation ID:
- 37
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Analyzing sequential data is crucial in many domains, particularly due to the abundance of data collected from the Internet of Things paradigm. Time series classification, the task of categorizing sequential data, has gained prominence, with machine learning approaches demonstrating remarkable performance on public benchmark datasets. However, progress has primarily been in designing architectures for learning representations from raw data at fixed (or ideal) time scales, which can fail to generalize to longer sequences. This work introduces a \textit{compositional representation learning} approach trained on statistically coherent components extracted from sequential data. Based on a multi-scale change space, an unsupervised approach is proposed to segment the sequential data into chunks with similar statistical properties. A sequence-based encoder model is trained in a multi-task setting to learn compositional representations from these temporal components for time series classification. We demonstrate its effectiveness through extensive experiments on publicly available time series classification benchmarks. Evaluating the coherence of segmented components shows its competitive performance on the unsupervised segmentation task.more » « less
-
Abstract Postmating reproductive isolation can help maintain species boundaries when premating barriers to reproduction are incomplete. The strength and identity of postmating reproductive barriers are highly variable among diverging species, leading to questions about their genetic basis and evolutionary drivers. These questions have been tackled in model systems but are less often addressed with broader phylogenetic resolution. In this study we analyse patterns of genetic divergence alongside direct measures of postmating reproductive barriers in an overlooked group of sympatric species within the model monkeyflower genus, Mimulus. Within this Mimulus brevipes species group, we find substantial divergence among species, including a cryptic genetic lineage. However, rampant gene discordance and ancient signals of introgression suggest a complex history of divergence. In addition, we find multiple strong postmating barriers, including postmating prezygotic isolation, hybrid seed inviability and hybrid male sterility. M. brevipes and M. fremontii have substantial but incomplete postmating isolation. For all other tested species pairs, we find essentially complete postmating isolation. Hybrid seed inviability appears linked to differences in seed size, providing a window into possible developmental mechanisms underlying this reproductive barrier. While geographic proximity and incomplete mating isolation may have allowed gene flow within this group in the distant past, strong postmating reproductive barriers today have likely played a key role in preventing ongoing introgression. By producing foundational information about reproductive isolation and genomic divergence in this understudied group, we add new diversity and phylogenetic resolution to our understanding of the mechanisms of plant speciation. Abstract Hybrid seed inviability and other postmating reproductive barriers isolate species in Mimulus section Eunanus. Variation in seed size may help explain hybrid seed failure. Whole-genome sequencing indicates a complex history of divergence, including signals of ancient introgression and cryptic diversity.more » « less
-
Summary Metagenomics sequencing is routinely applied to quantify bacterial abundances in microbiome studies, where bacterial composition is estimated based on the sequencing read counts. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which tend to result in inaccurate estimates of bacterial abundance and diversity. This paper takes a multisample approach to estimation of bacterial abundances in order to borrow information across samples and across species. Empirical results from real datasets suggest that the composition matrix over multiple samples is approximately low rank, which motivates a regularized maximum likelihood estimation with a nuclear norm penalty. An efficient optimization algorithm using the generalized accelerated proximal gradient and Euclidean projection onto simplex space is developed. Theoretical upper bounds and the minimax lower bounds of the estimation errors, measured by the Kullback–Leibler divergence and the Frobenius norm, are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The method is applied to an analysis of a human gut microbiome dataset.more » « less
-
Abstract Geographic isolation is the primary driver of speciation in many vertebrate lineages. This trend is exemplified by North American darters, a clade of freshwater fishes where nearly all sister species pairs are allopatric and separated by millions of years of divergence. One of the only exceptions is the Lake Waccamaw endemic Etheostoma perlongum and its riverine sister species Etheostoma maculaticeps, which have no physical barriers to gene flow. Here we show that lacustrine speciation of E. perlongum is characterized by morphological and ecological divergence likely facilitated by a large chromosomal inversion. While E. perlongum is phylogenetically nested within the geographically widespread E. maculaticeps, there is a sharp genetic and morphological break coinciding with the lake–river boundary in the Waccamaw River system. Despite recent divergence, an active hybrid zone, and ongoing gene flow, analyses using a de novo reference genome reveal a 9 Mb chromosomal inversion with elevated divergence between E. perlongum and E. maculaticeps. This region exhibits striking synteny with known inversion supergenes in two distantly related fish lineages, suggesting deep evolutionary convergence of genomic architecture. Our results illustrate that rapid, ecological speciation with gene flow is possible even in lineages where geographic isolation is the dominant mechanism of speciation.more » « less
An official website of the United States government

