Abstract Structured population models are among the most widely used tools in ecology and evolution. Integral projection models (IPMs) use continuous representations of how survival, reproduction and growth change as functions of state variables such as size, requiring fewer parameters to be estimated than projection matrix models (PPMs). Yet, almost all published IPMs make an important assumption that size‐dependent growth transitions are or can be transformed to be normally distributed. In fact, many organisms exhibit highly skewed size transitions. Small individuals can grow more than they can shrink, and large individuals may often shrink more dramatically than they can grow. Yet, the implications of such skew for inference from IPMs has not been explored, nor have general methods been developed to incorporate skewed size transitions into IPMs, or deal with other aspects of real growth rates, including bounds on possible growth or shrinkage.Here, we develop a flexible approach to modelling skewed growth data using a modified beta regression model. We propose that sizes first be converted to a (0,1) interval by estimating size‐dependent minimum and maximum sizes through quantile regression. Transformed data can then be modelled using beta regression with widely available statistical tools. We demonstrate the utility of this approach using demographic data for a long‐lived plant, gorgonians and an epiphytic lichen. Specifically, we compare inferences of population parameters from discrete PPMs to those from IPMs that either assume normality or incorporate skew using beta regression or, alternatively, a skewed normal model.The beta and skewed normal distributions accurately capture the mean, variance and skew of real growth distributions. Incorporating skewed growth into IPMs decreases population growth and estimated life span relative to IPMs that assume normally distributed growth, and more closely approximate the parameters of PPMs that do not assume a particular growth distribution. A bounded distribution, such as the beta, also avoids the eviction problem caused by predicting some growth outside the modelled size range.Incorporating biologically relevant skew in growth data has important consequences for inference from IPMs. The approaches we outline here are flexible and easy to implement with existing statistical tools.
more »
« less
This content will become publicly available on May 27, 2026
Evaluating methods for addressing skewness in clustering: a focus on generalized hyperbolic mixture models
In model-based clustering, the population is assumed to be a combination of sub-populations. Typically, each sub-population is modeled by a mixture model component, distributed according to a known probability distribution. Each component is considered a cluster. Two primary approaches have been used in the literature when clusters are skewed: (1) transforming the data within each cluster and applying a mixture of symmetric distributions to the transformed data, and (2) directly modeling each cluster using a skewed distribution. Among skewed distributions, the generalized hyperbolic distribution is notably flexible and includes many other known distributions as special or limiting cases. This paper achieves two goals. First, it extends the flexibility of transformation-based methods as outlined in approach (1) by employing a flexible symmetric generalized hyperbolic distribution to model each transformed cluster. This innovation results in the introduction of two new models, each derived from distinct within-cluster data transformations. Second, the paper benchmarks the approaches listed in (1) and (2) for handling skewness using both simulated and real data. The findings highlight the necessity of both approaches in varying contexts.
more »
« less
- Award ID(s):
- 2209974
- PAR ID:
- 10621649
- Publisher / Repository:
- Taylor & Francis
- Date Published:
- Journal Name:
- Journal of Statistical Computation and Simulation
- ISSN:
- 0094-9655
- Page Range / eLocation ID:
- 1 to 16
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Context. The growing set of gravitational-wave sources is being used to measure the properties of the underlying astrophysical populations of compact objects, black holes, and neutron stars. Most of the detected systems are black hole binaries. While much has been learned about black holes by analyzing the latest LIGO-Virgo-KAGRA (LVK) catalog, GWTC-3, a measurement of the astrophysical distribution of the black hole spin orientations remains elusive. This is usually probed by measuring the cosine of the tilt angle (cos τ ) between each black hole spin and the orbital angular momentum, with cos τ = +1 being perfect alignment. Aims. The LVK Collaboration has modeled the cos τ distribution as a mixture of an isotropic component and a Gaussian component with mean fixed at +1 and width measured from the data. We want to verify if the data require the existence of such a peak at cos τ = +1. Methods. We used various alternative models for the astrophysical tilt distribution and measured their parameters using the LVK GWTC-3 catalog. Results. We find that (a) augmenting the LVK model, such that the mean μ of the Gaussian is not fixed at +1, returns results that strongly depend on priors. If we allow μ > +1, then the resulting astrophysical cos τ distribution peaks at +1 and looks linear, rather than Gaussian. If we constrain −1 ≤ μ ≤ +1, the Gaussian component peaks at μ = 0.48 −0.99 +0.46 (median and 90% symmetric credible interval). Two other two-component mixture models yield cos τ distributions that either have a broad peak centered at 0.19 −0.18 +0.22 or a plateau that spans the range [ − 0.5, +1], without a clear peak at +1. (b) All of the models we considered agree as to there being no excess of black hole tilts at around −1. (c) While yielding quite different posteriors, the models considered in this work have Bayesian evidences that are the same within error bars. Conclusions. We conclude that the current dataset is not sufficiently informative to draw any model-independent conclusions on the astrophysical distribution of spin tilts, except that there is no excess of spins with negatively aligned tilts.more » « less
-
Abstract Identification of clusters of co‐expressed genes in transcriptomic data is a difficult task. Most algorithms used for this purpose can be classified into two broad categories: distance‐based or model‐based approaches. Distance‐based approaches typically utilize a distance function between pairs of data objects and group similar objects together into clusters. Model‐based approaches are based on using the mixture‐modeling framework. Compared to distance‐based approaches, model‐based approaches offer better interpretability because each cluster can be explicitly characterized in terms of the proposed model. However, these models present a particular difficulty in identifying a correct multivariate distribution that a mixture can be based upon. In this manuscript, we review some of the approaches used to select a distribution for the needed mixture model first. Then, we propose avoiding this problem altogether by using a nonparametric MSL (maximum smoothed likelihood) algorithm. This algorithm was proposed earlier in statistical literature but has not been, to the best of our knowledge, applied to transcriptomics data. The salient feature of this approach is that it avoids explicit specification of distributions of individual biological samples altogether, thus making the task of a practitioner easier. We performed both a simulation study and an application of the proposed algorithm to two different real datasets. When used on a real dataset, the algorithm produces a large number of biologically meaningful clusters and performs at least as well as several other mixture‐based algorithms commonly used for RNA‐seq data clustering. Our results also show that this algorithm is capable of uncovering clustering solutions that may go unnoticed by several other model‐based clustering algorithms. Our code is publicly available on Github at https://github.com/Matematikoi/non_parametric_clusteringmore » « less
-
Abstract The observation of gravitational waves from multiple compact binary coalescences by the LIGO–Virgo–KAGRA detector networks has enabled us to infer the underlying distribution of compact binaries across a wide range of masses, spins, and redshifts. In light of the new features found in the mass spectrum of binary black holes and the uncertainty regarding binary formation models, nonparametric population inference has become increasingly popular. In this work, we develop a data-driven clustering framework that can identify features in the component mass distribution of compact binaries simultaneously with those in the corresponding redshift distribution, from gravitational-wave data in the presence of significant measurement uncertainties, while making very few assumptions about the functional form of these distributions. Our generalized model is capable of inferring correlations among various population properties, such as the redshift evolution of the shape of the mass distribution itself, in contrast to most existing nonparametric inference schemes. We test our model on simulated data and demonstrate the accuracy with which it can reconstruct the underlying distributions of component masses and redshifts. We also reanalyze public LIGO–Virgo–KAGRA data from events in GWTC-3 using our model and compare our results with those from some alternative parametric and nonparametric population inference approaches. Finally, we investigate the potential presence of correlations between mass and redshift in the population of binary black holes in GWTC-3 (those observed by the LIGO–Virgo–KAGRA detector network in their first three observing runs), without making any assumptions about the specific nature of these correlations.more » « less
-
Zhang, Aidong; Rangwala, Huzefa (Ed.)Zero-inflated, heavy-tailed spatiotemporal data is common across science and engineering, from climate science to meteorology and seismology. A central modeling objective in such settings is to forecast the intensity, frequency, and timing of extreme and non-extreme events; yet in the context of deep learning, this objective presents several key challenges. First, a deep learning framework applied to such data must unify a mixture of distributions characterizing the zero events, moderate events, and extreme events. Second, the framework must be capable of enforcing parameter constraints across each component of the mixture distribution. Finally, the framework must be flexible enough to accommodate for any changes in the threshold used to define an extreme event after training. To address these challenges, we propose Deep Extreme Mixture Model (DEMM), fusing a deep learning-based hurdle model with extreme value theory to enable point and distribution prediction of zero-inflated, heavy-tailed spatiotemporal variables. The framework enables users to dynamically set a threshold for defining extreme events at inference-time without the need for retraining. We present an extensive experimental analysis applying DEMM to precipitation forecasting, and observe significant improvements in point and distribution prediction. All code is available at https://github.com/andrewmcdonald27/DeepExtremeMixtureModel.more » « less
An official website of the United States government
