skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Overcoming the pitfalls of categorizing continuous variables in ecology, evolution and behaviour
Many variables in biological research—from body size to life-history timing to environmental characteristics—are measured continuously (e.g. body mass in kilograms) but analysed as categories (e.g. large versus small), which can lower statistical power and change interpretation. We conducted a mini-review of 72 recent publications in six popular ecology, evolution and behaviour journals to quantify the prevalence of categorization. We then summarized commonly categorized metrics and simulated a dataset to demonstrate the drawbacks of categorization using common variables and realistic examples. We show that categorizing continuous variables is common (31% of publications reviewed). We also underscore that predictor variables can and should be collected and analysed continuously. Finally, we provide recommendations on how to keep variables continuous throughout the entire scientific process. Together, these pieces comprise an actionable guide to increasing statistical power and facilitating large synthesis studies by simply leaving continuous variables alone. Overcoming the pitfalls of categorizing continuous variables will allow ecologists, ethologists and evolutionary biologists to continue making trustworthy conclusions about natural processes, along with predictions about their responses to climate change and other environmental contexts.  more » « less
Award ID(s):
2052497
PAR ID:
10637709
Author(s) / Creator(s):
;
Publisher / Repository:
Proceedings of the Royal Society B: Biological Sciences
Date Published:
Journal Name:
Proceedings of the Royal Society B: Biological Sciences
Volume:
291
Issue:
2032
ISSN:
0962-8452
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    This paper focuses on the empirical derivation of regret bounds for mobile systems that can vary their locations within a spatiotemporally varying environment in order to maximize performance. In particular, the paper focuses on an airborne wind energy system, where the replacement of towers with tethers and a lifting body allows the system to adjust its altitude continuously, with the goal of operating at the altitude that maximizes net power production. While prior publications have proposed control strategies for this problem, often with favorable results based on simulations that use real wind data, they lack any theoretical or statistical performance guarantees. In the present work, we make use of a very large synthetic data set, identified through parameters from real wind data, to derive probabilistic bounds on the difference between optimal and actual performance, termed regret. The results are presented for a variety of control strategies, including a maximum probability of improvement, upper confidence bound, greedy, and constant altitude approaches. 
    more » « less
  2. Past research has demonstrated that treatment effects frequently vary across sites (e.g., schools) and that such variation can be explained by site-level or individual-level variables (e.g., school size or gender). The purpose of this study is to develop a statistical framework and tools for the effective and efficient design of multisite randomized trials (MRTs) probing moderated treatment effects. The framework considers three core facets of such designs: (a) Level 1 and Level 2 moderators, (b) random and nonrandomly varying slopes (coefficients) of the treatment variable and its interaction terms with the moderators, and (c) binary and continuous moderators. We validate the formulas for calculating statistical power and the minimum detectable effect size difference with simulations, probe its sensitivity to model assumptions, execute the formulas in accessible software, demonstrate an application, and provide suggestions in designing MRTs probing moderated treatment effects. 
    more » « less
  3. We consider a supervised classification problem of categorizing e-commerce products based on just the words in the title. If done in real-time, the categorization can greatly benefit sellers by enabling them to offer immediate feedback. We present a deterministic algorithm by constructing weighted word co-occurrence graphs from the listing/item titles. We empirically evaluate this algorithm on two publicly available product listing datasets, Etsy and Amazon. Our method’s accuracy is comparable to that of a supervised classifier constructed using the fastText library. The inference time of our model is up to 2.9× faster than the fastText classifier and has small training times. The training and inference of our model scales well for big datasets performing large-scale classification on millions of listings. We perform a detailed analysis and provide insights into our method and the product categorization task. 
    more » « less
  4. Hypothesis tests are a crucial statistical tool for data mining and are the workhorse of scientific research in many fields. Here we study differentially private tests of independence between a categorical and a continuous variable. We take as our starting point traditional nonparametric tests, which require no distributional assumption (e.g., normality) about the data distribution. We present private analogues of the Kruskal-Wallis, Mann-Whitney, and Wilcoxon signed-rank tests, as well as the parametric one-sample t-test. These tests use novel test statistics developed specifically for the private setting. We compare our tests to prior work, both on parametric and nonparametric tests. We find that in all cases our new nonparametric tests achieve large improvements in statistical power, even when the assumptions of parametric tests are met. 
    more » « less
  5. Abstract The prevalence and intensity of parasites in wild hosts varies across space and is a key determinant of infection risk in humans, domestic animals and threatened wildlife. Because the immune system serves as the primary barrier to infection, replication and transmission following exposure, we here consider the environmental drivers of immunity. Spatial variation in parasite pressure, abiotic and biotic conditions, and anthropogenic factors can all shape immunity across spatial scales. Identifying the most important spatial drivers of immunity could help pre‐empt infectious disease risks, especially in the context of how large‐scale factors such as urbanization affect defence by changing environmental conditions.We provide a synthesis of how to apply macroecological approaches to the study of ecoimmunology (i.e. macroimmunology). We first review spatial factors that could generate spatial variation in defence, highlighting the need for large‐scale studies that can differentiate competing environmental predictors of immunity and detailing contexts where this approach might be favoured over small‐scale experimental studies. We next conduct a systematic review of the literature to assess the frequency of spatial studies and to classify them according to taxa, immune measures, spatial replication and extent, and statistical methods.We review 210 ecoimmunology studies sampling multiple host populations. We show that whereas spatial approaches are relatively common, spatial replication is generally low and unlikely to provide sufficient environmental variation or power to differentiate competing spatial hypotheses. We also highlight statistical biases in macroimmunology, in that few studies characterize and account for spatial dependence statistically, potentially affecting inferences for the relationships between environmental conditions and immune defence.We use these findings to describe tools from geostatistics and spatial modelling that can improve inference about the associations between environmental and immunological variation. In particular, we emphasize exploratory tools that can guide spatial sampling and highlight the need for greater use of mixed‐effects models that account for spatial variability while also allowing researchers to account for both individual‐ and habitat‐level covariates.We finally discuss future research priorities for macroimmunology, including focusing on latitudinal gradients, range expansions and urbanization as being especially amenable to large‐scale spatial approaches. Methodologically, we highlight critical opportunities posed by assessing spatial variation in host tolerance, using metagenomics to quantify spatial variation in parasite pressure, coupling large‐scale field studies with small‐scale field experiments and longitudinal approaches, and applying statistical tools from macroecology and meta‐analysis to identify generalizable spatial patterns. Such work will facilitate scaling ecoimmunology from individual‐ to habitat‐level insights about the drivers of immune defence and help predict where environmental change may most alter infectious disease risk. 
    more » « less