Biotic specialization holds information about the assembly, evolution, and stability of biological communities. Partner availabilities can play an important role in enabling species interactions, where uneven partner availabilities can bias estimates of biotic specialization when using phylogenetic diversity indices. It is therefore important to account for partner availability when characterizing biotic specialization using phylogenies. We developed an index, phylogenetic structure of specialization (PSS), that avoids bias from uneven partner availabilities by uncoupling the null models for interaction frequency and phylogenetic distance. We incorporate the deviation between observed and random interaction frequencies as weights into the calculation of partner phylogenetic α‐diversity. To calculate the PSS index, we then compare observed partner phylogenetic α‐diversity to a null distribution generated by randomizing phylogenetic distances among the same number of partners. PSS quantifies the phylogenetic structure (i.e., clustered, overdispersed, or random) of the partners of a focal species. We show with simulations that the PSS index is not correlated with network properties, which allows comparisons across multiple systems. We also implemented PSS on empirical networks of host–parasite, avian seed‐dispersal, lichenized fungi–cyanobacteria, and hummingbird pollination interactions. Across these systems, a large proportion of taxa interact with phylogenetically random partners according to PSS, sometimes to a larger extent than detected with an existing method that does not account for partner availability. We also found that many taxa interact with phylogenetically clustered partners, while taxa with overdispersed partners were rare. We argue that species with phylogenetically overdispersed partners have often been misinterpreted as generalists when they should be considered specialists. Our results highlight the important role of randomness in shaping interaction networks, even in highly intimate symbioses, and provide a much‐needed quantitative framework to assess the role that evolutionary history and symbiotic specialization play in shaping patterns of biodiversity. PSS is available as an R package at
Clustering data is a challenging problem in unsupervised learning where there is no gold standard. Results depend on several factors, such as the selection of a clustering method, measures of dissimilarity, parameters, and the determination of the number of reliable groupings. Stability has become a valuable surrogate to performance and robustness that can provide insight to an investigator on the quality of a clustering, and guidance on subsequent cluster prioritization. This work develops a framework for stability measurements that is based on resampling and OB estimation. Bootstrapping methods for cluster stability can be prone to overfitting in a setting that is analogous to poor delineation of test and training sets in supervised learning. Stability that relies on OB items from a resampling overcomes these issues and does not depend on a reference clustering for comparisons. Furthermore, OB stability can provide estimates at the level of the item, cluster, and as an overall summary, which has good interpretive value. This framework is extended to develop stability estimates for determining the number of clusters (model selection) through contrasts between stability estimates on clustered data, and stability estimates of clustered reference data with no signal. These contrasts form stability profiles that can be used to identify the largest differences in stability and do not require a direct threshold on stability values, which tend to be data specific. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network.
more » « less- NSF-PAR ID:
- 10379307
- Publisher / Repository:
- Wiley Blackwell (John Wiley & Sons)
- Date Published:
- Journal Name:
- Statistical Analysis and Data Mining: The ASA Data Science Journal
- Volume:
- 15
- Issue:
- 6
- ISSN:
- 1932-1864
- Page Range / eLocation ID:
- p. 781-796
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract https://github.com/cjpardodelahoz/pss . -
Abstract Recent advances in hail trajectory modeling regularly produce datasets containing millions of hail trajectories. Because hail growth within a storm cannot be entirely separated from the structure of the trajectories producing it, a method to condense the multidimensionality of the trajectory information into a discrete number of features analyzable by humans is necessary. This article presents a three-dimensional trajectory clustering technique that is designed to group trajectories that have similar updraft-relative structures and orientations. The new technique is an application of a two-dimensional method common in the data mining field. Hail trajectories (or “parent” trajectories) are partitioned into segments before they are clustered using a modified version of the density-based spatial applications with noise (DBSCAN) method. Parent trajectories with segments that are members of at least two common clusters are then grouped into parent trajectory clusters before output. This multistep method has several advantages. Hail trajectories with structural similarities along only portions of their length, e.g., sourced from different locations around the updraft before converging to a common pathway, can still be grouped. However, the physical information inherent in the full length of the trajectory is retained, unlike methods that cluster trajectory segments alone. The conversion of trajectories to an updraft-relative space also allows trajectories separated in time to be clustered. Once the final output trajectory clusters are identified, a method for calculating a representative trajectory for each cluster is proposed. Cluster distributions of hailstone and environmental characteristics at each time step in the representative trajectory can also be calculated.
Significance Statement To understand how a storm produces large hail, we need to understand the paths that hailstones take in a storm when growing. We can simulate these paths using computer models. However, the millions of hailstones in a simulated storm create millions of paths, which is hard to analyze. This article describes a machine learning method that groups together hailstone paths based on how similar their three-dimensional structures look. It will let hail scientists analyze hailstone pathways in storms more easily, and therefore better understand how hail growth happens.
-
null (Ed.)Abstract Subspace clustering is the unsupervised grouping of points lying near a union of low-dimensional linear subspaces. Algorithms based directly on geometric properties of such data tend to either provide poor empirical performance, lack theoretical guarantees or depend heavily on their initialization. We present a novel geometric approach to the subspace clustering problem that leverages ensembles of the $K$-subspace (KSS) algorithm via the evidence accumulation clustering framework. Our algorithm, referred to as ensemble $K$-subspaces (EKSSs), forms a co-association matrix whose $(i,j)$th entry is the number of times points $i$ and $j$ are clustered together by several runs of KSS with random initializations. We prove general recovery guarantees for any algorithm that forms an affinity matrix with entries close to a monotonic transformation of pairwise absolute inner products. We then show that a specific instance of EKSS results in an affinity matrix with entries of this form, and hence our proposed algorithm can provably recover subspaces under similar conditions to state-of-the-art algorithms. The finding is, to the best of our knowledge, the first recovery guarantee for evidence accumulation clustering and for KSS variants. We show on synthetic data that our method performs well in the traditionally challenging settings of subspaces with large intersection, subspaces with small principal angles and noisy data. Finally, we evaluate our algorithm on six common benchmark datasets and show that unlike existing methods, EKSS achieves excellent empirical performance when there are both a small and large number of points per subspace.more » « less
-
Abstract This exploratory paper highlights how problem‐based learning (PBL) provided the pedagogical framework used to design and interpret learning analytics from C
rystal Island: EcoJourneys , a collaborative game‐based learning environment centred on supporting science inquiry. In Crystal Island: EcoJourneys , students work in teams of four, investigate the problem individually and then utilize a brainstorming board, an in‐game PBL whiteboard that structured the collaborative inquiry process. The paper addresses a central question: how can PBL support the interpretation of the observed patterns in individual actions and collaborative interactions in the collaborative game‐based learning environment? Drawing on a mixed method approach, we first analyzed students' pre‐ and post‐test results to determine if there were learning gains. We then used principal component analysis (PCA) to describe the patterns in game interaction data and clustered students based on the PCA. Based on the pre‐ and post‐test results and PCA clusters, we used interaction analysis to understand how collaborative interactions unfolded across selected groups. Results showed that students learned the targeted content after engaging with the game‐based learning environment. Clusters based on the PCA revealed four main ways of engaging in the game‐based learning environment: students engaged in low to moderate self‐directed actions with (1) high and (2) moderate collaborative sense‐making actions, (3) low self‐directed with low collaborative sense‐making actions and (4) high self‐directed actions with low collaborative sense‐making actions. Qualitative interaction analysis revealed that a key difference among four groups in each cluster was the nature of verbal student discourse: students in the low to moderate self‐directed and high collaborative sense‐making cluster actively initiated discussions and integrated information they learned to the problem, whereas students in the other clusters required more support. These findings have implications for designing adaptive support that responds to students' interactions with in‐game activities.Practitioner notes What is already known about this topic
Learning analytic methods have been effective for understanding student learning interactions for the purposes of assessment, profiling student behaviour and the effectiveness of interventions.
However, the interpretation of analytics from these diverse data sets are not always grounded in theory and challenges of interpreting student data are further compounded in collaborative inquiry settings, where students work in groups to solve a problem.
What this paper adds
Problem‐based learning as a pedagogical framework allowed for the design to focus on individual and collaborative actions in a game‐based learning environment and, in turn, informed the interpretation of game‐based analytics as it relates to student's self‐directed learning in their individual investigations and collaborative inquiry discussions.
The combination of principal component analysis and qualitative interaction analysis was critical in understanding the nuances of student collaborative inquiry.
Implications for practice and/or policy
Self‐directed actions in individual investigations are critical steps to collaborative inquiry. However, students may need to be encouraged to engage in these actions.
Clustering student data can inform which scaffolds can be delivered to support both self‐directed learning and collaborative inquiry interactions.
All students can engage in knowledge‐integration discourse, but some students may need more direct support from teachers to achieve this.
-
Abstract Biodiversity studies rely heavily on estimates of species' distributions often obtained through ecological niche modelling. Numerous software packages exist that allow users to model ecological niches using machine learning and statistical methods. However, no existing package with a graphical user interface allows users to perform model calibration and selection based on convex forms such as ellipsoids, which may match fundamental ecological niche shapes better, incorporating tools for exploring, modelling, and evaluating niches and distributions that are intuitive for both novice and proficient users.
Here we describe an
r package, Niche Tool Box (ntbox ), that allows users to conduct all processing steps involved in ecological niche modelling: downloading and curating occurrence data, obtaining and transforming environmental data layers, selecting environmental variables, exploring relationships between geographic and environmental spaces, calibrating and selecting ellipsoid models, evaluating models using binomial and partial ROC tests, assessing extrapolation risk, and performing geographic information system operations via a graphical user interface. A summary of the entire workflow is produced for use as a stand‐alone algorithm or as part of research reports.The method is explained in detail and tested via modelling the threatened feline species
Leopardus wiedii . Georeferenced occurrence data for this species are queried to display both point occurrences and the IUCN extent of occurrence polygon (IUCN, 2007). This information is used to illustrate tools available for accessing, processing and exploring biodiversity data (e.g. number of occurrences and chronology of collecting) and transforming environmental data (e.g. a summary PCA for 19 bioclimatic layers). Visualizations of three‐dimensional ecological niches modelled as minimum volume ellipsoids are developed with ancillary statistics. This niche model is then projected to geographic space, to represent a corresponding potential suitability map.Using
ntbox allows a fast and straightforward means by which to retrieve and manipulate occurrence and environmental data, which can then be implemented in model calibration, projection and evaluation for assessing distributions of species in geographic space and their corresponding environmental combinations.