skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Unifying community detection across scales from genomes to landscapes
Biodiversity science encompasses multiple disciplines and biological scales from molecules to landscapes. Nevertheless, biodiversity data are often analyzed separately with discipline‐specific methodologies, constraining resulting inferences to a single scale. To overcome this, we present a topic modeling framework to analyze community composition in cross‐disciplinary datasets, including those generated from metagenomics, metabolomics, field ecology and remote sensing. Using topic models, we demonstrate how community detection in different datasets can inform the conservation of interacting plants and herbivores. We show how topic models can identify members of molecular, organismal and landscape‐level communities that relate to wildlife health, from gut microbes to forage quality. We conclude with a future vision for how topic modeling can be used to design cross‐scale studies that promote a holistic approach to detect, monitor and manage biodiversity.  more » « less
Award ID(s):
1826801 1757324 1738865
PAR ID:
10227912
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Oikos
ISSN:
0030-1299
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Monitoring the “health” of an ecological community is a critical component of conservation planning. We propose that aggregating intraspecific genetic variation across all species of an ecological community (Community Genetic Distribution; CGD) provides a new way to measure biodiversity that is unifying across taxa, economically scalable, and geographically transferable. Such community-scale data provides information about past dynamics that can unveil processes structuring contemporary biodiversity, and can identify communities that are resilient to perturbation. Using the CGD, high-throughput biodiversity genetic inventories (e.g. metabarcoding/eDNA) can be leveraged to identify the genetic signatures of pristine and disturbed systems. We show examples of the CGD from empirical systems, how it responds through space and time to human disturbance, and how it successfully recovers restoration and succession gradients from metabarcoding datasets with the goal of obtaining insight on community genetic health and developing indicator metrics which can identify communities that are resilient to perturbation. We outline ways in which the CGD complements and extends information in the suite of currently described essential biodiversity variables, and how it can contribute to the targets of the Kunming-Montreal Global Biodiversity Framework. 
    more » « less
  2. This paper discusses opportunities for developments in spatial clustering methods to help leverage broad scale community science data for building species distribution models (SDMs). SDMs are tools that inform the science and policy needed to mitigate the impacts of climate change on biodiversity. Community science data span spatial and temporal scales unachievable by expert surveys alone, but they lack the structure imposed in smaller scale studies to allow adjustments for observational biases. Spatial clustering approaches can construct the necessary structure after surveys have occurred, but more work is needed to ensure that they are effective for this purpose. In this proposal, we describe the role of spatial clustering for realizing the potential of large biodiversity datasets, how existing methods approach this problem, and ideas for future work. 
    more » « less
  3. Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze the Multi-Attribute Data Set on Statisticians (MADStat), a data set on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods to MADStat leads to interesting findings. For example, we identified 11 representative topics in statistics. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of 11 topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research from 1975 to 2015, from a text analysis perspective. 
    more » « less
  4. There is a growing body of research revealing that longitudinal passive sensing data from smartphones and wearable devices can capture daily behavior signals for human behavior modeling, such as depression detection. Most prior studies build and evaluate machine learning models using data collected from a single population. However, to ensure that a behavior model can work for a larger group of users, its generalizability needs to be verified on multiple datasets from different populations. We present the first work evaluating cross-dataset generalizability of longitudinal behavior models, using depression detection as an application. We collect multiple longitudinal passive mobile sensing datasets with over 500 users from two institutes over a two-year span, leading to four institute-year datasets. Using the datasets, we closely re-implement and evaluated nine prior depression detection algorithms. Our experiment reveals the lack of model generalizability of these methods. We also implement eight recently popular domain generalization algorithms from the machine learning community. Our results indicate that these methods also do not generalize well on our datasets, with barely any advantage over the naive baseline of guessing the majority. We then present two new algorithms with better generalizability. Our new algorithm, Reorder, significantly and consistently outperforms existing methods on most cross-dataset generalization setups. However, the overall advantage is incremental and still has great room for improvement. Our analysis reveals that the individual differences (both within and between populations) may play the most important role in the cross-dataset generalization challenge. Finally, we provide an open-source benchmark platform GLOBEM- short for Generalization of Longitudinal BEhavior Modeling - to consolidate all 19 algorithms. GLOBEM can support researchers in using, developing, and evaluating different longitudinal behavior modeling methods. We call for researchers' attention to model generalizability evaluation for future longitudinal human behavior modeling studies. 
    more » « less
  5. Ruis, Andrew; Lee, Seung B. (Ed.)
    When text datasets are very large, manually coding line by line becomes impractical. As a result, researchers sometimes try to use machine learning algorithms to automatically code text data. One of the most popular algorithms is topic modeling. For a given text dataset, a topic model provides probability distributions of words for a set of “topics” in the data, which researchers then use to interpret meaning of the topics. A topic model also gives each document in the dataset a score for each topic, which can be used as a non-binary coding for what proportion of a topic is in the document. Unfortunately, it is often difficult to interpret what the topics mean in a defensible way, or to validate document topic proportion scores as meaningful codes. In this study, we examine how keywords from codes developed by human experts were distributed in topics generated from topic modeling. The results show that (1) top keywords of a single topic often contain words from multiple human-generated codes; and conversely, (2) words from human-generated codes appear as high-probability keywords in multiple topic. These results explain why directly using topics from topic models as codes is problematic. However, they also imply that topic modeling makes it possible for researchers to discover codes from short word lists. 
    more » « less