skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: phylogatR : Phylogeographic data aggregation and repurposing
Abstract Patterns of genetic diversity within species contain information the history of that species, including how they have responded to historical climate change and how easily the organism is able to disperse across its habitat. More than 40,000 phylogeographic and population genetic investigations have been published to date, each collecting genetic data from hundreds of samples. Despite these millions of data points, meta‐analyses are challenging because the synthesis of results across hundreds of studies, each using different methods and forms of analysis, is a daunting and time‐consuming task. It is more efficient to proceed by repurposing existing data and using automated data analysis. To facilitate data repurposing, we created a database (phylogatR) that aggregates data from different sources and conducts automated multiple sequence alignments and data curation to provide users with nearly ready‐to‐analyse sets of data for thousands of species. Two types of scientific research will be made easier by phylogatR: large meta‐analyses of thousands of species that can address classic questions in evolutionary biology and ecology, and student‐ or citizen‐ science based investigations that will introduce a broad range of people to the analysis of genetic data. phylogatR enhances the value of existing data via the creation of software and web‐based tools that enable these data to be recycled and reanalysed and increase accessibility to big data for research laboratories and classroom instructors with limited computational expertise and resources.  more » « less
Award ID(s):
1911293 1910623
PAR ID:
10373305
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Molecular Ecology Resources
Volume:
22
Issue:
8
ISSN:
1755-098X
Page Range / eLocation ID:
p. 2830-2842
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Genetic diversity influences the evolutionary potential of forest trees under changing environmental conditions, thus indirectly the ecosystem services that forests provide. European beech ( Fagus sylvatica L.) is a dominant European forest tree species that increasingly suffers from climate change-related die-back. Here, we conducted a systematic literature review of neutral genetic diversity in European beech and created a meta-data set of expected heterozygosity ( He ) from all past studies providing nuclear microsatellite data. We propose a novel approach, based on population genetic theory and a min–max scaling to make past studies comparable. Using a new microsatellite data set with unprecedented geographic coverage and various re-sampling schemes to mimic common sampling biases, we show the potential and limitations of the scaling approach. The scaled meta-dataset reveals the expected trend of decreasing genetic diversity from glacial refugia across the species range and also supports the hypothesis that different lineages met and admixed north of the European mountain ranges. As a result, we present a map of genetic diversity across the range of European beech which could help to identify seed source populations harboring greater diversity and guide sampling strategies for future genome-wide and functional investigations of genetic variation. Our approach illustrates how to combine information from several nuclear microsatellite data sets to describe patterns of genetic diversity extending beyond the geographic scale or mean number of loci used in each individual study, and thus is a proof-of-concept for synthesizing knowledge from existing studies also in other species. 
    more » « less
  2. Abstract The genomics revolution continues to change how ecologists and evolutionary biologists study the evolution and maintenance of biodiversity. It is now easier than ever to generate large molecular data sets consisting of hundreds to thousands of independently evolving nuclear loci to estimate a suite of evolutionary and demographic parameters. However, any inferences will be incomplete or inaccurate if incorrect taxonomic identities and perpetuated throughout the analytical pipeline. Due to decades of research and comprehensive online databases, sequencing and analysis of mitochondrial DNA (mtDNA), chloroplast DNA (cpDNA) and select nuclear genes can provide researchers with a cost effective and simple means to verify the species identity of samples prior to subsequent phylogeographic and population genomic analysis. The addition of these sequences to genomic studies can also shed light on other important evolutionary questions such as explanations for gene tree‐species tree discordance, species limits, sex‐biased dispersal patterns, adaptation, and mtDNA introgression. Although the mtDNA and cpDNA genomes often should not be used exclusively to make historical inferences given their well‐known limitations, the addition of these data to modern genomic studies adds little cost and effort while simultaneously providing a wealth of useful data that can have significant implications for both basic and applied research. 
    more » « less
  3. Abstract Plants display a range of temporal patterns of inter‐annual reproduction, from relatively constant seed production to “mast seeding,” the synchronized and highly variable interannual seed production of plants within a population. Previous efforts have compiled global records of seed production in long‐lived plants to gain insight into seed production, forest and animal population dynamics, and the effects of global change on masting. Existing datasets focus on seed production dynamics at the population scale but are limited in their ability to examine community‐level mast seeding dynamics across different plant species at the continental scale. We harmonized decades of plant reproduction data for 141 woody plant species across nine Long‐Term Ecological Research (LTER) or long‐term ecological monitoring sites from a wide range of habitats across the United States. Plant reproduction data are reported annually between 1957 and 2021 and based on either seed traps or seed and/or cone counts on individual trees. A wide range of woody plant species including trees, shrubs, and lianas are represented within sites allowing for direct community‐level comparisons among species. We share code for filtering of data that enables the comparison of plot and individual tree data across sites. For each species, we compiled relevant life history attributes (e.g., seed mass, dispersal syndrome, seed longevity, sexual system) that may serve as important predictors of mast seeding in future analyses. To aid in phylogenetically informed analyses, we also share a phylogeny and phylogenetic distance matrix for all species in the dataset. These data can be used to investigate continent‐scale ecological properties of seed production, including individual and population variability, synchrony within and across species, and how these properties of seed production vary in relation to plant species traits and environmental conditions. In addition, these data can be used to assess how annual variability in seed production is associated with climate conditions and how that varies across populations, species, and regions. The dataset is released under a CC0 1.0 Universal public domain license. 
    more » « less
  4. Koepfli, Klaus-Peter (Ed.)
    Abstract Bison are an icon of the American West and an ecologically, commercially, and culturally important species. Despite numbering in the hundreds of thousands today, conservation concerns remain for the species, including the impact on genetic diversity of a severe bottleneck around the turn of the 20th century and genetic introgression from domestic cattle. Genetic diversity and admixture are best evaluated at genome-wide scale, for which a high-quality reference is necessary. Here, we use trio binning of long reads from a bison–Simmental cattle (Bos taurus taurus) male F1 hybrid to sequence and assemble the genome of the American plains bison (Bison bison bison). The male haplotype genome is chromosome-scale, with a total length of 2.65 Gb across 775 scaffolds (839 contigs) and a scaffold N50 of 87.8 Mb. Our bison genome is ~13× more contiguous overall and ~3400× more contiguous at the contig level than the current bison reference genome. The bison genome sequence presented here (ARS-UCSC_bison1.0) will enable new research into the evolutionary history of this iconic megafauna species and provide a new tool for the management of bison populations in federal and commercial herds. 
    more » « less
  5. This work introduces TrialSieve, a novel framework for biomedical information extraction that enhances clinical meta-analysis and drug repurposing. By extending traditional PICO (Patient, Intervention, Comparison, Outcome) methodologies, TrialSieve incorporates hierarchical, treatment group-based graphs, enabling more comprehensive and quantitative comparisons of clinical outcomes. TrialSieve was used to annotate 1609 PubMed abstracts, 170,557 annotations, and 52,638 final spans, incorporating 20 unique annotation categories that capture a diverse range of biomedical entities relevant to systematic reviews and meta-analyses. The performance (accuracy, precision, recall, F1-score) of four natural-language processing (NLP) models (BioLinkBERT, BioBERT, KRISSBERT, PubMedBERT) and the large language model (LLM), GPT-4o, was evaluated using the human-annotated TrialSieve dataset. BioLinkBERT had the best accuracy (0.875) and recall (0.679) for biomedical entity labeling, whereas PubMedBERT had the best precision (0.614) and F1-score (0.639). Error analysis showed that NLP models trained on noisy, human-annotated data can match or, in most cases, surpass human performance. This finding highlights the feasibility of fully automating biomedical information extraction, even when relying on imperfectly annotated datasets. An annotator user study (n = 39) revealed significant (p < 0.05) gains in efficiency and human annotation accuracy with the unique TrialSieve tree-based annotation approach. In summary, TrialSieve provides a foundation to improve automated biomedical information extraction for frontend clinical research. 
    more » « less