skip to main content


Title: Importance of timely metadata curation to the global surveillance of genetic diversity
Abstract

Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species resilience to changing climate, its measurement is relevant to many national and global conservation policy targets. Many studies produce large amounts of genome‐scale genetic diversity data for wild populations, but most (87%) do not include the associated spatial and temporal metadata necessary for them to be reused in monitoring programs or for acknowledging the sovereignty of nations or Indigenous peoples. We undertook a distributed datathon to quantify the availability of these missing metadata and to test the hypothesis that their availability decays with time. We also worked to remediate missing metadata by extracting them from associated published papers, online repositories, and direct communication with authors. Starting with 848 candidate genomic data sets (reduced representation and whole genome) from the International Nucleotide Sequence Database Collaboration, we determined that 561 contained mostly samples from wild populations. We successfully restored spatiotemporal metadata for 78% of these 561 data sets (n = 440 data sets with data on 45,105 individuals from 762 species in 17 phyla). Examining papers and online repositories was much more fruitful than contacting 351 authors, who replied to our email requests 45% of the time. Overall, 23% of our email queries to authors unearthed useful metadata. The probability of retrieving spatiotemporal metadata declined significantly as age of the data set increased. There was a 13.5% yearly decrease in metadata associated with published papers or online repositories and up to a 22% yearly decrease in metadata that were only available from authors. This rapid decay in metadata availability, mirrored in studies of other types of biological data, should motivate swift updates to data‐sharing policies and researcher practices to ensure that the valuable context provided by metadata is not lost to conservation science forever.

 
more » « less
Award ID(s):
1764316
NSF-PAR ID:
10401185
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  more » ;  ;  ;  ;   « less
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Conservation Biology
Volume:
37
Issue:
4
ISSN:
0888-8892
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Genomic data are being produced and archived at a prodigious rate, and current studies could become historical baselines for future global genetic diversity analyses and monitoring programs. However, when we evaluated the potential utility of genomic data from wild and domesticated eukaryote species in the world’s largest genomic data repository, we found that most archived genomic datasets (87%) lacked the spatiotemporal metadata necessary for genetic biodiversity surveillance. Labor-intensive scouring of a subset of published papers yielded geospatial coordinates and collection years for only 39% (51% if place names were considered) of these genomic datasets. Streamlined data input processes, updated metadata deposition policies, and enhanced scientific community awareness are urgently needed to preserve these irreplaceable records of today’s genetic biodiversity and to plug the growing metadata gap. 
    more » « less
  2. Abstract

    Understanding how genetic diversity is distributed across spatiotemporal scales in species of conservation or management concern is critical for identifying large‐scale mechanisms affecting local conservation status and implementing large‐scale biodiversity monitoring programmes. However, cross‐scale surveys of genetic diversity are often impractical within single studies, and combining datasets to increase spatiotemporal coverage is frequently impeded by using different sets of molecular markers. Recently developed molecular tools make surveys based on standardized single‐nucleotide polymorphism (SNP) panels more feasible than ever, but require existing genomic information. Here, we conduct the first survey of genome‐wide SNPs across the native range of brook trout (Salvelinus fontinalis), a cold‐adapted species that has been the focus of considerable conservation and management effort across eastern North America. Our dataset can be leveraged to easily design SNP panels that allow datasets to be combined for large‐scale analyses. We performed restriction site‐associated DNA sequencing for wild brook trout from 82 locations spanning much of the native range and domestic brook trout from 24 hatchery strains used in stocking efforts. We identified over 24,000 SNPs distributed throughout the brook trout genome. We explored the ability of these SNPs to resolve relationships across spatial scales, including population structure and hatchery admixture. Our dataset captures a wide spectrum of genetic diversity in native brook trout, offering a valuable resource for developing SNP panels. We highlight potential applications of this resource with the goal of increasing the integration of genomic information into decision‐making for brook trout and other species of conservation or management concern.

     
    more » « less
  3. Nielsen, Rasmus (Ed.)
    Abstract Drosophila melanogaster is a leading model in population genetics and genomics, and a growing number of whole-genome data sets from natural populations of this species have been published over the last years. A major challenge is the integration of disparate data sets, often generated using different sequencing technologies and bioinformatic pipelines, which hampers our ability to address questions about the evolution of this species. Here we address these issues by developing a bioinformatics pipeline that maps pooled sequencing (Pool-Seq) reads from D. melanogaster to a hologenome consisting of fly and symbiont genomes and estimates allele frequencies using either a heuristic (PoolSNP) or a probabilistic variant caller (SNAPE-pooled). We use this pipeline to generate the largest data repository of genomic data available for D. melanogaster to date, encompassing 271 previously published and unpublished population samples from over 100 locations in >20 countries on four continents. Several of these locations have been sampled at different seasons across multiple years. This data set, which we call Drosophila Evolution over Space and Time (DEST), is coupled with sampling and environmental metadata. A web-based genome browser and web portal provide easy access to the SNP data set. We further provide guidelines on how to use Pool-Seq data for model-based demographic inference. Our aim is to provide this scalable platform as a community resource which can be easily extended via future efforts for an even more extensive cosmopolitan data set. Our resource will enable population geneticists to analyze spatiotemporal genetic patterns and evolutionary dynamics of D. melanogaster populations in unprecedented detail. 
    more » « less
  4. Abstract

    Avoiding extinction in a rapidly changing environment often relies on a species’ ability to quickly adapt in the face of extreme selective pressures. In Panamá, two closely related harlequin frog species (Atelopus variusandAtelopus zeteki) are threatened with extinction due to the fungal pathogenBatrachochytrium dendrobatidis(Bd). Once thought to be nearly extirpated from Panamá,A. variushave recently been rediscovered in multiple localities across their historical range; however,A. zetekiare possibly extinct in the wild. By leveraging a unique collection of 186Atelopustissue samples collected before and after theBdoutbreak in Panama, we describe the genetics of persistence for these species on the brink of extinction. We sequenced the transcriptome and developed an exome‐capture assay to sequence the coding regions of theAtelopusgenome. Using these genetic data, we evaluate the population genetic structure of historicalA. variusandA. zetekipopulations, describe changes in genetic diversity over time, assess the relationship between contemporary and historical individuals, and test the hypothesis that someA. variuspopulations have rapidly evolved to resist or tolerateBdinfection. We found a significant decrease in genetic diversity in contemporary (compared to historical)A. variuspopulations. We did not find strong evidence of directional allele frequency change or selection forBdresistance genes, but we uncovered a set of candidate genes that warrant further study. Additionally, we found preliminary evidence of recent migration and gene flow in one of the largest persistingA. variuspopulations in Panamá, suggesting the potential for genetic rescue in this system. Finally, we propose that previous conservation units should be modified, as clear genetic breaks do not exist beyond the local population level. Our data lay the groundwork for genetically informed conservation and advance our understanding of how imperiled species might be rescued from extinction.

     
    more » « less
  5. Abstract

    Stage‐based demographic methods, such as matrix population models (MPMs), are powerful tools used to address a broad range of fundamental questions in ecology, evolutionary biology and conservation science. Accordingly, MPMs now exist for over 3000 species worldwide. These data are being digitised as an ongoing process and periodically released into two large open‐access online repositories: the COMPADRE Plant Matrix Database and the COMADRE Animal Matrix Database. During the last decade, data archiving and curation of COMPADRE and COMADRE, and subsequent comparative research, have revealed pronounced variation in how MPMs are parameterized and reported.

    Here, we summarise current issues related to the parameterisation and reporting of MPMs that arise most frequently and outline how they affect MPM construction, analysis, and interpretation. To quantify variation in how MPMs are reported, we present results from a survey identifying key aspects of MPMs that are frequently unreported in manuscripts. We then screen COMPADRE and COMADRE to quantify how often key pieces of information are omitted from manuscripts using MPMs.

    Over 80% of surveyed researchers (n = 60) state a clear benefit to adopting more standardised methodologies for reporting MPMs. Furthermore, over 85% of the 300 MPMs assessed from COMPADRE and COMADRE omitted one or more elements that are key to their accurate interpretation. Based on these insights, we identify fundamental issues that can arise from MPM construction and communication and provide suggestions to improve clarity, reproducibility and future research utilising MPMs and their required metadata. To fortify reproducibility and empower researchers to take full advantage of their demographic data, we introduce a standardised protocol to present MPMs in publications. This standard is linked towww.compadre‐db.org, so that authors wishing to archive their MPMs can do so prior to submission of publications, following examples from other open‐access repositories such as DRYAD, Figshare and Zenodo.

    Combining and standardising MPMs parameterized from populations around the globe and across the tree of life opens up powerful research opportunities in evolutionary biology, ecology and conservation research. However, this potential can only be fully realised by adopting standardised methods to ensure reproducibility.

     
    more » « less