skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Building a global genomics observatory: Using GEOME (the Genomic Observatories Metadatabase) to expedite and improve deposition and retrieval of genetic data and metadata for biodiversity research
Abstract Genetic data represent a relatively new frontier for our understanding of global biodiversity. Ideally, such data should include both organismal DNA‐based genotypes and the ecological context where the organisms were sampled. Yet most tools and standards for data deposition focus exclusively either on genetic or ecological attributes. The Genomic Observatories Metadatabase (GEOME: geome‐db.org) provides an intuitive solution for maintaining links between genetic data sets stored by the International Nucleotide Sequence Database Collaboration (INSDC) and their associated ecological metadata. GEOME facilitates the deposition of raw genetic data to INSDCs sequence read archive (SRA) while maintaining persistent links to standards‐compliant ecological metadata held in the GEOME database. This approach facilitates findable, accessible, interoperable and reusable data archival practices. Moreover, GEOME enables data management solutions for large collaborative groups and expedites batch retrieval of genetic data from the SRA. The article that follows describes how GEOME can enable genuinely open data workflows for researchers in the field of molecular ecology.  more » « less
Award ID(s):
1756316
PAR ID:
10454275
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Molecular Ecology Resources
Volume:
20
Issue:
6
ISSN:
1755-098X
Page Range / eLocation ID:
p. 1458-1469
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract We present a draft Minimum Information About Geospatial Information System (MIAGIS) standard for facilitating public deposition of geospatial information system (GIS) datasets that follows the FAIR (Findable, Accessible, Interoperable and Reusable) principles. The draft MIAGIS standard includes a deposition directory structure and a minimum javascript object notation (JSON) metadata formatted file that is designed to capture critical metadata describing GIS layers and maps as well as their sources of data and methods of generation. The associated miagis Python package facilitates the creation of this MIAGIS metadata file and directly supports metadata extraction from both Esri JSON and GEOJSON GIS data formats plus options for extraction from user-specified JSON formats. We also demonstrate their use in crafting two example depositions of ArcGIS generated maps. We hope this draft MIAGIS standard along with the supporting miagis Python package will assist in establishing a GIS standards group that will develop the draft into a full standard for the wider GIS community as well as a future public repository for GIS datasets. 
    more » « less
  2. This project was designed to describe fine-scale population genetic differentiation of the stream salamander Gryinophilus porphyriticus among five study streams in the Hubbard Brook Experimental Forest. The data are paired with intensive capture-recapture data to assess direct fitness effects of individual genetic diversity, including effects of individual multilocus heterozygosity on stage-specific survival probabilities. This dataset publishes a manifest of the genomic sequence reads submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). These samples are published at NCBI under the BioProject ID 1090913 (https://www.ncbi.nlm.nih.gov/bioproject/1090913). The tables here include sample metadata and the NCBI URLs to each sample. These data were gathered as part of the Hubbard Brook Ecosystem Study (HBES). The HBES is a collaborative effort at the Hubbard Brook Experimental Forest, which is operated and maintained by the USDA Forest Service, Northern Research Station. 
    more » « less
  3. Abstract Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species resilience to changing climate, its measurement is relevant to many national and global conservation policy targets. Many studies produce large amounts of genome‐scale genetic diversity data for wild populations, but most (87%) do not include the associated spatial and temporal metadata necessary for them to be reused in monitoring programs or for acknowledging the sovereignty of nations or Indigenous peoples. We undertook a distributed datathon to quantify the availability of these missing metadata and to test the hypothesis that their availability decays with time. We also worked to remediate missing metadata by extracting them from associated published papers, online repositories, and direct communication with authors. Starting with 848 candidate genomic data sets (reduced representation and whole genome) from the International Nucleotide Sequence Database Collaboration, we determined that 561 contained mostly samples from wild populations. We successfully restored spatiotemporal metadata for 78% of these 561 data sets (n = 440 data sets with data on 45,105 individuals from 762 species in 17 phyla). Examining papers and online repositories was much more fruitful than contacting 351 authors, who replied to our email requests 45% of the time. Overall, 23% of our email queries to authors unearthed useful metadata. The probability of retrieving spatiotemporal metadata declined significantly as age of the data set increased. There was a 13.5% yearly decrease in metadata associated with published papers or online repositories and up to a 22% yearly decrease in metadata that were only available from authors. This rapid decay in metadata availability, mirrored in studies of other types of biological data, should motivate swift updates to data‐sharing policies and researcher practices to ensure that the valuable context provided by metadata is not lost to conservation science forever. 
    more » « less
  4. Abstract SummarySequencing data resources have increased exponentially in recent years, as has interest in large-scale meta-analyses of integrated next-generation sequencing datasets. However, curation of integrated datasets that match a user’s particular research priorities is currently a time-intensive and imprecise task. MetaSeek is a sequencing data discovery tool that enables users to flexibly search and filter on any metadata field to quickly find the sequencing datasets that meet their needs. MetaSeek automatically scrapes metadata from all publicly available datasets in the Sequence Read Archive, cleans and parses messy, user-provided metadata into a structured, standard-compliant database and predicts missing fields where possible. MetaSeek provides a web-based graphical user interface and interactive visualization dashboard, as well as a programmatic API to rapidly search, filter, visualize, save, share and download matching sequencing metadata. Availability and implementationThe MetaSeek online interface is available at https://www.metaseek.cloud/. The MetaSeek database can also be accessed via API to programmatically search, filter and download all metadata. MetaSeek source code, metadata scrapers and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek/. 
    more » « less
  5. Abstract Over the last couple of decades, there has been a rapid growth in the number and scope of agricultural genetics, genomics and breeding databases and resources. The AgBioData Consortium (https://www.agbiodata.org/) currently represents 44 databases and resources (https://www.agbiodata.org/databases) covering model or crop plant and animal GGB data, ontologies, pathways, genetic variation and breeding platforms (referred to as ‘databases’ throughout). One of the goals of the Consortium is to facilitate FAIR (Findable, Accessible, Interoperable, and Reusable) data management and the integration of datasets which requires data sharing, along with structured vocabularies and/or ontologies. Two AgBioData working groups, focused on Data Sharing and Ontologies, respectively, conducted a Consortium-wide survey to assess the current status and future needs of the members in those areas. A total of 33 researchers responded to the survey, representing 37 databases. Results suggest that data-sharing practices by AgBioData databases are in a fairly healthy state, but it is not clear whether this is true for all metadata and data types across all databases; and that, ontology use has not substantially changed since a similar survey was conducted in 2017. Based on our evaluation of the survey results, we recommend (i) providing training for database personnel in a specific data-sharing techniques, as well as in ontology use; (ii) further study on what metadata is shared, and how well it is shared among databases; (iii) promoting an understanding of data sharing and ontologies in the stakeholder community; (iv) improving data sharing and ontologies for specific phenotypic data types and formats; and (v) lowering specific barriers to data sharing and ontology use, by identifying sustainability solutions, and the identification, promotion, or development of data standards. Combined, these improvements are likely to help AgBioData databases increase development efforts towards improved ontology use, and data sharing via programmatic means. Database URL https://www.agbiodata.org/databases 
    more » « less