skip to main content

Title: Indexing Biotic Interactions in GBIF data
The Global Biodiversity Information Facility (GBIF 2022a) has indexed more than 2 billion occurrence records from 70,147 datasets. These datasets often include "hidden" biotic interaction data because biodiversity communities use the Darwin Core standard (DwC, Wieczorek et al. 2012) in different ways to document biotic interactions. In this study, we extracted biotic interactions from GBIF data using an approach similar to that employed in the Global Biotic Interactions (GloBI; Poelen et al. 2014) and summarized the results. Here we aim to present an estimation of the interaction data available in GBIF, showing that biotic interaction claims can be automatically found and extracted from GBIF. Our results suggest that much can be gained by an increased focus on development of tools that help to index and curate biotic interaction data in existing datasets. Combined with data standardization and best practices for sharing biotic interactions, such as the initiative on plant-pollinators interaction (Salim 2022), this approach can rapidly contribute to and meet open data principles (Wilkinson 2016). We used Preston (Elliott et al. 2020), open-source software that versions biodiversity datasets, to copy all GBIF-indexed datasets. The biodiversity data graph version (Poelen 2020) of the GBIF-indexed datasets used during this study contains 58,504 datasets in Darwin Core Archive (DwC-A) format, totaling 574,715,196 records. After retrieval and verification, the datasets were processed using Elton. Elton extracts biotic interaction data and supports 20+ existing file formats, including various types of data elements in DwC records. Elton also helps align interaction claims (e.g., host of, parasite of, associated with) to the Relations Ontology (RO, Mungall 2022), making it easier to discover datasets across a heterogeneous collection of datasets. Using specific mapping between interaction claims found in the DwC records to the terms in RO*1, Elton found 30,167,984 potential records (with non-empty values for the scanned DwC terms) and 15,248,478 records with recognized interaction types. Taxonomic name validation was performed using Nomer, which maps input names to names found in a variety of taxonomic catalogs. We only considered an interaction record valid where the interaction type could be mapped to a term in RO and where Nomer found a valid name for source and target taxa. Based on the workflow described in Fig. 1, we found 7,947,822 interaction records (52% of the potential interactions). Most of them were generic interactions ( interacts_ with , 87.5%), but the remaining 12.5% (993,477 records) included host-parasite and plant-animal interactions. The majority of the interactions records found involved plants (78%), animals (14%) and fungi (6%). In conclusion, there are many biotic interactions embedded in existing datasets registered in large biodiversity data indexers and aggregators like iDigBio, GBIF, and BioCASE. We exposed these biotic interaction claims using the combined functionality of biodiversity data tools Elton (for interaction data extraction), Preston (for reliable dataset tracking) and Nomer (for taxonomic name alignment). Nonetheless, the development of new vocabularies, standards and best practice guides would facilitate aggregation of interaction data, including the diversification of the GBIF data model (GBIF 2022b) for sharing biodiversity data beyond occurrences data. That is the aim of the TDWG Interest Group on Biological Interactions Data (TDWG 2022).  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Biodiversity Information Science and Standards
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    A wealth of information about how parasites interact with their hosts already exists in collections, scientific publications, specialized databases, and grey literature. The US National Science Foundation-funded Terrestrial Parasite Tracker Thematic Collection Network (TPT) project began in 2019 to help build a comprehensive picture of arthropod ectoparasites including the evolution of these parasite-host biotic associations, distributions, and the ecological interactions of disease vectors. TPT is a network of biodiversity collections whose data can assist scientists, educators, land managers, and policymakers to better understand the complex relationship between hosts and parasites including emergent properties that may explain the causes and frequency of human and wildlife pathogens. TPT member collections make their association information easier to access via Global Biotic Interactions (GloBI, Poelen et al. 2014), which is periodically archived through Zenodo to track progress in the TPT project. TPT leverages GloBI's ability to index biotic associations from specimen occurrence records that come from existing management systems (e.g., Arctos, Symbiota, EMu, Excel, MS Access) to avoid having to completely rework existing, or build new, cyber-infrastructures before collections can share data. TPT-affiliated collection managers use collection-specific translation tables to connect their verbatim (or original) terms used to describe associations (e.g., "ex", "found on", "host") to their interpreted, machine-readable terms in the OBO Relations Ontology (RO). These interpreted terms enable searches across previously siloed association record sets, while the original verbatim values remain accessible to help retain provenance and allow for interpretation improvements. TPT is an ambitious project, with the goal to database label data from over 1.2 million specimens of arthropod parasites of vertebrates coming from 22 collections across North America. In the first year of the project, the TPT collections created over 73,700 new records and 41,984 images. In addition, 17 TPT data providers and three other collaborators shared datasets that are now indexed by GloBI, visible on the TPT GloBI project page. These datasets came from collection specimen occurrence records and literature sources. Two TPT data archives that capture and preserve the changes in the data coming from TPT to GloBI were published through Zenodo (Poelen et al. 2020a, Poelen et al. 2020b). The archives document the changes in how data are shared by collections including the biotic association data format and quantity of data captured. The Poelen et al. 2020b report included all TPT collections and biotic interactions from Arctos collections in VertNet and the Symbiota Collection of Arthropods Network (SCAN). The total number of interactions included in this report was 376,671 records (500,000 interactions is the overall goal for TPT). In addition, close coordination with TPT collection data managers including many one-on-one conversations, a workshop, and a webinar (Sullivan et al. 2020) was conducted to help guide the data capture of biotic associations. GloBI is an effective tool to help integrate biotic association data coming from occurrence records into an openly accessible, global, linked view of existing species interaction records. The results gleaned from the TPT workshop and Zenodo data archives demonstrate that minimizing changes to existing workflows allow for custom interpretation of collection-specific interaction terms. In addition, including collection data managers in the development of the interaction term vocabularies is an important part of the process that may improve data sharing and the overall downstream data quality. 
    more » « less
  2. PLEASE CONTACT AUTHORS IF YOU CONTRIBUTE AND WOULD LIKE TO BE LISTED AS A CO-AUTHOR. (this message will be removed some time weeks/months after the first publication)

    Terrestrial Parasite Tracker indexed biotic interactions and review summary.

    The Terrestrial Parasite Tracker (TPT) project began in 2019 and is funded by the National Science foundation to mobilize data from vector and ectoparasite collections to data aggregators (e.g., iDigBio, GBIF) to help build a comprehensive picture of arthropod host-association evolution, distributions, and the ecological interactions of disease vectors which will assist scientists, educators, land managers, and policy makers. Arthropod parasites often are important to human and wildlife health and safety as vectors of pathogens, and it is critical to digitize these specimens so that they, and their biotic interaction data, will be available to help understand and predict the spread of human and wildlife disease.

    This data publication contains versioned TPT associated datasets and related data products that were tracked, reviewed and indexed by Global Biotic Interactions (GloBI) and associated tools. GloBI provides open access to finding species interaction data (e.g., predator-prey, pollinator-plant, pathogen-host, parasite-host) by combining existing open datasets using open source software.

    If you have questions or comments about this publication, please open an issue at or contact the authors by email.

    The creation of this archive was made possible by the National Science Foundation award "Collaborative Research: Digitization TCN: Digitizing collections to trace parasite-host associations and predict the spread of vector-borne disease," Award numbers DBI:1901932 and DBI:1901926

    Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics.

    GloBI Data Review Report

    Datasets under review:
     - University of Michigan Museum of Zoology Insect Division. Full Database Export 2020-11-20 provided by Erika Tucker and Barry Oconner. accessed via on 2022-06-24T14:02:48.801Z
     - Academy of Natural Sciences Entomology Collection for the Parasite Tracker Project accessed via on 2022-06-24T14:04:22.091Z
     - Bernice Pauahi Bishop Museum, J. Linsley Gressitt Center for Research in Entomology accessed via on 2022-06-24T14:04:37.692Z
     - Texas A&M University, Biodiversity Teaching and Research Collections accessed via on 2022-06-24T14:06:40.154Z
     - Brigham Young University Arthropod Museum accessed via on 2022-06-24T14:06:51.420Z
     - California Academy of Sciences Entomology accessed via on 2022-06-24T14:07:16.371Z
     - Clemson University Arthropod Collection accessed via on 2022-06-24T14:07:40.925Z
     - Denver Museum of Nature and Science (DMNS) Parasite specimens (DMNS:Para) accessed via on 2022-06-24T14:08:00.730Z
     - Field Museum of Natural History IPT accessed via on 2022-06-24T14:18:51.995Z
     - Illinois Natural History Survey Insect Collection accessed via on 2022-06-24T14:19:37.563Z
     - UMSP / University of Minnesota / University of Minnesota Insect Collection accessed via on 2022-06-24T14:20:27.232Z
     - Milwaukee Public Museum Biological Collections Data Portal accessed via on 2022-06-24T14:20:46.185Z
     - Museum for Southern Biology (MSB) Parasite Collection accessed via on 2022-06-24T15:16:07.223Z
     - The Albert J. Cook Arthropod Research Collection accessed via on 2022-06-24T16:09:40.702Z
     - Ohio State University Acarology Laboratory accessed via on 2022-06-24T16:10:00.281Z
     - Frost Entomological Museum, Pennsylvania State University accessed via on 2022-06-24T16:10:07.741Z
     - Purdue Entomological Research Collection accessed via on 2022-06-24T16:10:26.654Z
     - Texas A&M University Insect Collection accessed via on 2022-06-24T16:10:58.496Z
     - University of California Santa Barbara Invertebrate Zoology Collection accessed via on 2022-06-24T16:12:29.854Z
     - University of Hawaii Insect Museum accessed via on 2022-06-24T16:12:41.408Z
     - University of New Hampshire Collection of Insects and other Arthropods UNHC-UNHC accessed via on 2022-06-24T16:12:59.500Z
     - Scott L. Gardner and Gabor R. Racz (2021). University of Nebraska State Museum - Parasitology. Harold W. Manter Laboratory of Parasitology. University of Nebraska State Museum. accessed via on 2022-06-24T16:13:06.914Z
     - Data were obtained from specimens belonging to the United States National Museum of Natural History (USNM), Smithsonian Institution, Washington DC and digitized by the Walter Reed Biosystematics Unit (WRBU). accessed via on 2022-06-24T16:13:38.013Z
     - US National Museum of Natural History Ixodes Records accessed via on 2022-06-24T16:13:45.666Z
     - Price Institute of Parasite Research, School of Biological Sciences, University of Utah accessed via on 2022-06-24T16:13:54.724Z
     - University of Wisconsin Stevens Point, Stephen J. Taft Parasitological Collection accessed via on 2022-06-24T16:14:04.745Z
     - Giraldo-Calderón, G. I., Emrich, S. J., MacCallum, R. M., Maslen, G., Dialynas, E., Topalis, P., … Lawson, D. (2015). VectorBase: an updated bioinformatics resource for invertebrate vectors and other organisms related with human diseases. Nucleic acids research, 43(Database issue), D707–D713. doi:10.1093/nar/gku1117. accessed via on 2022-06-24T16:14:11.965Z
     - WIRC / University of Wisconsin Madison WIS-IH / Wisconsin Insect Research Collection accessed via on 2022-06-24T16:14:29.743Z
     - Yale University Peabody Museum Collections Data Portal accessed via on 2022-06-24T16:23:29.289Z

    Generated on:

    GloBI's Elton 0.12.4 

    Note that all files ending with .tsv are files formatted 
    as UTF8 encoded tab-separated values files.

    Included in this review archive are:

      This file.

      Summary across all reviewed collections of total number of distinct review comments.

      Summary by reviewed collection of total number of distinct review comments.

      Summary of number of indexed interaction records by institutionCode and collectionCode.

      All review comments by collection.

      All indexed interactions for all reviewed collections.

      All indexed interactions for all reviewed collections selecting only sourceInstitutionCode, sourceCollectionCode, sourceCatalogNumber, sourceTaxonName, interactionTypeName and targetTaxonName.

      Details on the datasets under review.

      Program used to update datasets and generate the review reports and associated indexed interactions.
      Source datasets used by elton.jar in process of executing the script.
      Program used to generate the report

      Log file generated as part of running the script

    more » « less
  3. We report a dataset of all known and published occurrence records of animals of the phylum Rotifera, including Bdelloidea, Monogononta, and Seisonacea (with the exclusion of Acanthocephala) for Africa and surrounding islands and archipelagos. The dataset includes 27,225 records of 957 taxa (subspecies: 39; species: 819; genus: 81; family: 17; group: 1), gathered from 706 published papers. The published literature spans from 1854 to 2022, with the highest number of records in the decades 1990-1999 and 2010-2019.
    230 records of "species inquirendae", "nomina nuda", and "genera inquirenda" found in the published literature were not included in the dataset. Almost 90 % of the data are georeferenced.

    The African countries with the highest number of taxa are Nigeria, Algeria, South Africa, and Democratic Republic of the Congo, whereas no records are yet available for a dozen countries. The number of species known from each country can be explained mostly by sampling efforts, measured as the number of papers published for each country up to October 2022.

    This detailed literature search increased the number of known rotifer taxa at species, subspecies, form and variety level reported in previous reviews, which were 639 in 1986 (De Ridder, 1986) and 765 (Smolak et al., 2022) in 2022. Of the taxa reported in the current dataset, 167 (18%) are Bdelloidea, 665 (698%) Ploima, 97 (10%) Flosculariaceae, 27 (3%) Collothecacea and one representative of Seisonacea, the marine epizoic rotifer Seison africanus Sørensen, Segers & Funch, 2005 described and recorded only from coastal waters of Kenya (Sørensen et al., 2005).

    The data were structured based on the Darwin Core standard (Wieczorek et al., 2012). The dataset is structured to have in each row each record of a rotifer taxon from a sample from Africa and surrounding islands, as cited in the literature. The columns report the original and updated taxon name, additional taxonomic information together with origin of the data and habitat.
    All invalid names (i.e. at the level of species inquirenda, nomen nudum, genus inquirendum) were not included in the records uploaded to GBIF. All names were also checked against the backbone of GBIF.

    more » « less
  4. All life on earth is linked by a shared evolutionary history. Even before Darwin developed the theory of evolution, Linnaeus categorized types of organisms based on their shared traits. We now know these traits derived from these species’ shared ancestry. This evolutionary history provides a natural framework to harness the enormous quantities of biological data being generated today. The Open Tree of Life project is a collaboration developing tools to curate and share evolutionary estimates (phylogenies) covering the entire tree of life (Hinchliff et al. 2015, McTavish et al. 2017). The tree is viewable at, and the data is all freely available online. The taxon identifiers used in the Open Tree unified taxonomy (Rees and Cranston 2017) are mapped to identifiers across biological informatics databases, including the Global Biodiversity Information Facility (GBIF), NCBI, and others. Linking these identifiers allows researchers to easily unify data from across these different resources (Fig. 1). Leveraging a unified evolutionary framework across the diversity of life provides new avenues for integrative wide scale research. Downstream tools, such as R packages developed by the R OpenSci foundation (rotl, rgbif) (Michonneau et al. 2016, Chamberlain 2017) and others tools (Revell 2012), make accessing and combining this information straightforward for students as well as researchers (e.g. Figure 1. Example linking phylogenetic relationships accessed from the Open Tree of Life with specimen location data from Global Biodiversity Information Facility. For example, a recent publication by Santorelli et al. 2018 linked evolutionary information from Open Tree with species locality data gathered from a local field study as well as GBIF species location records to test a river-barrier hypothesis in the Amazon. By combining these data, the authors were able test a widely held biogeographic hypothesis across 1952 species in 14 taxonomic groups, and found that a river that had been postulated to drive endemism, was in fact not a barrier to gene flow. However, data provenance and taxonomic name reconciliation remain key hurdles to applying data from these large digital biodiversity and evolution community resources to answering biological questions. In the Amazonian river analysis, while they leveraged use of GBIF records as a secondary check on their species records, they relied on their an intensive local field study for their major conclusions, and preferred taxon specific phylogenetic resources over Open Tree where they were available (Santorelli et al. 2018). When Li et al. 2018 assessed large scale phylogenetic approaches, including Open Tree, for measuring community diversity, they found that synthesis phylogenies were less resolved than purpose-built phylogenies, but also found that these synthetic phylogenies were sufficient for community level phylogenetic diversity analyses. Nonetheless, data quality concerns have limited adoption of analyses data from centralized resources (McTavish et al. 2017). Taxonomic name recognition and reconciliation across databases also remains a hurdle for large scale analyses, despite several ongoing efforts to improve taxonomic interoperability and unify taxonomies, such at Catalogue of Life + (Bánki et al. 2018). In order to support innovative science, large scale digital data resources need to facilitate data linkage between resources, and address researchers' data quality and provenance concerns. I will present the model that the Open Tree of Life is using to provide evolutionary data at the scale of the entire tree of life, while maintaining traceable provenance to the publications and taxonomies these evolutionary relationships are inferred from. I will discuss the hurdles to adoption of these large scale resources by researchers, as well as the opportunities for new research avenues provided by the connections between evolutionary inferences and biodiversity digital databases. 
    more » « less
  5. Making the most of biodiversity data requires linking observations of biological species from multiple sources both efficiently and accurately (Bisby 2000, Franz et al. 2016). Aggregating occurrence records using taxonomic names and synonyms is computationally efficient but known to experience significant limitations on accuracy when the assumption of one-to-one relationships between names and biological entities breaks down (Remsen 2016, Franz and Sterner 2018). Taxonomic treatments and checklists provide authoritative information about the correct usage of names for species, including operational representations of the meanings of those names in the form of range maps, reference genetic sequences, or diagnostic traits. They increasingly provide taxonomic intelligence in the form of precise description of the semantic relationships between different published names in the literature. Making this authoritative information Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al. 2016) would be a transformative advance for biodiversity data sharing and help drive adoption and novel extensions of existing standards such as the Taxonomic Concept Schema and the OpenBiodiv Ontology (Kennedy et al. 2006, Senderov et al. 2018). We call for the greater, global Biodiversity Information Standards (TDWG) and taxonomy community to commit to extending and expanding on how FAIR applies to biodiversity data and include practical targets and criteria for the publication and digitization of taxonomic concept representations and alignments in taxonomic treatments, checklists, and backbones. As a motivating case, consider the abundantly sampled North American deer mouse— Peromyscus maniculatus (Wagner 1845)—which was recently split from one continental species into five more narrowly defined forms, so that the name P. maniculatus is now only applied east of the Mississippi River (Bradley et al. 2019, Greenbaum et al. 2019). That single change instantly rendered ambiguous ~7% of North American mammal records in the Global Biodiversity Information Facility (n=242,663, downloaded 2021-06-04; 2021) and ⅓ of all National Ecological Observatory Network (NEON) small mammal samples (n=10,256, downloaded 2021-06-27). While this type of ambiguity is common in name-based databases when species are split, the example of P. maniculatus is particularly striking for its impact upon biological questions ranging from hantavirus surveillance in North America to studies of climate change impacts upon rodent life-history traits. Of special relevance to NEON sampling is recent evidence suggesting deer mice potentially transmit SARS-CoV-2 (Griffin et al. 2021). Automating the updating of occurrence records in such cases and others will require operational representations of taxonomic concepts—e.g., range maps, reference sequences, and diagnostic traits—that are FAIR in addition to taxonomic concept alignment information (Franz and Peet 2009). Despite steady progress, it remains difficult to find, access, and reuse authoritative information about how to apply taxonomic names even when it is already digitized. It can also be difficult to tell without manual inspection whether similar types of concept representations derived from multiple sources, such as range maps or reference sequences selected from different research articles or checklists, are in fact interoperable for a particular application. The issue is therefore different from important ongoing efforts to digitize trait information in species circumscriptions, for example, and focuses on how already digitized knowledge can best be packaged to inform human experts and artifical intelligence applications (Sterner and Franz 2017). We therefore propose developing community guidelines and criteria for FAIR taxonomic concept representations as "semantic artefacts" of general relevance to linked open data and life sciences research (Le Franc et al. 2020). 
    more » « less