skip to main content

Title: Global Biotic Interactions: Benefits of Pragmatic Reuse of Species Interaction Datasets, 10.17605/OSF.IO/9JT24
Global Biotic Interactions (GloBI, uses frugal and pragmatic methods to make openly available species interaction datasets (e.g., parasite-host, predator-prey, plant-pollinator) easier to find and reuse. Since 2013, GloBI increased the reach of existing datasets, facilitated research, improved data integration methods and provided dataset reviews. In this talk, GloBI is introduced and various reuse examples are presented to discuss the question: Why should we bother to reuse existing (species-interaction) datasets?  more » « less
Award ID(s):
Author(s) / Creator(s):
Date Published:
Journal Name:
Slides of Seminar at Leibniz Institute for Zoo and Wildlife Research Berlin, Germany on 9 January 2020
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    A wealth of information about how parasites interact with their hosts already exists in collections, scientific publications, specialized databases, and grey literature. The US National Science Foundation-funded Terrestrial Parasite Tracker Thematic Collection Network (TPT) project began in 2019 to help build a comprehensive picture of arthropod ectoparasites including the evolution of these parasite-host biotic associations, distributions, and the ecological interactions of disease vectors. TPT is a network of biodiversity collections whose data can assist scientists, educators, land managers, and policymakers to better understand the complex relationship between hosts and parasites including emergent properties that may explain the causes and frequency of human and wildlife pathogens. TPT member collections make their association information easier to access via Global Biotic Interactions (GloBI, Poelen et al. 2014), which is periodically archived through Zenodo to track progress in the TPT project. TPT leverages GloBI's ability to index biotic associations from specimen occurrence records that come from existing management systems (e.g., Arctos, Symbiota, EMu, Excel, MS Access) to avoid having to completely rework existing, or build new, cyber-infrastructures before collections can share data. TPT-affiliated collection managers use collection-specific translation tables to connect their verbatim (or original) terms used to describe associations (e.g., "ex", "found on", "host") to their interpreted, machine-readable terms in the OBO Relations Ontology (RO). These interpreted terms enable searches across previously siloed association record sets, while the original verbatim values remain accessible to help retain provenance and allow for interpretation improvements. TPT is an ambitious project, with the goal to database label data from over 1.2 million specimens of arthropod parasites of vertebrates coming from 22 collections across North America. In the first year of the project, the TPT collections created over 73,700 new records and 41,984 images. In addition, 17 TPT data providers and three other collaborators shared datasets that are now indexed by GloBI, visible on the TPT GloBI project page. These datasets came from collection specimen occurrence records and literature sources. Two TPT data archives that capture and preserve the changes in the data coming from TPT to GloBI were published through Zenodo (Poelen et al. 2020a, Poelen et al. 2020b). The archives document the changes in how data are shared by collections including the biotic association data format and quantity of data captured. The Poelen et al. 2020b report included all TPT collections and biotic interactions from Arctos collections in VertNet and the Symbiota Collection of Arthropods Network (SCAN). The total number of interactions included in this report was 376,671 records (500,000 interactions is the overall goal for TPT). In addition, close coordination with TPT collection data managers including many one-on-one conversations, a workshop, and a webinar (Sullivan et al. 2020) was conducted to help guide the data capture of biotic associations. GloBI is an effective tool to help integrate biotic association data coming from occurrence records into an openly accessible, global, linked view of existing species interaction records. The results gleaned from the TPT workshop and Zenodo data archives demonstrate that minimizing changes to existing workflows allow for custom interpretation of collection-specific interaction terms. In addition, including collection data managers in the development of the interaction term vocabularies is an important part of the process that may improve data sharing and the overall downstream data quality. 
    more » « less
  2. PLEASE CONTACT AUTHORS IF YOU CONTRIBUTE AND WOULD LIKE TO BE LISTED AS A CO-AUTHOR. (this message will be removed some time weeks/months after the first publication)

    Terrestrial Parasite Tracker indexed biotic interactions and review summary.

    The Terrestrial Parasite Tracker (TPT) project began in 2019 and is funded by the National Science foundation to mobilize data from vector and ectoparasite collections to data aggregators (e.g., iDigBio, GBIF) to help build a comprehensive picture of arthropod host-association evolution, distributions, and the ecological interactions of disease vectors which will assist scientists, educators, land managers, and policy makers. Arthropod parasites often are important to human and wildlife health and safety as vectors of pathogens, and it is critical to digitize these specimens so that they, and their biotic interaction data, will be available to help understand and predict the spread of human and wildlife disease.

    This data publication contains versioned TPT associated datasets and related data products that were tracked, reviewed and indexed by Global Biotic Interactions (GloBI) and associated tools. GloBI provides open access to finding species interaction data (e.g., predator-prey, pollinator-plant, pathogen-host, parasite-host) by combining existing open datasets using open source software.

    If you have questions or comments about this publication, please open an issue at or contact the authors by email.

    The creation of this archive was made possible by the National Science Foundation award "Collaborative Research: Digitization TCN: Digitizing collections to trace parasite-host associations and predict the spread of vector-borne disease," Award numbers DBI:1901932 and DBI:1901926

    Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics.

    GloBI Data Review Report

    Datasets under review:
     - University of Michigan Museum of Zoology Insect Division. Full Database Export 2020-11-20 provided by Erika Tucker and Barry Oconner. accessed via on 2022-06-24T14:02:48.801Z
     - Academy of Natural Sciences Entomology Collection for the Parasite Tracker Project accessed via on 2022-06-24T14:04:22.091Z
     - Bernice Pauahi Bishop Museum, J. Linsley Gressitt Center for Research in Entomology accessed via on 2022-06-24T14:04:37.692Z
     - Texas A&M University, Biodiversity Teaching and Research Collections accessed via on 2022-06-24T14:06:40.154Z
     - Brigham Young University Arthropod Museum accessed via on 2022-06-24T14:06:51.420Z
     - California Academy of Sciences Entomology accessed via on 2022-06-24T14:07:16.371Z
     - Clemson University Arthropod Collection accessed via on 2022-06-24T14:07:40.925Z
     - Denver Museum of Nature and Science (DMNS) Parasite specimens (DMNS:Para) accessed via on 2022-06-24T14:08:00.730Z
     - Field Museum of Natural History IPT accessed via on 2022-06-24T14:18:51.995Z
     - Illinois Natural History Survey Insect Collection accessed via on 2022-06-24T14:19:37.563Z
     - UMSP / University of Minnesota / University of Minnesota Insect Collection accessed via on 2022-06-24T14:20:27.232Z
     - Milwaukee Public Museum Biological Collections Data Portal accessed via on 2022-06-24T14:20:46.185Z
     - Museum for Southern Biology (MSB) Parasite Collection accessed via on 2022-06-24T15:16:07.223Z
     - The Albert J. Cook Arthropod Research Collection accessed via on 2022-06-24T16:09:40.702Z
     - Ohio State University Acarology Laboratory accessed via on 2022-06-24T16:10:00.281Z
     - Frost Entomological Museum, Pennsylvania State University accessed via on 2022-06-24T16:10:07.741Z
     - Purdue Entomological Research Collection accessed via on 2022-06-24T16:10:26.654Z
     - Texas A&M University Insect Collection accessed via on 2022-06-24T16:10:58.496Z
     - University of California Santa Barbara Invertebrate Zoology Collection accessed via on 2022-06-24T16:12:29.854Z
     - University of Hawaii Insect Museum accessed via on 2022-06-24T16:12:41.408Z
     - University of New Hampshire Collection of Insects and other Arthropods UNHC-UNHC accessed via on 2022-06-24T16:12:59.500Z
     - Scott L. Gardner and Gabor R. Racz (2021). University of Nebraska State Museum - Parasitology. Harold W. Manter Laboratory of Parasitology. University of Nebraska State Museum. accessed via on 2022-06-24T16:13:06.914Z
     - Data were obtained from specimens belonging to the United States National Museum of Natural History (USNM), Smithsonian Institution, Washington DC and digitized by the Walter Reed Biosystematics Unit (WRBU). accessed via on 2022-06-24T16:13:38.013Z
     - US National Museum of Natural History Ixodes Records accessed via on 2022-06-24T16:13:45.666Z
     - Price Institute of Parasite Research, School of Biological Sciences, University of Utah accessed via on 2022-06-24T16:13:54.724Z
     - University of Wisconsin Stevens Point, Stephen J. Taft Parasitological Collection accessed via on 2022-06-24T16:14:04.745Z
     - Giraldo-Calderón, G. I., Emrich, S. J., MacCallum, R. M., Maslen, G., Dialynas, E., Topalis, P., … Lawson, D. (2015). VectorBase: an updated bioinformatics resource for invertebrate vectors and other organisms related with human diseases. Nucleic acids research, 43(Database issue), D707–D713. doi:10.1093/nar/gku1117. accessed via on 2022-06-24T16:14:11.965Z
     - WIRC / University of Wisconsin Madison WIS-IH / Wisconsin Insect Research Collection accessed via on 2022-06-24T16:14:29.743Z
     - Yale University Peabody Museum Collections Data Portal accessed via on 2022-06-24T16:23:29.289Z

    Generated on:

    GloBI's Elton 0.12.4 

    Note that all files ending with .tsv are files formatted 
    as UTF8 encoded tab-separated values files.

    Included in this review archive are:

      This file.

      Summary across all reviewed collections of total number of distinct review comments.

      Summary by reviewed collection of total number of distinct review comments.

      Summary of number of indexed interaction records by institutionCode and collectionCode.

      All review comments by collection.

      All indexed interactions for all reviewed collections.

      All indexed interactions for all reviewed collections selecting only sourceInstitutionCode, sourceCollectionCode, sourceCatalogNumber, sourceTaxonName, interactionTypeName and targetTaxonName.

      Details on the datasets under review.

      Program used to update datasets and generate the review reports and associated indexed interactions.
      Source datasets used by elton.jar in process of executing the script.
      Program used to generate the report

      Log file generated as part of running the script

    more » « less
  3. The Global Biodiversity Information Facility (GBIF 2022a) has indexed more than 2 billion occurrence records from 70,147 datasets. These datasets often include "hidden" biotic interaction data because biodiversity communities use the Darwin Core standard (DwC, Wieczorek et al. 2012) in different ways to document biotic interactions. In this study, we extracted biotic interactions from GBIF data using an approach similar to that employed in the Global Biotic Interactions (GloBI; Poelen et al. 2014) and summarized the results. Here we aim to present an estimation of the interaction data available in GBIF, showing that biotic interaction claims can be automatically found and extracted from GBIF. Our results suggest that much can be gained by an increased focus on development of tools that help to index and curate biotic interaction data in existing datasets. Combined with data standardization and best practices for sharing biotic interactions, such as the initiative on plant-pollinators interaction (Salim 2022), this approach can rapidly contribute to and meet open data principles (Wilkinson 2016). We used Preston (Elliott et al. 2020), open-source software that versions biodiversity datasets, to copy all GBIF-indexed datasets. The biodiversity data graph version (Poelen 2020) of the GBIF-indexed datasets used during this study contains 58,504 datasets in Darwin Core Archive (DwC-A) format, totaling 574,715,196 records. After retrieval and verification, the datasets were processed using Elton. Elton extracts biotic interaction data and supports 20+ existing file formats, including various types of data elements in DwC records. Elton also helps align interaction claims (e.g., host of, parasite of, associated with) to the Relations Ontology (RO, Mungall 2022), making it easier to discover datasets across a heterogeneous collection of datasets. Using specific mapping between interaction claims found in the DwC records to the terms in RO*1, Elton found 30,167,984 potential records (with non-empty values for the scanned DwC terms) and 15,248,478 records with recognized interaction types. Taxonomic name validation was performed using Nomer, which maps input names to names found in a variety of taxonomic catalogs. We only considered an interaction record valid where the interaction type could be mapped to a term in RO and where Nomer found a valid name for source and target taxa. Based on the workflow described in Fig. 1, we found 7,947,822 interaction records (52% of the potential interactions). Most of them were generic interactions ( interacts_ with , 87.5%), but the remaining 12.5% (993,477 records) included host-parasite and plant-animal interactions. The majority of the interactions records found involved plants (78%), animals (14%) and fungi (6%). In conclusion, there are many biotic interactions embedded in existing datasets registered in large biodiversity data indexers and aggregators like iDigBio, GBIF, and BioCASE. We exposed these biotic interaction claims using the combined functionality of biodiversity data tools Elton (for interaction data extraction), Preston (for reliable dataset tracking) and Nomer (for taxonomic name alignment). Nonetheless, the development of new vocabularies, standards and best practice guides would facilitate aggregation of interaction data, including the diversification of the GBIF data model (GBIF 2022b) for sharing biodiversity data beyond occurrences data. That is the aim of the TDWG Interest Group on Biological Interactions Data (TDWG 2022). 
    more » « less
  4. null (Ed.)
    Abstract Motivation Mapping genetic interactions (GIs) can reveal important insights into cellular function and has potential translational applications. There has been great progress in developing high-throughput experimental systems for measuring GIs (e.g. with double knockouts) as well as in defining computational methods for inferring (imputing) unknown interactions. However, existing computational methods for imputation have largely been developed for and applied in baker’s yeast, even as experimental systems have begun to allow measurements in other contexts. Importantly, existing methods face a number of limitations in requiring specific side information and with respect to computational cost. Further, few have addressed how GIs can be imputed when data are scarce. Results In this article, we address these limitations by presenting a new imputation framework, called Extensible Matrix Factorization (EMF). EMF is a framework of composable models that flexibly exploit cross-species information in the form of GI data across multiple species, and arbitrary side information in the form of kernels (e.g. from protein–protein interaction networks). We perform a rigorous set of experiments on these models in matched GI datasets from baker’s and fission yeast. These include the first such experiments on genome-scale GI datasets in multiple species in the same study. We find that EMF models that exploit side and cross-species information improve imputation, especially in data-scarce settings. Further, we show that EMF outperforms the state-of-the-art deep learning method, even when using strictly less data, and incurs orders of magnitude less computational cost. Availability Implementations of models and experiments are available at: Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  5. Abstract Background

    Protein–protein interaction (PPI) is vital for life processes, disease treatment, and drug discovery. The computational prediction of PPI is relatively inexpensive and efficient when compared to traditional wet-lab experiments. Given a new protein, one may wish to find whether the protein has any PPI relationship with other existing proteins. Current computational PPI prediction methods usually compare the new protein to existing proteins one by one in a pairwise manner. This is time consuming.


    In this work, we propose a more efficient model, called deep hash learning protein-and-protein interaction (DHL-PPI), to predict all-against-all PPI relationships in a database of proteins. First, DHL-PPI encodes a protein sequence into a binary hash code based on deep features extracted from the protein sequences using deep learning techniques. This encoding scheme enables us to turn the PPI discrimination problem into a much simpler searching problem. The binary hash code for a protein sequence can be regarded as a number. Thus, in the pre-screening stage of DHL-PPI, the string matching problem of comparing a protein sequence against a database withMproteins can be transformed into a much more simpler problem: to find a number inside a sorted array of lengthM. This pre-screening process narrows down the search to a much smaller set of candidate proteins for further confirmation. As a final step, DHL-PPI uses the Hamming distance to verify the final PPI relationship.


    The experimental results confirmed that DHL-PPI is feasible and effective. Using a dataset with strictly negative PPI examples of four species, DHL-PPI is shown to be superior or competitive when compared to the other state-of-the-art methods in terms of precision, recall or F1 score. Furthermore, in the prediction stage, the proposed DHL-PPI reduced the time complexity from$$O(M^2)$$O(M2)to$$O(M\log M)$$O(MlogM)for performing an all-against-all PPI prediction for a database withMproteins. With the proposed approach, a protein database can be preprocessed and stored for later search using the proposed encoding scheme. This can provide a more efficient way to cope with the rapidly increasing volume of protein datasets.

    more » « less