skip to main content


Title: PaleoRec: A sequential recommender system for the annotation of paleoclimate datasets
Abstract Studying past climate variability is fundamental to our understanding of current changes. In the era of Big Data, the value of paleoclimate information critically depends on our ability to analyze large volume of data, which itself hinges on standardization. Standardization also ensures that these datasets are more Findable, Accessible, Interoperable, and Reusable. Building upon efforts from the paleoclimate community to standardize the format, terminology, and reporting of paleoclimate data, this article describes PaleoRec, a recommender system for the annotation of such datasets. The goal is to assist scientists in the annotation task by reducing and ranking relevant entries in a drop-down menu. Scientists can either choose the best option for their metadata or enter the appropriate information manually. PaleoRec aims to reduce the time to science while ensuring adherence to community standards. PaleoRec is a type of sequential recommender system based on a recurrent neural network that takes into consideration the short-term interest of a user in a particular dataset. The model was developed using 1996 expert-annotated datasets, resulting in 6,512 sequences. The performance of the algorithm, as measured by the Hit Ratio, varies between 0.7 and 1.0. PaleoRec is currently deployed on a web interface used for the annotation of paleoclimate datasets using emerging community standards.  more » « less
Award ID(s):
1948822 1948746
NSF-PAR ID:
10337364
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Environmental Data Science
Volume:
1
ISSN:
2634-4602
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    SARS-CoV-2 RNA detection in wastewater is being rapidly developed and adopted as a public health monitoring tool worldwide. With wastewater surveillance programs being implemented across many different scales and by many different stakeholders, it is critical that data collected and shared are accompanied by an appropriate minimal amount of meta-information to enable meaningful interpretation and use of this new information source and intercomparison across datasets. While some databases are being developed for specific surveillance programs locally, regionally, nationally, and internationally, common globally-adopted data standards have not yet been established within the research community. Establishing such standards will require national and international consensus on what meta-information should accompany SARS-CoV-2 wastewater measurements. To establish a recommendation on minimum information to accompany reporting of SARS-CoV-2 occurrence in wastewater for the research community, the United States National Science Foundation (NSF) Research Coordination Network on Wastewater Surveillance for SARS-CoV-2 hosted a workshop in February 2021 with participants from academia, government agencies, private companies, wastewater utilities, public health laboratories, and research institutes. This report presents the primary two outcomes of the workshop: (i) a recommendation on the set of minimum meta-information that is needed to confidently interpret wastewater SARS-CoV-2 data, and (ii) insights from workshop discussions on how to improve standardization of data reporting. 
    more » « less
  2. Rationale

    A major hurdle in identifying chemicals in mass spectrometry experiments is the availability of tandem mass spectrometry (MS/MS) reference spectra in public databases. Currently, scientists purchase databases or use public databases such as Global Natural Products Social Molecular Networking (GNPS). The MSMS‐Chooser workflow is an open‐source protocol for the creation of MS/MS reference spectra directly in the GNPS infrastructure.

    Methods

    An MSMS‐Chooser Sample Template is provided and completed manually. The MSMS‐Chooser Submission File and Sequence Table for data acquisition were programmatically generated. Standards from the Mass Spectrometry Metabolite Library (MSMLS) suspended in a methanol–water (1:1) solution were analyzed. Flow injection on an LC/MS/MS system was used to generate negative and positive mode data using data‐dependent acquisition. The MS/MS spectra and Submission File were uploaded to MSMS‐Chooser workflow in GNPS for automatic selection of MS/MS spectra.

    Results

    Data acquisition and processing required ~2 h and ~2 min, respectively, per 96‐well plate using MSMS‐Chooser. Analysis of the MSMLS, over 600 small molecules, using MSMS‐Chooser added 889 spectra (including multiple adducts) to the public library in GNPS. Manual validation of one plate indicated accurate selection of MS/MS scans (true positive rate of 0.96 and a true negative rate of 0.99). The MSMS‐Chooser output includes a table formatted for inclusion in the GNPS library as well as the ability to directly launch searches via MASST.

    Conclusions

    MSMS‐Chooser enables rapid data acquisition, data analysis (selection of MS/MS spectra), and a formatted table for inspection and upload to GNPS. Open file‐format data (.mzML or.mzXML) from most mass spectrometry platforms containing MS/MS spectra can be processed using MSMS‐Chooser. MSMS‐Chooser democratizes the creation of MS/MS reference spectra in GNPS which will improve annotation and strengthen the tools which use the annotation information.

     
    more » « less
  3. The Global Biodiversity Information Facility (GBIF 2022a) has indexed more than 2 billion occurrence records from 70,147 datasets. These datasets often include "hidden" biotic interaction data because biodiversity communities use the Darwin Core standard (DwC, Wieczorek et al. 2012) in different ways to document biotic interactions. In this study, we extracted biotic interactions from GBIF data using an approach similar to that employed in the Global Biotic Interactions (GloBI; Poelen et al. 2014) and summarized the results. Here we aim to present an estimation of the interaction data available in GBIF, showing that biotic interaction claims can be automatically found and extracted from GBIF. Our results suggest that much can be gained by an increased focus on development of tools that help to index and curate biotic interaction data in existing datasets. Combined with data standardization and best practices for sharing biotic interactions, such as the initiative on plant-pollinators interaction (Salim 2022), this approach can rapidly contribute to and meet open data principles (Wilkinson 2016). We used Preston (Elliott et al. 2020), open-source software that versions biodiversity datasets, to copy all GBIF-indexed datasets. The biodiversity data graph version (Poelen 2020) of the GBIF-indexed datasets used during this study contains 58,504 datasets in Darwin Core Archive (DwC-A) format, totaling 574,715,196 records. After retrieval and verification, the datasets were processed using Elton. Elton extracts biotic interaction data and supports 20+ existing file formats, including various types of data elements in DwC records. Elton also helps align interaction claims (e.g., host of, parasite of, associated with) to the Relations Ontology (RO, Mungall 2022), making it easier to discover datasets across a heterogeneous collection of datasets. Using specific mapping between interaction claims found in the DwC records to the terms in RO*1, Elton found 30,167,984 potential records (with non-empty values for the scanned DwC terms) and 15,248,478 records with recognized interaction types. Taxonomic name validation was performed using Nomer, which maps input names to names found in a variety of taxonomic catalogs. We only considered an interaction record valid where the interaction type could be mapped to a term in RO and where Nomer found a valid name for source and target taxa. Based on the workflow described in Fig. 1, we found 7,947,822 interaction records (52% of the potential interactions). Most of them were generic interactions ( interacts_ with , 87.5%), but the remaining 12.5% (993,477 records) included host-parasite and plant-animal interactions. The majority of the interactions records found involved plants (78%), animals (14%) and fungi (6%). In conclusion, there are many biotic interactions embedded in existing datasets registered in large biodiversity data indexers and aggregators like iDigBio, GBIF, and BioCASE. We exposed these biotic interaction claims using the combined functionality of biodiversity data tools Elton (for interaction data extraction), Preston (for reliable dataset tracking) and Nomer (for taxonomic name alignment). Nonetheless, the development of new vocabularies, standards and best practice guides would facilitate aggregation of interaction data, including the diversification of the GBIF data model (GBIF 2022b) for sharing biodiversity data beyond occurrences data. That is the aim of the TDWG Interest Group on Biological Interactions Data (TDWG 2022). 
    more » « less
  4. Villazón-Terrazas, B. (Ed.)
    Each day a vast amount of unstructured content is generated in the biomedical domain from various sources such as clinical notes, research articles and medical reports. Such content contain a sufficient amount of efficient and meaningful information that needs to be converted into actionable knowledge for secondary use. However, accessing precise biomedical content is quite challenging because of content heterogeneity, missing and imprecise metadata and unavailability of associated semantic tags required for search engine optimization. We have introduced a socio-technical semantic annotation optimization approach that enhance the semantic search of biomedical contents. The proposed approach consist of layered architecture. At First layer (Preliminary Semantic Enrichment), it annotates the biomedical contents with the ontological concepts from NCBO BioPortal. With the growing biomedical information, the suggested semantic annotations from NCBO Bioportal are not always correct. Therefore, in the second layer (Optimizing the Enriched Semantic Information), we introduce a knowledge sharing scheme through which authors/users could request for recommendations from other users to optimize the semantic enrichment process. To guage the credibility of the the human recommended, our systems records the recommender confidence score, collects community voting against previous recommendations, stores percentage of correctly suggested annotation and translates that into an index to later connect right users to get suggestions to optimize the semantic enrichment of biomedical contents. At the preliminary layer of annotation from NCBO, we analyzed the n-gram strategy for biomedical word boundary identification. We have found that NCBO recognizes biomedical terms for n-gram-1 more than for n-gram-2 to n-gram-5. Similarly, a statistical measure conducted on significant features using the Wilson score and data normalization. In contrast, the proposed methodology achieves an suitable accuracy of ≈90% for the semantic optimization approach. 
    more » « less
  5. Abstract

    Effective research, education, and outreach efforts by theArabidopsis thalianacommunity, as well as other scientific communities that depend on Arabidopsis resources, depend vitally on easily available and publicly‐shared resources. These resources include reference genome sequence data and an ever‐increasing number of diverse data sets and data types.TAIR(The Arabidopsis Information Resource) and Araport (originally named the Arabidopsis Information Portal) are community informatics resources that provide tools, data, and applications to the more than 30,000 researchers worldwide that use in their work either Arabidopsis as a primary system of study or data derived from Arabidopsis. Four years after Araport's establishment, theIAICheld another workshop to evaluate the current status of Arabidopsis Informatics and chart a course for future research and development. The workshop focused on several challenges, including the need for reliable and current annotation, community‐defined common standards for data and metadata, and accessible and user‐friendly repositories/tools/methods for data integration and visualization. Solutions envisioned included (a) a centralized annotation authority to coalesce annotation from new groups, establish a consistent naming scheme, distribute this format regularly and frequently, and encourage and enforce its adoption. (b) Standards for data and metadata formats, which are essential, but challenging when comparing across diverse genotypes and in areas with less‐established standards (e.g., phenomics, metabolomics). Community‐established guidelines need to be developed. (c) A searchable, central repository for analysis and visualization tools. Improved versioning and user access would make tools more accessible. Workshop participants proposed a “one‐stop shop” website, an Arabidopsis “Super‐Portal” to link tools, data resources, programmatic standards, and best practice descriptions for each data type. This must have community buy‐in and participation in its establishment and development to encourage adoption.

     
    more » « less