In the metadata of digital environmental datasets, automated processing is hindered by the wide variety of representations for unit that may be human-readable, but may not be unambiguous or machine-interpretable, (e.g., grams per square meter, gm/m2, g/m2, gm-2, g/m^2, g.m-2, g m-2 and gramPerMeterSquared). Matching disparate representations of the same unit into a single unit concept from an ontology assists with interpretation and reuse by providing a linkage to a complete unit definitions with label, description, dimensions. Datasets with shared units can be identified during searches, and are more suitable for automating analyses and potential transformation. This dataset contains data and code associated with a project to map units in ecological metadata collected between 2013 and 2022 by DataONE, the Environmental Data Initiative and the U.S. National Ecological Observatory Network to the QUDT ontology using successive string transformations. Data entities include a) raw metadata as received (355,057 unit instances); b) integrated raw data; c) substitution tables for string transformations; d) resulting lookup table for 896 distinct units matched to QUDT units; e) associated R code used for QUDT matching plus a web service and R functions for adding annotation elements to Ecological Metadata Language metadata documents. Using these substitutions and code, 91% of unit instances in the raw metadata could be matched to QUDT. Data and results are discussed in “Porter JH, M O’Brien, M Frants, S Earl, M Martin, C Laney. (in review) Using a Units Ontology to Annotate Pre-Existing Metadata. Submitted to Scientific Data.
more »
« less
Semantic Integration in Heterogeneous Databases Using Neural Networks
One important step in integrating heterogeneous databases is matching equivalent attributes: Determining which fields in two databases refer to the same data. The meaning of information may be embodied within a database model, a conceptual schema, application programs, or data contents. Integration involves extracting semantics, expressing them as metadata, and matching semantically equivalent data elements. We present a procedure using a classifier to categorize attributes according to their field specifications and data values, then train a neural network to recognize similar attributes. In our technique, the knowledge of how to match equivalent data elements is "discovered" from metadata, not "pre-programmed".
more »
« less
- Award ID(s):
- 9210704
- PAR ID:
- 10077889
- Date Published:
- Journal Name:
- Proceedings of the 20th International Conference on Very Large Data Bases
- Page Range / eLocation ID:
- 1-12
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Over the last couple of decades, there has been a rapid growth in the number and scope of agricultural genetics, genomics and breeding databases and resources. The AgBioData Consortium (https://www.agbiodata.org/) currently represents 44 databases and resources (https://www.agbiodata.org/databases) covering model or crop plant and animal GGB data, ontologies, pathways, genetic variation and breeding platforms (referred to as ‘databases’ throughout). One of the goals of the Consortium is to facilitate FAIR (Findable, Accessible, Interoperable, and Reusable) data management and the integration of datasets which requires data sharing, along with structured vocabularies and/or ontologies. Two AgBioData working groups, focused on Data Sharing and Ontologies, respectively, conducted a Consortium-wide survey to assess the current status and future needs of the members in those areas. A total of 33 researchers responded to the survey, representing 37 databases. Results suggest that data-sharing practices by AgBioData databases are in a fairly healthy state, but it is not clear whether this is true for all metadata and data types across all databases; and that, ontology use has not substantially changed since a similar survey was conducted in 2017. Based on our evaluation of the survey results, we recommend (i) providing training for database personnel in a specific data-sharing techniques, as well as in ontology use; (ii) further study on what metadata is shared, and how well it is shared among databases; (iii) promoting an understanding of data sharing and ontologies in the stakeholder community; (iv) improving data sharing and ontologies for specific phenotypic data types and formats; and (v) lowering specific barriers to data sharing and ontology use, by identifying sustainability solutions, and the identification, promotion, or development of data standards. Combined, these improvements are likely to help AgBioData databases increase development efforts towards improved ontology use, and data sharing via programmatic means. Database URL https://www.agbiodata.org/databasesmore » « less
-
null (Ed.)Abstract Background High-throughput sequencing has increased the number of available microbial genomes recovered from isolates, single cells, and metagenomes. Accordingly, fast and comprehensive functional gene annotation pipelines are needed to analyze and compare these genomes. Although several approaches exist for genome annotation, these are typically not designed for easy incorporation into analysis pipelines, do not combine results from different annotation databases or offer easy-to-use summaries of metabolic reconstructions, and typically require large amounts of computing power for high-throughput analysis not available to the average user. Results Here, we introduce MicrobeAnnotator, a fully automated, easy-to-use pipeline for the comprehensive functional annotation of microbial genomes that combines results from several reference protein databases and returns the matching annotations together with key metadata such as the interlinked identifiers of matching reference proteins from multiple databases [KEGG Orthology (KO), Enzyme Commission (E.C.), Gene Ontology (GO), Pfam, and InterPro]. Further, the functional annotations are summarized into Kyoto Encyclopedia of Genes and Genomes (KEGG) modules as part of a graphical output (heatmap) that allows the user to quickly detect differences among (multiple) query genomes and cluster the genomes based on their metabolic similarity. MicrobeAnnotator is implemented in Python 3 and is freely available under an open-source Artistic License 2.0 from https://github.com/cruizperez/MicrobeAnnotator . Conclusions We demonstrated the capabilities of MicrobeAnnotator by annotating 100 Escherichia coli and 78 environmental Candidate Phyla Radiation (CPR) bacterial genomes and comparing the results to those of other popular tools. We showed that the use of multiple annotation databases allows MicrobeAnnotator to recover more annotations per genome compared to faster tools that use reduced databases and is computationally efficient for use in personal computers. The output of MicrobeAnnotator can be easily incorporated into other analysis pipelines while the results of other annotation tools can be seemingly incorporated into MicrobeAnnotator to generate summary plots.more » « less
-
Abstract Motivation As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the non-redundant (NR) database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. Results We found more than 2 million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. Availability Source code, dataset, documentation, Jupyter notebooks, and Docker container are available at https://github.com/boalang/nr. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
Over the past five decades, a large number of wild animals have been individually identified by various observation systems and/or temporary tracking methods, providing unparalleled insights into their lives over both time and space. However, so far there is no comprehensive record of uniquely individually identified animals nor where their data and metadata are stored, for example photos, physiological and genetic samples, disease screens, information on social relationships.Databases currently do not offer unique identifiers for living, individual wild animals, similar to the permanent ID labelling for deceased museum specimens.To address this problem, we introduce two new concepts: (1) a globally unique animal ID (UAID) available to define uniquely and individually identified animals archived in any database, including metadata archived at the time of publication; and (2) the digital ‘home’ for UAIDs, the Movebank Life History Museum (MoMu), storing and linking metadata, media, communications and other files associated with animals individually identified in the wild. MoMu will ensure that metadata are available for future generations, allowing permanent linkages to information in other databases.MoMu allows researchers to collect and store photos, behavioural records, genome data and/or resightings of UAIDed animals, encompassing information not easily included in structured datasets supported by existing databases. Metadata is uploaded through the Animal Tracker app, the MoMu website, by email from registered users or through an Application Programming Interface (API) from any database. Initially, records can be stored in a temporary folder similar to a field drawer, as naturalists routinely do. Later, researchers and specialists can curate these materials for individual animals, manage the secure sharing of sensitive information and, where appropriate, publish individual life histories with DOIs. The storage of such synthesized lifetime stories of wild animals under a UAID (unique identifier or ‘animal passport’) will support basic science, conservation efforts and public participation.more » « less
An official website of the United States government

