skip to main content


Title: Modernized uniform representation of carbohydrate molecules in the Protein Data Bank
Abstract Since 1971, the Protein Data Bank (PDB) has served as the single global archive for experimentally determined 3D structures of biological macromolecules made freely available to the global community according to the FAIR principles of Findability–Accessibility–Interoperability–Reusability. During the first 50 years of continuous PDB operations, standards for data representation have evolved to better represent rich and complex biological phenomena. Carbohydrate molecules present in more than 14,000 PDB structures have recently been reviewed and remediated to conform to a new standardized format. This machine-readable data representation for carbohydrates occurring in the PDB structures and the corresponding reference data improves the findability, accessibility, interoperability and reusability of structural information pertaining to these molecules. The PDB Exchange MacroMolecular Crystallographic Information File data dictionary now supports (i) standardized atom nomenclature that conforms to International Union of Pure and Applied Chemistry-International Union of Biochemistry and Molecular Biology (IUPAC-IUBMB) recommendations for carbohydrates, (ii) uniform representation of branched entities for oligosaccharides, (iii) commonly used linear descriptors of carbohydrates developed by the glycoscience community and (iv) annotation of glycosylation sites in proteins. For the first time, carbohydrates in PDB structures are consistently represented as collections of standardized monosaccharides, which precisely describe oligosaccharide structures and enable improved carbohydrate visualization, structure validation, robust quantitative and qualitative analyses, search for dendritic structures and classification. The uniform representation of carbohydrate molecules in the PDB described herein will facilitate broader usage of the resource by the glycoscience community and researchers studying glycoproteins.  more » « less
Award ID(s):
1832184
NSF-PAR ID:
10312715
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Glycobiology
Volume:
31
Issue:
9
ISSN:
1460-2423
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Paleoscience data are extremely heterogeneous; hundreds of different types of measurements and reconstructions are routinely made by scientists on a variety of types of physical samples. This heterogeneity is one of the biggest barriers to finding paleoclimatic records, to building large‐scale data products, and to the use of paleoscience data beyond the community of specialists. Here, we document the Paleoenvironmental Standard Terms (PaST) thesaurus, the first authoritative vocabulary of standardized variable names for paleoclimatic and paleoenvironmental data developed in a formal knowledge organization structure. This structure is designed to improve data set discovery, support automated processing of data, and provide connectivity to other vocabularies. PaST is now used operationally at the World Data Service for Paleoclimatology (WDS‐Paleo), one of the largest repositories of paleoscience information. Terms from the PaST thesaurus standardize a broad array of paleoenvironmental and paleoclimatic measured and inferred variables, providing enough detail for accurate and precise data discovery and thereby promoting data reuse. We describe the main design decisions and features of the thesaurus, the governance structure for ongoing maintenance, and WDS‐Paleo services that now employ PaST. These services include an advanced search by variable name, an interface for thesaurus navigation, and a machine‐readable representation in the Simple Knowledge Organization System (SKOS) standard. This overview is designed for developers of thesauri, data contributors, and users of the WDS‐Paleo, and serves as a building block for future efforts within the broader paleoscience community to improve how data are described for long‐term findability, accessibility, interoperability, and reusability.

     
    more » « less
  2. Abstract Background

    The proliferation of metagenomic sequencing technologies has enabled novel insights into the functional genomic potentials and taxonomic structure of microbial communities. However, cyberinfrastructure efforts to manage and enable the reproducible analysis of sequence data have not kept pace. Thus, there is increasing recognition of the need to make metagenomic data discoverable within machine-searchable frameworks compliant with the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles for data stewardship. Although a variety of metagenomic web services exist, none currently leverage the hierarchically structured terminology encoded within common life science ontologies to programmatically discover data.

    Results

    Here, we integrate large-scale marine metagenomic datasets with community-driven life science ontologies into a novel FAIR web service. This approach enables the retrieval of data discovered by intersecting the knowledge represented within ontologies against the functional genomic potential and taxonomic structure computed from marine sequencing data. Our findings highlight various microbial functional and taxonomic patterns relevant to the ecology of prokaryotes in various aquatic environments.

    Conclusions

    In this work, we present and evaluate a novel Semantic Web architecture that can be used to ask novel biological questions of existing marine metagenomic datasets. Finally, the FAIR ontology searchable data products provided by our API can be leveraged by future research efforts.

     
    more » « less
  3. null (Ed.)
    ABSTRACT The FaceBase Consortium was established by the National Institute of Dental and Craniofacial Research in 2009 as a ‘big data’ resource for the craniofacial research community. Over the past decade, researchers have deposited hundreds of annotated and curated datasets on both normal and disordered craniofacial development in FaceBase, all freely available to the research community on the FaceBase Hub website. The Hub has developed numerous visualization and analysis tools designed to promote integration of multidisciplinary data while remaining dedicated to the FAIR principles of data management (findability, accessibility, interoperability and reusability) and providing a faceted search infrastructure for locating desired data efficiently. Summaries of the datasets generated by the FaceBase projects from 2014 to 2019 are provided here. FaceBase 3 now welcomes contributions of data on craniofacial and dental development in humans, model organisms and cell lines. Collectively, the FaceBase Consortium, along with other NIH-supported data resources, provide a continuously growing, dynamic and current resource for the scientific community while improving data reproducibility and fulfilling data sharing requirements. 
    more » « less
  4. Summary

    High‐quality microbiome research relies on the integrity, management and quality of supporting data. Currently biobanks and culture collections have different formats and approaches to data management. This necessitates a standard data format to underpin research, particularly in line with the FAIR data standards of findability, accessibility, interoperability and reusability. We address the importance of a unified, coordinated approach that ensures compatibility of data between that needed by biobanks and culture collections, but also to ensure linkage between bioinformatic databases and the wider research community.

     
    more » « less
  5. Digital publishing platforms and internet resources enable openness of access to scientific findings and data at scales never before realized. Unfortunately, researchers sometimes embrace lock-in systems for data generation and analysis out of necessity because meaningful alternatives do not exist. Scientific advances still take place when this occurs, but they become fragmented with discordant quality control, interoperability, reproducibility, and democratization of access. To maximize the value of these—often—publicly funded resources, disciplines are turning to FAIR Guiding Principles for data stewardship. FAIR (Findability, Accessibility, Interoperability, and Reuse) promotes the added value of widespread data sharing that is transparent, equitable, and inclusive. Here we present NoCTURN, an NSF-funded FAIR Open Science Research Coordination Network for computed tomography users. NoCTURN (the Non-clinical Computed Tomography Users Research Network) aims to address the fragmentation of tomography toolkits stemming from proprietary software, non-uniform metadata formats, and repeatability limits. In this presentation, we outline how we will achieve this aim together by 1) developing a community committed to information sharing; 2) coordinating data analysis, storage, and reporting requirements; 3) highlighting underrepresented voices in the field; 4) developing community standards inclusive of industry, research, education, and outreach stake-holders; and 5) modeling FAIR open science strategies for our colleagues and students. NoCTURN is recruiting undergraduates through established investigators from X-ray-, neutron-, and synchrotron-beam computed tomography communities—and we want to hear from you. 
    more » « less