skip to main content


Title: DataVault: a data storage infrastructure for the Einstein Toolkit
Abstract Data sharing is essential in the numerical simulations research. We introduce a data repository, DataVault, which is designed for data sharing, search and analysis. A comparative study of existing repositories is performed to analyze features that are critical to a data repository. We describe the architecture, workflow, and deployment of DataVault, and provide three use-case scenarios for different communities to facilitate the use and application of DataVault. Potential features are proposed and we outline the future development for these features.  more » « less
Award ID(s):
2004879
NSF-PAR ID:
10320826
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Classical and Quantum Gravity
Volume:
38
Issue:
13
ISSN:
0264-9381
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Object signatures have been widely used in object detection and classification. Following a similar idea, the authors developed geometric signatures for architecture, engineering, and construction (AEC) objects such as footings, slabs, walls, beams, and columns. The signatures were developed both scientifically and empirically, by following a data-driven approach based on analysis of collected building information modeling (BIM) data using geometric theories. Rigorous geometric properties and statistical information were included in the developed geometric signatures. To enable an open access to BIM data using these signatures, the authors also initiated a BIM data repository with a preliminary collection of AEC objects and their geometric signatures. The developed geometric signatures were preliminarily tested by a small object classification experiment where 389 object instances from an architectural model were used. A rule-based algorithm developed using all parameter values of 14 features from the geometric signatures of the objects successfully classified 336 object instances into the correct categories of beams, columns, slabs, and walls. This higher than 85% accuracy showed the developed geometric signatures are promising. The collected and processed data were deposited into the Purdue University Research Repository (PURR) for sharing. 
    more » « less
  2. As requirements for swift and sustainable data sharing are growing, questions of where and how researchers are sharing data are becoming increasingly important for institutions to answer. One of the goals of the Reality of Academic Data Sharing (RADS) Initiative, comprised of six academic institutions from the Data Curation Network (DCN), was to answer this question. This presentation will discuss the process of how RADS determined where data from our researchers are shared. To do this, we programmatically pulled DOIs from DataCite, making the naive assumption that the information we were collecting, the metadata fields we were utilizing, and the platforms we were using would present us with a neutral and unbiased view of where data from our affiliated researchers were shared. However, as we dug into the data, we found inconsistencies in the use and completeness of the necessary metadata fields for our questions, as well as differences in how DOIs were assigned across repositories. While we expected some differences, we did not anticipate these subtle differences would dramatically affect how we interpret the answer to the question of where data are shared. Our presentation will highlight examples in our work that show how these subtleties in the data are systematic and challenge our assumptions of neutrality of not just the data, but of our platforms and practices as well. By examining these biases, we are forced to reexamine the decisions behind how we practice and, as we move forward as information and repository managers, how to reduce bias or assumption of neutrality. As a community, we often rely on data-driven decisions and decision makers need to be aware of these biases, especially as we are likely to see increased investments due to the evolving data policies and practices. 
    more » « less
  3. Tephra is a unique volcanic product that plays an unparalleled role in understanding past eruptions, the long-term behavior of volcanoes, and the effects of volcanism on climate and the environment. Tephra deposits also provide spatially widespread, extremely high-resolution time-stratigraphic markers across a range of sedimentary settings and are used by many disciplines (e.g. volcanology, seismotectonics, climate science, archaeology, ecology, public health and ash impact assessment). In the last two decades, tephra studies have become more interdisciplinary in nature but are challenged by a lack of standardization that often prevents comparison amongst various regions and across disciplines. To address this challenge, the global tephra community has come together through a series of workshops to establish best practice recommendations for tephra studies from sample collection through analysis and data reporting. This new standardized framework will facilitate consistent tephra documentation and parametrization, foster interdisciplinary communication, and improve effectiveness of data sharing among diverse communities of researchers. One specific goal is to use the best practice guidelines to inform digital tool and data repository development. Here we report on 1) a new set of templates for tephra sample documentation, geochemical method documentation and data reporting using recommended best- practice data and metadata fields, 2) a new tephra module added to StraboSpot, an open source geologic mapping and data- recording multi-platform software application, and 3) new implementations and cross-mapping of metadata requirements at SESAR (System for Earth Sample Registration) and EarthChem. Addition of tephra-specific fields to StraboSpot enables users to consistently collect and report essential tephra data in the field which is then automatically saved to an online data repository. A new tephra portal on the EarthChem website will allow users to follow simple workflows to register tephra samples at SESAR and submit microanalytical data to EarthChem. 
    more » « less
  4. Incomplete and inconsistent connections between institutional repository holdings and the global data infrastructure inhibit research data discovery and reusability. Preventing metadata loss on the path from institutional repositories to the global research infrastructure can substantially improve research data reusability. The Realities of Academic Data Sharing (RADS) Initiative, funded by the National Science Foundation, is investigating institutional processes for improving research data FAIRness. Focal points of the RADS inquiry are to understand where researchers are sharing their data and to assess metadata quality, i.e., completeness, at six Data Curation Network (DCN) academic institutions: Cornell University, Duke University, University of Michigan, University of Minnesota, Washington University in St. Louis, and Virginia Tech. RADS is examining where researchers are storing their data, considering local institutional repositories and other popular repositories, and analyzing the completeness of the research data metadata stored in these institutional and other repositories. Metadata FAIRness (Findable, Accessible, Interoperable, Reusable) is used as the metric to assess metadata quality as FAIR complete. Research findings show significant content loss when metadata from local institutional repositories are compared to metadata found in DataCite. After examining the factors contributing to this metadata loss, RADS investigators are developing a set of recommended best practices for institutions to increase the quality of their scholarly metadata. Further, documentation such as README files are of particular importance not only for data reuse, but as sources containing valuable metadata such as Persistent Identifiers (PIDs). DOIs and related PIDs such as ORCID and ROR are still rarely used in institutional repositories. More frequent use would have a positive effect on discoverability, interoperability and reusability, especially when transferring to global infrastructure. 
    more » « less
  5. Public genomic repositories are notoriously lacking in racially and ethnically diverse samples. This limits the reaches of exploration and has in fact been one of the driving factors for the initiation of the All of Us project. Our particular focus here is to provide a model-based framework for accurately predicting DNA methylation from genetic data using racially sparse public repository data. Epigenetic alterations are of great interest in cancer research but public repository data is limited in the information it provides. However, genetic data is more plentiful. Our phenotype of interest is cervical cancer in The Cancer Genome Atlas (TCGA) repository. Being able to generate such predictions would nicely complement other work that has generated gene-level predictions of gene expression for normal samples. We develop a new prediction approach which uses shared random effects from a nested error mixed effects regression model. The sharing of random effects allows borrowing of strength across racial groups greatly improving predictive accuracy. Additionally, we show how to further borrow strength by combining data from different cancers in TCGA even though the focus of our predictions is DNA methylation in cervical cancer. We compare our methodology against other popular approaches including the elastic net shrinkage estimator and random forest prediction. Results are very encouraging with the shared classified random effects approach uniformly producing more accurate predictions – overall and for each racial group. 
    more » « less