skip to main content

Title: DataVault: a data storage infrastructure for the Einstein Toolkit
Abstract Data sharing is essential in the numerical simulations research. We introduce a data repository, DataVault, which is designed for data sharing, search and analysis. A comparative study of existing repositories is performed to analyze features that are critical to a data repository. We describe the architecture, workflow, and deployment of DataVault, and provide three use-case scenarios for different communities to facilitate the use and application of DataVault. Potential features are proposed and we outline the future development for these features.
; ; ;
Award ID(s):
Publication Date:
Journal Name:
Classical and Quantum Gravity
Sponsoring Org:
National Science Foundation
More Like this
  1. Object signatures have been widely used in object detection and classification. Following a similar idea, the authors developed geometric signatures for architecture, engineering, and construction (AEC) objects such as footings, slabs, walls, beams, and columns. The signatures were developed both scientifically and empirically, by following a data-driven approach based on analysis of collected building information modeling (BIM) data using geometric theories. Rigorous geometric properties and statistical information were included in the developed geometric signatures. To enable an open access to BIM data using these signatures, the authors also initiated a BIM data repository with a preliminary collection of AEC objects and their geometric signatures. The developed geometric signatures were preliminarily tested by a small object classification experiment where 389 object instances from an architectural model were used. A rule-based algorithm developed using all parameter values of 14 features from the geometric signatures of the objects successfully classified 336 object instances into the correct categories of beams, columns, slabs, and walls. This higher than 85% accuracy showed the developed geometric signatures are promising. The collected and processed data were deposited into the Purdue University Research Repository (PURR) for sharing.
  2. Public genomic repositories are notoriously lacking in racially and ethnically diverse samples. This limits the reaches of exploration and has in fact been one of the driving factors for the initiation of the All of Us project. Our particular focus here is to provide a model-based framework for accurately predicting DNA methylation from genetic data using racially sparse public repository data. Epigenetic alterations are of great interest in cancer research but public repository data is limited in the information it provides. However, genetic data is more plentiful. Our phenotype of interest is cervical cancer in The Cancer Genome Atlas (TCGA) repository. Being able to generate such predictions would nicely complement other work that has generated gene-level predictions of gene expression for normal samples. We develop a new prediction approach which uses shared random effects from a nested error mixed effects regression model. The sharing of random effects allows borrowing of strength across racial groups greatly improving predictive accuracy. Additionally, we show how to further borrow strength by combining data from different cancers in TCGA even though the focus of our predictions is DNA methylation in cervical cancer. We compare our methodology against other popular approaches including the elastic net shrinkagemore »estimator and random forest prediction. Results are very encouraging with the shared classified random effects approach uniformly producing more accurate predictions – overall and for each racial group.« less
  3. Sharing data can be a journey with various characters, challenges along the way, and uncertain outcomes. These “epic journeys in sharing” teach information professionals about our patrons, our institutions, our community, and ourselves. In this paper, we tell a particularly dramatic data-sharing story, in effect a case study, in the form of a Greek Drama. It is the quest of – a young idealistic researcher collecting fascinating sensitive data and seeking to share it, encountering an institution doing its due diligence, helpful library folks, and an expert repository. Our story has moments of joy, such as when our researcher is solely motivated to share because she wants others to be able to reuse her unique data; dramatic plot twists involving IRBs; and a poignant ending. It explores major tropes and themes about how researchers’ motivations, data types, and data sensitivity can impact sharing; the importance of having clarity concerning institutional policies and procedures; and the role of professional communities and relationships. Just like the chorus in greek drama provides commentary on the action, a a chorus of data elders in our drama points out larger lessons that the case study has for research data management and data sharing. Where actorsmore »in the greek chorus were wearing masks, our chorus carries different items, symbolizing their message, on every entry.  [i] The narrative structure of this paper was inspired by the IASSIST 2018 conference theme of “Once Upon a Data Point: Sustaining Our Data Storytellers.”« less
  4. Exploratory data science largely happens in computational notebooks with dataframe APIs, such as pandas, that support flexible means to transform, clean, and analyze data. Yet, visually exploring data in dataframes remains tedious, requiring substantial programming effort for visualization and mental effort to determine what analysis to perform next. We propose Lux, an always-on framework for accelerating visual insight discovery in dataframe workflows. When users print a dataframe in their notebooks, Lux recommends visualizations to provide a quick overview of the patterns and trends and suggests promising analysis directions. Lux features a high-level language for generating visualizations on demand to encourage rapid visual experimentation with data. We demonstrate that through the use of a careful design and three system optimizations, Lux adds no more than two seconds of overhead on top of pandas for over 98% of datasets in the UCI repository. We evaluate Lux in terms of usability via interviews with early adopters, finding that Lux helps fulfill the needs of data scientists for visualization support within their dataframe workflows. Lux has already been embraced by data science practitioners, with over 3.1k stars on Github.
  5. Abstract Objective Supporting public health research and the public’s situational awareness during a pandemic requires continuous dissemination of infectious disease surveillance data. Legislation, such as the Health Insurance Portability and Accountability Act of 1996 and recent state-level regulations, permits sharing deidentified person-level data; however, current deidentification approaches are limited. Namely, they are inefficient, relying on retrospective disclosure risk assessments, and do not flex with changes in infection rates or population demographics over time. In this paper, we introduce a framework to dynamically adapt deidentification for near-real time sharing of person-level surveillance data. Materials and Methods The framework leverages a simulation mechanism, capable of application at any geographic level, to forecast the reidentification risk of sharing the data under a wide range of generalization policies. The estimates inform weekly, prospective policy selection to maintain the proportion of records corresponding to a group size less than 11 (PK11) at or below 0.1. Fixing the policy at the start of each week facilitates timely dataset updates and supports sharing granular date information. We use August 2020 through October 2021 case data from Johns Hopkins University and the Centers for Disease Control and Prevention to demonstrate the framework’s effectiveness in maintaining the PK11 thresholdmore »of 0.01. Results When sharing COVID-19 county-level case data across all US counties, the framework’s approach meets the threshold for 96.2% of daily data releases, while a policy based on current deidentification techniques meets the threshold for 32.3%. Conclusion Periodically adapting the data publication policies preserves privacy while enhancing public health utility through timely updates and sharing epidemiologically critical features.« less