skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: DataSense: Display-Agnostic Data Documentation
Documentation of data is critical for understanding the semantics of data, understanding how data was created, and for raising awareness of data quality problem, errors, and assumptions. However, manually creating, maintaining, and exploring documentation is time consuming and error prone. In this work, we present our vision for display-agnostic data documentation (DAD), a novel data management paradigm that aids users in dealing with documentation for data. We introduce DataSense, a system implementing the DAD paradigm. Specifically, DataSense supports multiple types of documentation from free form text to structured information like provenance and uncertainty annotations, as well as several display formats for documentation. DataSense automatically computes documentation for derived data. A user study we conducted with uncertainty documentation produced by DataSense demonstrates the benefits of documentation management.  more » « less
Award ID(s):
1640864
PAR ID:
10274666
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Conference on Innovative Data Systems Research
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Boncz, Peter; Ozcan, Fatma; Patel, Jignesh (Ed.)
    Documentation of data is critical for understanding the semantics of data, understanding how data was created, and for raising aware- ness of data quality problem, errors, and assumptions. However, manually creating, maintaining, and exploring documentation is time consuming and error prone. In this work, we present our vi- sion for display-agnostic data documentation (DAD), a novel data management paradigm that aids users in dealing with documenta- tion for data. We introduce DataSense, a system implementing the DAD paradigm. Specifically, DataSense supports multiple types of documentation from free form text to structured information like provenance and uncertainty annotations, as well as several display formats for documentation. DataSense automatically computes documentation for derived data. A user study we conducted with uncertainty documentation produced by DataSense demonstrates the benefits of documentation management. 
    more » « less
  2. EDBT (Ed.)
    Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practi- tioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of models increases, the challenges of finding, differentiating, and understanding mod- els become increasingly crucial. Inspired from research on data lakes, we introduce the concept of model lakes. We formalize key model lake tasks, including model attribution, versioning, search, and benchmarking, and discuss fundamental research challenges in the management of large models. We also explore what data management techniques can be brought to bear on the study of large model management. 
    more » « less
  3. Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, composed of both a workflow and documentation, aims to provide responsible, reliable, reproducible, and transparent results across the data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. As part of the PCS workflow, we develop PCS inference procedures, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We illustrate PCS inference through neuroscience and genomics projects of our own and others. Moreover, we demonstrate its favorable performance over existing methods in terms of receiver operating characteristic (ROC) curves in high-dimensional, sparse linear model simulations, including a wide range of misspecified models. Finally, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo. 
    more » « less
  4. Abstract Background Social and behavioral determinants of health (SBDH) are environmental and behavioral factors that often impede disease management and result in sexually transmitted infections. Despite their importance, SBDH are inconsistently documented in electronic health records (EHRs) and typically collected only in an unstructured format. Evidence suggests that structured data elements present in EHRs can contribute further to identify SBDH in the patient record. Objective Explore the automated inference of both the presence of SBDH documentation and individual SBDH risk factors in patient records. Compare the relative ability of clinical notes and structured EHR data, such as laboratory measurements and diagnoses, to support inference. Methods We attempt to infer the presence of SBDH documentation in patient records, as well as patient status of 11 SBDH, including alcohol abuse, homelessness, and sexual orientation. We compare classification performance when considering clinical notes only, structured data only, and notes and structured data together. We perform an error analysis across several SBDH risk factors. Results Classification models inferring the presence of SBDH documentation achieved good performance (F1 score: 92.7–78.7; F1 considered as the primary evaluation metric). Performance was variable for models inferring patient SBDH risk status; results ranged from F1 = 82.7 for LGBT (lesbian, gay, bisexual, and transgender) status to F1 = 28.5 for intravenous drug use. Error analysis demonstrated that lexical diversity and documentation of historical SBDH status challenge inference of patient SBDH status. Three of five classifiers inferring topic-specific SBDH documentation and 10 of 11 patient SBDH status classifiers achieved highest performance when trained using both clinical notes and structured data. Conclusion Our findings suggest that combining clinical free-text notes and structured data provide the best approach in classifying patient SBDH status. Inferring patient SBDH status is most challenging among SBDH with low prevalence and high lexical diversity. 
    more » « less
  5. Abstract Marine invasive species can transform coastal ecosystems, yet mitigating their effects can be difficult, and even impractical. Often, marine invasive species are managed at poorly matched spatial scales, and at the same time, rates of spread and establishment are increasing under climate change and can outpace resources available for population suppression. These circumstances challenge traditional conservation goals of maintaining a historic environmental state, especially for a species like the European green crab (Carcinus maenas), a formidable invader with few examples of successful long‐term removal programs.A management paradigm where decision alternatives include resisting or accepting a new ecological trajectory may be needed. We apply mathematical concepts from decision theory to develop a quantitative framework for navigating management decisions in this new resist‐accept paradigm. We develop a model of European green crab growth, removal and colonization, and we find optimal levels of removal effort that minimize both ecological change and removal cost.We establish a benchmark of colonization pressure at which green crab density becomes decoupled from a decision maker's actions, such that population control can no longer shape the invasion trajectory. For informing the decision boundary between resistance and acceptance, our results highlight that a decision maker's understanding of how removal cost scales with removal effort is more important than understanding the density‐impact relationship.We show that assuming stationary system dynamics can result in sub‐optimal levels of species removal effort, highlighting the importance of developing anticipatory management strategies by accounting for non‐stationary dynamics.Policy implications. For marine invasive species that can disperse across long distances and recolonize rapidly after removal, the focus of conservation policy should shift away from understandinghowto resist change to understandingwhen to stopresisting change. Navigating this decision problem involves trade‐offs among competing objectives, highlighting the need for structured approaches to elicit objective weights that reflect the values of the decision maker. For natural resource managers facing possible ecosystem transformation, this decision framework can enable proactive and strategic decisions made under uncertainty in a changing world. 
    more » « less