NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Drag, Drop, Merge: A Tool for Streamlining Integration of Longitudinal Survey Instruments

https://doi.org/10.1145/3665939.3665965

Pokharel, Pratik; Lee, Juseung; Kennedy, Oliver; Markatou, Marianthi; Talal, Andrew; Good, Jeff; Mukhopadhyay, Raktim (June 2024, ACM)

We explore data management for longitudinal study survey instruments: (i) Survey instrument evolution presents a unique data integration challenge; and (ii) Longitudinal study data frequently requires repeated, task-specific integration efforts. We present DDM (Drag, Drop, Merge), a user interface for documenting relationships among attributes of source schemas into a form that can streamline subsequent efforts to generate task-specific datasets. DDM employs a "human-in-the-loop" approach, allowing users to validate and refine semantic mappings. Through a simulation of user interactions with DDM, we demonstrate its viability as a way to reduce cognitive overhead for longitudinal study data curators.
more » « less
Full Text Available
Overlay Spreadsheets

https://doi.org/10.1145/3597465.3605220

Kennedy, Oliver; Glavic, Boris; Brachmann, Michael (June 2023, HILDA '23: Proceedings of the Workshop on Human-In-the-Loop Data Analytics)

Full Text Available
Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data

https://doi.org/10.14778/3583140.3583151

Feng, Su; Glavic, Boris; Kennedy, Oliver (February 2023, Proceedings of the VLDB Endowment)

Uncertainty arises naturally in many application domains due to, e.g., data entry errors and ambiguity in data cleaning. Prior work in incomplete and probabilistic databases has investigated the semantics and efficient evaluation of ranking and top-k queries over uncertain data. However, most approaches deal with top-k and ranking in isolation and do represent uncertain input data and query results using separate, incompatible data models. We present an efficient approach for under- and over-approximating results of ranking, top-k, and window queries over uncertain data. Our approach integrates well with existing techniques for querying uncertain data, is efficient, and is to the best of our knowledge the first to support windowed aggregation. We design algorithms for physical operators for uncertain sorting and windowed aggregation, and implement them in PostgreSQL. We evaluated our approach on synthetic and real world datasets, demonstrating that it outperforms all competitors, and often produces more accurate results.
more » « less
Full Text Available
The Right Tool for the Job: Data-Centric Workflows in Vizier

Oliver Kennedy, Boris Glavic (September 2022, Bulletin of the Technical Committee on Data Engineering)
Sudeepa Roy and Jun Yang (Ed.)
Data scientists use a wide variety of systems with a wide variety of user interfaces such as spreadsheets and notebooks for their data exploration, discovery, preprocessing, and analysis tasks. While this wide selection of tools offers data scientists the freedom to pick the right tool for each task, each of these tools has limitations (e.g., the lack of reproducibility of notebooks), data needs to be translated between tool-specific formats, and common functionality such as versioning, provenance, and dealing with data errors often has to be implemented for each system. We argue that rather than alternating between task-specific tools, a superior approach is to build multiple user-interfaces on top of a single incremental workflow / dataflow platform with built-in support for versioning, provenance, error & tracking, and data cleaning. We discuss Vizier, a notebook system that implements this approach, introduce the challenges that arose in building such a system, and highlight how our work on Vizier lead to novel research in uncertain data management and incremental execution of workflows.
more » « less
Full Text Available
Runtime provenance refinement for notebooks

https://doi.org/10.1145/3530800.3534535

Deo, Nachiket; Glavic, Boris; Kennedy, Oliver (June 2022, Proceedings of the 14th International Workshop on the Theory and Practice of Provenance)

Full Text Available
Reducing Ambiguity in Json Schema Discovery

https://doi.org/10.1145/3448016.3452801

Spoth, William; Kennedy, Oliver; Lu, Ying; Hammerschmidt, Beda; Liu, Zhen Hua (June 2021, SIGMOD '21: International Conference on Management of Data)
null (Ed.)
Ad-hoc data models like Json simplify schema evolution and enable multiplexing various data sources into a single stream. While useful when writing data, this flexibility makes Json harder to validate and query, forcing such tasks to rely on automated schema discovery techniques. Unfortunately, ambiguity in the schema design space forces existing schema discovery systems to make simplifying, data-independent assumptions about schema structure. When these assumptions are violated, most notably by APIs, the generated schemas are imprecise, creating numerous opportunities for false positives during validation. In this paper, we propose Jxplain, a Json schema discovery algorithm with heuristics that mitigate common forms of ambiguity. Although Jxplain is slightly slower than state of the art schema extractors, we show that it produces significantly more precise schemas.
more » « less
Full Text Available
DataSense: Display Agnostic Data Documentation

Kumari, Poonam; Brachmann, Michael; Kennedy, Oliver; Feng, Su; Glavic, Boris (January 2021, Conference on Innovative Data Systems Research)
Boncz, Peter; Ozcan, Fatma; Patel, Jignesh (Ed.)
Documentation of data is critical for understanding the semantics of data, understanding how data was created, and for raising aware- ness of data quality problem, errors, and assumptions. However, manually creating, maintaining, and exploring documentation is time consuming and error prone. In this work, we present our vi- sion for display-agnostic data documentation (DAD), a novel data management paradigm that aids users in dealing with documenta- tion for data. We introduce DataSense, a system implementing the DAD paradigm. Specifically, DataSense supports multiple types of documentation from free form text to structured information like provenance and uncertainty annotations, as well as several display formats for documentation. DataSense automatically computes documentation for derived data. A user study we conducted with uncertainty documentation produced by DataSense demonstrates the benefits of documentation management.
more » « less
Full Text Available
Loki: Streamlining Integration and Enrichment

Spoth, William; Kumari, Poonam; Kennedy, Oliver; Nargesian, Fatemeh (June 2020, Human in the Loop Data Analytics)
Abouzied, Azza; Amer-Yahia, Sihem; Ives, Zachary (Ed.)
Data scientists frequently transform data from one form to another while cleaning, integrating, and enriching datasets.Writing such transformations, or “mapping functions" is time-consuming and often involves significant code re-use. Unfortunately, when every dataset is slightly different from the last, finding the right mapping functions to re-use can be equally difficult. In this paper, we propose “Link Once and Keep It" (Loki), a system which consists of a repository of datasets and mapping functions and relates new datasets to datasets it already knows about, helping a data scientist to quickly locate and re-use mapping functions she developed for other datasets in the past. Loki represents a first step towards building and re-using repositories of domain-specific data integration pipelines.
more » « less
Full Text Available
Your notebook is not crumby enough, REPLace it

Michael Brachmann, William Spoth (January 2020, Conference on Innovative Data Systems Research (CIDR))

Notebook and spreadsheet systems are currently the de-facto standard for data collection, preparation, and analysis. However, these systems have been criticized for their lack of reproducibility, versioning, and support for sharing. These shortcomings are particularly detrimental for data curation where data scientists iteratively build workflows to clean up and integrate data as a prerequisite for analysis. We present Vizier, an open-source tool that helps analysts to build and refine data pipelines. Vizier combines the flexibility of notebooks with the easy-to-use data manipulation interface of spreadsheets. Combined with advanced provenance tracking for both data and computational steps this enables reproducibility, versioning, and streamlined data exploration. Unique to Vizier is that it exposes potential issues with data, no matter whether they already exist in the input or are introduced by the operations of a notebook. We refer to such potential errors as data caveats. Caveats are propagated alongside data using principled techniques from uncertain data management. Vizier provides extensive user interface support for caveats, e.g., exposing them as summaries in a dedicated error view and highlighting cells with caveats in spreadsheets.
more » « less
Full Text Available
Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers

https://doi.org/10.1145/3299869.3319887

Feng, Su; Huber, Aaron; Glavic, Boris; Kennedy, Oliver (June 2019, SIGMOD)

Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
more » « less
Full Text Available

« Prev Next »

Search for: All records