NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The Right Tool for the Job: Data-Centric Workflows in Vizier

Oliver Kennedy, Boris Glavic (September 2022, Bulletin of the Technical Committee on Data Engineering)
Sudeepa Roy and Jun Yang (Ed.)
Data scientists use a wide variety of systems with a wide variety of user interfaces such as spreadsheets and notebooks for their data exploration, discovery, preprocessing, and analysis tasks. While this wide selection of tools offers data scientists the freedom to pick the right tool for each task, each of these tools has limitations (e.g., the lack of reproducibility of notebooks), data needs to be translated between tool-specific formats, and common functionality such as versioning, provenance, and dealing with data errors often has to be implemented for each system. We argue that rather than alternating between task-specific tools, a superior approach is to build multiple user-interfaces on top of a single incremental workflow / dataflow platform with built-in support for versioning, provenance, error & tracking, and data cleaning. We discuss Vizier, a notebook system that implements this approach, introduce the challenges that arose in building such a system, and highlight how our work on Vizier lead to novel research in uncertain data management and incremental execution of workflows.
more » « less
Full Text Available
To Not Miss the Forest for the Trees - A Holistic Approach for Explaining Missing Answers over Nested Data

https://doi.org/10.1145/3448016.3457249

Diestelkämper, Ralf; Lee, Seokki; Herschel, Melanie; Glavic, Boris (July 2021, Proceedings of the 46th International Conference on Management of Data)
null (Ed.)
Query-based explanations for missing answers identify which operators of a query are responsible for the failure to return a missing answer of interest. This type of explanations has proven useful, e.g., to debug complex analytical queries. Such queries are frequent in big data systems such as Apache Spark. We present a novel approach to produce query-based explanations. It is the first to support nested data and to consider operators that modify the schema and structure of the data (e.g., nesting, projections) as potential causes of missing answers. To efficiently compute explanations, we propose a heuristic algorithm that applies two novel techniques: (i) reasoning about multiple schema alternatives for a query and (ii) re-validating at each step whether an intermediate result can contribute to the missing answer. Using an implementation on Spark, we demonstrate that our approach is the first to scale to large datasets while often finding explanations that existing techniques fail to identify.
more » « less
Full Text Available
Putting Things into Context: Rich Explanations for Query Answers using Join Graphs

https://doi.org/10.1145/3448016.3459246

Li, Chenjie; Miao, Zhengjie; Zeng, Qitian; Glavic, Boris; Roy, Sudeepa (July 2021, Proceedings of the 46th International Conference on Management of Data)
null (Ed.)
In many data analysis applications there is a need to explain why a surprising or interesting result was produced by a query. Previous approaches to explaining results have directly or indirectly relied on data provenance, i.e., input tuples contributing to the result(s) of interest. However, some information that is relevant for explaining an answer may not be contained in the provenance. We propose a new approach for explaining query results by augmenting provenance with information from other related tables in the database. Using a suite of optimization techniques, we demonstrate experimentally using real datasets and through a user study that our approach produces meaningful results and is efficient.
more » « less
Full Text Available
Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds

https://doi.org/10.1145/3448016.3452791

Feng, Su; Glavic, Boris; Huber, Aaron; Kennedy, Oliver A. (June 2021, SIGMOD '21: International Conference on Management of Data)
null (Ed.)
Incomplete and probabilistic database techniques are principled methods for coping with uncertainty in data. unfortunately, the class of queries that can be answered efficiently over such databases is severely limited, even when advanced approximation techniques are employed. We introduce attribute-annotated uncertain databases (AU-DBs), an uncertain data model that annotates tuples and attribute values with bounds to compactly approximate an incomplete database. AU-DBs are closed under relational algebra with aggregation using an efficient evaluation semantics. Using optimizations that trade accuracy for performance, our approach scales to complex queries and large datasets, and produces accurate results.
more » « less
Full Text Available
Reducing Ambiguity in Json Schema Discovery

https://doi.org/10.1145/3448016.3452801

Spoth, William; Kennedy, Oliver; Lu, Ying; Hammerschmidt, Beda; Liu, Zhen Hua (June 2021, SIGMOD '21: International Conference on Management of Data)
null (Ed.)
Ad-hoc data models like Json simplify schema evolution and enable multiplexing various data sources into a single stream. While useful when writing data, this flexibility makes Json harder to validate and query, forcing such tasks to rely on automated schema discovery techniques. Unfortunately, ambiguity in the schema design space forces existing schema discovery systems to make simplifying, data-independent assumptions about schema structure. When these assumptions are violated, most notably by APIs, the generated schemas are imprecise, creating numerous opportunities for false positives during validation. In this paper, we propose Jxplain, a Json schema discovery algorithm with heuristics that mitigate common forms of ambiguity. Although Jxplain is slightly slower than state of the art schema extractors, we show that it produces significantly more precise schemas.
more » « less
Full Text Available
DataSense: Display Agnostic Data Documentation

Kumari, Poonam; Brachmann, Michael; Kennedy, Oliver; Feng, Su; Glavic, Boris (January 2021, Conference on Innovative Data Systems Research)
Boncz, Peter; Ozcan, Fatma; Patel, Jignesh (Ed.)
Documentation of data is critical for understanding the semantics of data, understanding how data was created, and for raising aware- ness of data quality problem, errors, and assumptions. However, manually creating, maintaining, and exploring documentation is time consuming and error prone. In this work, we present our vi- sion for display-agnostic data documentation (DAD), a novel data management paradigm that aids users in dealing with documenta- tion for data. We introduce DataSense, a system implementing the DAD paradigm. Specifically, DataSense supports multiple types of documentation from free form text to structured information like provenance and uncertainty annotations, as well as several display formats for documentation. DataSense automatically computes documentation for derived data. A user study we conducted with uncertainty documentation produced by DataSense demonstrates the benefits of documentation management.
more » « less
Full Text Available
DataSense: Display-Agnostic Data Documentation

Kumari, Poonam; Brachmann, Michael; Kennedy, Oliver; Feng, Su; Glavic, Boris (January 2021, Conference on Innovative Data Systems Research)
null (Ed.)
Documentation of data is critical for understanding the semantics of data, understanding how data was created, and for raising awareness of data quality problem, errors, and assumptions. However, manually creating, maintaining, and exploring documentation is time consuming and error prone. In this work, we present our vision for display-agnostic data documentation (DAD), a novel data management paradigm that aids users in dealing with documentation for data. We introduce DataSense, a system implementing the DAD paradigm. Specifically, DataSense supports multiple types of documentation from free form text to structured information like provenance and uncertainty annotations, as well as several display formats for documentation. DataSense automatically computes documentation for derived data. A user study we conducted with uncertainty documentation produced by DataSense demonstrates the benefits of documentation management.
more » « less
Full Text Available
Data Provenance

https://doi.org/10.1561/1900000068

Glavic, Boris (January 2021, Foundations and Trends® in Databases)
null (Ed.)
Full Text Available
Loki: Streamlining Integration and Enrichment

Spoth, William; Kumari, Poonam; Kennedy, Oliver; Nargesian, Fatemeh (June 2020, Human in the Loop Data Analytics)
Abouzied, Azza; Amer-Yahia, Sihem; Ives, Zachary (Ed.)
Data scientists frequently transform data from one form to another while cleaning, integrating, and enriching datasets.Writing such transformations, or “mapping functions" is time-consuming and often involves significant code re-use. Unfortunately, when every dataset is slightly different from the last, finding the right mapping functions to re-use can be equally difficult. In this paper, we propose “Link Once and Keep It" (Loki), a system which consists of a repository of datasets and mapping functions and relates new datasets to datasets it already knows about, helping a data scientist to quickly locate and re-use mapping functions she developed for other datasets in the past. Loki represents a first step towards building and re-using repositories of domain-specific data integration pipelines.
more » « less
Full Text Available
Data-driven domain discovery for structured datasets

https://doi.org/10.14778/3384345.3384346

Ota, Masayo; Müller, Heiko; Freire, Juliana; Srivastava, Divesh (March 2020, Proceedings of the VLDB Endowment)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records