NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Reducing Ambiguity in Json Schema Discovery

https://doi.org/10.1145/3448016.3452801

Spoth, William; Kennedy, Oliver; Lu, Ying; Hammerschmidt, Beda; Liu, Zhen Hua (June 2021, SIGMOD '21: International Conference on Management of Data)
null (Ed.)
Ad-hoc data models like Json simplify schema evolution and enable multiplexing various data sources into a single stream. While useful when writing data, this flexibility makes Json harder to validate and query, forcing such tasks to rely on automated schema discovery techniques. Unfortunately, ambiguity in the schema design space forces existing schema discovery systems to make simplifying, data-independent assumptions about schema structure. When these assumptions are violated, most notably by APIs, the generated schemas are imprecise, creating numerous opportunities for false positives during validation. In this paper, we propose Jxplain, a Json schema discovery algorithm with heuristics that mitigate common forms of ambiguity. Although Jxplain is slightly slower than state of the art schema extractors, we show that it produces significantly more precise schemas.
more » « less
Full Text Available
Loki: Streamlining Integration and Enrichment

Spoth, William; Kumari, Poonam; Kennedy, Oliver; Nargesian, Fatemeh (June 2020, Human in the Loop Data Analytics)
Abouzied, Azza; Amer-Yahia, Sihem; Ives, Zachary (Ed.)
Data scientists frequently transform data from one form to another while cleaning, integrating, and enriching datasets.Writing such transformations, or “mapping functions" is time-consuming and often involves significant code re-use. Unfortunately, when every dataset is slightly different from the last, finding the right mapping functions to re-use can be equally difficult. In this paper, we propose “Link Once and Keep It" (Loki), a system which consists of a repository of datasets and mapping functions and relates new datasets to datasets it already knows about, helping a data scientist to quickly locate and re-use mapping functions she developed for other datasets in the past. Loki represents a first step towards building and re-using repositories of domain-specific data integration pipelines.
more » « less
Full Text Available
Data Debugging and Exploration with Vizier

https://doi.org/10.1145/3299869.3320246

Brachmann, Mike; Spoth, William; Yang, Ying; Bautista, Carlos; Castelo, Sonia; Feng, Su; Freire, Juliana; Glavic, Boris; Kennedy, Oliver; Müeller, Heiko; et al (January 2019, SIGMOD)

We present Vizier, a multi-modal data exploration and debugging tool. The system supports a wide range of operations by seamlessly integrating Python, SQL, and automated data curation and debugging methods. Using Spark as an execution backend, Vizier handles large datasets in multiple formats. Ease-of-use is attained through integration of a notebook with a spreadsheet-style interface and with visualizations that guide and support the user in the loop. In addition, native support for provenance and versioning enable collaboration and uncertainty management. In this demonstration we will illustrate the diverse features of the system using several realistic data science tasks based on real data.
more » « less
Full Text Available
SchemaDrill: Interactive Semi-Structured Schema Design

https://doi.org/10.1145/3209900.3209908

Spoth, William; Xie, Ting; Kennedy, Oliver; Yang, Ying; Hammerschmidt, Beda; Liu, Zhen Hua; Gawlick, Dieter (January 2018, Proceedings of the Workshop on Human-In-the-Loop Data Analytics)

Ad-hoc data models like JSON make it easy to evolve schemas and to multiplex different data-types into a single stream. This flexibility makes JSON great for generating data, but also makes it much harder to query, ingest into a database, and index. In this paper, we explore the first step of JSON data loading: schema design. Specifically, we consider the challenge of designing schemas for existing JSON datasets as an interactive problem. We present SchemaDrill, a roll-up/drill-down style interface for exploring collections of JSON records. SchemaDrill helps users to visualize the collection, identify relevant fragments, and map it down into one or more flat, relational schemas. We describe and evaluate two key components of SchemaDrill: (1) A summary schema representation that significantly reduces the complexity of JSON schemas without a meaningful reduction in information content, and (2) A collection of schema visualizations that help users to qualitatively survey variability amongst different schemas in the collection.
more » « less
Full Text Available

Search for: All records