- NSF-PAR ID:
- 10061133
- Date Published:
- Journal Name:
- Proceedings of the Workshop on Human-In-the-Loop Data Analytics
- Page Range / eLocation ID:
- 1 to 7
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
null (Ed.)Ad-hoc data models like Json simplify schema evolution and enable multiplexing various data sources into a single stream. While useful when writing data, this flexibility makes Json harder to validate and query, forcing such tasks to rely on automated schema discovery techniques. Unfortunately, ambiguity in the schema design space forces existing schema discovery systems to make simplifying, data-independent assumptions about schema structure. When these assumptions are violated, most notably by APIs, the generated schemas are imprecise, creating numerous opportunities for false positives during validation. In this paper, we propose Jxplain, a Json schema discovery algorithm with heuristics that mitigate common forms of ambiguity. Although Jxplain is slightly slower than state of the art schema extractors, we show that it produces significantly more precise schemas.more » « less
-
The popularity of JSON as a data interchange format resulted in big amounts of datasets available for processing. Users would like to analyze this data using SQL queries but existing distributed systems limit their users to only two specific formats, JSONLine and GeoJSON. The complexity of JSON schema makes it challenging to parse arbitrary files in a modern distributed system while producing records with unified schema that can be processed with SQL. To address these challenges, this paper introduces dsJSON, a state-of-the-art distributed JSON processor that overcomes limitations in existing systems and scales to big and complex data. dsJSON introduces the projection tree, a novel data structure that applies selective parsing of nested attributes to produce records that are ready for SQL processors. The key objective of the projection tree is to parse a big JSON file in parallel to produce records with a unified schema that can be processed with SQL. dsJSON is integrated into SparkSQL which enables users to run arbitrary SQL queries on complex JSON files. It also pushes projection and filter down into the parser for full integration between the parser and the processor. Experiments on up-to two terabytes of real data show that dsJSON performs several times faster than existing systems. It can also efficiently parse extremely large files not supported by existing distributed parsersmore » « less
-
The rigid schemas of classical relational databases help users in specifying queries and inform the storage organization of data. However, the advantages of schemas come at a high upfront cost through schema and ETL process design. In this work, we propose a new paradigm where the database system takes a more active role in schema development and data integration. We refer to this approach as adaptive schema databases (ASDs). An ASD ingests semi-structured or unstructured data directly using a pluggable combination of extraction and data integration techniques. Over time it discovers and adapts schemas for the ingested data using information provided by data integration and information extraction techniques, as well as from queries and user-feedback. In contrast to relational databases, ASDs maintain multiple schema workspaces that represent individualized views over the data, which are fine-tuned to the needs of a particular user or group of users. A novel aspect of ASDs is that probabilistic database techniques are used to encode ambiguity in automatically generated data extraction workflows and in generated schemas. ASDs can provide users with context-dependent feedback on the quality of a schema, both in terms of its ability to satisfy a user's queries, and the quality of the resulting answers. We outline our vision for ASDs, and present a proof of concept implementation as part of the Mimir probabilistic data curation system.more » « less
-
Since the early days of relational databases, it was realized that acyclic hypergraphs give rise to database schemas with desirable structural and algorithmic properties. In a bynow classical paper, Beeri, Fagin, Maier, and Yannakakis established several different equivalent characterizations of acyclicity; in particular, they showed that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for relations over that schema holds, which means that every collection of pairwise consistent relations over the schema is globally consistent. Even though real-life databases consist of bags (multisets), there has not been a study of the interplay between local consistency and global consistency for bags. We embark on such a study here and we first show that the sets of attributes of a schema form an acyclic hypergraph if and only if the local-to-global consistency property for bags over that schema holds. After this, we explore algorithmic aspects of global consistency for bags by analyzing the computational complexity of the global consistency problem for bags: given a collection of bags, are these bags globally consistent? We show that this problem is in NP, even when the schema is part of the input. We then establish the following dichotomy theorem for fixed schemas: if the schema is acyclic, then the global consistency problem for bags is solvable in polynomial time, while if the schema is cyclic, then the global consistency problem for bags is NP-complete. The latter result contrasts sharply with the state of affairs for relations, where, for each fixed schema, the global consistency problem for relations is solvable in polynomial time.more » « less
-
Binder is a publicly accessible online service for executing interactive notebooks based on Git repositories. Binder dynamically builds and deploys containers following a recipe stored in the repository, then gives the user a browser-based notebook interface. The Binder group periodically releases a log of container launches from the public Binder service. Archives of launch records are available here. These records do not include identifiable information like IP addresses, but do give the source repo being launched along with some other metadata. The main content of this dataset is in the
binder.sqlite
file. This SQLite database includes launch records from 2018-11-03 to 2021-06-06 in theevents
table, which has the following schema.CREATE TABLE events( version INTEGER, timestamp TEXT, provider TEXT, spec TEXT, origin TEXT, ref TEXT, guessed_ref TEXT ); CREATE INDEX idx_timestamp ON events(timestamp);
version
indicates the version of the record as assigned by Binder. Theorigin
field became available with version 3, and theref
field with version 4. Older records where this information was not recorded will have the corresponding fields set to null.timestamp
is the ISO timestamp of the launchprovider
gives the type of source repo being launched ("GitHub" is by far the most common). The rest of the explanations assume GitHub, other providers may differ.spec
gives the particular branch/release/commit being built. It consists of<github-id>/<repo>/<branch>
.origin
indicates which backend was used. Each has its own storage, compute, etc. so this info might be important for evaluating caching and performance. Note that only recent records include this field. May be null.ref
specifies the git commit that was actually used, rather than the named branch referenced byspec
. Note that this was not recorded from the beginning, so only the more recent entries include it. May be null.- For records where
ref
is not available, we attempted to clone the named reference given byspec
rather than the specific commit (see below). Theguessed_ref
field records the commit found at the time of cloning. If the branch was updated since the container was launched, this will not be the exact version that was used, and instead will refer to whatever was available at the time (early 2021). Depending on the application, this might still be useful information. Selecting only records with version 4 (or non-nullref
) will exclude these guessed commits. May be null.
The Binder launch dataset identifies the source repos that were used, but doesn't give any indication of their contents. We crawled GitHub to get the actual specification files in the repos which were fed into repo2docker when preparing the notebook environments, as well as filesystem metadata of the repos. Some repos were deleted/made private at some point, and were thus skipped. This is indicated by the absence of any row for the given commit (or absence of both
ref
andguessed_ref
in theevents
table). The schema is as follows.CREATE TABLE spec_files ( ref TEXT NOT NULL PRIMARY KEY, ls TEXT, runtime BLOB, apt BLOB, conda BLOB, pip BLOB, pipfile BLOB, julia BLOB, r BLOB, nix BLOB, docker BLOB, setup BLOB, postbuild BLOB, start BLOB );
Here
ref
corresponds toref
and/orguessed_ref
from theevents
table. For each repo, we collected spec files into the following fields (see the repo2docker docs for details on what these are). The records in the database are simply the verbatim file contents, with no parsing or further processing performed.runtime
:runtime.txt
apt
:apt.txt
conda
:environment.yml
pip
:requirements.txt
pipfile
:Pipfile.lock
orPipfile
julia
:Project.toml
orREQUIRE
r
:install.R
nix
:default.nix
docker
:Dockerfile
setup
:setup.py
postbuild
:postBuild
start
:start
The
ls
field gives a metadata listing of the repo contents (excluding the.git
directory). This field is JSON encoded with the following structure based on JSON types:- Object: filesystem directory. Keys are file names within it. Values are the contents, which can be regular files, symlinks, or subdirectories.
- String: symlink. The string value gives the link target.
- Number: regular file. The number value gives the file size in bytes.
CREATE TABLE clean_specs ( ref TEXT NOT NULL PRIMARY KEY, conda_channels TEXT, conda_packages TEXT, pip_packages TEXT, apt_packages TEXT );
The
clean_specs
table provides parsed and validated specifications for some of the specification files (currently Pip, Conda, and APT packages). Each column gives either a JSON encoded list of package requirements, or null. APT packages have been validated using a regex adapted from the repo2docker source. Pip packages have been parsed and normalized using the Requirement class from the pkg_resources package of setuptools. Conda packages have been parsed and normalized using theconda.models.match_spec.MatchSpec
class included with the library form of Conda (distinct from the command line tool). Users might want to use these parsers when working with the package data, as the specifications can become fairly complex.The
missing
table gives the repos that were not accessible, andevent_logs
records which log files have already been added. These tables are used for updating the dataset and should not be of interest to users.