skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, January 16 until 2:00 AM ET on Friday, January 17 due to maintenance. We apologize for the inconvenience.


Title: Adaptive Schema Databases
The rigid schemas of classical relational databases help users in specifying queries and inform the storage organization of data. However, the advantages of schemas come at a high upfront cost through schema and ETL process design. In this work, we propose a new paradigm where the database system takes a more active role in schema development and data integration. We refer to this approach as adaptive schema databases (ASDs). An ASD ingests semi-structured or unstructured data directly using a pluggable combination of extraction and data integration techniques. Over time it discovers and adapts schemas for the ingested data using information provided by data integration and information extraction techniques, as well as from queries and user-feedback. In contrast to relational databases, ASDs maintain multiple schema workspaces that represent individualized views over the data, which are fine-tuned to the needs of a particular user or group of users. A novel aspect of ASDs is that probabilistic database techniques are used to encode ambiguity in automatically generated data extraction workflows and in generated schemas. ASDs can provide users with context-dependent feedback on the quality of a schema, both in terms of its ability to satisfy a user's queries, and the quality of the resulting answers. We outline our vision for ASDs, and present a proof of concept implementation as part of the Mimir probabilistic data curation system.  more » « less
Award ID(s):
1640864
PAR ID:
10048275
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ad-hoc data models like JSON make it easy to evolve schemas and to multiplex different data-types into a single stream. This flexibility makes JSON great for generating data, but also makes it much harder to query, ingest into a database, and index. In this paper, we explore the first step of JSON data loading: schema design. Specifically, we consider the challenge of designing schemas for existing JSON datasets as an interactive problem. We present SchemaDrill, a roll-up/drill-down style interface for exploring collections of JSON records. SchemaDrill helps users to visualize the collection, identify relevant fragments, and map it down into one or more flat, relational schemas. We describe and evaluate two key components of SchemaDrill: (1) A summary schema representation that significantly reduces the complexity of JSON schemas without a meaningful reduction in information content, and (2) A collection of schema visualizations that help users to qualitatively survey variability amongst different schemas in the collection. 
    more » « less
  2. Relational databases play an important role in business, science, and more. However, many users cannot fully unleash the analytical power of relational databases, because they are not familiar with database languages such as SQL. Many techniques have been proposed to automatically generate SQL from natural language, but they suffer from two issues: (1) they still make many mistakes, particularly for complex queries, and (2) they do not provide a flexible way for non-expert users to validate and refine incorrect queries. To address these issues, we introduce a new interaction mechanism that allows users to directly edit a step-by-step explanation of a query to fix errors. Our experiments on multiple datasets, as well as a user study with 24 participants, demonstrate that our approach can achieve better performance than multiple SOTA approaches. 
    more » « less
  3. null (Ed.)
    Incomplete and probabilistic database techniques are principled methods for coping with uncertainty in data. unfortunately, the class of queries that can be answered efficiently over such databases is severely limited, even when advanced approximation techniques are employed. We introduce attribute-annotated uncertain databases (AU-DBs), an uncertain data model that annotates tuples and attribute values with bounds to compactly approximate an incomplete database. AU-DBs are closed under relational algebra with aggregation using an efficient evaluation semantics. Using optimizations that trade accuracy for performance, our approach scales to complex queries and large datasets, and produces accurate results. 
    more » « less
  4. It has been decades since category theory was applied to databases. In spite of their mathematical elegance, categorical models have traditionally had difficulty representing factual data, such as integers or strings. This paper proposes a categorical dataset for power system computational models, which is used for AC Optimal Power Flows (ACOPF). In addition, categorical databases incorporate factual data using multi-sorted algebraic theories (also known as Lawvere theories) based on the set-valued functor model. In the advanced metering infrastructure of power systems, this approach is capable of handling missing information efficiently. This methodology enables constraints and queries to employ operations on data, such as multiplicative and comparative processes, thereby facilitating the integration between conventional databases and programming languages like Julia and Python’s Pypower. The demonstration illustrates how all elements of the model, including schemas, instances, and functors, can modify the schema in ACOPF instances. 
    more » « less
  5. We consider the problem of answering temporal queries on RDF stores, in presence of atemporal RDFS domain ontologies, of relational data sources that include temporal information, and of rules that map the domain information in the source schemas into the target ontology. Our proposed practice-oriented solution consists of two rule-based domain-independent algorithms. The first algorithm materializes target RDF data via a version of data exchange that enriches both the data and the ontology with temporal information from the relational sources. The second algorithm accepts as inputs temporal queries expressed in terms of the domain ontology using a lightweight temporal extension of SPARQL, and ensures successful evaluation of the queries on the materialized temporally-enriched RDF data. To study the quality of the information generated by the algorithms, we develop a general framework that formalizes the relational-to-RDF temporal data-exchange problem. The framework includes a chase formalism and a formal solution for the problem of answering temporal queries in the context of relational-to-RDF temporal data exchange. In this article, we present the algorithms and the formal framework that proves correctness of the information output by the algorithms, and also report on the algorithm implementation and experimental results for two application domains. 
    more » « less