skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: SQL++: We Can Finally Relax!
SQL is five decades old and has outlasted many programming and query languages that have come and gone during its lifetime. It was born shortly after the introduction of the relational model, and was designed for querying a flat and typed tabular world. Support for modern, flexible data in the SQL standard and in relational database systems has largely been approached via the addition of new column types (e.g. XML or JSON) together with functions to operate on them. It is time for a cleaner solution that retains the benefits that have allowed SQL to be so successful for so long. We describe SQL++, a SQL extension that relaxes SQL's strictness in terms of both object structure (flat → nested) and schema (mandatory → optional), along with a multi-party effort to agree on a core definition and syntax supportable by multiple vendors. SQL++ sees relational data as a subset of a more flexible object model and it sees collections of document data (e.g., JSON) as a natural and supportable relaxation as opposed to a “bolt on” addition via a SQL column type. We describe the core features of SQL++ and explain how its definition can accommodate flexible data, while staying true to SQL in situations where the target data is tabular and strongly typed. Index Terms-semistructured data, query, JSON, SQL, NoSQL  more » « less
Award ID(s):
1954962 1954644
PAR ID:
10548703
Author(s) / Creator(s):
; ; ; ; ; ; ;
Publisher / Repository:
2024 IEEE 40th International Conference on Data Engineering (ICDE)
Date Published:
ISBN:
979-8-3503-1715-2
Page Range / eLocation ID:
5501 to 5510
Format(s):
Medium: X
Location:
Utrecht, Netherlands
Sponsoring Org:
National Science Foundation
More Like this
  1. Ad-hoc data models like JSON make it easy to evolve schemas and to multiplex different data-types into a single stream. This flexibility makes JSON great for generating data, but also makes it much harder to query, ingest into a database, and index. In this paper, we explore the first step of JSON data loading: schema design. Specifically, we consider the challenge of designing schemas for existing JSON datasets as an interactive problem. We present SchemaDrill, a roll-up/drill-down style interface for exploring collections of JSON records. SchemaDrill helps users to visualize the collection, identify relevant fragments, and map it down into one or more flat, relational schemas. We describe and evaluate two key components of SchemaDrill: (1) A summary schema representation that significantly reduces the complexity of JSON schemas without a meaningful reduction in information content, and (2) A collection of schema visualizations that help users to qualitatively survey variability amongst different schemas in the collection. 
    more » « less
  2. Comparing relational languages by their logical expressiveness is well understood. Less understood is how to compare relational languages by their ability to represent relational query patterns. Indeed, what are query patterns other than ''a certain way of writing a query''? And how can query patterns be defined across procedural and declarative languages, irrespective of their syntax? Our SIGMOD 2024 paper proposes a semantic definition of relational query patterns that uses a variant of structure-preserving mappings between the relational tables of queries. This formalism allows us to analyze the relative pattern expressiveness of relational languages. Notably, for the nondisjunctive language fragment, we show that relational calculus (RC) can express a larger class of patterns than the basic operators of relational algebra (RA). We also propose Relational Diagrams, a complete and sound diagrammatic representation of safe relational calculus. These diagrams can represent all query patterns for unions of non-disjunctive queries, in contrast to visual query representations that derive visual marks from the basic operators of algebra. Our anonymously preregistered user study shows that Relational Diagrams allow users to recognize relational patterns meaningfully faster and more accurately than they can with SQL. 
    more » « less
  3. Comparing relational languages by their logical expressiveness is well understood. Less well understood is how to compare relational languages by their ability to represent relational query patterns. Indeed, what are query patterns other than a certain way of writing a query? And how can query patterns be defined across procedural and declarative languages, irrespective of their syntax? To the best of our knowledge, we provide the first semantic definition of relational query patterns by using a variant of structure-preserving mappings between the relational tables of queries. This formalism allows us to analyze the relative pattern expressiveness of relational language fragments and create a hierarchy of languages with equal logical expressiveness yet different pattern expressiveness. Notably, for the non-disjunctive language fragment, we show that relational calculus can express a larger class of patterns than the basic operators of relational algebra. Our language-independent definition of query patterns opens novel paths for assisting database users. For example, these patterns could be leveraged to create visual query representations that faithfully represent query patterns, speed up interpretation, and provide visual feedback during query editing. As a concrete example, we propose Relational Diagrams, a complete and sound diagrammatic representation of safe relational calculus that is provably (i) unambiguous, (ii) relationally complete, and (iii) able to represent all query patterns for unions of non-disjunctive queries. Among all diagrammatic representations for relational queries that we are aware of, ours is the only one with these three properties. Furthermore, our anonymously preregistered user study shows that Relational Diagrams allow users to recognize patterns meaningfully faster and more accurately than SQL. 
    more » « less
  4. Recognizing the promise of natural language interfaces to databases, prior studies have emphasized the development of text-to-SQL systems. Existing research has generally focused on generating SQL statements from text queries, and the broader challenge lies in inferring new information about the returned data. Our research makes two major contributions to address this gap. First, we introduce a novel Internet-of-Things (IoT) text-to-SQL dataset comprising 10,985 text-SQL pairs and 239,398 rows of network traffic activity. The dataset contains additional query types limited in prior text-to-SQL datasets, notably, temporal-related queries. Our dataset is sourced from a smart building’s IoT ecosystem exploring sensor read and network traffic data. Second, our dataset allows two-stage processing, where the returned data (network traffic) from a generated SQL can be categorized as malicious or not. Our results show that joint training to query and infer information about the data improves overall text-to-SQL performance, nearly matching that of substantially larger models. We also show that current large language models (e.g., GPT3.5) struggle to infer new information about returned data (i.e., they are bad at tabular data understanding), thus our dataset provides a novel test bed for integrating complex domain-specific reasoning into LLMs. 
    more » « less
  5. The popularity of JSON as a data interchange format resulted in big amounts of datasets available for processing. Users would like to analyze this data using SQL queries but existing distributed systems limit their users to only two specific formats, JSONLine and GeoJSON. The complexity of JSON schema makes it challenging to parse arbitrary files in a modern distributed system while producing records with unified schema that can be processed with SQL. To address these challenges, this paper introduces dsJSON, a state-of-the-art distributed JSON processor that overcomes limitations in existing systems and scales to big and complex data. dsJSON introduces the projection tree, a novel data structure that applies selective parsing of nested attributes to produce records that are ready for SQL processors. The key objective of the projection tree is to parse a big JSON file in parallel to produce records with a unified schema that can be processed with SQL. dsJSON is integrated into SparkSQL which enables users to run arbitrary SQL queries on complex JSON files. It also pushes projection and filter down into the parser for full integration between the parser and the processor. Experiments on up-to two terabytes of real data show that dsJSON performs several times faster than existing systems. It can also efficiently parse extremely large files not supported by existing distributed parsers 
    more » « less