skip to main content

Title: dsJSON: A Distributed SQL JSON Processor
The popularity of JSON as a data interchange format resulted in big amounts of datasets available for processing. Users would like to analyze this data using SQL queries but existing distributed systems limit their users to only two specific formats, JSONLine and GeoJSON. The complexity of JSON schema makes it challenging to parse arbitrary files in a modern distributed system while producing records with unified schema that can be processed with SQL. To address these challenges, this paper introduces dsJSON, a state-of-the-art distributed JSON processor that overcomes limitations in existing systems and scales to big and complex data. dsJSON introduces the projection tree, a novel data structure that applies selective parsing of nested attributes to produce records that are ready for SQL processors. The key objective of the projection tree is to parse a big JSON file in parallel to produce records with a unified schema that can be processed with SQL. dsJSON is integrated into SparkSQL which enables users to run arbitrary SQL queries on complex JSON files. It also pushes projection and filter down into the parser for full integration between the parser and the processor. Experiments on up-to two terabytes of real data show that dsJSON performs several times faster than existing systems. It can also efficiently parse extremely large files not supported by existing distributed parsers  more » « less
Award ID(s):
2046236 1924694 1954644 1838222
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM Digital Library
Date Published:
Journal Name:
Proceedings of the ACM on Management of Data
Page Range / eLocation ID:
1 to 25
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this demonstration, we present SpeakQL, a speech-driven query system and interface for structured data. SpeakQL supports a tractable and practically useful subset of regular SQL, allowing users to query in any domain with unbounded vocabulary with the help of speech/touch based user-in-the-loop mechanisms for correction. When querying in such domains, automatic speech recognition introduces countless forms of errors in transcriptions, presenting us with a technical challenge. We characterize such errors and leverage our observations along with SQL's unambiguous context-free grammar to first correct the query structure. We then exploit phonetic representation of the queried database to identify the correct Literals, hence delivering the corrected transcribed query. In this demo, we show that SpeakQL helps users reduce time and effort in specifying SQL queries significantly. In addition, we show that SpeakQL, unlike Natural Language Interfaces and conversational assistants, allows users to query over any arbitrary database schema. We allow the audience to explore SpeakQL using an easy-to-use web-based interface to compose SQL queries. 
    more » « less
  2. Ad-hoc data models like JSON make it easy to evolve schemas and to multiplex different data-types into a single stream. This flexibility makes JSON great for generating data, but also makes it much harder to query, ingest into a database, and index. In this paper, we explore the first step of JSON data loading: schema design. Specifically, we consider the challenge of designing schemas for existing JSON datasets as an interactive problem. We present SchemaDrill, a roll-up/drill-down style interface for exploring collections of JSON records. SchemaDrill helps users to visualize the collection, identify relevant fragments, and map it down into one or more flat, relational schemas. We describe and evaluate two key components of SchemaDrill: (1) A summary schema representation that significantly reduces the complexity of JSON schemas without a meaningful reduction in information content, and (2) A collection of schema visualizations that help users to qualitatively survey variability amongst different schemas in the collection. 
    more » « less
  3. Speech-driven querying is becoming popular in new device environments such as smartphones, tablets, and even conversational assistants. However, such querying is largely restricted to natural language. Typed SQL remains the gold standard for sophisticated structured querying although it is painful in many environments, which restricts when and how users consume their data. In this work, we propose to bridge this gap by designing a speech-driven querying system and interface for structured data we call SpeakQL. We support a practically useful subset of regular SQL and allow users to query in any domain with novel touch/speech based human-in-the-loop correction mechanisms. Automatic speech recognition (ASR) introduces myriad forms of errors in transcriptions, presenting us with a technical challenge. We exploit our observations of SQL's properties, its grammar, and the queried database to build a modular architecture. We present the first dataset of spoken SQL queries and a generic approach to generate them for any arbitrary schema. Our experiments show that SpeakQL can automatically correct a large fraction of errors in ASR transcriptions. User studies show that SpeakQL can help users specify SQL queries significantly faster with a speedup of average 2.7x and up to 6.7x compared to typing on a tablet device. SpeakQL also reduces the user effort in specifying queries by a factor of average 10x and up to 60x compared to raw typing effort. 
    more » « less
  4. Abstract

    Window queries are important analytical tools for ordered data and have been researched both in streaming and stored data environments. By incorporating ideas for window queries from existing streaming and stored data systems, we propose a new window syntax that makes a wide range of window queries easier to write and optimize. We have implemented this new window syntax in SQL++, an SQL extension that supports querying semistructured data, on top of AsterixDB, a Big Data Management System, thus allowing us to process window queries over large datasets in a parallel and efficient manner.

    more » « less
  5. Big spatial data has become ubiquitous, from mobile applications to satellite data. In most of these applications, data is continuously growing to huge volumes. Existing systems for big spatial data organize records at either the record-level or block-level. Systems that use record-level structures include key-value stores and LSM-Tree stores, which support insert and delete operations and they are optimized for highly-selective queries. On the other hand, systems like GeoSpark that use block-level structures (e.g. 128 MB each) are more efficient for analytical queries, but they cannot incrementally maintain the partitioned data and do not support delete operations. This paper proposes a general framework that enables block-level systems to incrementally maintain spatial partitions, in the presence of bulk insertions and deletions, in distributed file system (DFS) blocks. We first formally study the incremental spatial partitioning problem for big data and demonstrate its NP-hardness. Then, we propose a cost model to estimate the performance of queries on the partitioned data and the effect of modifying it as the data grows. After that, we provide three different implementations of the incremental partitioning framework. Comprehensive experiments on large real datasets show that our proposed partitioning algorithms outperforms state-of-the-art spatial partitioning methods. 
    more » « less