skip to main content


Title: PolyFrame: a retargetable query-based approach to scaling dataframes
In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision-making and applications. Scaling data analyses to large volumes of data requires the utilization of distributed frameworks. This can lead to serious technical challenges for data analysts and reduce their productivity. AFrame, a data analytics library, is implemented as a layer on top of Apache AsterixDB, addressing these issues by providing the data scientists' familiar interface, Pandas Dataframe, and transparently scaling out the evaluation of analytical operations through a Big Data management system. While AFrame is able to leverage data management facilities (e.g., indexes and query optimization) and allows users to interact with a large volume of data, the initial version only generated SQL++ queries and only operated against AsterixDB. In this work, we describe a new design that retargets AFrame's incremental query formation to other query-based database systems, making it more flexible for deployment against other data management systems with composable query languages.  more » « less
Award ID(s):
1954962
NSF-PAR ID:
10300517
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
14
Issue:
11
ISSN:
2150-8097
Page Range / eLocation ID:
2296 to 2304
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Stefanidis, K. ; Golab, L. (Ed.)
    Secondary indexes in relational database systems are traditionally built under the assumption that one data record maps to one indexed value. Nowadays, particularly in NoSQL systems, single data records can hold collections of values that users want to access efficiently in an ad-hoc manner. Multi-valued indexes aim to give users the best of both worlds: (i) to keep a more natural data model of records with collections of values, and (ii) to reap the benefits of a secondary index. In this paper, we detail the steps taken to realize multi-valued indexes in AsterixDB, a Big Data management system with a structured query language operating over a collection of docu- ments. This includes (a) creating the specification language for such indexes, (b) illustrating data flows for bulk-loading and maintaining an index, and (c) discussing query plans to take advantage of multi-valued indexes for use in predicates with existential and universal quantification. We conclude with ex- periments that compare AsterixDB multi-valued indexes against similar indexes in MongoDB and Couchbase Query. 
    more » « less
  2. null (Ed.)
    Query Optimization remains an open problem for Big Data Management Systems. Traditional optimizers are cost-based and use statistical estimates of intermediate result cardinalities to assign costs and pick the best plan. However, such estimates tend to become less accurate because of filtering conditions caused either from undetected correlations between multiple predicates local to a single dataset, predicates with query parameters, or predicates involving user-defined functions (UDFs). Consequently, traditional query optimizers tend to ignore or miscalculate those settings, thus leading to suboptimal execution plans. Given the volume of today’s data, a suboptimal plan can quickly become very inefficient. In this work, we revisit the old idea of runtime dynamic optimization and adapt it to a shared-nothing distributed database system, AsterixDB. The optimization runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created from each stage is used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimations, thus leading to much better execution plans. While it introduces the overhead for materializing these intermediate results, our experiments show that this overhead is relatively small and it is an acceptable price to pay given the optimization benefits. In fact, our experimental evaluation shows that runtime dynamic optimization leads to much better execution plans as compared to the current default AsterixDB plans as well as to plans produced by static cost-based optimization (i.e. based on the initial dataset statistics) and other state-of-the-art approaches. 
    more » « less
  3. Building large software systems is always a chal- lenging venture, but it is especially so in academia. This paper describes the experiences that the author and his (mostly UC- based) partners in software crime have had that culminated in the Big Data Management System now available as Apache AsterixDB. It covers a mix of the history and technical content of the nearly ten-year-old project, starting with its inception during the MapReduce craze. It describes the phases that the effort has gone through and some of the lessons learned along the way. The paper also covers some personal reflections and opinions about the challenges of systems-building, as well as writing about it, in our current academic culture. Included is the case for doing this sort of work at all – discussing the pitfalls of doing “systems” research in the absence of an actual system, and why the gain outweighs the pain of building and sharing database software in academia. As of late 2018, Apache AsterixDB is also having a commercial impact as the storage and parallel query engine underlying a new offering called Couchbase Analytics. The last part of the paper explains how we are attempting to balance the uses of AsterixDB as (i) a generally available open source Apache software platform, (ii) an end-to-end research testbed for universities, and (iii) the technology powering a commercial NoSQL product. 
    more » « less
  4. Effective query optimization remains an open problem for Big Data Management Systems. In this work, we revisit an old idea, runtime dynamic optimization, and adapt it to a big data management system, AsterixDB. The approach runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created by a stage is then used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimates, thus leading to much better execution plans. While it introduces overhead for materializing intermediate results, experiments show that this overhead is relatively small and is an acceptable price to pay given the optimization benefits. 
    more » « less
  5. Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity issues for “normal” data scientists. This paper introduces AFrame, a new scalable data analysis package powered by a Big Data management system that extends the data scientists' familiar DataFrame operations to efficiently operate on managed data at scale. AFrame is implemented as a layer on top of Apache AsterixDB, transparently scaling out the execution of DataFrame operations and machine learning model invocation through a parallel, shared-nothing big data management system. AFrame incrementally constructs SQL++ queries and leverages AsterixDB's semistructured data management facilities, user-defined function support, and live data ingestion support. In order to evaluate the proposed approach, this paper also introduces an extensible micro-benchmark for use in evaluating DataFrame performance in both single-node and distributed settings via a collection of representative analytic operations. This paper presents the architecture of AFrame, describes the underlying capabilities of AsterixDB that efficiently support modern data analytic operations, and utilizes the proposed benchmark to evaluate and compare the performance and support for largescale data analyses provided by alternative DataFrame libraries. 
    more » « less