NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Rethinking the Encoding of Integers for Scans on Skewed Data

https://doi.org/10.1145/3626751

Prammer, Martin; Patel, Jignesh M (December 2023, Proceedings of the ACM on Management of Data)

Bit-parallel scanning techniques are characterized by their ability to accelerate compute through the process known as early pruning. Early pruning techniques iterate over the bits of each value, searching for opportunities to safely prune compute early, before processing each data value in its entirety. However, because of this iterative evaluation, the effectiveness of early pruning depends on the relative position of bits that can be used for pruning within each value. Due to this behavior, bit-parallel techniques have faced significant challenges when processing skewed data, especially when values contain many leading zeroes. This problem is further amplified by the inherent trade-off that bit-parallel techniques make between columnar scan and fetch performance: a storage layer that supports early pruning requires multiple memory accesses to fetch a single value. Thus, in the case of skewed data, bit-parallel techniques increase fetch latency without significantly improving scan performance when compared to baseline columnar implementations. To remedy this shortcoming, we transform the values in bit-parallel columns using novel encodings. We propose the concept of forward encodings: a family of encodings that shift pruning-relevant bits closer to the most significant bit. Using this concept, we propose two particular encodings: the Data Forward Encoding and the Extended Data Forward Encoding. We demonstrate the impact of these encodings using multiple real-world datasets. Across these datasets, forward encodings improve the current state-of-the-art bit-parallel technique's scan and fetch performance in many cases by 1.4x and 1.3x, respectively.
more » « less
Full Text Available
Simple Adaptive Query Processing vs. Learned Query Optimizers: Observations and Analysis

https://doi.org/10.14778/3611479.3611501

Zhang, Yunjia; Chronis, Yannis; Patel, Jignesh M.; Rekatsinas, Theodoros (July 2023, Proceedings of the VLDB Endowment)

There have been many decades of work on optimizing query processing in database management systems. Recently, modern machine learning (ML), and specifically reinforcement learning (RL), has gained increased attention as a means to develop a query optimizer (QO). In this work, we take a closer look at two recent state-of-the-art (SOTA) RL-based QO methods to better understand their behavior. We find that these RL-based methods do not generalize as well as it seems at first glance. Thus, we ask a simple question:How do SOTA RL-based QOs compare to a simple, modern, adaptive query processing approach?To answer this question, we choose two simple adaptive query processing techniques and implemented them in PostgreSQL. The first adapts an individual join operation on-the-fly and switches between a Nested Loop Join algorithm and a Hash Join algorithm to avoid sub-optimal join algorithm decisions. The second is a technique calledLookahead Information Passing(LIP), in which adaptive semijoin techniques are used to make a pipeline of join operations execute efficiently. To our surprise, we find that this simple adaptive query processing approach is not only competitive to the SOTA RL-based approaches but, in some cases, outperforms the RL-based approaches. The adaptive approach is also appealing because it does not require an expensive training step, and it is fully interpretable compared to the RL-based QO approaches. Further, the adaptive method works across complex query constructs that RL-based QO methods currently cannot optimize.
more » « less
Full Text Available
SQLite: past, present, and future

https://doi.org/10.14778/3554821.3554842

Gaffney, Kevin P.; Prammer, Martin; Brasfield, Larry; Hipp, D. Richard; Kennedy, Dan; Patel, Jignesh M. (August 2022, Proceedings of the VLDB Endowment)

In the two decades following its initial release, SQLite has become the most widely deployed database engine in existence. Today, SQLite is found in nearly every smartphone, computer, web browser, television, and automobile. Several factors are likely responsible for its ubiquity, including its in-process design, standalone codebase, extensive test suite, and cross-platform file format. While it supports complex analytical queries, SQLite is primarily designed for fast online transaction processing (OLTP), employing row-oriented execution and a B-tree storage format. However, fueled by the rise of edge computing and data science, there is a growing need for efficient in-process online analytical processing (OLAP). DuckDB, a database engine nicknamed "the SQLite for analytics", has recently emerged to meet this demand. While DuckDB has shown strong performance on OLAP benchmarks, it is unclear how SQLite compares. Furthermore, we are aware of no work that attempts to identify root causes for SQLite's performance behavior on OLAP workloads. In this paper, we discuss SQLite in the context of this changing workload landscape. We describe how SQLite evolved from its humble beginnings to the full-featured database engine it is today. We evaluate the performance of modern SQLite on three benchmarks, each representing a different flavor of in-process data management, including transactional, analytical, and blob processing. We delve into analytical data processing on SQLite, identifying key bottlenecks and weighing potential solutions. As a result of our optimizations, SQLite is now up to 4.2X faster on SSB. Finally, we discuss the future of SQLite, envisioning how it will evolve to meet new demands and challenges.
more » « less
Full Text Available
How Good is My HTAP System?

https://doi.org/10.1145/3514221.3526148

Milkai, Elena; Chronis, Yannis; Gaffney, Kevin P.; Guo, Zhihan; Patel, Jignesh M.; Yu, Xiangyao (June 2022, SIGMOD)

Full Text Available
VIP hashing: adapting to skew in popularity of data on the fly

https://doi.org/10.14778/3547305.3547306

Kakaraparthy, Aarati; Patel, Jignesh M.; Kroth, Brian P.; Park, Kwanghyun (June 2022, Proceedings of the VLDB Endowment)

All data is not equally popular. Often, some portion of data is more frequently accessed than the rest, which causes a skew in popularity of the data items. Adapting to this skew can improve performance, and this topic has been studied extensively in the past for disk-based settings. In this work, we consider an in-memory data structure, namely hash table , and show how one can leverage the skew in popularity for higher performance. Hashing is a low-latency operation, sensitive to the effects of caching and code complexity, among other factors. These factors make learning in-the-loop challenging as the overhead of performing additional operations can have significant impact on performance. In this paper, we propose VIP hashing, a hash table method that uses lightweight mechanisms for learning the skew in popularity and adapting the hash table layout on the fly. These mechanisms are non-blocking, i.e, the hash table is operational at all times. The overhead is controlled by sensing changes in the popularity distribution to dynamically switch-on/off the mechanisms as needed. We ran extensive tests against a host of workloads generated by Wiscer , a homegrown benchmarking tool, and we find that VIP hashing improves performance in the presence of skew (22% increase in fetch operation throughput for a hash table with 1M keys under low skew) while adapting to insert and delete operations, and changing popularity distribution of keys on the fly. Our experiments on DuckDB show that VIP hashing reduces the end-to-end execution time of TPC-H query 9 by 20% under low skew.
more » « less
Full Text Available
On inter-operator data transfers in query processing

https://doi.org/10.1109/ICDE53745.2022.00066

Deshmukh, Harshad; Sundarmurthy, Bruhathi; Patel, Jignesh M. (May 2022, International Conference on Data Engineering (ICDE))

Full Text Available
Introducing a Query Acceleration Path for Analytics in SQLite3

Martin Prammer, Suryadev Sahadevan (January 2022, Conference on Innovative Data Systems Research (CIDR))

As large scale data processing becomes an ever more prominent component of modern computing tasks, databases now exist as a fundamental necessity of most computational platforms. However, in many cases there exists a disparity between the specializations of database management systems and the needs of the applications that run on them. The distinction between transactional and analyt- ical workloads for databases has been well established, but not fully addressed within the space of the most widely used embedded data- base system, namely SQLite3. To overcome this shortcoming, we implement SQLite3/HE, an analytical database engine implemented as an alternative execution path for SQLite. Through the utilization of an additional, complementary storage layer, SQLite3/HE trans- forms SQLite into a hybrid database system, able to fully leverage the benefits of both row and columnar storage layouts. SQLite3/HE improves the performance of analytical queries in the 100x-1000x speedup range, at no cost to the existing transactional query perfor- mance. These results validate the decision to implement SQLite3/HE as an alternative execution path, enabling it to serve as a drop-in replacement for SQLite3 in existing systems.
more » « less
Full Text Available
Database isolation by scheduling

https://doi.org/10.14778/3461535.3461537

Gaffney, Kevin P.; Claus, Robert; Patel, Jignesh M. (May 2021, Proceedings of the VLDB Endowment)

Transaction isolation is conventionally achieved by restricting access to the physical items in a database. To maximize performance, isolation functionality is often packaged with recovery, I/O, and data access methods in a monolithic transactional storage manager. While this design has historically afforded high performance in online transaction processing systems, industry trends indicate a growing need for a new approach in which intertwined components of the transactional storage manager are disaggregated into modular services. This paper presents a new method to modularize the isolation component. Our work builds on predicate locking, an isolation mechanism that enables this modularization by locking logical rather than physical items in a database. Predicate locking is rarely used as the core isolation mechanism because of its high theoretical complexity and perceived overhead. However, we show that this overhead can be substantially reduced in practice by optimizing for common predicate structures. We present DIBS, a transaction scheduler that employs our predicate locking optimizations to guarantee isolation as a modular service. We evaluate the performance of DIBS as the sole isolation mechanism in a data processing system. In this setting, DIBS scales up to 10.5 million transactions per second on a TATP workload. We also explore how DIBS can be applied to existing database systems to increase transaction throughput. DIBS reduces per-transaction file system writes by 90% on TATP in SQLite, resulting in a 3X improvement in throughput. Finally, DIBS reduces row contention on YCSB in MySQL, providing serializable isolation with a 1.4X improvement in throughput.
more » « less
Full Text Available
Scaling-up in-memory datalog processing: observations and techniques

https://doi.org/10.14778/3311880.3311886

Fan, Zhiwei; Zhu, Jianqiao; Zhang, Zuyu; Albarghouthi, Aws; Koutris, Paraschos; Patel, Jignesh M. (February 2019, Proceedings of the VLDB Endowment)

Full Text Available

Search for: All records