skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: An Evaluation of B-tree Compression Techniques
Abstract B-trees are widely recognized as one of the most important index structures in database systems, providing efficient query processing capabilities. Over the past few decades, many techniques have been developed to enhance the efficiency of B-trees from various perspectives. Among them,B-tree compressionis an important technique introduced as early as the 1970s to improve both space efficiency and query performance. Since then, several B-tree compression techniques have been developed. However, to our surprise, we have found that these B-tree compression techniques werenevercompared against each other in prior works. Consequently, many important questions remain unanswered, such as whether B-tree compression is truly effective or not. If it is effective, under what scenarios and which B-tree compression methods should be employed? In this paper, we conduct an experimental evaluation of seven widely used B-tree compression techniques using both synthetic and real datasets. Based on our evaluation, we present lessons and insights regarding the use of B-tree compression that can be leveraged to guide system design decisions in modern databases.  more » « less
Award ID(s):
2337806
PAR ID:
10650006
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
The VLDB Journal
Volume:
35
Issue:
1
ISSN:
1066-8888
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Waves of miseryis a phenomenon where spikes of many node splits occur over short periods of time in tree indexes. Waves of misery negatively affect the performance of tree indexes in insertion-heavy workloads. Waves of misery have been first observed in the context of the B-tree, where these waves cause unpredictable index performance. In particular, the performance of search and index-update operations deteriorate when a wave of misery takes place, but is more predictable between the waves. This paper investigates the presence or lack of waves of misery in several R-tree variants, and studies the extent of which these waves impact the performance of each variant. Interestingly, although having poorer query performance, the Linear and Quadratic R-trees are found to be more resilient to waves of misery than both the Hilbert and R*-trees. This paper presents several techniques to reduce the impact in performance of the waves of misery for the Hilbert and R*-trees. One way to eliminate waves of misery is to force node splits to take place at regular times before nodes become full to achieve deterministic performance. The other way is that upon splitting a node, do not split it evenly but rather at different node utilization factors. This allows leaf nodes not to fill at the same pace. We study the impact of two new techniques to mitigate waves of misery after the tree index has been constructed, namely Regular Elective Splits (RES, for short) and Unequal Random Splits (URS, for short). Our experimental investigation highlights the trade-offs in performance of the introduced techniques and the pros and cons of each technique. 
    more » « less
  2. The kd-tree is one of the most widely used data structures to manage multi-dimensional data. Due to the ever-growing data volume, it is imperative to consider parallelism in kd-trees. However, we observed challenges in existing parallel kd-tree implementations, for both constructions and updates. The goal of this paper is to develop efficient in-memory kd-trees by supporting high parallelism and cache-efficiency. We propose the Pkd-tree (Parallel kd-tree), a parallel kd-tree that is efficient both in theory and in practice. The Pkd-tree supports parallel tree construction, batch update (insertion and deletion), and various queries including k-nearest neighbor search, range query, and range count. We proved that our algorithms have strong theoretical bounds in work (sequential time complexity), span (parallelism), and cache complexity. Our key techniques include 1) an efficient construction algorithm that optimizes work, span, and cache complexity simultaneously, and 2) reconstruction-based update algorithms that guarantee the tree to be weight-balanced. With the new algorithmic insights and careful engineering effort, we achieved a highly optimized implementation of the Pkd-tree. We tested Pkd-tree with various synthetic and real-world datasets, including both uniform and highly skewed data. We compare the Pkd-tree with state-of-the-art parallel kd-tree implementations. In all tests, with better or competitive query performance, Pkd-tree is much faster in construction and updates consistently than all baselines. We released our code. 
    more » « less
  3. Satta, Yoko (Ed.)
    Abstract Likelihood-based tests of phylogenetic trees are a foundation of modern systematics. Over the past decade, an enormous wealth and diversity of model-based approaches have been developed for phylogenetic inference of both gene trees and species trees. However, while many techniques exist for conducting formal likelihood-based tests of gene trees, such frameworks are comparatively underdeveloped and underutilized for testing species tree hypotheses. To date, widely used tests of tree topology are designed to assess the fit of classical models of molecular sequence data and individual gene trees and thus are not readily applicable to the problem of species tree inference. To address this issue, we derive several analogous likelihood-based approaches for testing topologies using modern species tree models and heuristic algorithms that use gene tree topologies as input for maximum likelihood estimation under the multispecies coalescent. For the purpose of comparing support for species trees, these tests leverage the statistical procedures of their original gene tree-based counterparts that have an extended history for testing phylogenetic hypotheses at a single locus. We discuss and demonstrate a number of applications, limitations, and important considerations of these tests using simulated and empirical phylogenomic data sets that include both bifurcating topologies and reticulate network models of species relationships. Finally, we introduce the open-source R package SpeciesTopoTestR (SpeciesTopology Tests in R) that includes a suite of functions for conducting formal likelihood-based tests of species topologies given a set of input gene tree topologies. 
    more » « less
  4. Having dominated databases and various data management systems for decades,B+-tree is infamously subject to a logging dilemma: One could improveB+-tree speed performance by equipping it with a larger log, which nevertheless will degrade its crash recovery speed. Such a logging dilemma is particularly prominent in the presence of modern workloads that involve intensive small writes. In this paper, we propose a novel solution, calledper-page loggingbasedB+-tree, which leverages the emerging computational storage drive (CSD) with built-in transparent compression to fundamentally resolve the logging dilemma. Our key idea is to divide the large single log into many small (e.g., 4KB), highly compressibleper-page logs, each being statically bounded with aB+-tree page. All per-page logs together form a very large over-provisioned log space forB+-tree to improve its operational speed performance. Meanwhile, during crash recovery,B+-tree does not need to scan any per-page logs, leading to a recovery latency independent from the total log size. We have developed and open-sourced a fully functional prototype. Our evaluation results show that, under small-write intensive workloads, our design solution can improveB+-tree operational throughput by up to 625.6% and maintain a crash recovery time of as low as 19.2 ms, while incurring a minimal storage overhead of only 0.5-1.6%. 
    more » « less
  5. Ensemble models (bagging and gradient-boosting) of relational decision trees have proved to be some of the most effective learning methods in the area of probabilistic logic models (PLMs). While effective, they lose one of the most important benefits of PLMs—interpretability. In this paper we consider the problem of compressing a large set of learned trees into a single explainable model. To this effect, we propose CoTE—Compression of Tree Ensembles—that produces a single small decision list as a compressed representation. CoTE first converts the trees to decision lists and then performs the combination and compression with the aid of the original training set. An experimental evaluation demonstrates the effectiveness of CoTE in several benchmark relational data sets. 
    more » « less