skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: JoinBoost: Grow Trees over Normalized Data Using Only SQL
Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating theYvariable to the residual in thenon-materialized join result.Although this view update problem is generally ambiguous, we identifyaddition-to-multiplication preserving, the key property of variance semi-ring to supportrmsethe most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).  more » « less
Award ID(s):
2008295
PAR ID:
10559959
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
VLDB Endowment
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
16
Issue:
11
ISSN:
2150-8097
Page Range / eLocation ID:
3071 to 3084
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Dashboards are vital in modern business intelligence tools, providing non-technical users with an interface to access comprehensive business data. With the rise of cloud technology, there is an increased number of data sources to provide enriched contexts for various analytical tasks, leading to a demand for interactive dashboards over a large number of joins. Nevertheless, joins are among the most expensive operations in DBMSes, making the support of interactive dashboards over joins challenging. In this paper, we present Treant, a dashboard accelerator for queries over large joins. Treant uses factorized query execution to handle aggregation queries over large joins, which alone is still insufficient for interactive speeds. To address this, we exploit the incremental nature of user interactions using Calibrated Junction Hypertree (CJT), a novel data structure that applies lightweight materialization of the intermediates during factorized execution. CJT ensures that the work needed to compute a query is proportional to how different it is from the previous query, rather than the overall complexity. Treant manages CJTs to share work between queries and performs materialization offline or during user think-times. Implemented as a middleware that rewrites SQL, Treant is portable to any SQL-based DBMS. Our experiments on single node and cloud DBMSes show that Treant improves dashboard interactions by two orders of magnitude, and provides 10x improvement for ML augmentation compared to SOTA factorized ML system. 
    more » « less
  2. Deep learning (DL) is growing in popularity for many data analytics applications, including among enterprises. Large business-critical datasets in such settings typically reside in RDBMSs or other data systems. The DB community has long aimed to bring machine learning (ML) to DBMS-resident data. Given past lessons from in-DBMS ML and recent advances in scalable DL systems, DBMS and cloud vendors are increasingly interested in adding more DL support for DB-resident data. Recently, a new parallel DL model selection execution approach called Model Hopper Parallelism (MOP) was proposed. In this paper, we characterize the particular suitability of MOP for DL on data systems, but to bring MOP-based DL to DB-resident data, we show that there is no single "best" approach, and an interesting tradeoff space of approaches exists. We explain four canonical approaches and build prototypes upon Greenplum Database, compare them analytically on multiple criteria (e.g., runtime efficiency and ease of governance) and compare them empirically with large-scale DL workloads. Our experiments and analyses show that it is non-trivial to meet all practical desiderata well and there is a Pareto frontier; for instance, some approaches are 3x-6x faster but fare worse on governance and portability. Our results and insights can help DBMS and cloud vendors design better DL support for DB users. All of our source code, data, and other artifacts are available at https://github.com/makemebitter/cerebro-ds. 
    more » « less
  3. Vector databases have recently gained significant attention due to the emergence of large language models that produce vector embeddings for text. Existing vector databases can be broadly categorized into two types: specialized and generalized. Specialized vector databases are explicitly designed and optimized for managing vector data, while generalized ones support vector data management within a general purpose database. While specialized vector databases are interesting, there is a substantial customer base interested in generalized vector databases for various reasons, e.g., a reluctance to move data out of relational databases to reduce data silos and costs, the desire to use SQL, and the need for more sophisticated query processing of vector and non-vector data. However, generalized vector databases face two main challenges: performance and interoperability of vector search with SQL, such as combining vector search with filters, joins, or even fulltext search. In this paper, we present SingleStore-V, a full-fledged generalized vector database integrated into SingleStore, a modern distributed relational database optimized for both OLAP and OLTP workloads. SingleStore-V achieves high performance and interoperability via a suite of optimizations. Experiments on standard vector benchmarks show that SingleStore-V performs comparably to Milvus, a highly-optimized specialized vector database, and significantly outperforms pgvector, a popular generalized vector database in PostgreSQL. We believe this paper will shed light on integrating vector search into relational databases in general, as many design concepts and optimizations apply to other databases. 
    more » « less
  4. In file systems and database management systems (DBMSes), deleting data marks it as unallocated storage rather than explicitly erasing data. This data can be reconstructed from raw storage, making it vulnerable to data theft and exposing organizations to liability and compliance risks, violating data retention and destruction policies. The problem is further magnified in DBMSes because (unlike in file systems) DBMS backups are performed in pages and will include such deleted records. Data erasure (or sanitization) is a process that eliminates this vulnerability, providing users with “the right to be forgotten”. However, most of the work in data sanitization is only relevant to erasing data at the file system level, and not in DBMSes. Limited existing work in database sanitization takes an erase-on-commit approach, which can introduce significant I/O bottlenecks. In this paper, we describe a novel data sanitization method, DBSanitizer, that 1) is DBMS agnostic, 2) can batch value erasure, and 3) targets specific data to erase. DBSanitizer is designed as a template for DBMS vendors to support backup sanitization and ensure that no undesirable data is retained in backups. In this paper, we demonstrate how our approach can be used in any row-store relational DBMS (including Oracle, PostgreSQL, MySQL, and SQLite). As there are no backup sanitization tools available on the market or in research literature, we evaluate DBSanitizer, in a live database that supports erase-on-commit sanitization approach. 
    more » « less
  5. Cyberattacks continue to evolve and adapt to state-of-the-art security mechanisms. Therefore, it is critical for security experts to routinely inspect audit logs to detect complex security breaches. However, if a system was compromised during a cyberattack, the validity of the audit logs themselves cannot necessarily be trusted. Specifically, for a database management system (DBMS), an attacker with elevated privileges may temporarily disable the audit logs, bypassing logging altogether along with any tamper-proof logging mechanisms. Thus, security experts need techniques to validate logs independent of a potentially compromised system to detect security breaches. This paper demonstrates that SQL query operations produce a repeatable set of patterns within DBMS process memory. Operations such as full table scans, index accesses, or joins each produce their own set of distinct forensic artifacts in memory. Given these known patterns, we propose that collecting forensic artifacts from a trusted memory snapshot allows one to reverse-engineer query activity and validate audit logs independent of the DBMS itself and outside the scope of a database administrator's privileges. We rely on the fact the queries must ultimately be processed in memory regardless of any security mechanisms they may have bypassed. This work is generalized to all relational DBMSes by using two representative DBMSes, Oracle and MySQL. 
    more » « less