skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Distributed deep learning on data systems: a comparative analysis of approaches
Deep learning (DL) is growing in popularity for many data analytics applications, including among enterprises. Large business-critical datasets in such settings typically reside in RDBMSs or other data systems. The DB community has long aimed to bring machine learning (ML) to DBMS-resident data. Given past lessons from in-DBMS ML and recent advances in scalable DL systems, DBMS and cloud vendors are increasingly interested in adding more DL support for DB-resident data. Recently, a new parallel DL model selection execution approach called Model Hopper Parallelism (MOP) was proposed. In this paper, we characterize the particular suitability of MOP for DL on data systems, but to bring MOP-based DL to DB-resident data, we show that there is no single "best" approach, and an interesting tradeoff space of approaches exists. We explain four canonical approaches and build prototypes upon Greenplum Database, compare them analytically on multiple criteria (e.g., runtime efficiency and ease of governance) and compare them empirically with large-scale DL workloads. Our experiments and analyses show that it is non-trivial to meet all practical desiderata well and there is a Pareto frontier; for instance, some approaches are 3x-6x faster but fare worse on governance and portability. Our results and insights can help DBMS and cloud vendors design better DL support for DB users. All of our source code, data, and other artifacts are available at https://github.com/makemebitter/cerebro-ds.  more » « less
Award ID(s):
1942724
PAR ID:
10337018
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
14
Issue:
10
ISSN:
2150-8097
Page Range / eLocation ID:
1769 to 1782
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating theYvariable to the residual in thenon-materialized join result.Although this view update problem is generally ambiguous, we identifyaddition-to-multiplication preserving, the key property of variance semi-ring to supportrmsethe most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas). 
    more » « less
  2. In file systems and database management systems (DBMSes), deleting data marks it as unallocated storage rather than explicitly erasing data. This data can be reconstructed from raw storage, making it vulnerable to data theft and exposing organizations to liability and compliance risks, violating data retention and destruction policies. The problem is further magnified in DBMSes because (unlike in file systems) DBMS backups are performed in pages and will include such deleted records. Data erasure (or sanitization) is a process that eliminates this vulnerability, providing users with “the right to be forgotten”. However, most of the work in data sanitization is only relevant to erasing data at the file system level, and not in DBMSes. Limited existing work in database sanitization takes an erase-on-commit approach, which can introduce significant I/O bottlenecks. In this paper, we describe a novel data sanitization method, DBSanitizer, that 1) is DBMS agnostic, 2) can batch value erasure, and 3) targets specific data to erase. DBSanitizer is designed as a template for DBMS vendors to support backup sanitization and ensure that no undesirable data is retained in backups. In this paper, we demonstrate how our approach can be used in any row-store relational DBMS (including Oracle, PostgreSQL, MySQL, and SQLite). As there are no backup sanitization tools available on the market or in research literature, we evaluate DBSanitizer, in a live database that supports erase-on-commit sanitization approach. 
    more » « less
  3. Real-time applications such as autonomous and connected cars, surveillance, and online learning applications have to train on streaming data. They require low-latency, high throughput machine learning (ML) functions resident in the network and in the cloud to perform learning and inference. NFV on edge cloud platforms can provide support for these applications by having heterogeneous computing including GPUs and other accelerators to offload ML-related computation. GPUs provide the necessary speedup for performing learning and inference to meet the needs of these latency sensitive real-time applications. Supporting ML inference and learning efficiently for streaming data in NFV platforms has several challenges. In this paper, we present a framework, NetML, that runs existing ML applications on an heterogeneous NFV platform that includes both CPUs and GPUs. NetML efficiently transfers the appropriate packet payload to the GPU, minimizing overheads, avoiding locks, and avoiding CPU-based data copies. Additionally, NetML minimizes latency by maximizing overlap between the data movement and GPU computation. We evaluate the efficiency of our approach for training and inference using popular object detection algorithms on our platform. NetML reduces the latency for inferring images by more than 20% and increases the training throughput by 30% while reducing CPU utilization compared to other state-of-the-art alternatives. 
    more » « less
  4. In the past decade, academia and industry have embraced machine learning (ML) for database management system (DBMS) automation. These efforts have focused on designing ML models that predict DBMS behavior to support picking actions (e.g., building indexes) that improve the system's performance. Recent developments in ML have created automated methods for finding good models. Such advances shift the bottleneck from DBMS model design to obtaining the training data necessary for building these models. But generating good training data is challenging and requires encoding subject matter expertise into DBMS instrumentation. Existing methods for training data collection are bespoke to individual DBMS components and do not account for (1) how workload trends affect the system and (2) the subtle interactions between internal system components. Consequently, the models created from this data do not support holistic tuning across subsystems and require frequent retraining to boost their accuracy. This paper presents the architecture of a database gym, an integrated environment that provides a unified API of pluggable components for obtaining high-quality training data. The goal of a database gym is to simplify ML model training and evaluation to accelerate autonomous DBMS research. But unlike gyms in other domains that rely on custom simulators, a database gym uses the DBMS itself to create simulation environments for ML training. Thus, we discuss and prescribe methods for overcoming challenges in DBMS simulation, which include demanding requirements for performance, simulation fidelity, and DBMS-generated hints for guiding training processes. 
    more » « less
  5. From the United States’ Health Insurance Portability and Accountability Act (HIPAA) to the European Union’s General Data Protection Regulation (GDPR), there has been an increased focus on individual data privacy protection. Because multiple enforcement agencies (such as legal entities and external governing bodies) have jurisdiction over data governance, it is possible for the same data value to be subject to multiple (and potentially conflicting) policies. As a result, managing and enforcing all applicable legal requirements has become a complex task. In this paper, we present a comprehensive overview of the steps to integrating data retention and purging into a database management system (DBMS). We describe the changes necessary at each step of the data lifecycle management, the minimum functionality that any DBMS (relational or NoSQL) must support, and the guarantees provided by this system. Our proposed solution is 1) completely transparent from the perspective of the DBMS user; 2) requires only a minimal amount of tuning by the database administrator; 3) imposes a negligible performance overhead and a modest storage overhead; and 4) automates the enforcement of both retention and purging policies in the database. 
    more » « less