skip to main content


Title: Distributed deep learning on data systems: a comparative analysis of approaches
Deep learning (DL) is growing in popularity for many data analytics applications, including among enterprises. Large business-critical datasets in such settings typically reside in RDBMSs or other data systems. The DB community has long aimed to bring machine learning (ML) to DBMS-resident data. Given past lessons from in-DBMS ML and recent advances in scalable DL systems, DBMS and cloud vendors are increasingly interested in adding more DL support for DB-resident data. Recently, a new parallel DL model selection execution approach called Model Hopper Parallelism (MOP) was proposed. In this paper, we characterize the particular suitability of MOP for DL on data systems, but to bring MOP-based DL to DB-resident data, we show that there is no single "best" approach, and an interesting tradeoff space of approaches exists. We explain four canonical approaches and build prototypes upon Greenplum Database, compare them analytically on multiple criteria (e.g., runtime efficiency and ease of governance) and compare them empirically with large-scale DL workloads. Our experiments and analyses show that it is non-trivial to meet all practical desiderata well and there is a Pareto frontier; for instance, some approaches are 3x-6x faster but fare worse on governance and portability. Our results and insights can help DBMS and cloud vendors design better DL support for DB users. All of our source code, data, and other artifacts are available at https://github.com/makemebitter/cerebro-ds.  more » « less
Award ID(s):
1942724
NSF-PAR ID:
10337018
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
14
Issue:
10
ISSN:
2150-8097
Page Range / eLocation ID:
1769 to 1782
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Today’s systems, rely on sending all the data to the cloud, and then use complex algorithms, such as Deep Neural Networks, which require billions of parameters and many hours to train a model. In contrast, the human brain can do much of this learning effortlessly. Hyperdimensional (HD) Computing aims to mimic the behavior of the human brain by utilizing high dimensional representations. This leads to various desirable properties that other Machine Learning (ML) algorithms lack such as: robustness to noise in the system and simple, highly parallel operations. In this paper, we propose \(\mathsf {HyDREA} \) , a Hy per D imensional Computing system that is R obust, E fficient, and A ccurate. We propose a Processing-in-Memory (PIM) architecture that works in a federated learning environment with challenging communication scenarios that cause errors in the transmitted data. \(\mathsf {HyDREA} \) adaptively changes the bitwidth of the model based on the signal to noise ratio (SNR) of the incoming sample to maintain the accuracy of the HD model while achieving significant speedup and energy efficiency. Our PIM architecture is able to achieve a speedup of 28 × and 255 × better energy efficiency compared to the baseline PIM architecture for Classification and achieves 32 × speed up and 289 × higher energy efficiency than the baseline architecture for Clustering. \(\mathsf {HyDREA} \) is able to achieve this by relaxing hardware parameters to gain energy efficiency and speedup while introducing computational errors. We show experimentally, HD Computing is able to handle the errors without a significant drop in accuracy due to its unique robustness property. For wireless noise, we found that \(\mathsf {HyDREA} \) is 48 × more robust to noise than other comparable ML algorithms. Our results indicate that our proposed system loses less than \(1\% \) Classification accuracy, even in scenarios with an SNR of 6.64. We additionally test the robustness of using HD Computing for Clustering applications and found that our proposed system also looses less than \(1\% \) in the mutual information score, even in scenarios with an SNR under 7 dB , which is 57 × more robust to noise than K-means. 
    more » « less
  2. Real-time applications such as autonomous and connected cars, surveillance, and online learning applications have to train on streaming data. They require low-latency, high throughput machine learning (ML) functions resident in the network and in the cloud to perform learning and inference. NFV on edge cloud platforms can provide support for these applications by having heterogeneous computing including GPUs and other accelerators to offload ML-related computation. GPUs provide the necessary speedup for performing learning and inference to meet the needs of these latency sensitive real-time applications. Supporting ML inference and learning efficiently for streaming data in NFV platforms has several challenges. In this paper, we present a framework, NetML, that runs existing ML applications on an heterogeneous NFV platform that includes both CPUs and GPUs. NetML efficiently transfers the appropriate packet payload to the GPU, minimizing overheads, avoiding locks, and avoiding CPU-based data copies. Additionally, NetML minimizes latency by maximizing overlap between the data movement and GPU computation. We evaluate the efficiency of our approach for training and inference using popular object detection algorithms on our platform. NetML reduces the latency for inferring images by more than 20% and increases the training throughput by 30% while reducing CPU utilization compared to other state-of-the-art alternatives. 
    more » « less
  3. Many applications that use large-scale machine learning (ML) increasingly prefer different models for subgroups (e.g., countries) to improve accuracy, fairness, or other desiderata. We call this emerging popular practice learning over groups , analogizing to GROUP BY in SQL, albeit for ML training instead of SQL aggregates. From the systems standpoint, this practice compounds the already data-intensive workload of ML model selection (e.g., hyperparameter tuning). Often, thousands of models may need to be trained, necessitating high-throughput parallel execution. Alas, most ML systems today focus on training one model at a time or at best, parallelizing hyperparameter tuning. This status quo leads to resource wastage, low throughput, and high runtimes. In this work, we take the first step towards enabling and optimizing learning over groups from the data systems standpoint for three popular classes of ML: linear models, neural networks, and gradient-boosted decision trees. Analytically and empirically, we compare standard approaches to execute this workload today: task-parallelism and data-parallelism. We find neither is universally dominant. We put forth a novel hybrid approach we call grouped learning that avoids redundancy in communications and I/O using a novel form of parallel gradient descent we call Gradient Accumulation Parallelism (GAP). We prototype our ideas into a system we call Kingpin built on top of existing ML tools and the flexible massively-parallel runtime Ray. An extensive empirical evaluation on large ML benchmark datasets shows that Kingpin matches or is 4x to 14x faster than state-of-the-art ML systems, including Ray's native execution and PyTorch DDP. 
    more » « less
  4. Software security depends on coordinated vulnerability disclosure (CVD) from researchers, a process that the community has continually sought to measure and improve. Yet, CVD practices are only as effective as the data that informs them. In this paper, we use DScope, a cloud-based interactive Internet telescope, to build statistical models of vulnerability lifecycles, bridging the data gap in over 20 years of CVD research. By analyzing application-layer Internet scanning traffic over two years, we identify real-world exploitation timelines for 63 threats. We bring this data together with six additional datasets to build a complete birth-to-death model of these vulnerabilities, the most complete analysis of vulnerability lifecycles to date. Our analysis reaches three key recommendations: (1) CVD across diverse vendors shows lower effectiveness than previously thought, (2) intrusion detection systems are underutilized to provide protection for critical vulnerabilities, and (3) existing data sources of CVD can be augmented by novel approaches to Internet measurement. In this way, our vantage point offers new opportunities to improve the CVD process, achieving a safer software ecosystem in practice. 
    more » « less
  5. Software security depends on coordinated vulnerability disclosure (CVD) from researchers, a process that the community has continually sought to measure and improve. Yet, CVD practices are only as effective as the data that informs them. In this paper, we use DScope, a cloud-based interactive Internet telescope, to build statistical models of vulnerability lifecycles, bridging the data gap in over 20 years of CVD research. By analyzing application-layer Internet scanning traffic over two years, we identify real-world exploitation timelines for 63 threats. We bring this data together with six additional datasets to build a complete birth-to-death model of these vulnerabilities, the most complete analysis of vulnerability lifecycles to date. Our analysis reaches three key recommendations: (1) CVD across diverse vendors shows lower effectiveness than previously thought, (2) intrusion detection systems are underutilized to provide protection for critical vulnerabilities, and (3) existing data sources of CVD can be augmented by novel approaches to Internet measurement. In this way, our vantage point offers new opportunities to improve the CVD process, achieving a safer software ecosystem in practice. 
    more » « less