skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, December 13 until 2:00 AM ET on Saturday, December 14 due to maintenance. We apologize for the inconvenience.


Title: Orchestrating data placement and query execution in heterogeneous CPU-GPU DBMS
There has been a growing interest in using GPU to accelerate data analytics due to its massive parallelism and high memory bandwidth. The main constraint of using GPU for data analytics is the limited capacity of GPU memory. Heterogeneous CPU-GPU query execution is a compelling approach to mitigate the limited GPU memory capacity and PCIe bandwidth. However, the design space of heterogeneous CPU-GPU query execution has not been fully explored. We aim to improve state-of-the-art CPU-GPU data analytics engine by optimizing data placement and heterogeneous query execution. First, we introduce a semantic-aware fine-grained caching policy which takes into account various aspects of the workload such as query semantics, data correlation, and query frequency when determining data placement between CPU and GPU. Second, we introduce a heterogeneous query executor which can fully exploit data in both CPU and GPU and coordinate query execution at a fine granularity. We integrate both solutions in Mordred, our novel hybrid CPU-GPU data analytics engine. Evaluation on the Star Schema Benchmark shows that the semantic-aware caching policy can outperform the best traditional caching policy by up to 3x. Compared to existing GPU DBMSs, Mordred can outperform by an order of magnitude.  more » « less
Award ID(s):
2028818
PAR ID:
10373971
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
15
Issue:
11
ISSN:
2150-8097
Page Range / eLocation ID:
2491 to 2503
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. With the rise of machine learning and data analytics, the ability to process large and diverse sets of data efficiently has become crucial. Research has shown that data transformation is a key performance bottleneck for applications across a variety of domains, from data analytics to scientific computing. Custom hardware accelerators and GPU implementations targeting specific data transformation tasks can alleviate the problem, but suffer from narrow applicability and lack of generality.To tackle this problem, we propose a GPU-accelerated data transformation engine grounded on pushdown transducers. We define an extended pushdown transducer abstraction (effPDT) that allows expressing a wide range of data transformations in a memory-efficient fashion, and is thus amenable for GPU deployment. The effPDT execution engine utilizes a data streaming model that reduces the application’s memory requirements significantly, facilitating deployment on high- and low-end systems. We showcase our GPU-accelerated engine on a diverse set of transformation tasks covering data encoding/decoding, parsing and querying of structured data, and matrix transformation, and we evaluate it against publicly available CPU and GPU library implementations of the considered data transformation tasks. To understand the benefits of the effPDT abstraction, we extend our data transformation engine to also support finite state transducers (FSTs), we map the considered data transformation tasks on FSTs, and we compare the performance and resource requirements of the FST-based and the effPDT-based implementations. 
    more » « less
  2. Despite the popularity of in-situ analytics in scientific computing, there is only limited work to date on in-situ analytics for simulations running on GPUs. Notably, two unaddressed challenges are 1) performing memory-efficient in-situ analysis on accelerators and 2)automatically choosing the processing resources and suitable data representation for a given query and platform. This paper addresses both problems. First, GAP makes several new contributions toward making bitmap indices suitable, effective, and efficient as a compressed data summary structure for the GPUs - this includes introducing a layout structure, a method for generating multi-attribute bitmaps, and novel techniques for bitmap-based processing of major operators that comprise complex data analytics. Second, this paper presents a performance modeling methodology, aiming to predict the placement (i.e., CPU or GPU) and the data representation choice (summarization or original) that yield the best performance on a given configuration. Our extensive evaluation of complex in-situ queries and real-world simulations shows that with our methods, analytics on GPU using bitmaps almost always outperforms other options, and the GAP performance model predicts the optimal placement and data representation for most scenarios. 
    more » « less
  3. Despite the popularity of in-situ analytics in scientific computing, there is only limited work to date on in-situ analytics for simulations running on GPUs. Notably, two unaddressed challenges are 1) performing memory-efficient in-situ analysis on accelerators and 2)automatically choosing the processing resources and suitable data representation for a given query and platform. This paper addresses both problems. First, GAP makes several new contributions toward making bitmap indices suitable, effective, and efficient as a compressed data summary structure for the GPUs - this includes introducing a layout structure, a method for generating multi-attribute bitmaps, and novel techniques for bitmap-based processing of major operators that comprise complex data analytics. Second, this paper presents a performance modeling methodology, aiming to predict the placement (i.e., CPU or GPU) and the data representation choice (summarization or original) that yield the best performance on a given configuration. Our extensive evaluation of complex in-situ queries and real-world simulations shows that with our methods, analytics on GPU using bitmaps almost always outperforms other options, and the GAP performance model predicts the optimal placement and data representation for most scenarios. 
    more » « less
  4. Alternating Least Square (ALS) is a classic algorithm to solve matrix factorization widely used in recommendation systems. Existing efforts focus on parallelizing ALS on multi-/many-core platforms to handle large datasets. Recently, an optimized ALS variant called eALS was proposed, and it yields significantly lower time complexity and higher recommending accuracy than ALS. However, it is challenging to parallelize eALS on modern parallel architectures (e.g., CPUs and GPUs) because: 1) eALS’ data dependence prevents it from fine-grained parallel execution, thus eALS cannot fully utilize GPU's massive parallelism, 2) the sparsity of input data causes poor data locality and unbalanced workload, and 3) its large memory usage cannot fit into GPU's limited on-device memory, particularly for real-world large datasets. This paper proposes an efficient CPU/GPU heterogeneous recommendation system based on fast eALS for the first time (called HEALS) that consists of a set of system optimizations. HEALS employs newly designed architecture-adaptive data formats to achieve load balance and good data locality on CPU and GPU. HEALS also presents a CPU/GPU collaboration model that can explore both task parallelism and data parallelism. HEALS also optimizes this collaboration model with data communication overlapping and dynamic workload partition between CPU and GPU. Moreover, HEALS is further enhanced by various parallel techniques (e.g., loop unrolling, vectorization, and GPU parallel reduction). Evaluation results show that HEALS outperforms other state-of-the-art baselines in both performance and recommendation quality. Particularly, HEALS achieves up to 5.75 x better performance than a state-of-the-art ALS GPU library. This work also demonstrates the possibility of conducting fast recommendations on large datasets with constrained (or relaxed) hardware resources, e.g, a single CPU/GPU node. 
    more » « less
  5. null (Ed.)
    As specialized hardware accelerators such as GPUs become increasingly popular, developers are looking for ways to target these platforms with high-level APIs. One promising approach is kernel libraries such as PyTorch or cuML, which provide interfaces that mirror CPU-only counterparts such as NumPy or Scikit-Learn. Unfortunately, these libraries are hard to develop and to adopt incrementally: they only support a subset of their CPU equivalents, only work with datasets that fit in device memory, and require developers to reason about data placement and transfers manually. To address these shortcomings, we present a new approach called offload annotations (OAs) that enables heterogeneous GPU computing in existing workloads with few or no code modifications. An annotator annotates the types and functions in a CPU library with equivalent kernel library functions and provides an offloading API to specify how the inputs and outputs of the function can be partitioned into chunks that fit in device memory and transferred between devices. A runtime then maps existing CPU functions to equivalent GPU kernels and schedules execution, data transfers and paging. In data science workloads using CPU libraries such as NumPy and Pandas, OAs enable speedups of up to 1200⇥ and a median speedup of 6.3⇥ by transparently offloading functions to a GPU using existing kernel libraries. In many cases, OAs match the performance of handwritten heterogeneous implementations. Finally, OAs can automatically page data in these workloads to scale to datasets larger than GPU memory, which would need to be done manually with most current GPU libraries. 
    more » « less