Nested queries are commonly used to express complex use-cases by connecting the output of a subquery as an input to the outer query block. However, their execution is highly time-consuming. Researchers have proposed various algorithms and techniques that unnest subqueries to improve performance. Since this is a customized approach that needs high algorithmic and engineering efforts, it is largely not an open feature in most existing database systems.Our approach is general-purpose and GPU-acceleration based, aiming for high performance at a minimum development cost. We look into the major differences between nested and unnested query structures to identify their merits and limits for GPU processing. Furthermore, we focus on the nested approach that is algorithmically simple and rich in parallels, in relatively low space complexity, and generic in program structure. We create a new code generation framework that best fits GPU for the nested method. We also make several critical system optimizations including massive parallel scanning with indexing, effective vectorization to optimize join operations, exploiting cache locality for loops and efficient GPU memory management. We have implemented the proposed solutions in NestGPU, a GPU-based column-store database system that is GPU device independent. We have extensively evaluated and tested the system to show the effectiveness of our proposed methods. more »« less
Floratos, Sofoklis; Xiao, mengbai; Wang, Hao; Guo, Chengxin; Yuan, Yuan; Lee, Rubao; Zhang, Xiaodong(
, Proceedings of 2021 IEEE International Conference on Data Engineering)
null
(Ed.)
Nested queries are commonly used to express complex
use-cases by connecting the output of a subquery as an input
to the outer query block. However, their execution is highly time consuming.
Researchers have proposed various algorithms and
techniques that unnest subqueries to improve performance. Since
this is a customized approach that needs high algorithmic and
engineering efforts, it is largely not an open feature in most
existing database systems.
Our approach is general-purpose and GPU-acceleration based,
aiming for high performance at a minimum development cost.
We look into the major differences between nested and unnested
query structures to identify their merits and limits for GPU
processing. Furthermore, we focus on the nested approach that
is algorithmically simple and rich in parallels, in relatively
low space complexity, and generic in program structure. We
create a new code generation framework that best fits GPU
for the nested method. We also make several critical system
optimizations including massive parallel scanning with indexing,
effective vectorization to optimize join operations, exploiting
cache locality for loops and efficient GPU memory management.
We have implemented the proposed solutions in NestGPU, a
GPU-based column-store database system that is GPU device
independent. We have extensively evaluated and tested the system
to show the effectiveness of our proposed methods.
Campbell, C.; Mecca, N.; Duong, T.; Obeid, I.; Picone, J.(
, IEEE Signal Processing in Medicine and Biology Symposium (SPMB))
Obeid, Iyad; Selesnick, Ivan; Picone, Joseph
(Ed.)
The goal of this work was to design a low-cost computing facility that can support the development of an
open source digital pathology corpus containing 1M images [1]. A single image from a clinical-grade digital
pathology scanner can range in size from hundreds of megabytes to five gigabytes. A 1M image database
requires over a petabyte (PB) of disk space. To do meaningful work in this problem space requires a
significant allocation of computing resources. The improvements and expansions to our HPC (highperformance
computing) cluster, known as Neuronix [2], required to support working with digital
pathology fall into two broad categories: computation and storage. To handle the increased computational
burden and increase job throughput, we are using Slurm [3] as our scheduler and resource manager. For
storage, we have designed and implemented a multi-layer filesystem architecture to distribute a filesystem
across multiple machines. These enhancements, which are entirely based on open source software, have
extended the capabilities of our cluster and increased its cost-effectiveness.
Slurm has numerous features that allow it to generalize to a number of different scenarios. Among the most
notable is its support for GPU (graphics processing unit) scheduling. GPUs can offer a tremendous
performance increase in machine learning applications [4] and Slurm’s built-in mechanisms for handling
them was a key factor in making this choice. Slurm has a general resource (GRES) mechanism that can be
used to configure and enable support for resources beyond the ones provided by the traditional HPC
scheduler (e.g. memory, wall-clock time), and GPUs are among the GRES types that can be supported by
Slurm [5]. In addition to being able to track resources, Slurm does strict enforcement of resource allocation.
This becomes very important as the computational demands of the jobs increase, so that they have all the
resources they need, and that they don’t take resources from other jobs. It is a common practice among
GPU-enabled frameworks to query the CUDA runtime library/drivers and iterate over the list of GPUs,
attempting to establish a context on all of them. Slurm is able to affect the hardware discovery process of
these jobs, which enables a number of these jobs to run alongside each other, even if the GPUs are in
exclusive-process mode.
To store large quantities of digital pathology slides, we developed a robust, extensible distributed storage
solution. We utilized a number of open source tools to create a single filesystem, which can be mounted
by any machine on the network. At the lowest layer of abstraction are the hard drives, which were split into
4 60-disk chassis, using 8TB drives. To support these disks, we have two server units, each equipped with
Intel Xeon CPUs and 128GB of RAM. At the filesystem level, we have implemented a multi-layer solution
that: (1) connects the disks together into a single filesystem/mountpoint using the ZFS (Zettabyte File
System) [6], and (2) connects filesystems on multiple machines together to form a single mountpoint using
Gluster [7].
ZFS, initially developed by Sun Microsystems, provides disk-level awareness and a filesystem which takes
advantage of that awareness to provide fault tolerance. At the filesystem level, ZFS protects against data
corruption and the infamous RAID write-hole bug by implementing a journaling scheme (the ZFS intent
log, or ZIL) and copy-on-write functionality. Each machine (1 controller + 2 disk chassis) has its own separate ZFS filesystem. Gluster, essentially a meta-filesystem, takes each of these, and provides the means
to connect them together over the network and using distributed (similar to RAID 0 but without striping
individual files), and mirrored (similar to RAID 1) configurations [8].
By implementing these improvements, it has been possible to expand the storage and computational power
of the Neuronix cluster arbitrarily to support the most computationally-intensive endeavors by scaling
horizontally. We have greatly improved the scalability of the cluster while maintaining its excellent
price/performance ratio [1].
Hu, Yu-Ching; Li, Yuliang Li; Tseng, Hung-Wei(
, Proceedings of the 2022 International Conference on Management of Data)
The emergence of novel hardware accelerators has powered the
tremendous growth of machine learning in recent years. These
accelerators deliver incomparable performance gains in processing
high-volume matrix operators, particularly matrix multiplication, a
core component of neural network training and inference. In this
work, we explored opportunities of accelerating database systems
using NVIDIA’s Tensor Core Units (TCUs). We present TCUDB, a
TCU-accelerated query engine processing a set of query operators
including natural joins and group-by aggregates as matrix operators
within TCUs. Matrix multiplication was considered inefficient in
the past; however, this strategy has remained largely unexplored in
conventional GPU-based databases, which primarily rely on vector
or scalar processing. We demonstrate the significant performance
gain of TCUDB in a range of real-world applications including
entity matching, graph query processing, and matrix-based data
analytics. TCUDB achieves up to 288× speedup compared to a
baseline GPU-based query engine.
Sofoklis Floratos, Ahmed Ghazal(
, 2021 37th IEEE International Conference on Data Engineering)
null
(Ed.)
Relational database management systems (RDBMS) have limited iterative processing support. Recursive queries were added to ANSI SQL, however, their semantics do not allow aggregation functions, which disqualifies their use for several applications, such as PageRank and shortest path computations. Recently, another SQL extension, iterative Common Table Expressions (CTEs), is proposed to enable users to perform general iterative computations on RDBMSs.In this work 1 , we demonstrate how iterative CTEs can be efficiently incorporated into a production RDBMS without major intrusion to the system. We have prototyped our approach on Futurewei's MPPDB, a shared nothing relational parallel database engine. The implementation is based on a functional rewrite that translates iterative CTEs to other existing SQL operators. Thus, query plans of iterative CTEs can be optimized and executed by the engine with minimal modification to the code base. We have also applied several optimizations specifically for iterative CTEs to i) minimize data movement, ii) reuse results that remain constant and iii) push down predicates to avoid unnecessary data processing. We verified our implementation through extensive experimental evaluation using real world datasets and queries. The results show the feasibility of the rewrite approach and the effectiveness of the optimizations, which improve performance by an order of magnitude in some cases.
Mailthody, Vikram Sharma; Qureshi, Zaid; Liang, Weixin; Feng, Ziyan; de Gonzalo, Simon Garcia; Li, Youjie; Franke, Hubertus; Xiong, Jinjun; Huang, Jian; Hwu, Wen-mei(
, Proceedings of the 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO'19))
Recent advancements in deep learning techniques facilitate intelligent-query support in diverse applications, such as content-based image retrieval and audio texturing. Unlike conventional key-based queries, these intelligent queries lack efficient indexing and require complex compute operations for feature matching. To achieve high-performance intelligent querying against massive datasets, modern computing systems employ GPUs in-conjunction with solid-state drives (SSDs) for fast data access and parallel data processing. However, our characterization with various intelligent-query workloads developed with deep neural networks (DNNs), shows that the storage I/O bandwidth is still the major bottleneck that contributes 56%--90% of the query execution time.
To this end, we present DeepStore, an in-storage accelerator architecture for intelligent queries. It consists of (1) energy-efficient in-storage accelerators designed specifically for supporting DNN-based intelligent queries, under the resource constraints in modern SSD controllers; (2) a similarity-based in-storage query cache to exploit the temporal locality of user queries for further performance improvement; and (3) a lightweight in-storage runtime system working as the query engine, which provides a simple software abstraction to support different types of intelligent queries. DeepStore exploits SSD parallelisms with design space exploration for achieving the maximal energy efficiency for in-storage accelerators. We validate DeepStore design with an SSD simulator, and evaluate it with a variety of vision, text, and audio based intelligent queries. Compared with the state-of-the-art GPU+SSD approach, DeepStore improves the query performance by up to 17.7×, and energy-efficiency by up to 78.6×.
Sofoklis Floratos, Mengbai Xiao. NestGPU: nested query processing on GPU. Retrieved from https://par.nsf.gov/biblio/10294952. 2021 IEEE 37th International Conference on Data Engineering .
Sofoklis Floratos, Mengbai Xiao. NestGPU: nested query processing on GPU. 2021 IEEE 37th International Conference on Data Engineering, (). Retrieved from https://par.nsf.gov/biblio/10294952.
Sofoklis Floratos, Mengbai Xiao.
"NestGPU: nested query processing on GPU". 2021 IEEE 37th International Conference on Data Engineering (). Country unknown/Code not available. https://par.nsf.gov/biblio/10294952.
@article{osti_10294952,
place = {Country unknown/Code not available},
title = {NestGPU: nested query processing on GPU},
url = {https://par.nsf.gov/biblio/10294952},
abstractNote = {Nested queries are commonly used to express complex use-cases by connecting the output of a subquery as an input to the outer query block. However, their execution is highly time-consuming. Researchers have proposed various algorithms and techniques that unnest subqueries to improve performance. Since this is a customized approach that needs high algorithmic and engineering efforts, it is largely not an open feature in most existing database systems.Our approach is general-purpose and GPU-acceleration based, aiming for high performance at a minimum development cost. We look into the major differences between nested and unnested query structures to identify their merits and limits for GPU processing. Furthermore, we focus on the nested approach that is algorithmically simple and rich in parallels, in relatively low space complexity, and generic in program structure. We create a new code generation framework that best fits GPU for the nested method. We also make several critical system optimizations including massive parallel scanning with indexing, effective vectorization to optimize join operations, exploiting cache locality for loops and efficient GPU memory management. We have implemented the proposed solutions in NestGPU, a GPU-based column-store database system that is GPU device independent. We have extensively evaluated and tested the system to show the effectiveness of our proposed methods.},
journal = {2021 IEEE 37th International Conference on Data Engineering},
author = {Sofoklis Floratos, Mengbai Xiao},
editor = {null}
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.