Enabling Efficient and General Subpopulation Analytics in Multidimensional Data Streams

Manousis, Antonis; Cheng, Zhuo; Basat, Ran Ben; Liu, Zaoxing; Sekar, Vyas

doi:10.14778/3551793.3551867

Citation Details

Enabling Efficient and General Subpopulation Analytics in Multidimensional Data Streams

Today’s large-scale services (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional datasets. However, state-of-the-art frameworks do not offer general and accurate analytics in real time at reasonable costs. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to monitor simultaneously. We present Hydra, an efficient framework for multidimensional analytics that presents a novel combination of using a “sketch of sketches” to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times. more »

Award ID(s):: 2132643 2107086 2106946

PAR ID:: 10350423

Author(s) / Creator(s):: Manousis, Antonis; Cheng, Zhuo; Basat, Ran Ben; Liu, Zaoxing; Sekar, Vyas

Date Published:: 2022-09-05

Journal Name:: Proceedings of the VLDB Endowment

Volume:: 15

Issue:: 11

ISSN:: 2150-8097

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.14778/3551793.3551867

More Like this