Stratified Random Sampling over Streaming and Stored Data

Nguyen, T; Shih, M; Srivastava, D; Tirthapura, S; Xu, B

doi:10.5441/002/edbt.2019.04

Citation Details

Stratified Random Sampling over Streaming and Stored Data

Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams, and make the following contributions. We present a lower bound that shows that any streaming algorithm for SRS must have (in the worst case) a variance that is Ω(r) factor away from the optimal, where r is the number of strata. We present S-VOILA, a streaming algorithm for SRS that is locally variance-optimal. Results from experiments on real and synthetic data show that S-VOILA results in a variance that is typically close to an optimal offline algorithm, which was given the entire input beforehand. We also present a variance-optimal offline algorithm VOILA for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant, i.e. has a large number of data points to choose from. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data. more »

Award ID(s):: 1725702 1527541

PAR ID:: 10110905

Author(s) / Creator(s):: Nguyen, T; Shih, M; Srivastava, D; Tirthapura, S; Xu, B

Date Published:: 2019-03-01

Journal Name:: Advances in Database Technology - 22nd International Conference on Extending Database Technology (EDBT)

Page Range / eLocation ID:: 25-36

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Conference Paper:
https://doi.org/10.5441/002/edbt.2019.04

More Like this