One-Pass Diversified Sampling with Application to Terabyte-Scale Genomic Sequence Streams

Coleman, Benjamin; Geordie, Benito; Chou, Li; Elworth, RA Leo; Treangen, Todd; Shrivastava, Anshumali

Citation Details

A popular approach to reduce the size of a massive dataset is to apply efficient online sampling to the stream of data as it is read or generated. Online sampling routines are currently restricted to variations of reservoir sampling, where each sample is selected uniformly and independently of other samples. This renders them unsuitable for large-scale applications in computational biology, such as metagenomic community profiling and protein function annotation, which suffer from severe class imbalance. To maintain a representative and diverse sample, we must identify and preferentially select data that are likely to belong to rare classes. We argue that existing schemes for diversity sampling have prohibitive overhead for large-scale problems and high-throughput streams. We propose an efficient sampling routine that uses an online representation of the data distribution as a prefilter to retain elements from rare groups. We apply this method to several genomic data analysis tasks and demonstrate significant speedup in downstream analysis without sacrificing the quality of the results. Because our algorithm is 2x faster and uses 1000x less memory than coreset, reservoir and sketch-based alternatives, we anticipate that it will become a useful preprocessing step for applications with large-scale streaming data. more »

Award ID(s):: 2126387

PAR ID:: 10378504

Author(s) / Creator(s):: Coleman, Benjamin; Geordie, Benito; Chou, Li; Elworth, RA Leo; Treangen, Todd; Shrivastava, Anshumali

Date Published:: 2022-07-01

Journal Name:: Proceedings of Machine Learning Research

Volume:: 162

ISSN:: 2640-3498

Page Range / eLocation ID:: 4202 - 4218

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this