skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: One-Pass Diversified Sampling with Application to Terabyte-Scale Genomic Sequence Streams
A popular approach to reduce the size of a massive dataset is to apply efficient online sampling to the stream of data as it is read or generated. Online sampling routines are currently restricted to variations of reservoir sampling, where each sample is selected uniformly and independently of other samples. This renders them unsuitable for large-scale applications in computational biology, such as metagenomic community profiling and protein function annotation, which suffer from severe class imbalance. To maintain a representative and diverse sample, we must identify and preferentially select data that are likely to belong to rare classes. We argue that existing schemes for diversity sampling have prohibitive overhead for large-scale problems and high-throughput streams. We propose an efficient sampling routine that uses an online representation of the data distribution as a prefilter to retain elements from rare groups. We apply this method to several genomic data analysis tasks and demonstrate significant speedup in downstream analysis without sacrificing the quality of the results. Because our algorithm is 2x faster and uses 1000x less memory than coreset, reservoir and sketch-based alternatives, we anticipate that it will become a useful preprocessing step for applications with large-scale streaming data.  more » « less
Award ID(s):
2126387
PAR ID:
10378504
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of Machine Learning Research
Volume:
162
ISSN:
2640-3498
Page Range / eLocation ID:
4202 - 4218
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We introduce Tiered Sampling , a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M , which can be magnitudes smaller than the number of edges. Our methods address the challenging task of counting sparse motifs—sub-graph patterns—that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing Tiered Sampling to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest. We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs. 
    more » « less
  2. This work addresses inverse linear optimization, where the goal is to infer the unknown cost vector of a linear program. Specifically, we consider the data-driven setting in which the available data are noisy observations of optimal solutions that correspond to different instances of the linear program. We introduce a new formulation of the problem that, compared with other existing methods, allows the recovery of a less restrictive and generally more appropriate admissible set of cost estimates. It can be shown that this inverse optimization problem yields a finite number of solutions, and we develop an exact two-phase algorithm to determine all such solutions. Moreover, we propose an efficient decomposition algorithm to solve large instances of the problem. The algorithm extends naturally to an online learning environment where it can be used to provide quick updates of the cost estimate as new data become available over time. For the online setting, we further develop an effective adaptive sampling strategy that guides the selection of the next samples. The efficacy of the proposed methods is demonstrated in computational experiments involving two applications: customer preference learning and cost estimation for production planning. The results show significant reductions in computation and sampling efforts. Summary of Contribution: Using optimization to facilitate decision making is at the core of operations research. This work addresses the inverse problem (i.e., inverse optimization), which aims to infer unknown optimization models from decision data. It is, conceptually and computationally, a challenging problem. Here, we propose a new formulation of the data-driven inverse linear optimization problem and develop an efficient decomposition algorithm that can solve problem instances up to a scale that has not been addressed previously. The computational performance is further improved by an online adaptive sampling strategy that substantially reduces the number of required data points. 
    more » « less
  3. Physical sampling of water for off-site analysis is necessary for many applications like monitoring the quality of drinking water in reservoirs, understanding marine ecosystems, and measuring contamination levels in fresh-water systems. In this paper, the focus is on algorithms for efficient measurement and sampling using a multi-robot, data-driven, water-sampling behavior, where autonomous surface vehicles plan and execute water sampling using the chlorophyll density as a cue for plankton-rich water samples. We use two Autonomous Surface Vehicles (ASVs), one equipped with a water quality sensor and the other equipped with a water-sampling apparatus. The ASV with the sensor acts as an explorer, measuring and building a spatial map of chlorophyll density in the given region of interest. The ASV equipped with the water sampling apparatus makes decisions in real time on where to sample the water based on the suggestions made by the explorer robot. We evaluate the system in the context of measuring chlorophyll distributions. We do this both in simulation based on real geophysical data from MODIS measurements, and on real robots in a water reservoir. We demonstrate the effectiveness of the proposed approach in several ways including in terms of mean error in the interpolated data as a function of distance traveled. 
    more » « less
  4. ABSTRACT Zoonotic pathogens pose a significant risk to human health, with spillover into human populations contributing to chronic disease, sporadic epidemics, and occasional pandemics. Despite the widely recognized burden of zoonotic spillover, our ability to identify which animal populations serve as primary reservoirs for these pathogens remains incomplete. This challenge is compounded when prevalence reaches detectable levels only at specific times of year. In these cases, statistical models designed to predict the timing of peak prevalence could guide field sampling for active infections. Here we develop a general model that leverages routinely collected serosurveillance data to optimize sampling for elusive pathogens. Using simulated data sets we show that our methodology reliably identifies times when pathogen prevalence is expected to peak. We then apply our method to two putativeEbolavirusreservoirs, straw-colored fruit bats (Eidolon helvum) and hammer-headed bats (Hypsignathus monstrosus) to predict when these species should be sampled to maximize the probability of detecting active infections. In addition to guiding future sampling of these species, our method yields predictions for the times of year that are most likely to produce future spillover events. The generality and simplicity of our methodology make it broadly applicable to a wide range of putative reservoir species where seasonal patterns of birth lead to predictable, but potentially short-lived, pulses of pathogen prevalence. AUTHOR SUMMARYMany deadly pathogens, such as Ebola, Lassa, and Nipah viruses, originate in wildlife and jump to human populations. When this occurs, human health is at risk. At the extreme, this can lead to pandemics such as the West African Ebola epidemic and the COVID-19 pandemic. Despite the widely recognized risk wildlife pathogens pose to humans, identifying host species that serve as primary reservoirs for many pathogens remains challenging. Ebola is a notable example of a pathogen with an unconfirmed wildlife reservoir. A key obstacle to confirming reservoir hosts is sampling animals with active infections. Often, disease prevalence fluctuates seasonally in wildlife populations and only reaches detectable levels at certain times of year. In these cases, statistical models designed to predict the timing of peak prevalence could guide efficient field sampling for active infections. Therefore, we have developed a general model that uses serological data to predict times of year when pathogen prevalence is likely to peak. We demonstrate with simulated data that our method produces reliable predictions, and then apply our method to two hypothesized reservoirs for Ebola virus, straw-colored fruit bats and hammer-headed bats. Our method can be broadly applied to a range of potential reservoir species where seasonal patterns of birth can lead to predictable pulses of peak pathogen prevalence. Overall, our method can guide future sampling of reservoir populations and can also be used to make predictions for times of year that future outbreaks in human populations are most likely to occur. 
    more » « less
  5. Graph mining is an important data analysis methodology, but struggles as the input graph size increases. The scalability and usability challenges posed by such large graphs make it imperative to sample the input graph and reduce its size. The critical challenge in sampling is to identify the appropriate algorithm to insure the resulting analysis does not suffer heavily from the data reduction. Predicting the expected performance degradation for a given graph and sampling algorithm is also useful. In this paper, we present different sampling approaches for graph mining applications such as Frequent Subgrpah Mining (FSM), and Community Detection (CD). We explore graph metrics such as PageRank, Triangles, and Diversity to sample a graph and conclude that for heterogeneous graphs Triangles and Diversity perform better than degree based metrics. We also present two new sampling variations for targeted graph mining applications. We present empirical results to show that knowledge of the target application, along with input graph properties can be used to select the best sampling algorithm. We also conclude that performance degradation is an abrupt, rather than gradual phenomena, as the sample size decreases. We present the empirical results to show that the performance degradation follows a logistic function. 
    more » « less