skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Combining Spatial and Temporal Properties for Improvements in Data Reduction
Due to I/O bandwidth limitations, intelligent in situ data reduction methods are needed to enable post-hoc workflows. Current state-of-the-art sampling methods save data points if they deem them spatially or temporally important. By analyzing the properties of the data values at each time-step, two consecutive steps may be very similar. This research follows the notion that if neighboring time-steps are very similar, samples from both are unnecessary, which leaves storage for adding more useful samples. Here, we present an investigation of the combination of spatial and temporal sampling to drastically reduce data size without the loss of valuable information. We demonstrate that, by reusing samples, our reconstructed data set reduces the overall data size while achieving a higher post-reconstruction quality over other reduction methods.  more » « less
Award ID(s):
1910197
PAR ID:
10294531
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2020 IEEE International Conference on Big Data (Big Data)
Page Range / eLocation ID:
2654 to 2663
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Sampling-based methods promise scalability improvements when paired with stochastic gradient descent in training Graph Convolutional Networks (GCNs). While effective in alleviating the neighborhood explosion, due to bandwidth and memory bottlenecks, these methods lead to computational overheads in preprocessing and loading new samples in heterogeneous systems, which significantly deteriorate the sampling performance. By decoupling the frequency of sampling from the sampling strategy, we propose LazyGCN, a general yet effective framework that can be integrated with any sampling strategy to substantially improve the training time. The basic idea behind LazyGCN is to perform sampling periodically and effectively recycle the sampled nodes to mitigate data preparation overhead. We theoretically analyze the proposed algorithm and show that under a mild condition on the recycling size, by reducing the variance of inner layers, we are able to obtain the same convergence rate as the underlying sampling method. We also give corroborating empirical evidence on large real-world graphs, demonstrating that the proposed schema can significantly reduce the number of sampling steps and yield superior speedup without compromising the accuracy. 
    more » « less
  2. This is a descriptive, tabular dataset of publications related to microbial or genomic research conducted within PIE. Assession numbers for genetic sequences generated from PIE samples are provided where available, followed by a very brief description of analysis type and study objectives. Sampling locations within PIE, sampling dates, and habitat type (sea water, fresh water, sediment, marsh) are also given. Environmental data are included in some publications and are listed here (if brief) or availability is described. Links to sequence archives are given in Methods. 
    more » « less
  3. The calibration of the wake effect in wind turbines is computationally expensive and with high risk due to noise in the data. Wake represents the energy loss in downstream turbines, and characterizing it is essential to design wind farm layout and control turbines for maximum power generation. With big data, calibrating the wake parameters is a derivative-free optimization that can be computationally expensive. But with stochastic optimization combined with variance reduction, we can reach robust solutions by harnessing the uncertainty through two sampling mechanisms: the sample size and the sample choices. We do the former by generating a varying number of samples and the latter using the variance-reduced sampling methods. 
    more » « less
  4. Synopsis Understanding recent population trends is critical to quantifying species vulnerability and implementing effective management strategies. To evaluate the accuracy of genomic methods for quantifying recent declines (beginning <120 generations ago), we simulated genomic data using forward-time methods (SLiM) coupled with coalescent simulations (msprime) under a number of demographic scenarios. We evaluated both site frequency spectrum (SFS)-based methods (momi2, Stairway Plot) and methods that employ linkage disequilibrium information (NeEstimator, GONE) with a range of sampling schemes (contemporary-only samples, sampling two time points, and serial sampling) and data types (RAD-like data and whole-genome sequencing). GONE and momi2 performed best overall, with >80% power to detect severe declines with large sample sizes. Two-sample and serial sampling schemes could accurately reconstruct changes in population size, and serial sampling was particularly valuable for making accurate inferences when genotyping errors or minor allele frequency cutoffs distort the SFS or under model mis-specification. However, sampling only contemporary individuals provided reliable inferences about contemporary size and size change using either site frequency or linkage-based methods, especially when large sample sizes or whole genomes from contemporary populations were available. These findings provide a guide for researchers designing genomics studies to evaluate recent demographic declines. 
    more » « less
  5. Abstract In this paper we present a reconstruction technique for the reduction of unsteady flow data based on neural representations of time‐varying vector fields. Our approach is motivated by the large amount of data typically generated in numerical simulations, and in turn the types of data that domain scientists can generatein situthat are compact, yet useful, for post hoc analysis. One type of data commonly acquired during simulation are samples of the flow map, where a single sample is the result of integrating the underlying vector field for a specified time duration. In our work, we treat a collection of flow map samples for a single dataset as a meaningful, compact, and yet incomplete, representation of unsteady flow, and our central objective is to find a representation that enables us to best recover arbitrary flow map samples. To this end, we introduce a technique for learning implicit neural representations of time‐varying vector fields that are specifically optimized to reproduce flow map samples sparsely covering the spatiotemporal domain of the data. We show that, despite aggressive data reduction, our optimization problem — learning a function‐space neural network to reproduce flow map samples under a fixed integration scheme — leads to representations that demonstrate strong generalization, both in the field itself, and using the field to approximate the flow map. Through quantitative and qualitative analysis across different datasets we show that our approach is an improvement across a variety of data reduction methods, and across a variety of measures ranging from improved vector fields, flow maps, and features derived from the flow map. 
    more » « less