skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Combining Spatial and Temporal Properties for Improvements in Data Reduction
Due to I/O bandwidth limitations, intelligent in situ data reduction methods are needed to enable post-hoc workflows. Current state-of-the-art sampling methods save data points if they deem them spatially or temporally important. By analyzing the properties of the data values at each time-step, two consecutive steps may be very similar. This research follows the notion that if neighboring time-steps are very similar, samples from both are unnecessary, which leaves storage for adding more useful samples. Here, we present an investigation of the combination of spatial and temporal sampling to drastically reduce data size without the loss of valuable information. We demonstrate that, by reusing samples, our reconstructed data set reduces the overall data size while achieving a higher post-reconstruction quality over other reduction methods.  more » « less
Award ID(s):
1910197
PAR ID:
10294531
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2020 IEEE International Conference on Big Data (Big Data)
Page Range / eLocation ID:
2654 to 2663
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Sampling-based methods promise scalability improvements when paired with stochastic gradient descent in training Graph Convolutional Networks (GCNs). While effective in alleviating the neighborhood explosion, due to bandwidth and memory bottlenecks, these methods lead to computational overheads in preprocessing and loading new samples in heterogeneous systems, which significantly deteriorate the sampling performance. By decoupling the frequency of sampling from the sampling strategy, we propose LazyGCN, a general yet effective framework that can be integrated with any sampling strategy to substantially improve the training time. The basic idea behind LazyGCN is to perform sampling periodically and effectively recycle the sampled nodes to mitigate data preparation overhead. We theoretically analyze the proposed algorithm and show that under a mild condition on the recycling size, by reducing the variance of inner layers, we are able to obtain the same convergence rate as the underlying sampling method. We also give corroborating empirical evidence on large real-world graphs, demonstrating that the proposed schema can significantly reduce the number of sampling steps and yield superior speedup without compromising the accuracy. 
    more » « less
  2. This is a descriptive, tabular dataset of publications related to microbial or genomic research conducted within PIE. Assession numbers for genetic sequences generated from PIE samples are provided where available, followed by a very brief description of analysis type and study objectives. Sampling locations within PIE, sampling dates, and habitat type (sea water, fresh water, sediment, marsh) are also given. Environmental data are included in some publications and are listed here (if brief) or availability is described. Links to sequence archives are given in Methods. 
    more » « less
  3. Recent approaches have shown promises distilling diffusion models into efficient one-step generators. Among them, Distribution Matching Distillation (DMD) produces one-step generators that match their teacher in distribution, without enforcing a one-to-one correspondence with the sampling trajectories of their teachers. However, to ensure stable training, DMD requires an additional regression loss computed using a large set of noise-image pairs generated by the teacher with many steps of a deterministic sampler. This is costly for large-scale text-to-image synthesis and limits the student's quality, tying it too closely to the teacher's original sampling paths. We introduce DMD2, a set of techniques that lift this limitation and improve DMD training. First, we eliminate the regression loss and the need for expensive dataset construction. We show that the resulting instability is due to the fake critic not estimating the distribution of generated samples accurately and propose a two time-scale update rule as a remedy. Second, we integrate a GAN loss into the distillation procedure, discriminating between generated samples and real images. This lets us train the student model on real data, mitigating the imperfect real score estimation from the teacher model, and enhancing quality. Lastly, we modify the training procedure to enable multi-step sampling. We identify and address the training-inference input mismatch problem in this setting, by simulating inference-time generator samples during training time. Taken together, our improvements set new benchmarks in one-step image generation, with FID scores of 1.28 on ImageNet-64x64 and 8.35 on zero-shot COCO 2014, surpassing the original teacher despite a 500X reduction in inference cost. Further, we show our approach can generate megapixel images by distilling SDXL, demonstrating exceptional visual quality among few-step methods. 
    more » « less
  4. The calibration of the wake effect in wind turbines is computationally expensive and with high risk due to noise in the data. Wake represents the energy loss in downstream turbines, and characterizing it is essential to design wind farm layout and control turbines for maximum power generation. With big data, calibrating the wake parameters is a derivative-free optimization that can be computationally expensive. But with stochastic optimization combined with variance reduction, we can reach robust solutions by harnessing the uncertainty through two sampling mechanisms: the sample size and the sample choices. We do the former by generating a varying number of samples and the latter using the variance-reduced sampling methods. 
    more » « less
  5. Synopsis Understanding recent population trends is critical to quantifying species vulnerability and implementing effective management strategies. To evaluate the accuracy of genomic methods for quantifying recent declines (beginning <120 generations ago), we simulated genomic data using forward-time methods (SLiM) coupled with coalescent simulations (msprime) under a number of demographic scenarios. We evaluated both site frequency spectrum (SFS)-based methods (momi2, Stairway Plot) and methods that employ linkage disequilibrium information (NeEstimator, GONE) with a range of sampling schemes (contemporary-only samples, sampling two time points, and serial sampling) and data types (RAD-like data and whole-genome sequencing). GONE and momi2 performed best overall, with >80% power to detect severe declines with large sample sizes. Two-sample and serial sampling schemes could accurately reconstruct changes in population size, and serial sampling was particularly valuable for making accurate inferences when genotyping errors or minor allele frequency cutoffs distort the SFS or under model mis-specification. However, sampling only contemporary individuals provided reliable inferences about contemporary size and size change using either site frequency or linkage-based methods, especially when large sample sizes or whole genomes from contemporary populations were available. These findings provide a guide for researchers designing genomics studies to evaluate recent demographic declines. 
    more » « less