skip to main content


Title: Recursive nearest neighbor co‐kriging models for big multi‐fidelity spatial data sets
Abstract

Big datasets are gathered daily from different remote sensing platforms. Recently, statistical co‐kriging models, with the help of scalable techniques, have been able to combine such datasets by using spatially varying bias corrections. The associated Bayesian inference for these models is usually facilitated via Markov chain Monte Carlo (MCMC) methods which present (sometimes prohibitively) slow mixing and convergence because they require the simulation of high‐dimensional random effect vectors from their posteriors given large datasets. To enable fast inference in big data spatial problems, we propose the recursive nearest neighbor co‐kriging (RNNC) model. Based on this model, we develop two computationally efficient inferential procedures: (a) the collapsed RNNC which reduces the posterior sampling space by integrating out the latent processes, and (b) the conjugate RNNC, an MCMC free inference which significantly reduces the computational time without sacrificing prediction accuracy. An important highlight of conjugate RNNC is that it enables fast inference in massive multifidelity data sets by avoiding expensive integration algorithms. The efficient computational and good predictive performances of our proposed algorithms are demonstrated on benchmark examples and the analysis of the High‐resolution Infrared Radiation Sounder data gathered from two NOAA polar orbiting satellites in which we managed to reduce the computational time from multiple hours to just a few minutes.

 
more » « less
Award ID(s):
2053668
PAR ID:
10506727
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Environmetrics
Volume:
35
Issue:
4
ISSN:
1180-4009
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. While society continues to be transformed by insights from processing big data, the increasing rate at which this data is gathered is making processing in private clusters obsolete. A vast amount of big data already resides in the cloud, and cloud infrastructures provide a scalable platform for both the computational and I/O needs of big data processing applications. Virtualization is used as a base technology in the cloud; however, existing virtual machine placement techniques do not consider data replication and I/O bottlenecks of the infrastructure, yielding sub-optimal data retrieval times. This paper targets efficient big data processing in the cloud and proposes novel virtual machine placement techniques, which minimize data retrieval time by considering data replication, storage performance, and network bandwidth. We first present an integer-programming based optimal virtual machine placement algorithm and then propose two low cost data- and energy-aware virtual machine placement heuristics. Our proposed heuristics are compared with optimal and existing algorithms through extensive evaluation. Experimental results provide strong indications for the superiority of our proposed solutions in both performance and energy, and clearly outline the importance of big data aware virtual machine placement for efficient processing of large datasets in the cloud. 
    more » « less
  2. Many problems in the physical sciences, machine learning, and statistical inference necessitate sampling from a high-dimensional, multimodal probability distribution. Markov Chain Monte Carlo (MCMC) algorithms, the ubiquitous tool for this task, typically rely on random local updates to propagate configurations of a given system in a way that ensures that generated configurations will be distributed according to a target probability distribution asymptotically. In high-dimensional settings with multiple relevant metastable basins, local approaches require either immense computational effort or intricately designed importance sampling strategies to capture information about, for example, the relative populations of such basins. Here, we analyze an adaptive MCMC, which augments MCMC sampling with nonlocal transition kernels parameterized with generative models known as normalizing flows. We focus on a setting where there are no preexisting data, as is commonly the case for problems in which MCMC is used. Our method uses 1) an MCMC strategy that blends local moves obtained from any standard transition kernel with those from a generative model to accelerate the sampling and 2) the data generated this way to adapt the generative model and improve its efficacy in the MCMC algorithm. We provide a theoretical analysis of the convergence properties of this algorithm and investigate numerically its efficiency, in particular in terms of its propensity to equilibrate fast between metastable modes whose rough location is known a priori but respective probability weight is not. We show that our algorithm can sample effectively across large free energy barriers, providing dramatic accelerations relative to traditional MCMC algorithms. 
    more » « less
  3. De Vico Fallani, Fabrizio (Ed.)
    The exponential family random graph modeling (ERGM) framework provides a highly flexible approach for the statistical analysis of networks (i.e., graphs). As ERGMs with dyadic dependence involve normalizing factors that are extremely costly to compute, practical strategies for ERGMs inference generally employ a variety of approximations or other workarounds. Markov Chain Monte Carlo maximum likelihood (MCMC MLE) provides a powerful tool to approximate the maximum likelihood estimator (MLE) of ERGM parameters, and is generally feasible for typical models on single networks with as many as a few thousand nodes. MCMC-based algorithms for Bayesian analysis are more expensive, and high-quality answers are challenging to obtain on large graphs. For both strategies, extension to the pooled case—in which we observe multiple networks from a common generative process—adds further computational cost, with both time and memory scaling linearly in the number of graphs. This becomes prohibitive for large networks, or cases in which large numbers of graph observations are available. Here, we exploit some basic properties of the discrete exponential families to develop an approach for ERGM inference in the pooled case that (where applicable) allows an arbitrarily large number of graph observations to be fit at no additional computational cost beyond preprocessing the data itself. Moreover, a variant of our approach can also be used to perform Bayesian inference under conjugate priors, again with no additional computational cost in the estimation phase. The latter can be employed either for single graph observations, or for observations from graph sets. As we show, the conjugate prior is easily specified, and is well-suited to applications such as regularization. Simulation studies show that the pooled method leads to estimates with good frequentist properties, and posterior estimates under the conjugate prior are well-behaved. We demonstrate the usefulness of our approach with applications to pooled analysis of brain functional connectivity networks and to replicated x-ray crystal structures of hen egg-white lysozyme. 
    more » « less
  4. The modeling of the brain as a three-dimensional spatial object, similar to a geographical landscape, has the paved way for the successful application of Kriging methods in solving the seizure detection problem with good performance but in cubic computational time complexity. The Deep Neural Network (DNN) has been widely used for seizure detection due to its effectiveness in classification tasks, although at the cost of a protracted training time. While Kriging exploits the spatial correlation between data locations, DNN relies on its capacity to learn intrinsic representations within the dataset from the basest unit parts. This paper presents a Distributed Kriging-Bootstrapped Deep Neural Network (DNN) model as a twofold solution for fast and accurate seizure detection using brain signals collected with the electroencephalogram (EEG) from healthy subjects and patients of epilepsy. The proposed model parallelizes the Kriging computation into different cores in a machine and then produces a strongly correlated, unified quasi-output data which serves as an input to the Deep Neural Network. Experimental results validate the proposed model as superior to conventional Kriging methods and DNN by training in 91% less time than the basic DNN and about three times as fast as the ordinary Kriging-Bootstrapped Deep Neural Network (DNN) model while maintaining good performance in terms of sensitivity, specificity and testing accuracy compared to other models and existing works. 
    more » « less
  5. Abstract

    Bayesian hierarchical models allow ecologists to account for uncertainty and make inference at multiple scales. However, hierarchical models are often computationally intensive to fit, especially with large datasets, and researchers face trade‐offs between capturing ecological complexity in statistical models and implementing these models.

    We present a recursive Bayesian computing (RB) method that can be used to fit Bayesian models efficiently in sequential MCMC stages to ease computation and streamline hierarchical inference. We also introduce transformation‐assisted RB (TARB) to create unsupervised MCMC algorithms and improve interpretability of parameters. We demonstrate TARB by fitting a hierarchical animal movement model to obtain inference about individual‐ and population‐level migratory characteristics.

    Our recursive procedure reduced computation time for fitting our hierarchical movement model by half compared to fitting the model with a single MCMC algorithm. We obtained the same inference fitting our model using TARB as we obtained fitting the model with a single algorithm.

    For complex ecological statistical models, like those for animal movement, multi‐species systems, or large spatial and temporal scales, the computational demands of fitting models with conventional computing techniques can limit model specification, thus hindering scientific discovery. Transformation‐assisted RB is one of the most accessible methods for reducing these limitations, enabling us to implement new statistical models and advance our understanding of complex ecological phenomena.

     
    more » « less