NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The PEcAn+SIPNET Terrestrial Carbon Cycle Reanalysis: Development and Validation

https://doi.org/10.3897/aca.8.e152356

Dietze, Michael; Zhang, Dongchen; Finkeldei, Chaney; Gu, Yang; Huggins, Jonathan; Lai, Meng; Li, Qianyu; Ramachandran, Shashank; Roberts, Andrew; Serbin, Shawn; et al (May 2025, ARPHA Conference Abstracts)

Improving our ability to understand and predict the dynamics of the terrestrial carbon cycle remains a pressing challenge despite a rapidly growing volume and diversity of Earth Observation data. State data assimilation represents a path forward via an iterative cycle of making process-based forecasts and then statistically reconciling these forecasts against numerous ground-based and remotely-sensed data constraints into a “reanalysis” data product that provides full spatiotemporal carbon budgets with robust uncertainty accounting. Here we report on an >100x expansion of the PEcAn+SIPNET reanalysis from 500 sites CONUS, 25 ensemble members, and 2 data constraints to 6400 sites across North America, 100 ensemble members, and 5 data constraints: GEDI and Landtrendr AGB, MODIS LAI, SoilGrids Soil C, and SMAP soil moisture. We also report on an ensemble-based machine learning (ML) downscaling to a 1km product that preserves spatial, temporal, and across-variable covariances and demonstrate the impacts of these covariances on uncertainty accounting (Fig. 1). Synergistically, we use the same ML models to assess what climate, vegetation, and soil variables explain the spatiotemporal variability in different C pools and fluxes. In addition, we review a wide range of ongoing validation activities, comparing the outputs of the reanalysis against withheld data from: Ameriflux and NEON NEE and LE; USFS Forest Inventory biomass, biomass increment, tree rings, soil C, and litter; and NEON soil C and soil respiration. Finally, we touch on ML analyses to diagnose and correct systematic biases and emulator-based recalibration efforts.
more » « less
Free, publicly-accessible full text available May 28, 2026
Validated Variational Inference via Practical Posterior Error Bounds

Huggins, Jonathan; Kasprzak, Mikolaj; Campbell, Trevor; Broderick, Tamara (January 2020, Proceedings of Machine Learning Research)
Chiappa, Silvia; Calandra, Roberto (Ed.)
Full Text Available
LR-GLM: High-Dimensional Bayesian Inference Using Low-Rank Data Approximations

Trippe, Brian; Huggins, Jonathan; Agrawal, Raj; Broderick, Tamara (June 2019, Proceedings of Machine Learning Research)

Due to the ease of modern data collection, applied statisticians often have access to a large set of covariates that they wish to relate to some observed outcome. Generalized linear models (GLMs) offer a particularly interpretable framework for such an analysis. In these high-dimensional problems, the number of covariates is often large relative to the number of observations, so we face non-trivial inferential uncertainty; a Bayesian approach allows coherent quantification of this uncertainty. Unfortunately, existing methods for Bayesian inference in GLMs require running times roughly cubic in parameter dimension, and so are limited to settings with at most tens of thousand parameters. We propose to reduce time and memory costs with a low-rank approximation of the data in an approach we call LR-GLM. When used with the Laplace approximation or Markov chain Monte Carlo, LR-GLM provides a full Bayesian posterior approximation and admits running times reduced by a full factor of the parameter dimension. We rigorously establish the quality of our approximation and show how the choice of rank allows a tunable computational–statistical trade-off. Experiments support our theory and demonstrate the efficacy of LR-GLM on real large-scale datasets.
more » « less
Full Text Available
The Kernel Interaction Trick: Fast Bayesian Discovery of Pairwise Interactions in High Dimensions

Agrawal, Raj; Trippe, Brian; Huggins, Jonathan; Broderick, Tamara (June 2019, Proceedings of Machine Learning Research)

Discovering interaction effects on a response of interest is a fundamental problem faced in biology, medicine, economics, and many other scientific disciplines. In theory, Bayesian methods for discovering pairwise interactions enjoy many benefits such as coherent uncertainty quantification, the ability to incorporate background knowledge, and desirable shrinkage properties. In practice, however, Bayesian methods are often computationally intractable for even moderate- dimensional problems. Our key insight is that many hierarchical models of practical interest admit a Gaussian process representation such that rather than maintaining a posterior over all O(p^2) interactions, we need only maintain a vector of O(p) kernel hyper-parameters. This implicit representation allows us to run Markov chain Monte Carlo (MCMC) over model hyper-parameters in time and memory linear in p per iteration. We focus on sparsity-inducing models and show on datasets with a variety of covariate behaviors that our method: (1) reduces runtime by orders of magnitude over naive applications of MCMC, (2) provides lower Type I and Type II error relative to state-of-the-art LASSO-based approaches, and (3) offers improved computational scaling in high dimensions relative to existing Bayesian and LASSO-based approaches.
more » « less
Full Text Available
Data-dependent compression of random features for large-scale kernel approximation

Agrawal, Raj; Campbell, Trevor; Huggins, Jonathan; Broderick, Tamara (April 2019, Proceedings of Machine Learning Research)

Kernel methods offer the flexibility to learn complex relationships in modern, large data sets while enjoying strong theoretical guarantees on quality. Unfortunately, these methods typically require cubic running time in the data set size, a prohibitive cost in the large-data setting. Random feature maps (RFMs) and the Nyström method both consider low-rank approximations to the kernel matrix as a potential solution. But, in order to achieve desirable theoretical guarantees, the former may require a prohibitively large number of features J+, and the latter may be prohibitively expensive for high-dimensional problems. We propose to combine the simplicity and generality of RFMs with a data-dependent feature selection scheme to achieve desirable theoretical approximation properties of Nyström with just O(\log J+) features. Our key insight is to begin with a large set of random features, then reduce them to a small number of weighted features in a data-dependent, computationally efficient way, while preserving the statistical guarantees of using the original large set of features. We demonstrate the efficacy of our method with theory and experiments-including on a data set with over 50 million observations. In particular, we show that our method achieves small kernel matrix approximation error and better test set accuracy with provably fewer random features than state-of-the-art methods.
more » « less
Full Text Available
Scalable Gaussian Process Inference with Finite-data Mean and Variance Guarantees

Huggins, Jonathan H.; Campbell, Trevor; Kasprzak, Mikolaj; Broderick, Tamara (April 2019, Proceedings of Machine Learning Research)

Gaussian processes (GPs) offer a flexible class of priors for nonparametric Bayesian regression, but popular GP posterior inference methods are typically prohibitively slow or lack desirable finite-data guarantees on quality. We develop a scalable approach to approximate GP regression, with finite-data guarantees on the accuracy of our pointwise posterior mean and variance estimates. Our main contribution is a novel objective for approximate inference in the nonparametric setting: the preconditioned Fisher (pF) divergence. We show that unlike the Kullback–Leibler divergence (used in variational inference), the pF divergence bounds bounds the 2-Wasserstein distance, which in turn provides tight bounds on the pointwise error of mean and variance estimates. We demonstrate that, for sparse GP likelihood approximations, we can minimize the pF divergence bounds efficiently. Our experiments show that optimizing the pF divergence bounds has the same computational requirements as variational sparse GPs while providing comparable empirical performance—in addition to our novel finite-data quality guarantees.
more » « less
Full Text Available

Search for: All records