skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 10:00 PM ET on Friday, December 8 until 2:00 AM ET on Saturday, December 9 due to maintenance. We apologize for the inconvenience.

Title: Bivariate small‐area estimation for binary and gaussian variables based on a conditionally specified model

Many large‐scale surveys collect both discrete and continuous variables. Small‐area estimates may be desired for means of continuous variables, proportions in each level of a categorical variable, or for domain means defined as the mean of the continuous variable for each level of the categorical variable. In this paper, we introduce a conditionally specified bivariate mixed‐effects model for small‐area estimation, and provide a necessary and sufficient condition under which the conditional distributions render a valid joint distribution. The conditional specification allows better model interpretation. We use the valid joint distribution to calculate empirical Bayes predictors and use the parametric bootstrap to estimate the mean squared error. Simulation studies demonstrate the superior performance of the bivariate mixed‐effects model relative to univariate model estimators. We apply the bivariate mixed‐effects model to construct estimates for small watersheds using data from the Conservation Effects Assessment Project, a survey developed to quantify the environmental impacts of conservation efforts. We construct predictors of mean sediment loss, the proportion of land where the soil loss tolerance is exceeded, and the average sediment loss on land where the soil loss tolerance is exceeded. In the data analysis, the bivariate mixed‐effects model leads to more scientifically interpretable estimates of domain means than those based on two independent univariate models.

more » « less
Award ID(s):
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Date Published:
Journal Name:
Page Range / eLocation ID:
p. 1555-1565
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Many variables of interest in agricultural or economical surveys have skewed distributions and can equal zero. Our data are measures of sheet and rill erosion called Revised Universal Soil Loss Equation2 (RUSLE2). Small area estimates of mean RUSLE2 erosion are of interest. We use a zero‐inflated lognormal mixed effects model for small area estimation. The model combines a unit‐level lognormal model for the positive RUSLE2 responses with a unit‐level logistic mixed effects model for the binary indicator that the response is nonzero. In the Conservation Effects Assessment Project (CEAP) data, counties with a higher probability of nonzero responses also tend to have a higher mean among the positive RUSLE2 values. We capture this property of the data through an assumption that the pair of random effects for a county are correlated. We develop empirical Bayes (EB) small area predictors and a bootstrap estimator of the mean squared error (MSE). In simulations, the proposed predictor is superior to simpler alternatives. We then apply the method to construct EB predictors of mean RUSLE2 erosion for South Dakota counties. To obtain auxiliary variables for the population of cropland in South Dakota, we integrate a satellite‐derived land cover map with a geographic database of soil properties. We provide an R Shiny application calledviscover(available at to visualize the overlay operations required to construct the covariates. On the basis of bootstrap estimates of the mean square error, we conclude that the EB predictors of mean RUSLE2 erosion are superior to direct estimators.

    more » « less
  2. Abstract

    Using the existing measures for training numerical (non-categorical) prediction models can cause misclassification of droughts. Thus, developing a drought category-based measure is critical. Moreover, the existing fixed drought category thresholds need to be improved. The objective of this research is to develop a category-based scoring support vector regression (CBS-SVR) model based on an improved drought categorization method to overcome misclassification in drought prediction. To derive variable threshold levels for drought categorization, K-means (KM) and Gaussian mixture (GM) clustering are compared with the traditional drought categorization. For drought prediction, CBS-SVR is performed by using the best categorization method. The new drought model was applied to the Red River of the North Basin (RRB) in the USA. In the model training and testing, precipitation, temperature, and actual evapotranspiration were selected as the predictors, and the target variables consisted of multivariate drought indices, as well as bivariate and univariate standardized drought indices. Results indicated that the drought categorization method, variable threshold levels, and the type of drought index were the major factors that influenced the accuracy of drought prediction. The CBS-SVR outperformed the support vector classification and traditional SVR by avoiding overfitting and miscategorization in drought prediction.

    more » « less
  3. null (Ed.)
    Abstract Soil moisture (SM) and evapotranspiration (ET) are key variables of the terrestrial water cycle with a strong relationship. This study examines remotely sensed soil moisture and evapotranspiration data assimilation (DA) with the aim of improving drought monitoring. Although numerous efforts have gone into assimilating satellite soil moisture observations into land surface models to improve their predictive skills, little attention has been given to the combined use of soil moisture and evapotranspiration to better characterize hydrologic fluxes. In this study, we assimilate two remotely sensed datasets, namely, Soil Moisture Operational Product System (SMOPS) and MODIS evapotranspiration (MODIS16 ET), at 1-km spatial resolution, into the VIC land surface model by means of an evolutionary particle filter method. To achieve this, a fully parallelized framework based on model and domain decomposition using a parallel divide-and-conquer algorithm was implemented. The findings show improvement in soil moisture predictions by multivariate assimilation of both ET and SM as compared to univariate scenarios. In addition, monthly and weekly drought maps are produced using the updated root-zone soil moisture percentiles over the Apalachicola–Chattahoochee–Flint basin in the southeastern United States. The model-based estimates are then compared against the corresponding U.S. Drought Monitor (USDM) archive maps. The results are consistent with the USDM maps during the winter and spring season considering the drought extents; however, the drought severity was found to be slightly higher according to DA method. Comparing different assimilation scenarios showed that ET assimilation results in wetter conditions comparing to open-loop and univariate SM DA. The multivariate DA then combines the effects of the two variables and provides an in-between condition. 
    more » « less
  4. Windecker, Saras (Ed.)
    1. The ecological and environmental science communities have embraced machine learning (ML) for empirical modelling and prediction. However, going beyond prediction to draw insights into underlying functional relationships between response variables and environmental ‘drivers’ is less straightforward. Deriving ecological insights from fitted ML models requires techniques to extract the ‘learning’ hidden in the ML models. 2. We revisit the theoretical background and effectiveness of four approaches for deriving insights from ML: ranking independent variable importance (Gini importance, GI; permutation importance, PI; split importance, SI; and conditional permutation importance, CPI), and two approaches for inference of bivariate functional relationships (partial dependence plots, PDP; and accumulated local effect plots, ALE). We also explore the use of a surrogate model for visualization and interpretation of complex multi-variate relationships between response variables and environmental drivers. We examine the challenges and opportunities for extracting ecological insights with these interpretation approaches. Specifically, we aim to improve interpretation of ML models by investigating how effectiveness relates to (a) interpretation algorithm, (b) sample size and (c) the presence of spurious explanatory variables. 3. We base the analysis on simulations with known underlying functional relationships between response and predictor variables, with added white noise and the presence of correlated but non-influential variables. The results indicate that deriving ecological insight is strongly affected by interpretation algorithm and spurious variables, and moderately impacted by sample size. Removing spurious variables improves interpretation of ML models. Meanwhile, increasing sample size has limited value in the presence of spurious variables, but increasing sample size does improves performance once spurious variables are omitted. Among the four ranking methods, SI is slightly more effective than the other methods in the presence of spurious variables, while GI and SI yield higher accuracy when spurious variables are removed. PDP is more effective in retrieving underlying functional relationships than ALE, but its reliability declines sharply in the presence of spurious variables. Visualization and interpretation of the interactive effects of predictors and the response variable can be enhanced using surrogate models, including three-dimensional visualizations and use of loess planes to represent independent variable effects and interactions. 4. Machine learning analysts should be aware that including correlated independent variables in ML models with no clear causal relationship to response variables can interfere with ecological inference. When ecological inference is important, ML models should be constructed with independent variables that have clear causal effects on response variables. While interpreting ML models for ecological inference remains challenging, we show that careful choice of interpretation methods, exclusion of spurious variables and adequate sample size can provide more and better opportunities to ‘learn from machine learning’. 
    more » « less
  5. Two popular approaches for relating correlated measurements of a non‐Gaussian response variable to a set of predictors are to fit amarginal modelusing generalized estimating equations and to fit ageneralized linear mixed model(GLMM) by introducing latent random variables. The first approach is effective for parameter estimation, but leaves one without a formal model for the data with which to assess quality of fit or make individual‐level predictions for future observations. The second approach overcomes these deficiencies, but leads to parameter estimates that must be interpreted conditional on the latent variables. To obtain marginal summaries, one needs to evaluate an analytically intractable integral or use attenuation factors as an approximation. Further, we note an unpalatable implication of the standard GLMM. To resolve these issues, we turn to a class of marginally interpretable GLMMs that lead to parameter estimates with a marginal interpretation while maintaining the desirable statistical properties of a conditionally specified model and avoiding problematic implications. We establish the form of these models under the most commonly used link functions and address computational issues. For logistic mixed effects models, we introduce an accurate and efficient method for evaluating the logistic‐normal integral.

    more » « less