skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Consistent and scalable composite likelihood estimation of probit models with crossed random effects
Summary Estimation of crossed random effects models commonly incurs computational costs that grow faster than linearly in the sample size $ N $, often as fast as $$ \Omega(N^{3/2}) $$, making them unsuitable for large datasets. For non-Gaussian responses, integrating out the random effects to obtain a marginal likelihood poses significant challenges, especially for high-dimensional integrals for which the Laplace approximation may not be accurate. In this article we develop a composite likelihood approach to probit models that replaces the crossed random effects model with some hierarchical models that require only one-dimensional integrals. We show how to consistently estimate the crossed effects model parameters from the hierarchical model fits. We find that the computation scales linearly in the sample size. The method is illustrated by applying it to approximately five million observations from Stitch Fix, where the crossed effects formulation would require an integral of dimension larger than $$ 700\,000 $$.  more » « less
Award ID(s):
2152780
PAR ID:
10646155
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Biometrika
Date Published:
Journal Name:
Biometrika
Volume:
112
Issue:
3
ISSN:
1464-3510
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Given a random sample of size n from a p dimensional random vector, we are interested in testing whether the p components of the random vector are mutually independent. This is the so-called complete independence test. In the multivariate normal case, it is equivalent to testing whether the correlation matrix is an identity matrix. In this paper, we propose a one-sided empirical likelihood method for the complete independence test based on squared sample correlation coefficients. The limiting distribution for our one-sided empirical likelihood test statistic is proved to be Z^2I(Z > 0) when both n and p tend to infinity, where Z is a standard normal random variable. In order to improve the power of the empirical likelihood test statistic, we also introduce a rescaled empirical likelihood test statistic. We carry out an extensive simulation study to compare the performance of the rescaled empirical likelihood method and two other statistics. 
    more » « less
  2. Abstract Recent empirical studies have quantified correlation between survival and recovery by estimating these parameters as correlated random effects with hierarchical Bayesian multivariate models fit to tag‐recovery data. In these applications, increasingly negative correlation between survival and recovery has been interpreted as evidence for increasingly additive harvest mortality. The power of these hierarchal models to detect nonzero correlations has rarely been evaluated, and these few studies have not focused on tag‐recovery data, which is a common data type. We assessed the power of multivariate hierarchical models to detect negative correlation between annual survival and recovery. Using three priors for multivariate normal distributions, we fit hierarchical effects models to a mallard (Anas platyrhychos) tag‐recovery data set and to simulated data with sample sizes corresponding to different levels of monitoring intensity. We also demonstrate more robust summary statistics for tag‐recovery data sets than total individuals tagged. Different priors led to substantially different estimates of correlation from the mallard data. Our power analysis of simulated data indicated most prior distribution and sample size combinations could not estimate strongly negative correlation with useful precision or accuracy. Many correlation estimates spanned the available parameter space (−1,1) and underestimated the magnitude of negative correlation. Only one prior combined with our most intensive monitoring scenario provided reliable results. Underestimating the magnitude of correlation coincided with overestimating the variability of annual survival, but not annual recovery. The inadequacy of prior distributions and sample size combinations previously assumed adequate for obtaining robust inference from tag‐recovery data represents a concern in the application of Bayesian hierarchical models to tag‐recovery data. Our analysis approach provides a means for examining prior influence and sample size on hierarchical models fit to capture–recapture data while emphasizing transferability of results between empirical and simulation studies. 
    more » « less
  3. Abstract Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches. 
    more » « less
  4. Abstract We propose a Bayesian model selection approach for generalized linear mixed models (GLMMs). We consider covariance structures for the random effects that are widely used in areas such as longitudinal studies, genome-wide association studies, and spatial statistics. Since the random effects cannot be integrated out of GLMMs analytically, we approximate the integrated likelihood function using a pseudo-likelihood approach. Our Bayesian approach assumes a flat prior for the fixed effects and includes both approximate reference prior and half-Cauchy prior choices for the variances of random effects. Since the flat prior on the fixed effects is improper, we develop a fractional Bayes factor approach to obtain posterior probabilities of the several competing models. Simulation studies with Poisson GLMMs with spatial random effects and overdispersion random effects show that our approach performs favorably when compared to widely used competing Bayesian methods including deviance information criterion and Watanabe–Akaike information criterion. We illustrate the usefulness and flexibility of our approach with three case studies including a Poisson longitudinal model, a Poisson spatial model, and a logistic mixed model. Our proposed approach is implemented in the R package GLMMselect that is available on CRAN. 
    more » « less
  5. Abstract Integral projection models (IPMs) are widely used for studying continuously size‐structured populations. IPMs require a growth sub‐model that describes the probability of future size conditional on current size and any covariates. Most IPM studies assume that this distribution is Gaussian, despite calls for non‐Gaussian models that accommodate skewness and excess kurtosis. We provide a general workflow for accommodating non‐Gaussian growth patterns while retaining important covariates and random effects. Our approach emphasizes visual diagnostics from pilot Gaussian models and quantile‐based metrics of skewness and kurtosis that guide selection of a non‐Gaussian alternative, if necessary. Across six case studies, skewness and excess kurtosis were common features of growth data, and non‐Gaussian models consistently generated simulated data that were more consistent with real data than pilot Gaussian models. However, effects of “improved” growth modeling on IPM results were moderate to weak and differed in direction or magnitude between different outputs from the same model. Using tools not available when IPMs were first developed, it is now possible to fit non‐Gaussian models to growth data without sacrificing ecological complexity. Doing so, as guided by careful interrogation of the data, will result in models that better represent the populations for which they are intended. 
    more » « less