skip to main content


Title: Semiparametric mixed‐scale models using shared Bayesian forests
Abstract

This paper demonstrates the advantages of sharing information about unknown features of covariates across multiple model components in various nonparametric regression problems including multivariate, heteroscedastic, and semicontinuous responses. In this paper, we present a methodology which allows for information to be shared nonparametrically across various model components using Bayesian sum‐of‐tree models. Our simulation results demonstrate that sharing of information across related model components is often very beneficial, particularly in sparse high‐dimensional problems in which variable selection must be conducted. We illustrate our methodology by analyzing medical expenditure data from the Medical Expenditure Panel Survey (MEPS). To facilitate the Bayesian nonparametric regression analysis, we develop two novel models for analyzing the MEPS data using Bayesian additive regression trees—a heteroskedastic log‐normal hurdle model with a “shrink‐toward‐homoskedasticity” prior and a gamma hurdle model.

 
more » « less
Award ID(s):
2015636
NSF-PAR ID:
10122945
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Biometrics
Volume:
76
Issue:
1
ISSN:
0006-341X
Format(s):
Medium: X Size: p. 131-144
Size(s):
p. 131-144
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Causal inference practitioners have increasingly adopted machine learning techniques with the aim of producing principled uncertainty quantification for causal effects while minimizing the risk of model misspecification. Bayesian nonparametric approaches have attracted attention as well, both for their flexibility and their promise of providing natural uncertainty quantification. Priors on high‐dimensional or nonparametric spaces, however, can often unintentionally encode prior information that is at odds with substantive knowledge in causal inference—specifically, the regularization required for high‐dimensional Bayesian models to work can indirectly imply that the magnitude of the confounding is negligible. In this paper, we explain this problem and provide tools for (i) verifying that the prior distribution does not encode an inductive bias away from confounded models and (ii) verifying that the posterior distribution contains sufficient information to overcome this issue if it exists. We provide a proof‐of‐concept on simulated data from a high‐dimensional probit‐ridge regression model, and illustrate on a Bayesian nonparametric decision tree ensemble applied to a large medical expenditure survey.

     
    more » « less
  2. ABSTRACT

    Semicontinuous outcomes commonly arise in a wide variety of fields, such as insurance claims, healthcare expenditures, rainfall amounts, and alcohol consumption. Regression models, including Tobit, Tweedie, and two-part models, are widely employed to understand the relationship between semicontinuous outcomes and covariates. Given the potential detrimental consequences of model misspecification, after fitting a regression model, it is of prime importance to check the adequacy of the model. However, due to the point mass at zero, standard diagnostic tools for regression models (eg, deviance and Pearson residuals) are not informative for semicontinuous data. To bridge this gap, we propose a new type of residuals for semicontinuous outcomes that is applicable to general regression models. Under the correctly specified model, the proposed residuals converge to being uniformly distributed, and when the model is misspecified, they significantly depart from this pattern. In addition to in-sample validation, the proposed methodology can also be employed to evaluate predictive distributions. We demonstrate the effectiveness of the proposed tool using health expenditure data from the US Medical Expenditure Panel Survey.

     
    more » « less
  3. We consider the problem of nonparametric regression in the high-dimensional setting in which P≫N. We study the use of overlapping group structures to improve prediction and variable selection. These structures arise commonly when analyzing DNA microarray data, where genes can naturally be grouped according to genetic pathways. We incorporate overlapping group structure into a Bayesian additive regression trees model using a prior constructed so that, if a variable from some group is used to construct a split, this increases the probability that subsequent splits will use predictors from the same group. We refer to our model as an overlapping group Bayesian additive regression trees (OG-BART) model, and our prior on the splits an overlapping group Dirichlet (OG-Dirichlet) prior. Like the sparse group lasso, our prior encourages sparsity both within and between groups. We study the correlation structure of the prior, illustrate the proposed methodology on simulated data, and apply the methodology to gene expression data to learn which genetic pathways are predictive of breast cancer tumor metastasis. 
    more » « less
  4. In group anagram games, players cooperate to form words by sharing letters that they are initially given. The aim is to form as many words as possible as a group, within five minutes. Players take several different actions: requesting letters from their neighbors, replying to letter requests, and forming words. Agent-based models (ABMs) for the game compute likelihoods of each player’s next action, which contain uncertainty, as they are estimated from experimental data. We adopt a Bayesian approach as a natural means of quantifying uncertainty, to enhance the ABM for the group anagram game. Specifically, a Bayesian nonparametric clustering method is used to group player behaviors into different clusters without pre-specifying the number of clusters. Bayesian multi nominal regression is adopted to model the transition probabilities among different actions of the players in the ABM. We describe the methodology and the benefits of it, and perform agent-based simulations of the game. 
    more » « less
  5. Emergency medical services (EMS) teams are first responders providing urgent medical care to severely ill or injured patients in the field. Despite their criticality, EMS work is one of the very few medical domains with limited technical support. This paper describes a study conducted to examine technology opportunities for supporting EMS data work and decision-making. We transcribed and analyzed 25 simulation videos. Using the distributed cognition framework, we examined EMS teams' work practices that support information acquisition and sharing. Our results showed that EMS teams leveraged various mechanisms (e.g., verbal communication and external cognitive aids) to distribute cognitive labor in managing, collecting, and using patient data. However, we observed a set of prominent challenges in EMS data work, including lack of detailed documentation in real time, situation recall issues, situation awareness problems, and challenges in decision making and communication. Based on the results, we discuss implications for technology opportunities to support rapid information acquisition, integration, and sharing in time-critical, high-risk medical settings. 
    more » « less