skip to main content


Title: GPSRL: Learning Semi-Parametric Bayesian Survival Rule Lists from Heterogeneous Patient Data
Survival data is often collected in medical applications from a heterogeneous population of patients. While in the past, popular survival models focused on modeling the average effect of the covariates on survival outcomes, rapidly advancing sensing and information technologies have provided opportunities to further model the heterogeneity of the population as well as the non-linearity of the survival risk. With this motivation, we propose a new semi-parametric Bayesian Survival Rule List model in this paper. Our model derives a rule-based decision-making approach, while within the regime defined by each rule, survival risk is modelled via a Gaussian process latent variable model. Markov Chain Monte Carlo with a nested Laplace approximation on the Gaussian process posterior is used to search over the posterior of the rule lists efficiently. The use of ordered rule lists enables us to model heterogeneity while keeping the model complexity in check. Performance evaluations on a synthetic heterogeneous survival dataset and a real world sepsis survival dataset demonstrate the effectiveness of our model.  more » « less
Award ID(s):
1718513
NSF-PAR ID:
10220434
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
International Conference on Pattern Recognition
Volume:
2020
ISSN:
1051-4651
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Integrated population models (IPMs) have become increasingly popular for the modelling of populations, as investigators seek to combine survey and demographic data to understand processes governing population dynamics. These models are particularly useful for identifying and exploring knowledge gaps within life histories, because they allow investigators to estimate biologically meaningful parameters, such as immigration or reproduction, that were previously unidentifiable without additional data. AsIPMs have been developed relatively recently, there is much to learn about model behaviour. Behaviour of parameters, such as estimates near boundaries, and the consequences of varying degrees of dependency among datasets, has been explored. However, the reliability of parameter estimates remains underexamined, particularly when models include parameters that are not identifiable from one data source, but are indirectly identifiable from multiple datasets and a presumed model structure, such as the estimation of immigration using capture‐recapture, fecundity and count data, combined with a life‐history model.

    To examine the behaviour of model parameter estimates, we simulated stable populations closed to immigration and emigration. We simulated two scenarios that might induce error into survival estimates: marker induced bias in the capture–mark–recapture data and heterogeneity in the mortality process. We subsequently fit capture–mark–recapture, state‐space and fecundity models, as well asIPMs that estimated additional parameters.

    Simulation results suggested that when model assumptions are violated, estimation of additional, previously unidentifiable, parameters usingIPMs may be extremely sensitive to these violations of model assumption. For example, when annual marker loss was simulated, estimates of survival rates were low and estimates of immigration rate from anIPMwere high. When heterogeneity in the mortality process was induced, there were substantial relative differences between the medians of posterior distributions and truth for juvenile survival and fecundity.

    Our results have important implications for biological inference when usingIPMs, as well as future model development and implementation. Specifically, using multiple datasets to identify additional parameters resulted in the posterior distributions of additional parameters directly reflecting the effects of the violations of model assumptions in integrated modelling frameworks. We suggest that investigators interpret posterior distributions of these parameters as a combination of biological process and systematic error.

     
    more » « less
  2. The specificity of T cells is that each T cell has only one T cell receptor (TCR). A T cell clone represents a collection of T cells with the same TCR sequence. Thus, the number of different T cell clones in an organism reflects the number of different T cell receptors (TCRs) that arise from recombination of the V(D)J gene segments during T cell development in the thymus. TCR diversity and more specifically, the clone abundance distribution, are important factors in immune functions. Specific recombination patterns occur more frequently than others while subsequent interactions between TCRs and self-antigens are known to trigger proliferation and sustain naive T cell survival. These processes are TCR-dependent, leading to clone-dependent thymic export and naive T cell proliferation rates. We describe the heterogeneous steady-state population of naive T cells (those that have not yet been antigenically triggered) by using a mean-field model of a regulated birth-death-immigration process. After accounting for random sampling, we investigate how TCR-dependent heterogeneities in immigration and proliferation rates affect the shape of clone abundance distributions (the number of different clones that are represented by a specific number of cells, or “clone counts”). By using reasonable physiological parameter values and fitting predicted clone counts to experimentally sampled clone abundances, we show that realistic levels of heterogeneity in immigration rates cause very little change to predicted clone-counts, but that modest heterogeneity in proliferation rates can generate the observed clone abundances. Our analysis provides constraints among physiological parameters that are necessary to yield predictions that qualitatively match the data. Assumptions of the model and potentially other important mechanistic factors are discussed. 
    more » « less
  3. Abstract We propose and analyze a family of epidemiological models that extend the classic Susceptible-Infectious-Recovered/Removed (SIR)-like framework to account for dynamic heterogeneity in infection risk. The family of models takes the form of a system of reaction–diffusion equations given populations structured by heterogeneous susceptibility to infection. These models describe the evolution of population-level macroscopic quantities S ,  I ,  R as in the classical case coupled with a microscopic variable f , giving the distribution of individual behavior in terms of exposure to contagion in the population of susceptibles. The reaction terms represent the impact of sculpting the distribution of susceptibles by the infection process. The diffusion and drift terms that appear in a Fokker–Planck type equation represent the impact of behavior change both during and in the absence of an epidemic. We first study the mathematical foundations of this system of reaction–diffusion equations and prove a number of its properties. In particular, we show that the system will converge back to the unique equilibrium distribution after an epidemic outbreak. We then derive a simpler system by seeking self-similar solutions to the reaction–diffusion equations in the case of Gaussian profiles. Notably, these self-similar solutions lead to a system of ordinary differential equations including classic SIR-like compartments and a new feature: the average risk level in the remaining susceptible population. We show that the simplified system exhibits a rich dynamical structure during epidemics, including plateaus, shoulders, rebounds and oscillations. Finally, we offer perspectives and caveats on ways that this family of models can help interpret the non-canonical dynamics of emerging infectious diseases, including COVID-19. 
    more » « less
  4. Yang, Junyuan (Ed.)
    In this work, we develop a new set of Bayesian models to perform registration of real-valued functions. A Gaussian process prior is assigned to the parameter space of time warping functions, and a Markov chain Monte Carlo (MCMC) algorithm is utilized to explore the posterior distribution. While the proposed model can be defined on the infinite-dimensional function space in theory, dimension reduction is needed in practice because one cannot store an infinite-dimensional function on the computer. Existing Bayesian models often rely on some pre-specified, fixed truncation rule to achieve dimension reduction, either by fixing the grid size or the number of basis functions used to represent a functional object. In comparison, the new models in this paper randomize the truncation rule. Benefits of the new models include the ability to make inference on the smoothness of the functional parameters, a data-informative feature of the truncation rule, and the flexibility to control the amount of shape-alteration in the registration process. For instance, using both simulated and real data, we show that when the observed functions exhibit more local features, the posterior distribution on the warping functions automatically concentrates on a larger number of basis functions. Supporting materials including code and data to perform registration and reproduce some of the results presented herein are available online. 
    more » « less
  5. Abstract Particle filters avoid parametric estimates for Bayesian posterior densities, which alleviates Gaussian assumptions in nonlinear regimes. These methods, however, are more sensitive to sampling errors than Gaussian-based techniques such as ensemble Kalman filters. A recent study by the authors introduced an iterative strategy for particle filters that match posterior moments—where iterations improve the filter’s ability to draw samples from non-Gaussian posterior densities. The iterations follow from a factorization of particle weights, providing a natural framework for combining particle filters with alternative filters to mitigate the impact of sampling errors. The current study introduces a novel approach to forming an adaptive hybrid data assimilation methodology, exploiting the theoretical strengths of nonparametric and parametric filters. At each data assimilation cycle, the iterative particle filter performs a sequence of updates while the prior sample distribution is non-Gaussian, then an ensemble Kalman filter provides the final adjustment when Gaussian distributions for marginal quantities are detected. The method employs the Shapiro–Wilk test to determine when to make the transition between filter algorithms, which has outstanding power for detecting departures from normality. Experiments using low-dimensional models demonstrate that the approach has a significant value, especially for nonhomogeneous observation networks and unknown model process errors. Moreover, hybrid factors are extended to consider marginals of more than one collocated variables using a test for multivariate normality. Findings from this study motivate the use of the proposed method for geophysical problems characterized by diverse observation networks and various dynamic instabilities, such as numerical weather prediction models. Significance Statement Data assimilation statistically processes observation errors and model forecast errors to provide optimal initial conditions for the forecast, playing a critical role in numerical weather forecasting. The ensemble Kalman filter, which has been widely adopted and developed in many operational centers, assumes Gaussianity of the prior distribution and solves a linear system of equations, leading to bias in strong nonlinear regimes. On the other hand, particle filters avoid many of those assumptions but are sensitive to sampling errors and are computationally expensive. We propose an adaptive hybrid strategy that combines their advantages and minimizes the disadvantages of the two methods. The hybrid particle filter–ensemble Kalman filter is achieved with the Shapiro–Wilk test to detect the Gaussianity of the ensemble members and determine the timing of the transition between these filter updates. Demonstrations in this study show that the proposed method is advantageous when observations are heterogeneous and when the model has an unknown bias. Furthermore, by extending the statistical hypothesis test to the test for multivariate normality, we consider marginals of more than one collocated variable. These results encourage further testing for real geophysical problems characterized by various dynamic instabilities, such as real numerical weather prediction models. 
    more » « less