skip to main content


Title: Multivariate survival analysis in big data: A divide‐and‐combine approach
Abstract

Multivariate failure time data are frequently analyzed using the marginal proportional hazards models and the frailty models. When the sample size is extraordinarily large, using either approach could face computational challenges. In this paper, we focus on the marginal model approach and propose a divide‐and‐combine method to analyze large‐scale multivariate failure time data. Our method is motivated by the Myocardial Infarction Data Acquisition System (MIDAS), a New Jersey statewide database that includes 73,725,160 admissions to nonfederal hospitals and emergency rooms (ERs) from 1995 to 2017. We propose to randomly divide the full data into multiple subsets and propose a weighted method to combine these estimators obtained from individual subsets using three weights. Under mild conditions, we show that the combined estimator is asymptotically equivalent to the estimator obtained from the full data as if the data were analyzed all at once. In addition, to screen out risk factors with weak signals, we propose to perform the regularized estimation on the combined estimator using its combined confidence distribution. Theoretical properties, such as consistency, oracle properties, and asymptotic equivalence between the divide‐and‐combine approach and the full data approach are studied. Performance of the proposed method is investigated using simulation studies. Our method is applied to the MIDAS data to identify risk factors related to multivariate cardiovascular‐related health outcomes.

 
more » « less
Award ID(s):
2027855 2015373 1812048 1737857
NSF-PAR ID:
10397026
Author(s) / Creator(s):
 ;  ;  ;  ;  
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Biometrics
Volume:
78
Issue:
3
ISSN:
0006-341X
Format(s):
Medium: X Size: p. 852-866
Size(s):
["p. 852-866"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Monte Carlo algorithms, such as Markov chain Monte Carlo (MCMC) and Hamiltonian Monte Carlo (HMC), are routinely used for Bayesian inference; however, these algorithms are prohibitively slow in massive data settings because they require multiple passes through the full data in every iteration. Addressing this problem, we develop a scalable extension of these algorithms using the divide‐and‐conquer (D&C) technique that divides the data into a sufficiently large number of subsets, draws parameters in parallel on the subsets using apoweredlikelihood and produces Monte Carlo draws of the parameter by combining parameter draws obtained from each subset. The combined parameter draws play the role of draws from the original sampling algorithm. Our main contributions are twofold. First, we demonstrate through diverse simulated and real data analyses focusing on generalized linear models (GLMs) that our distributed algorithm delivers comparable results as the current state‐of‐the‐art D&C algorithms in terms of statistical accuracy and computational efficiency. Second, providing theoretical support for our empirical observations, we identify regularity assumptions under which the proposed algorithm leads to asymptotically optimal inference. We also provide illustrative examples focusing on normal linear and logistic regressions where parts of our D&C algorithm are analytically tractable.

     
    more » « less
  2. Abstract Identifying the factors that structure host–parasite interactions is fundamental to understand the drivers of species distributions and to predict novel cross-species transmission events. More phylogenetically related host species tend to have more similar parasite associations, but parasite specificity may vary as a function of transmission mode, parasite taxonomy or life history. Accordingly, analyses that attempt to infer host−parasite associations using combined data on different parasite groups may perform quite differently relative to analyses on each parasite subset. In essence, are more data always better when predicting host−parasite associations, or does parasite taxonomic resolution matter? Here, we explore how taxonomic resolution affects predictive models of host−parasite associations using the London Natural History Museum's database of host–helminth interactions. Using boosted regression trees, we demonstrate that taxon-specific models (i.e. of Acanthocephalans, Nematodes and Platyhelminthes) consistently outperform full models in predicting mammal-helminth associations. At finer spatial resolutions, full and taxon-specific model performance does not vary, suggesting tradeoffs between phylogenetic and spatial scales of analysis. Although all models identify similar host and parasite covariates as important to such patterns, our results emphasize the importance of phylogenetic scale in the study of host–parasite interactions and suggest that using taxonomic subsets of data may improve predictions of parasite distributions and cross-species transmission. Predictive models of host–pathogen interactions should thus attempt to encompass the spatial resolution and phylogenetic scale desired for inference and prediction and potentially use model averaging or ensemble models to combine predictions from separately trained models. 
    more » « less
  3. Density estimation is a building block for many other statistical methods, such as classification, nonparametric testing, and data compression. In this paper, we focus on a non-parametric approach to multivariate density estimation, and study its asymptotic properties under both frequentist and Bayesian settings. The estimated density function is obtained by considering a sequence of approximating spaces to the space of densities. These spaces consist of piecewise constant density functions supported by binary partitions with increasing complexity. To obtain an estimate, the partition is learned by maximizing either the likelihood of the corresponding histogram on that partition, or the marginal posterior probability of the partition under a suitable prior. We analyze the convergence rate of the maximum likelihood estimator and the posterior concentration rate of the Bayesian estimator, and conclude that for a relatively rich class of density functions the rate does not directly depend on the dimension. We also show that the Bayesian method can adapt to the unknown smoothness of the density function. The method is applied to several specific function classes and explicit rates are obtained. These include spatially sparse functions, functions of bounded variation, and Holder continuous functions. We also introduce an ensemble approach, obtained by aggregating multiple density estimates fit under carefully designed perturbations, and show that for density functions lying in a Holder space (H^(1,β),0<β≤1), the ensemble method can achieve minimax convergence rate up to a logarithmic term, while the corresponding rate of the density estimator based on a single partition is suboptimal for this function class. 
    more » « less
  4. Density estimation is a building block for many other statistical methods, such as classification, nonparametric testing, and data compression. In this paper, we focus on a nonparametric approach to multivariate density estimation, and study its asymptotic properties under both frequentist and Bayesian settings. The estimated density function is obtained by considering a sequence of approximating spaces to the space of densities. These spaces consist of piecewise constant density functions supported by binary partitions with increasing complexity. To obtain an estimate, the partition is learned by maximizing either the likelihood of the corresponding histogram on that partition, or the marginal posterior probability of the partition under a suitable prior. We analyze the convergence rate of the maximum likelihood estimator and the posterior concentration rate of the Bayesian estimator, and conclude that for a relatively rich class of density functions the rate does not directly depend on the dimension. We also show that the Bayesian method can adapt to the unknown smoothness of the density function. The method is applied to several specific function classes and explicit rates are obtained. These include spatially sparse functions, functions of bounded variation, and Holder continuous functions. We also introduce an ensemble approach, obtained by aggregating multiple density estimates fit under carefully designed perturbations, and show that for density functions lying in a Holder space (H^(1,β), 0 < β ≤ 1), the ensemble method can achieve minimax convergence rate up to a logarithmic term, while the corresponding rate of the density estimator based on a single partition is suboptimal for this function class. 
    more » « less
  5. Background: Multivariate pattern analysis (MVPA or pattern decoding) has attracted considerable attention as a sensitive analytic tool for investigations using functional magnetic resonance imaging (fMRI) data. With the introduction of MVPA, however, has come a proliferation of methodological choices confronting the researcher, with few studies to date offering guidance from the vantage point of controlled datasets detached from specific experimental hypotheses. New method: We investigated the impact of four data processing steps on support vector machine (SVM) classification performance aimed at maximizing information capture in the presence of common noise sources. The four techniques included: trial averaging (classifying on separate trial estimates versus condition-based averages), within-run mean centering (centering the data or not), method of cost selection (using a fixed or tuned cost value), and motion-related denoising approach (comparing no denoising versus a variety of nuisance regressions capturing motion-related reference signals). The impact of these approaches was evaluated on real fMRI data from two control ROIs, as well as on simulated pattern data constructed with carefully controlled voxel- and trial-level noise components. Results: We find significant improvements in classification performance across both real and simulated datasets with run-wise trial averaging and mean centering. When averaging trials within conditions of each run, we note a simultaneous increase in the between-subject variability of SVM classification accuracies which we attribute to the reduced size of the test set used to assess the classifier's prediction error. Therefore, we propose a hybrid technique whereby randomly sampled subsets of trials are averaged per run and demonstrate that it helps mitigate the tradeoff between improving signal-to-noise ratio by averaging and losing exemplars in the test set. Comparison with existing methods: Though a handful of empirical studies have employed run-based trial averaging, mean centering, or their combination, such studies have done so without theoretical justification or rigorous testing using control ROIs. Conclusions: Therefore, we intend this study to serve as a practical guide for researchers wishing to optimize pattern decoding without risk of introducing spurious results. 
    more » « less