skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on February 4, 2026

Title: DsubCox : a fast subsampling algorithm for Cox model with distributed and massive survival data
Abstract To ensure privacy protection and alleviate computational burden, we propose a fast subsmaling procedure for the Cox model with massive survival datasets from multi-centered, decentralized sources. The proposed estimator is computed based on optimal subsampling probabilities that we derived and enables transmission of subsample-based summary level statistics between different storage sites with only one round of communication. For inference, the asymptotic properties of the proposed estimator were rigorously established. An extensive simulation study demonstrated that the proposed approach is effective. The methodology was applied to analyze a large dataset from the U.S. airlines.  more » « less
Award ID(s):
2105571
PAR ID:
10596303
Author(s) / Creator(s):
; ;
Publisher / Repository:
De Gruyter
Date Published:
Journal Name:
The International Journal of Biostatistics
ISSN:
2194-573X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary A within-cluster resampling method is proposed for fitting a multilevel model in the presence of informative cluster size. Our method is based on the idea of removing the information in the cluster sizes by drawing bootstrap samples which contain a fixed number of observations from each cluster. We then estimate the parameters by maximizing an average, over the bootstrap samples, of a suitable composite loglikelihood. The consistency of the proposed estimator is shown and does not require that the correct model for cluster size is specified. We give an estimator of the covariance matrix of the proposed estimator, and a test for the noninformativeness of the cluster sizes. A simulation study shows, as in Neuhaus & McCulloch (2011), that the standard maximum likelihood estimator exhibits little bias for some regression coefficients. However, for those parameters which exhibit nonnegligible bias, the proposed method is successful in correcting for this bias. 
    more » « less
  2. Summary Panel count data arise when the number of recurrent events experienced by each subject is observed intermittently at discrete examination times. The examination time process can be informative about the underlying recurrent event process even after conditioning on covariates. We consider a semiparametric accelerated mean model for the recurrent event process and allow the two processes to be correlated through a shared frailty. The regression parameters have a simple marginal interpretation of modifying the time scale of the cumulative mean function of the event process. A novel estimation procedure for the regression parameters and the baseline rate function is proposed based on a conditioning technique. In contrast to existing methods, the proposed method is robust in the sense that it requires neither the strong Poisson-type assumption for the underlying recurrent event process nor a parametric assumption on the distribution of the unobserved frailty. Moreover, the distribution of the examination time process is left unspecified, allowing for arbitrary dependence between the two processes. Asymptotic consistency of the estimator is established, and the variance of the estimator is estimated by a model-based smoothed bootstrap procedure. Numerical studies demonstrated that the proposed point estimator and variance estimator perform well with practical sample sizes. The methods are applied to data from a skin cancer chemoprevention trial. 
    more » « less
  3. Abstract A constrained multivariate linear model is a multivariate linear model with the columns of its coefficient matrix constrained to lie in a known subspace. This class of models includes those typically used to study growth curves and longitudinal data. Envelope methods have been proposed to improve the estimation efficiency in unconstrained multivariate linear models, but have not yet been developed for constrained models. We pursue that development in this article. We first compare the standard envelope estimator with the standard estimator arising from a constrained multivariate model in terms of bias and efficiency. To further improve efficiency, we propose a novel envelope estimator based on a constrained multivariate model. We show the advantage of our proposals by simulations and by studying the probiotic capacity to reduced Salmonella infection. 
    more » « less
  4. Abstract We consider high‐dimensional inference for potentially misspecified Cox proportional hazard models based on low‐dimensional results by Lin and Wei (1989). A desparsified Lasso estimator is proposed based on the log partial likelihood function and shown to converge to a pseudo‐true parameter vector. Interestingly, the sparsity of the true parameter can be inferred from that of the above limiting parameter. Moreover, each component of the above (nonsparse) estimator is shown to be asymptotically normal with a variance that can be consistently estimated even under model misspecifications. In some cases, this asymptotic distribution leads to valid statistical inference procedures, whose empirical performances are illustrated through numerical examples. 
    more » « less
  5. We develop an analytical framework to appropriately model and adequately analyze A/B tests in presence of nonparametric nonstationarities in the targeted business metrics. A/B tests, also known as online randomized controlled experiments, have been used at scale by data-driven enterprises to guide decisions and test innovative ideas to improve core business metrics. Meanwhile, nonstationarities, such as the time-of-day effect and the day-of-week effect, can often arise nonparametrically in key business metrics involving purchases, revenue, conversions, customer experiences, and so on. First, we develop a generic nonparametric stochastic model to capture nonstationarities in A/B test experiments, where each sample represents a visit or action associated with a time label. We build a practically relevant limiting regime to facilitate analyzing large-sample estimator performances under nonparametric nonstationarities. Second, we show that ignoring or inadequately addressing nonstationarities can cause standard A/B test estimators to have suboptimal variance and nonvanishing bias, therefore leading to loss of statistical efficiency and accuracy. We provide a new estimator that views time as a continuous strata and performs poststratification with a data-dependent number of stratification levels. Without making parametric assumptions, we prove a central limit theorem for the proposed estimator and show that the estimator attains the best achievable asymptotic variance and is asymptotically unbiased. Third, we propose a time-grouped randomization that is designed to balance treatment and control assignments at granular time scales. We show that when the time-grouped randomization is integrated to standard experimental designs to generate experiment data, simple A/B test estimators can achieve asymptotically optimal variance. A brief account of numerical experiments are conducted to illustrate the analysis. This paper was accepted by Baris Ata, stochastic models and simulation. Supplemental Material: The online appendices and data files are available at https://doi.org/10.1287/mnsc.2022.01205 . 
    more » « less