Censored survival data are common in clinical trial studies. We propose a unified framework for sensitivity analysis to censoring at random in survival data using multiple imputation and martingale, called SMIM. The proposed framework adopts the δ‐adjusted and control‐based models, indexed by the sensitivity parameter, entailing censoring at random and a wide collection of censoring not at random assumptions. Also, it targets a broad class of treatment effect estimands defined as functionals of treatment‐specific survival functions, taking into account missing data due to censoring. Multiple imputation facilitates the use of simple full‐sample estimation; however, the standard Rubin's combining rule may overestimate the variance for inference in the sensitivity analysis framework. We decompose the multiple imputation estimator into a martingale series based on the sequential construction of the estimator and propose the wild bootstrap inference by resampling the martingale series. The new bootstrap inference has a theoretical guarantee for consistency and is computationally efficient compared to the nonparametric bootstrap counterpart. We evaluate the finite‐sample performance of the proposed SMIM through simulation and an application on an HIV clinical trial.
Missing data is inevitable in longitudinal clinical trials. Conventionally, the missing at random assumption is assumed to handle missingness, which however is unverifiable empirically. Thus, sensitivity analyses are critically important to assess the robustness of the study conclusions against untestable assumptions. Toward this end, regulatory agencies and the pharmaceutical industry use sensitivity models such as return-to-baseline, control-based, and washout imputation, following the ICH E9(R1) guidance. Multiple imputation is popular in sensitivity analyses; however, it may be inefficient and result in an unsatisfying interval estimation by Rubin’s combining rule. We propose distributional imputation in sensitivity analysis, which imputes each missing value by samples from its target imputation model given the observed data. Drawn on the idea of Monte Carlo integration, the distributional imputation estimator solves the mean estimating equations of the imputed dataset. It is fully efficient with theoretical guarantees. Moreover, we propose weighted bootstrap to obtain a consistent variance estimator, taking into account the variabilities due to model parameter estimation and target parameter estimation. The superiority of the distributional imputation framework is validated in the simulation study and an antidepressant longitudinal clinical trial.more » « less
- NSF-PAR ID:
- Publisher / Repository:
- SAGE Publications
- Date Published:
- Journal Name:
- Statistical Methods in Medical Research
- Medium: X Size: p. 181-194
- ["p. 181-194"]
- Sponsoring Org:
- National Science Foundation
More Like this
The human microbiome, which is linked to various diseases by growing evidence, has a profound impact on human health. Since changes in the composition of the microbiome across time are associated with disease and clinical outcomes, microbiome analysis should be performed in a longitudinal study. However, due to limited sample sizes and differing numbers of timepoints for different subjects, a significant amount of data cannot be utilized, directly affecting the quality of analysis results. Deep generative models have been proposed to address this lack of data issue. Specifically, a generative adversarial network (GAN) has been successfully utilized for data augmentation to improve prediction tasks. Recent studies have also shown improved performance of GAN-based models for missing value imputation in a multivariate time series dataset compared with traditional imputation methods.
This work proposes DeepMicroGen, a bidirectional recurrent neural network-based GAN model, trained on the temporal relationship between the observations, to impute the missing microbiome samples in longitudinal studies. DeepMicroGen outperforms standard baseline imputation methods, showing the lowest mean absolute error for both simulated and real datasets. Finally, the proposed model improved the predicted clinical outcome for allergies, by providing imputation for an incomplete longitudinal dataset used to train the classifier.
Availability and implementation
DeepMicroGen is publicly available at https://github.com/joungmin-choi/DeepMicroGen.
We consider the situation where there is a known regression model that can be used to predict an outcome,
Y, from a set of predictor variables X. A new variable Bis expected to enhance the prediction of Y. A dataset of size ncontaining Y, Xand Bis available, and the challenge is to build an improved model for Y| X, Bthat uses both the available individual level data and some summary information obtained from the known model for Y| X. We propose a synthetic data approach, which consists of creating madditional synthetic data observations, and then analyzing the combined dataset of size n+ mto estimate the parameters of the Y| X, Bmodel. This combined dataset of size n+ mnow has missing values of Bfor mof the observations, and is analyzed using methods that can handle missing data (e.g., multiple imputation). We present simulation studies and illustrate the method using data from the Prostate Cancer Prevention Trial. Though the synthetic data method is applicable to a general regression context, to provide some justification, we show in two special cases that the asymptotic variances of the parameter estimates in the Y| X, Bmodel are identical to those from an alternative constrained maximum likelihood estimation approach. This correspondence in special cases and the method's broad applicability makes it appealing for use across diverse scenarios. The Canadian Journal of Statistics47: 580–603; 2019 © 2019 Statistical Society of Canada
null (Ed.)Global surface water classification layers, such as the European Joint Research Centre’s (JRC) Monthly Water History dataset, provide a starting point for accurate and large scale analyses of trends in waterbody extents. On the local scale, there is an opportunity to increase the accuracy and temporal frequency of these surface water maps by using locally trained classifiers and gap-filling missing values via imputation in all available satellite images. We developed the Surface Water IMputation (SWIM) classification framework using R and the Google Earth Engine computing platform to improve water classification compared to the JRC study. The novel contributions of the SWIM classification framework include (1) a cluster-based algorithm to improve classification sensitivity to a variety of surface water conditions and produce approximately unbiased estimation of surface water area, (2) a method to gap-fill every available Landsat image for a region of interest to generate submonthly classifications at the highest possible temporal frequency, (3) an outlier detection method for identifying images that contain classification errors due to failures in cloud masking. Validation and several case studies demonstrate the SWIM classification framework outperforms the JRC dataset in spatiotemporal analyses of small waterbody dynamics with previously unattainable sensitivity and temporal frequency. Most importantly, this study shows that reliable surface water classifications can be obtained for all pixels in every available Landsat image, even those containing cloud cover, after performing gap-fill imputation. By using this technique, the SWIM framework supports monitoring water extent on a submonthly basis, which is especially applicable to assessing the impact of short-term flood and drought events. Additionally, our results contribute to addressing the challenges of training machine learning classifiers with biased ground truth data and identifying images that contain regions of anomalous classification errors.more » « less
In causal inference problems, one is often tasked with estimating causal effects which are analytically intractable functionals of the data‐generating mechanism. Relevant settings include estimating intention‐to‐treat effects in longitudinal problems with missing data or computing direct and indirect effects in mediation analysis. One approach to computing these effects is to use the
g‐formula implemented via Monte Carlo integration; when simulation‐based methods such as the nonparametric bootstrap or Markov chain Monte Carlo are used for inference, Monte Carlo integration must be nested within an already computationally intensive algorithm. We develop a widely‐applicable approach to accelerating this Monte Carlo integration step which greatly reduces the computational burden of existing g‐computation algorithms. We refer to our method as accelerated g‐computation (AGC). The algorithms we present are similar in spirit to multiple imputation, but require removing within‐imputation variance from the standard error rather than adding it. We illustrate the use of AGC on a mediation analysis problem using a beta regression model and in a longitudinal clinical trial subject to nonignorable missingness using a Bayesian additive regression trees model.