skip to main content

Title: Sampling‐based estimation for massive survival data with additive hazards model

For massive survival data, we propose a subsampling algorithm to efficiently approximate the estimates of regression parameters in the additive hazards model. We establish consistency and asymptotic normality of the subsample‐based estimator given the full data. The optimal subsampling probabilities are obtained via minimizing asymptotic variance of the resulting estimator. The subsample‐based procedure can largely reduce the computational cost compared with the full data method. In numerical simulations, our method has low bias and satisfactory coverage probabilities. We provide an illustrative example on the survival analysis of patients with lymphoma cancer from the Surveillance, Epidemiology, and End Results Program.

more » « less
Award ID(s):
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistics in Medicine
Page Range / eLocation ID:
p. 441-450
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we propose improved estimation method for logistic regression based on subsamples taken according the optimal subsampling probabilities developed in Wang et al. (2018). Both asymptotic results and numerical results show that the new estimator has a higher estimation efficiency. We also develop a new algorithm based on Poisson subsampling, which does not require to approximate the optimal subsampling probabilities all at once. This is computationally advantageous when available random-access memory is not enough to hold the full data. Interestingly, asymptotic distributions also show that Poisson subsampling produces a more efficient estimator if the sampling ratio, the ratio of the subsample size to the full data sample size, does not converge to zero. We also obtain the unconditional asymptotic distribution for the estimator based on Poisson subsampling. Pilot estimators are required to calculate subsampling probabilities and to correct biases in un-weighted estimators; interestingly, even if pilot estimators are inconsistent, the proposed method still produce consistent and asymptotically normal estimators. 
    more » « less
  2. This paper proposes a nonuniform subsampling method for finite mixtures of regression models to reduce large data computational tasks. A general estimator based on a subsample is investigated, and its asymptotic normality is established. We assign optimal subsampling probabilities to data points that minimize the asymptotic mean squared errors of the general estimator and linearly transformed estimators. Since the proposed probabilities depend on unknown parameters, an implementable algorithm is developed. We first approximate the optimal subsampling probabilities using a pilot sample. After that, we select a subsample using the approximated subsampling probabilities and compute estimates using the subsample. We evaluate the proposed method in a simulation study and present a real data example using appliance energy data. 
    more » « less
  3. null (Ed.)
    Summary We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling probabilities. One version minimizes the trace of the asymptotic variance-covariance matrix for a linearly transformed parameter estimator and the other minimizes that of the original parameter estimator. The former does not depend on the densities of the responses given covariates and is easy to implement. Algorithms based on optimal subsampling probabilities are proposed and asymptotic distributions, and the asymptotic optimality of the resulting estimators are established. Furthermore, we propose an iterative subsampling procedure based on the optimal subsampling probabilities in the linearly transformed parameter estimation which has great scalability to utilize available computational resources. In addition, this procedure yields standard errors for parameter estimators without estimating the densities of the responses given the covariates. We provide numerical examples based on both simulated and real data to illustrate the proposed method. 
    more » « less
  4. Abstract

    To tackle massive data, subsampling is a practical approach to select the more informative data points. However, when responses are expensive to measure, developing efficient subsampling schemes is challenging, and an optimal sampling approach under measurement constraints was developed to meet this challenge. This method uses the inverses of optimal sampling probabilities to reweight the objective function, which assigns smaller weights to the more important data points. Thus, the estimation efficiency of the resulting estimator can be improved. In this paper, we propose an unweighted estimating procedure based on optimal subsamples to obtain a more efficient estimator. We obtain the unconditional asymptotic distribution of the estimator via martingale techniques without conditioning on the pilot estimate, which has been less investigated in the existing subsampling literature. Both asymptotic results and numerical results show that the unweighted estimator is more efficient in parameter estimation.

    more » « less
  5. Abstract

    Subsampling has seen a resurgence in the big data era where the standard, full-resample size bootstrap can be infeasible to compute. Nevertheless, even choosing a single random subsample of size b can be computationally challenging with both b and the sample size n being very large. This paper shows how a set of appropriately chosen, nonrandom subsamples can be used to conduct effective, and computationally feasible, subsampling distribution estimation. Furthermore, the same set of subsamples can be used to yield a procedure for subsampling aggregation, also known as subagging, that is scalable with big data. Interestingly, the scalable subagging estimator can be tuned to have the same, or better, rate of convergence than that of θ^n. Statistical inference could then be based on the scalable subagging estimator instead of the original θ^n.

    more » « less