skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Confidence intervals for the Cox model test error from cross‐validation
Summary Cross‐validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model, but its behavior is not yet fully understood. It has been shown that standard confidence intervals for test error using estimates from CV may have coverage below nominal levels. This phenomenon occurs because each sample is used in both the training and testing procedures during CV and as a result, the CV estimates of the errors become correlated. Without accounting for this correlation, the estimate of the variance is smaller than it should be. One way to mitigate this issue is by estimating the mean squared error of the prediction error instead using nested CV. This approach has been shown to achieve superior coverage compared to intervals derived from standard CV. In this work, we generalize the nested CV idea to the Cox proportional hazards model and explore various choices of test error for this setting.  more » « less
Award ID(s):
2113389
PAR ID:
10552981
Author(s) / Creator(s):
;
Publisher / Repository:
Statistics in Medicine
Date Published:
Journal Name:
Statistics in Medicine
Volume:
42
Issue:
25
ISSN:
0277-6715
Page Range / eLocation ID:
4532 to 4541
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Many machine learning models have tuning parameters to be determined by the training data, and cross‐validation (CV) is perhaps the most commonly used method for selecting tuning parameters. This work concerns the problem of estimating the generalization error of a CV‐tuned predictive model. We propose to use an honest leave‐one‐out cross‐validation framework to produce a nearly unbiased estimator of the post‐tuning generalization error. By using the kernel support vector machine and the kernel logistic regression as examples, we demonstrate that the honest leave‐one‐out cross‐validation has very competitive performance even when competing with the state‐of‐the‐art .632+ estimator. 
    more » « less
  2. We propose a method for constructing confidence intervals that account for many forms of spatial correlation. The interval has the familiar “estimator plus and minus a standard error times a critical value” form, but we propose new methods for constructing the standard error and the critical value. The standard error is constructed using population principal components from a given “worst‐case” spatial correlation model. The critical value is chosen to ensure coverage in a benchmark parametric model for the spatial correlations. The method is shown to control coverage in finite sample Gaussian settings in a restricted but nonparametric class of models and in large samples whenever the spatial correlation is weak, that is, with average pairwise correlations that vanish as the sample size gets large. We also provide results on the efficiency of the method. 
    more » « less
  3. Abstract Summary The accurate estimation of prediction errors in time series is an important problem. It immediately affects the accuracy of prediction intervals but also the quality of a number of widely used time series model selection criteria such as AIC and others. Except for simple cases, however, it is difficult or even infeasible to obtain exact analytical expressions for one-step and multi-step predictions. This may be one of the reasons that, unlike in the independent case (see Efron, 2004), until today there has been no fully established methodology for time series prediction error estimation. Starting from an approximation to the bias-variance decomposition of the squared prediction error, this work is therefore concerned with the estimation of prediction errors in both univariate and multivariate stationary time series. In particular, several estimates are developed for a general class of predictors that includes most of the popular linear, nonlinear, parametric and nonparametric time series models used in practice, where causal invertible ARMA and nonparametric AR processes are discussed as lead examples. Simulation results indicate that the proposed estimators perform quite well in finite samples. The estimates may also be used for model selection when the purpose of modeling is prediction. 
    more » « less
  4. Abstract In this paper, we propose a flexible nested error regression small area model with high-dimensional parameter that incorporates heterogeneity in regression coefficients and variance components. We develop a new robust small area-specific estimating equations method that allows appropriate pooling of a large number of areas in estimating small area-specific model parameters. We propose a parametric bootstrap and jackknife method to estimate not only the mean squared errors but also other commonly used uncertainty measures such as standard errors and coefficients of variation. We conduct both model-based and design-based simulation experiments and real-life data analysis to evaluate the proposed methodology. 
    more » « less
  5. Segata, Nicola (Ed.)
    The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k -mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k -mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k -mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e= . 
    more » « less