<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Honest leave‐one‐out cross‐validation for estimating post‐tuning generalization error</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>12/01/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10329391</idno>
					<idno type="doi">10.1002/sta4.413</idno>
					<title level='j'>Stat</title>
<idno>2049-1573</idno>
<biblScope unit="volume">10</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Boxiang Wang</author><author>Hui Zou</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Many machine learning models have tuning parameters to be determined by the training data, and cross-validation (CV) is perhaps the most commonly used method for selecting tuning parameters. This work concerns the problem of estimating the generalization error of a CV-tuned predictive model. We propose to use an honest leave-one-out cross-validation framework to produce a nearly unbiased estimator of the post-tuning generalization error. By using the kernel support vector machine and the kernel logistic regression as examples, we demonstrate that the honest leaveone-out cross-validation has very competitive performance even when competing with the state-of-the-art .632+ estimator.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>and Zou (2021) have shown that LOOCV does not have higher variance than 10-fold CV and 5-fold CV in the context of binary classification.</p><p>Here, we show a graphical illustration. We fit the kernel support vector machine (SVM) on a data set drawn from a mixture Gaussian distribution whose Bayes decision boundary is non-linear. The details of the data generating model are given in the caption of Figure <ref type="figure">1</ref>, from which we observe that (i) LOOCV has almost no bias in estimating the generalization error, while the bias largely increases as V decreases; (ii) the variance of the CV error exhibits very little difference among V &#188; 5,10, n, but the variance is much larger in two-fold CV.</p><p>A legitimate complaint of LOOCV arises from its expensive computation. The vanilla implementation of LOOCV requires n repeats of model fitting, which seems to be too expensive. Thus, many people prefer to use 10-fold or 5-fold CV over LOOCV. This claim can be traced back to the invention of V-fold CV at the outset as a computational remedy of LOOCV <ref type="bibr">(Geisser, 1975)</ref>. However, the CV error depends on the way one splits the data and hence would vary even if the same CV procedure is deployed with a different data split. <ref type="bibr">Rodr&#237;guez et al. (2010)</ref> studied repeated Vfold CV to stabilize the CV procedure, but the total computational cost becomes much more expensive. In contrast, LOOCV enjoys the advantage of having a deterministic data split scheme. Moreover, a smart implementation of LOOCV can make the total computation time much less than that by the vanilla implementation. Let us take the ridge regression as an example. <ref type="bibr">Golub et al. (1979)</ref> introduced a neat formula to allow for efficient computation of the exact LOOCV error with basically the same cost as fitting a single ridge regression model. This nice result can be generalized to a class of linear smoothers that possess the self-stable property <ref type="bibr">(Fan et al. 2020</ref>). As such, computational cost will not become a deciding factor that prevents users from using LOOCV to estimate the generalization error in real applications. In <ref type="bibr">Wang and Zou (2021)</ref>, the neat formula for ridge regression is extended to the large-margin classifiers and a new algorithm has been proposed for fast and exact LOOCV computation of a kernel SVM or related kernel classifiers.</p><p>The above discussion focuses on a single given model and how to estimate its generalization error. In practice, we select a final model from a list of candidate models. Now, suppose that the final model is already selected and one is willing to use this model for predicting future unseen data. Naturally, we would like to know its true accuracy. Statistically speaking, we aim to find a good estimator of the generalization error of the selected model. In particular, consider the final model chosen by CV (10-fold CV or LOOCV), how to estimate the generalization error of this CV-tuned model?</p><p>In a series of papers (e.g., <ref type="bibr">Efron, 1983</ref><ref type="bibr">Efron, , 1986</ref><ref type="bibr">Efron, , 2004;;</ref><ref type="bibr">Efron &amp; Tibshirani, 1997)</ref>, Efron studied the problem of estimating the generalization error of any given model and applied his methods to tuning. Conceptually, his results can be applied to a tuned model to obtain a good estimate of the generalization error of the tuned model. The caveat is to consider the tuning process as a part of the selected model. According to Efron, the state-of-the-art method is the .632+ bootstrap estimator <ref type="bibr">(Efron &amp; Tibshirani, 1997)</ref>. On the other hand, we have not seen the use of the .632+ estimator in the context of post-tuning estimation. In this paper, we consider the kernel SVM and the kernel logistic regression as examples. We find that the .632+ estimator still performs very well for estimating the generalization error of a CV-tuned kernel SVM and kernel logistic regression. Moreover, we propose an honest LOOCV (HLOOCV) estimator for estimating the generalization error of a CV-tuned kernel SVM and kernel logistic regression. We point out that a naive plug-in LOOCV estimator is biased and the honest LOOCV estimator is nearly unbiased. We further compare the honest LOOCV estimator with the .632+ estimator on various numerical examples and show that the honest LOOCV has very competitive performance even against the .632+ estimator.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F I G U R E 1</head><p>The mean and variance of the classification error of the kernel SVM. The training data (the sample size: n &#188; 100, the number of input variables: p &#188; 5) are generated from a mixture of Gaussian models. We draw 10 centres &#956; k from N((2, 2, 0, 0, 0), I), where I is the pdimensional identity matrix, and we then randomly pick up one centre and generate one positive-class observation from N&#240;&#956; k ,4I&#222;. Observations from the negative class are convened similarly, with N((0, 2, 2, 0, 0), I). The Bayes error is 24.3%. We assess the true error on 10,000 observations that are generated independently. The mean and variance of the classification error are calculated on the basis of 100 independent runs 2 | METHOD</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">| Honest leave-one-out cross-validation</head><p>We first formally define the generalization error of a tuned model. Suppose training data F n &#188; fx i , y i g n i&#188;1 are a random sample drawn from an unknown distribution P F . Denote by A &#955; a family of generic algorithms that are indexed by a parameter &#955;. By implementing A &#955; on the training data F n , we obtain a model f&#955; which predicts the outcome of data x to be f&#955; &#240;x&#222;. An example of the generic algorithm A &#955; that we use to demonstrate the idea is the kernel large-margin classifier:</p><p>where L(u) is a margin-based loss function for classification and H K is the reproducing kernel Hilbert space (RKHS) induced by a kernel function K.</p><p>For example, the kernel SVM uses L&#240;u&#222; &#188; max&#240;1 &#192; u,0&#222; (the hinge loss), and formulation (2.1) leads to the kernel logistic regression if <ref type="bibr">Hastie et al. 2009</ref>). To evaluate the accuracy of the model f&#955; , a loss function &#981;(y, s) is used. For classification, &#981;(y, s) is the 0-1 loss by default, but it can also be the binomial deviance or other loss functions. To select the tuning parameter &#955; from a candidate set &#923;, a generic tuning procedure &#916; such as LOOCV, 10-fold CV or bootstrap is deployed. In this work, we consider LOOCV as the tuning method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Denote by F &#189;&#192;i n</head><p>the training data with the ith observation removed and denote by f&#189;&#192;i </p><p>The tuning procedure thereby outputs &#916;&#240;&#923;&#222; &#188; argmin &#955; &#923; c Err &#981; &#240; f&#955; &#222; as an estimate of the "optimal" tuning parameter. After selecting the tuning parameter &#955; for the algorithm A &#955; , we have finalized a tuned model from the training data. We write the tuned model as c M A,&#916;,n , where the subscript n is appended to highlight that the model is constructed with the sample size n. We say the tuned model predicts the response of unseen test data x as c M A,&#916;,n &#240;x&#222;. Consequently, the generalization error of the tuned model c M A,&#916;,n under the &#981;-loss is defined as</p><p>which is referred to as the post-tuning generalization error of c M A,&#916;,n . Our goal is to construct a good estimate of Err &#981; &#240; c M A,&#916;,n &#222;.</p><p>Since the procedure for producing c M A,&#916;,n includes the LOOCV tuning, the same tuning procedure should be echoed when evaluating the performance of c M A,&#916;,n using CV. Therefore, we suggest using the following CV within CV framework to yield an honest LOOCV estimator to estimate the generalization error of c M A,&#916;,n .</p><p>&#8226; For each i &#188; 1, &#8230;n:</p><p>n and for each &#955; &#923;, perform the algorithm A &#955; on F &#189;&#192;i,&#192;j n to obtain the predictive model f&#189;&#192;i,&#192;j &#955; , 2. select the tuning parameter according to the procedure &#916;, that is, compute &#916;i &#240;&#923;&#222; &#188; argmin &#955; &#923; c Err &#981; &#240; f&#189;&#192;i &#955; &#222; &#188; argmin &#955; &#923; X j &#8800; i &#981;&#240;y j , f&#189;&#192;i,&#192;j &#955; &#240;x j &#222;&#222;, 3. perform the algorithm A &#916;i&#240;&#923;&#222; on F &#189;&#192;i n to obtain the predictive model f&#189;&#192;i &#916;i&#240;&#923;&#222; , and evaluate &#981;&#240;y i , f&#189;&#192;i &#916;i&#240;&#923;&#222; &#240;x i &#222;&#222;. &#8226; The honest LOOCV (HLOOCV) estimator is defined as</p><p>By its construction, it is easy to see that</p><p>Therefore, the honest LOOCV estimator is an exact unbiased estimator of the post-tuning generalization error of the model based on a training set with the sample size n &#192; 1. Consequently, the honest LOOCV estimator is a nearly unbiased estimator of Err &#981; &#240; MA,&#916;,n &#222;, the post-tuning generalization error of the model based on a training set with the sample size n. This nice property holds for each sample size n, and other popular methods for estimating generalization error discussed in the next subsection do not necessarily have this property.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">| Other proposals</head><p>In the literature, there are only few papers on the estimation of the generalization error of a post-tuning model. <ref type="bibr">Tibshirani and Rosset (2019)</ref> studied the prediction error of SURE-tuned regression models. SURE is mostly used for regression models, and the covariance penalty method <ref type="bibr">(Efron, 1986</ref><ref type="bibr">(Efron, ,2004) )</ref> can generalize SURE to classification models under the 0-1 loss. However, SURE or the covariance penalty method has to work with the fixed-X case, which is not the typical setting for some classification methods like the SVM.</p><p>It appears to be straightforward to directly use the observed smallest LOOCV error to estimate the generalization error of the tuned model:</p><p>, because &#955; is the chosen regularization parameter for fitting the tuned classifier. We call it the plug-in LOOCV estimator because c Err &#981; can be obtained by replacing &#955; with &#955; in Equation (2.2). When the sample size n is large enough so that the variability in selection is ignorable, then the plug-in estimator is expected to perform well. Given a small sample size, the plug-in estimator is too optimistic. To correct the bias, <ref type="bibr">Tibshirani and Tibshirani (2009)</ref> proposed a bias-adjusted estimator:</p><p>, in which the bias term is estimated as</p><p>We also consider two bootstrap-based methods, namely, .632 estimator and .632+ estimator. <ref type="bibr">Efron (1983)</ref>  is fitted and tuned on F ? b n , and the out of bootstrap prediction error is given as</p><p>The .632 estimator is then defined as follows:</p><p>where err is the training error.</p><p>A more sophisticated .632+ estimator was proposed in <ref type="bibr">Efron and Tibshirani (1997)</ref> as an improvement over the .632 estimator:</p><p>The .632+ estimator is regarded as the state-of-the-art method for estimating the generalization error of a generic algorithm. Our experiments show that it also works very well for the LOOCV-tuned classifiers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">| DISCUSSION</head><p>In this work, we have proposed an honest LOOCV estimator for estimating the post-tuning generalization error. We have also compared it with several proposals in the literature, including a naive plug-in LOOCV estimator, a bias-adjusted method proposed by <ref type="bibr">Tibshirani and Tibshirani (2009)</ref> and two improved bootstrap methods, the .632 and .632+ estimators. The .632+ estimator is so far the state-of-the-art estimator for the generalization error. Although all these methods aim to reduce the bias, our studies reveal that these methods still have a large bias when estimating the generalization error. By contrast, our honest LOOCV estimator is nearly unbiased, and the variance is roughly on the same scale with these competitors.</p><p>One may concern about high computational cost of the honest LOOCV, especially when conducting the generic algorithm itself is computationally expensive. We note that the computation issue is not just for the honest LOOCV. The state-of-the-art .632+ method is also computationally demanding. For example, the standard LOOCV implementation of the kernel SVM has computational complexity O(n 4 ). Thus, the standard implementation of the honest LOOCV estimator has computational complexity O(n 5 ), while the .632+ method has computational complexity O (Bn 4 ). Usually, &#188; 200 and hence the cost of the .632+ estimator is at least of the same order of that of the honest LOOCV when n is in the range 100-500. To handle the computational issue, we have employed the algorithm developed in <ref type="bibr">Wang and Zou (2021)</ref> to efficiently compute the LOOCV-tuned SVM and obtain the honest LOOCV error. As shown in <ref type="bibr">Wang and Zou (2021)</ref>, the exact LOOCV computation for the kernel margin classifiers is roughly O(n 3 ) complexity, which reduces the cost of honest LOOCV estimator and the .632+ estimator down to O(n 4 ) and O (Bn 3 ), respectively. The reduction in computation is also true for the kernel logistic regression.</p><p>Finally, we point out that our numerical results suggest that it is a difficult task to estimate the generalization error of a tuned predictive model when the sample size is not large. Although the honest LOOCV estimator has nearly zero bias, as predicted by theory, all the methods</p><p>T A B L E 1 Comparisons on five estimators for the post-tuning generalization error of the kernel SVM and the kernel logistic regression n Method Truth Criteria HLOOCV PLOOCV TT09 .632 .632+ 100 SVM 27.06 est (%) 27.11 24.37 33.96 18.67 24.18 Bias (%) 0.06 &#192;2.69 6.91 &#192;8.38 &#192;2.88 ffiffiffiffiffiffiffi var p (%) 6.13 5.30 8.63 3.11 5.16 RMSE (%) 6.13 5.95 11.05 8.94 5.91 Logistic 26.79 est (%) 26.64 24.79 31.04 19.06 24.27 Bias (%) &#192;0.15 &#192;1.99 4.25 &#192;7.73 &#192;2.51 ffiffiffiffiffiffiffi var p (%) 5.63 5.08 7.31 3.12 4.99 RMSE (%) 5.63 5.46 8.45 8.33 5.59 200 SVM 24.40 est (%) 24.39 22.69 31.63 17.00 20.75 Bias (%) &#192;0.00 &#192;1.71 7.24 &#192;7.40 &#192;3.64 ffiffiffiffiffiffiffi var p (%) 3.70 3.11 5.50 1.92 2.82 RMSE (%) 3.70 3.55 9.09 7.64 4.61 Logistic 24.10 est (%) 24.13 22.90 28.25 18.62 21.29 Bias (%) 0.03 &#192;1.20 4.15 &#192;5.48 &#192;2.81 ffiffiffiffiffiffiffi var p (%) 3.47 3.08 4.78 2.12 2.77 RMSE (%) 3.47 3.31 6.33 5.88 3.95 300 SVM 23.47 est (%) 23.36 22.19 30.26 17.18 19.91 Bias (%) &#192;0.11 &#192;1.28 6.79 &#192;6.29 &#192;3.56 ffiffiffiffiffiffiffi var p (%) 2.99 2.63 4.70 1.71 2.29 RMSE (%) 3.00 2.92 8.26 6.52 4.24 Logistic 23.24 est (%) 23.16 22.33 27.09 19.12 20.74 Bias (%) &#192;0.07 &#192;0.91 3.85 &#192;4.11 &#192;2.49 ffiffiffiffiffiffiffi var p (%) 2.71 2.51 3.76 1.87 2.23 RMSE (%) 2.71 2.67 5.38 4.52 3.35 Note: The classifiers are trained on training data generated from the mixture Gaussian distributions. Five estimators are honest LOOCV (HLOOCV), plug-in LOOCV (PLOOCV), the bias-adjusted method proposed in Tibshirani and Tibshirani (2009), .632 estimator and .632+ estimator. The term "Truth" stands for the true generalization error computed via Monte Carlo. Reported numbers are the estimates (in the row labelled est) as well as the corresponding bias, standard deviation ffiffiffiffiffiffiffi var p &#192; &#193; and root mean squared error (RMSE), based on 500 replicates. included in the study exhibit comparable but large variance. It would be very interesting to have an estimator that can further reduce the estimation variance without sacrificing the bias property of honest LOOCV, and the new estimator should at least be computationally manageable in order to be considered as a practical solution. T A B L E 2 Estimation of the post-tuning generalization error of the kernel SVM and the kernel logistic regression on the benchmark data set covtype n Method Truth Method HLOOCV PLOOCV TT09 .632 .632+ 100 SVM 30.38 est (%) 30.34 26.73 38.24 22.50 28.90 Bias (%) &#192;0.04 &#192;3.65 7.86 &#192;7.88 &#192;1.47 ffiffiffiffiffiffiffi var p (%) 6.95 5.35 8.78 3.27 5.18 RMSE (%) 6.95 6.48 11.79 8.53 5.39 Logistic 29.95 est (%) 30.04 27.56 35.91 23.42 28.76 Bias (%) 0.09 &#192;2.39 5.96 &#192;6.52 &#192;1.19 ffiffiffiffiffiffiffi var p (%) 5.92 5.02 7.40 3.43 4.98 RMSE (%) 5.92 5.55 9.50 7.37 5.12 200 SVM 27.26 est (%) 26.97 25.16 36.66 22.35 25.62 Bias (%) &#192;0.28 &#192;2.09 9.40 &#192;4.91 &#192;1.64 ffiffiffiffiffiffiffi var p (%) 4.07 3.57 6.34 2.65 3.45 RMSE (%) 4.08 4.14 11.34 5.58 3.82 Logistic 27.02 est (%) 26.94 25.59 34.50 23.34 25.77 Bias (%) &#192;0.08 &#192;1.43 7.48 &#192;3.68 &#192;1.24 ffiffiffiffiffiffiffi var p (%) 3.78 3.38 5.24 2.67 3.27 RMSE (%) 3.78 3.67 9.13 4.54 3.49 300 SVM 25.95 est (%) 25.77 24.38 35.69 22.42 24.42 Bias (%) &#192;0.17 &#192;1.57 9.75 &#192;3.52 &#192;1.52 ffiffiffiffiffiffiffi var p (%) 3.14 2.77 5.04 2.19 2.59 RMSE (%) 3.14 3.18 10.97 4.15 3.01 Logistic 25.81 est (%) 25.62 24.67 33.61 23.27 24.70 Bias (%) &#192;0.18 &#192;1.14 7.80 &#192;2.53 &#192;1.11 ffiffiffiffiffiffiffi var p (%) 2.99 2.64 4.28 2.20 2.49 RMSE (%) 3.00 2.87 8.90 3.35 2.73 Note: The original data set has 495,141 observations and is randomly split into training and test data. The generalization error is assessed on test data (denoted by Truth) and estimated on training data using honest LOOCV (denoted by HLOOCV), plug-in LOOCV (denoted by PLOOCV), the bias-adjusted estimator proposed in Tibshirani and Tibshirani (2009) (denoted by TT09), .632 estimator and .632+ estimator. The term "Truth" stands for the true generalization error computed on the test data. Reported numbers are the estimates (in the row labelled est) as well as the corresponding bias, standard deviation ffiffiffiffiffiffiffi var p &#192; &#193; and root mean squared error (RMSE), based on 500 replicates.</p></div></body>
		</text>
</TEI>
