skip to main content


Title: Error bounds in estimating the out-of-sample prediction error using leave-one-out cross validation in high-dimensions
We study the problem of out-of-sample risk estimation in the high dimensional regime where both the sample size n and number of features p are large, and n/p can be less than one. Extensive empirical evidence confirms the accuracy of leave-one-out cross validation (LO) for out-of-sample risk estimation. Yet, a unifying theoretical evaluation of the accuracy of LO in high-dimensional problems has remained an open problem. This paper aims to fill this gap for penalized regression in the generalized linear family. With minor assumptions about the data generating process, and without any sparsity assumptions on the regression coefficients, our theoretical analysis obtains finite sample upper bounds on the expected squared error of LO in estimating the out-of-sample error. Our bounds show that the error goes to zero as n,p→∞, even when the dimension p of the feature vectors is comparable with or greater than the sample size n. One technical advantage of the theory is that it can be used to clarify and connect some results from the recent literature on scalable approximate LO.  more » « less
Award ID(s):
1810880
NSF-PAR ID:
10183414
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We study the problem of out-of-sample risk estimation in the high dimensional regime where both the sample size n and number of features p are large, and n/p can be less than one. Extensive empirical evidence confirms the accuracy of leave-one-out cross validation (LO) for out-of-sample risk estimation. Yet, a unifying theoretical evaluation of the accuracy of LO in high-dimensional problems has remained an open problem. This paper aims to fill this gap for penalized regression in the generalized linear family. With minor assumptions about the data generating process, and without any sparsity assumptions on the regression coefficients, our theoretical analysis obtains finite sample upper bounds on the expected squared error of LO in estimating the out-of-sample error. Our bounds show that the error goes to zero as n,p→∞, even when the dimension p of the feature vectors is comparable with or greater than the sample size n. One technical advantage of the theory is that it can be used to clarify and connect some results from the recent literature on scalable approximate LO. 
    more » « less
  2. Abstract

    A generic out-of-sample error estimate is proposed for $M$-estimators regularized with a convex penalty in high-dimensional linear regression where $(\boldsymbol{X},\boldsymbol{y})$ is observed and the dimension $p$ and sample size $n$ are of the same order. The out-of-sample error estimate enjoys a relative error of order $n^{-1/2}$ in a linear model with Gaussian covariates and independent noise, either non-asymptotically when $p/n\le \gamma $ or asymptotically in the high-dimensional asymptotic regime $p/n\to \gamma ^{\prime}\in (0,\infty )$. General differentiable loss functions $\rho $ are allowed provided that the derivative of the loss is 1-Lipschitz; this includes the least-squares loss as well as robust losses such as the Huber loss and its smoothed versions. The validity of the out-of-sample error estimate holds either under a strong convexity assumption, or for the L1-penalized Huber M-estimator and the Lasso under a sparsity assumption and a bound on the number of contaminated observations. For the square loss and in the absence of corruption in the response, the results additionally yield $n^{-1/2}$-consistent estimates of the noise variance and of the generalization error. This generalizes, to arbitrary convex penalty and arbitrary covariance, estimates that were previously known for the Lasso.

     
    more » « less
  3. This paper considers the problem of kernel regression and classification with possibly unobservable response variables in the data, where the mechanism that causes the absence of information can depend on both predictors and the response variables. Our proposed approach involves two steps: First we construct a family of models (possibly infinite dimensional) indexed by the unknown parameter of the missing probability mechanism. In the second step, a search is carried out to find the empirically optimal member of an appropriate cover (or subclass) of the underlying family in the sense of minimizing the mean squared prediction error. The main focus of the paper is to look into some of the theoretical properties of these estimators. The issue of identifiability is also addressed. Our methods use a data-splitting approach which is quite easy to implement. We also derive exponential bounds on the performance of the resulting estimators in terms of their deviations from the true regression curve in general $L_p$ norms, where we allow the size of the cover or subclass to diverge as the sample size n increases. These bounds immediately yield various strong convergence results for the proposed estimators. As an application of our findings, we consider the problem of statistical classification based on the proposed regression estimators and also look into their rates of convergence under different settings. Although this work is mainly stated for kernel-type estimators, it can also be extended to other popular local-averaging methods such as nearest-neighbor and histogram estimators. 
    more » « less
  4. null (Ed.)

    We consider the regression problem of estimating functions on $ \mathbb{R}^D $ but supported on a $ d $-dimensional manifold $ \mathcal{M} ~~\subset \mathbb{R}^D $ with $ d \ll D $. Drawing ideas from multi-resolution analysis and nonlinear approximation, we construct low-dimensional coordinates on $ \mathcal{M} $ at multiple scales, and perform multiscale regression by local polynomial fitting. We propose a data-driven wavelet thresholding scheme that automatically adapts to the unknown regularity of the function, allowing for efficient estimation of functions exhibiting nonuniform regularity at different locations and scales. We analyze the generalization error of our method by proving finite sample bounds in high probability on rich classes of priors. Our estimator attains optimal learning rates (up to logarithmic factors) as if the function was defined on a known Euclidean domain of dimension $ d $, instead of an unknown manifold embedded in $ \mathbb{R}^D $. The implemented algorithm has quasilinear complexity in the sample size, with constants linear in $ D $ and exponential in $ d $. Our work therefore establishes a new framework for regression on low-dimensional sets embedded in high dimensions, with fast implementation and strong theoretical guarantees.

     
    more » « less
  5. null (Ed.)
    Abstract Estimating the mean of a probability distribution using i.i.d. samples is a classical problem in statistics, wherein finite-sample optimal estimators are sought under various distributional assumptions. In this paper, we consider the problem of mean estimation when independent samples are drawn from $d$-dimensional non-identical distributions possessing a common mean. When the distributions are radially symmetric and unimodal, we propose a novel estimator, which is a hybrid of the modal interval, shorth and median estimators and whose performance adapts to the level of heterogeneity in the data. We show that our estimator is near optimal when data are i.i.d. and when the fraction of ‘low-noise’ distributions is as small as $\varOmega \left (\frac{d \log n}{n}\right )$, where $n$ is the number of samples. We also derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales of individual data points. Finally, we extend our theory to linear regression. In both the mean estimation and regression settings, we present computationally feasible versions of our estimators that run in time polynomial in the number of data points. 
    more » « less