skip to main content


Title: More efficient approximation of smoothing splines via space-filling basis selection
Summary We consider the problem of approximating smoothing spline estimators in a nonparametric regression model. When applied to a sample of size $n$, the smoothing spline estimator can be expressed as a linear combination of $n$ basis functions, requiring $O(n^3)$ computational time when the number $d$ of predictors is two or more. Such a sizeable computational cost hinders the broad applicability of smoothing splines. In practice, the full-sample smoothing spline estimator can be approximated by an estimator based on $q$ randomly selected basis functions, resulting in a computational cost of $O(nq^2)$. It is known that these two estimators converge at the same rate when $q$ is of order $O\{n^{2/(pr+1)}\}$, where $p\in [1,2]$ depends on the true function and $r > 1$ depends on the type of spline. Such a $q$ is called the essential number of basis functions. In this article, we develop a more efficient basis selection method. By selecting basis functions corresponding to approximately equally spaced observations, the proposed method chooses a set of basis functions with great diversity. The asymptotic analysis shows that the proposed smoothing spline estimator can decrease $q$ to around $O\{n^{1/(pr+1)}\}$ when $d\leq pr+1$. Applications to synthetic and real-world datasets show that the proposed method leads to a smaller prediction error than other basis selection methods.  more » « less
Award ID(s):
1903226
NSF-PAR ID:
10230055
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Biometrika
Volume:
107
Issue:
3
ISSN:
0006-3444
Page Range / eLocation ID:
723 to 735
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Summary Large samples are generated routinely from various sources. Classic statistical models, such as smoothing spline ANOVA models, are not well equipped to analyse such large samples because of high computational costs. In particular, the daunting computational cost of selecting smoothing parameters renders smoothing spline ANOVA models impractical. In this article, we develop an asympirical, i.e., asymptotic and empirical, smoothing parameters selection method for smoothing spline ANOVA models in large samples. The idea of our approach is to use asymptotic analysis to show that the optimal smoothing parameter is a polynomial function of the sample size and an unknown constant. The unknown constant is then estimated through empirical subsample extrapolation. The proposed method significantly reduces the computational burden of selecting smoothing parameters in high-dimensional and large samples. We show that smoothing parameters chosen by the proposed method tend to the optimal smoothing parameters that minimize a specific risk function. In addition, the estimator based on the proposed smoothing parameters achieves the optimal convergence rate. Extensive simulation studies demonstrate the numerical advantage of the proposed method over competing methods in terms of relative efficacy and running time. In an application to molecular dynamics data containing nearly one million observations, the proposed method has the best prediction performance. 
    more » « less
  2. We consider a semiparametric additive partially linear regression model (APLM) for analysing ultra‐high‐dimensional data where both the number of linear components and the number of non‐linear components can be much larger than the sample size. We propose a two‐step approach for estimation, selection, and simultaneous inference of the components in the APLM. In the first step, the non‐linear additive components are approximated using polynomial spline basis functions, and a doubly penalized procedure is proposed to select nonzero linear and non‐linear components based on adaptive lasso. In the second step, local linear smoothing is then applied to the data with the selected variables to obtain the asymptotic distribution of the estimators of the nonparametric functions of interest. The proposed method selects the correct model with probability approaching one under regularity conditions. The estimators of both the linear part and the non‐linear part are consistent and asymptotically normal, which enables us to construct confidence intervals and make inferences about the regression coefficients and the component functions. The performance of the method is evaluated by simulation studies. The proposed method is also applied to a dataset on the shoot apical meristem of maize genotypes.

     
    more » « less
  3. Summary

    The paper introduces free-knot regression spline estimators for the mean and the variance components of a sample of curves. The asymptotic distribution of the mean estimator is derived, and asymptotic confidence bands are constructed. A comparative simulation study shows that free-knot splines estimate salient features of the functions (such as sharp peaks) more accurately than smoothing splines. This adaptive behaviour is also illustrated by an analysis of weather data.

     
    more » « less
  4. null (Ed.)
    Abstract Estimating the mean of a probability distribution using i.i.d. samples is a classical problem in statistics, wherein finite-sample optimal estimators are sought under various distributional assumptions. In this paper, we consider the problem of mean estimation when independent samples are drawn from $d$-dimensional non-identical distributions possessing a common mean. When the distributions are radially symmetric and unimodal, we propose a novel estimator, which is a hybrid of the modal interval, shorth and median estimators and whose performance adapts to the level of heterogeneity in the data. We show that our estimator is near optimal when data are i.i.d. and when the fraction of ‘low-noise’ distributions is as small as $\varOmega \left (\frac{d \log n}{n}\right )$, where $n$ is the number of samples. We also derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales of individual data points. Finally, we extend our theory to linear regression. In both the mean estimation and regression settings, we present computationally feasible versions of our estimators that run in time polynomial in the number of data points. 
    more » « less
  5. Krause, Andreas ; Brunskill, Emma ; Cho, Kyunghyun ; Engelhardt, Barbara ; Sabato, Sivan ; Scarlett, Jonathan (Ed.)
    We consider the problem of estimating the optimal transport map between two probability distributions, P and Q in R^d, on the basis of i.i.d. samples. All existing statistical analyses of this problem require the assumption that the transport map is Lipschitz, a strong requirement that, in particular, excludes any examples where the transport map is discontinuous. As a first step towards developing estimation procedures for discontinuous maps, we consider the important special case where the data distribution Q is a discrete measure supported on a finite number of points in R^d. We study a computationally efficient estimator initially proposed by Pooladian & Niles-Weed (2021), based on entropic optimal transport, and show in the semi-discrete setting that it converges at the minimax-optimal rate n^{−1/2}, independent of dimension. Other standard map estimation techniques both lack finite-sample guarantees in this setting and provably suffer from the curse of dimensionality. We confirm these results in numerical experiments, and provide experiments for other settings, not covered by our theory, which indicate that the entropic estimator is a promising methodology for other discontinuous transport map estimation problems. 
    more » « less