skip to main content


Title: More efficient approximation of smoothing splines via space-filling basis selection
Summary We consider the problem of approximating smoothing spline estimators in a nonparametric regression model. When applied to a sample of size $n$, the smoothing spline estimator can be expressed as a linear combination of $n$ basis functions, requiring $O(n^3)$ computational time when the number $d$ of predictors is two or more. Such a sizeable computational cost hinders the broad applicability of smoothing splines. In practice, the full-sample smoothing spline estimator can be approximated by an estimator based on $q$ randomly selected basis functions, resulting in a computational cost of $O(nq^2)$. It is known that these two estimators converge at the same rate when $q$ is of order $O\{n^{2/(pr+1)}\}$, where $p\in [1,2]$ depends on the true function and $r > 1$ depends on the type of spline. Such a $q$ is called the essential number of basis functions. In this article, we develop a more efficient basis selection method. By selecting basis functions corresponding to approximately equally spaced observations, the proposed method chooses a set of basis functions with great diversity. The asymptotic analysis shows that the proposed smoothing spline estimator can decrease $q$ to around $O\{n^{1/(pr+1)}\}$ when $d\leq pr+1$. Applications to synthetic and real-world datasets show that the proposed method leads to a smaller prediction error than other basis selection methods.  more » « less
Award ID(s):
1903226
NSF-PAR ID:
10230055
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Biometrika
Volume:
107
Issue:
3
ISSN:
0006-3444
Page Range / eLocation ID:
723 to 735
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider a semiparametric additive partially linear regression model (APLM) for analysing ultra‐high‐dimensional data where both the number of linear components and the number of non‐linear components can be much larger than the sample size. We propose a two‐step approach for estimation, selection, and simultaneous inference of the components in the APLM. In the first step, the non‐linear additive components are approximated using polynomial spline basis functions, and a doubly penalized procedure is proposed to select nonzero linear and non‐linear components based on adaptive lasso. In the second step, local linear smoothing is then applied to the data with the selected variables to obtain the asymptotic distribution of the estimators of the nonparametric functions of interest. The proposed method selects the correct model with probability approaching one under regularity conditions. The estimators of both the linear part and the non‐linear part are consistent and asymptotically normal, which enables us to construct confidence intervals and make inferences about the regression coefficients and the component functions. The performance of the method is evaluated by simulation studies. The proposed method is also applied to a dataset on the shoot apical meristem of maize genotypes.

     
    more » « less
  2. null (Ed.)
    Summary Large samples are generated routinely from various sources. Classic statistical models, such as smoothing spline ANOVA models, are not well equipped to analyse such large samples because of high computational costs. In particular, the daunting computational cost of selecting smoothing parameters renders smoothing spline ANOVA models impractical. In this article, we develop an asympirical, i.e., asymptotic and empirical, smoothing parameters selection method for smoothing spline ANOVA models in large samples. The idea of our approach is to use asymptotic analysis to show that the optimal smoothing parameter is a polynomial function of the sample size and an unknown constant. The unknown constant is then estimated through empirical subsample extrapolation. The proposed method significantly reduces the computational burden of selecting smoothing parameters in high-dimensional and large samples. We show that smoothing parameters chosen by the proposed method tend to the optimal smoothing parameters that minimize a specific risk function. In addition, the estimator based on the proposed smoothing parameters achieves the optimal convergence rate. Extensive simulation studies demonstrate the numerical advantage of the proposed method over competing methods in terms of relative efficacy and running time. In an application to molecular dynamics data containing nearly one million observations, the proposed method has the best prediction performance. 
    more » « less
  3. Nonparametric estimation of multivariate functions is an important problem in statisti- cal machine learning with many applications, ranging from nonparametric regression to nonparametric graphical models. Several authors have proposed to estimate multivariate functions under the smoothing spline analysis of variance (SSANOVA) framework, which assumes that the multivariate function can be decomposed into the summation of main effects, two-way interaction effects, and higher order interaction effects. However, existing methods are not scalable to the dimension of the random variables and the order of inter- actions. We propose a LAyer-wiSE leaRning strategy (LASER) to estimate multivariate functions under the SSANOVA framework. The main idea is to approximate the multivari- ate function sequentially starting from a model with only the main effects. Conditioned on the support of the estimated main effects, we estimate the two-way interaction effects only when the corresponding main effects are estimated to be non-zero. This process is con- tinued until no more higher order interaction effects are identified. The proposed strategy provides a data-driven approach for estimating multivariate functions under the SSANOVA framework. Our proposal yields a sequence of estimators. To study the theoretical prop- erties of the sequence of estimators, we establish the notion of post-selection persistency. Extensive numerical studies are performed to evaluate the performance of LASER. 
    more » « less
  4. SUMMARY

    We present investigations of rapidly rotating convection in a thick spherical shell geometry relevant to planetary cores, comparing results from quasi-geostrophic (QG), 3-D and hybrid QG-3D models. The 170 reported calculations span Ekman numbers, Ek, between 10−4 and 10−10, Rayleigh numbers, Ra, between 2 and 150 times supercritical and Prandtl numbers, Pr, between 10 and 10−2. The default boundary conditions are no-slip at both the ICB and the CMB for the velocity field, with fixed temperatures at the ICB and the CMB. Cases driven by both homogeneous and inhomogeneous CMB heat flux patterns are also explored, the latter including lateral variations, as measured by Q*, the peak-to-peak amplitude of the pattern divided by its mean, taking values up to 5. The QG model is based on the open-source pizza code. We extend this in a hybrid approach to include the temperature field on a 3-D grid. In general, we find convection is dominated by zonal jets at mid-depths in the shell, with thermal Rossby waves prominent close to the outer boundary when the driving is weaker. For the thick spherical shell geometry studied here the hybrid method is best suited for studying convection at modest forcing, $Ra \le 10 \, Ra_c$ when Pr = 1, and departs from the 3-D model results at higher Ra, displaying systematically lower heat transport characterized by lower Nusselt and Reynolds numbers. We find that the lack of equatorially-antisymmetric motions and z-correlations between temperature and velocity in the buoyancy force contributes to the weaker flows in the hybrid formulation. On the other hand, the QG models yield broadly similar results to the 3-D models, for the specific aspect ratio and range of Rayleigh numbers explored here. We cannot point to major disagreements between these two data sets at Pr ≥ 0.1, with the QG model effectively more strongly driven than the hybrid case due to its cylindrically averaged thermal boundary conditions. When Pr is decreased, the range of agreement between the hybrid and 3-D models expands, for example up to $Ra \le 15 \, Ra_c$ at Pr = 0.1, indicating the hybrid method may be better suited to study convection in the low Pr regime. We thus observe a transition between two regimes: (i) at Pr ≥ 0.1 the QG and 3-D models agree in the studied range of Ra/Rac while the hybrid model fails when $Ra\gt 15\, Ra_c$ and (ii) at Pr = 0.01 the QG and 3-D models disagree for $Ra\gt 10\, Ra_c$ while the hybrid and 3-D models agree fairly well up to $Ra \sim 20\, Ra_c$. Models that include laterally varying heat flux at the outer boundary reproduce regional convection patterns that compare well with those found in similarly forced 3-D models. Previously proposed scaling laws for rapidly rotating convection are tested; our simulations are overall well described by a triple balance between Coriolis, inertia and Archimedean forces with the length-scale of the convection following the diffusion-free Rhines-scaling. The magnitude of Pr affects the number and the size of the jets with larger structures obtained at lower Pr. Higher velocities and lower heat transport are seen on decreasing Pr with the scaling behaviour of the convective velocity displaying a strong dependence on Pr. This study is an intermediate step towards a hybrid model of core convection also including 3-D magnetic effects.

     
    more » « less
  5. Abstract

    We consider estimating average treatment effects (ATE) of a binary treatment in observational data when data‐driven variable selection is needed to select relevant covariates from a moderately large number of available covariates . To leverage covariates among predictive of the outcome for efficiency gain while using regularization to fit a parametric propensity score (PS) model, we consider a dimension reduction of based on fitting both working PS and outcome models using adaptive LASSO. A novel PS estimator, the Double‐index Propensity Score (DiPS), is proposed, in which the treatment status is smoothed over the linear predictors for from both the initial working models. The ATE is estimated by using the DiPS in a normalized inverse probability weighting estimator, which is found to maintain double robustness and also local semiparametric efficiency with a fixed number of covariatesp. Under misspecification of working models, the smoothing step leads to gains in efficiency and robustness over traditional doubly robust estimators. These results are extended to the case wherepdiverges with sample size and working models are sparse. Simulations show the benefits of the approach in finite samples. We illustrate the method by estimating the ATE of statins on colorectal cancer risk in an electronic medical record study and the effect of smoking on C‐reactive protein in the Framingham Offspring Study.

     
    more » « less