We consider a semiparametric additive partially linear regression model (APLM) for analysing ultra‐high‐dimensional data where both the number of linear components and the number of non‐linear components can be much larger than the sample size. We propose a two‐step approach for estimation, selection, and simultaneous inference of the components in the APLM. In the first step, the non‐linear additive components are approximated using polynomial spline basis functions, and a doubly penalized procedure is proposed to select nonzero linear and non‐linear components based on adaptive lasso. In the second step, local linear smoothing is then applied to the data with the selected variables to obtain the asymptotic distribution of the estimators of the nonparametric functions of interest. The proposed method selects the correct model with probability approaching one under regularity conditions. The estimators of both the linear part and the non‐linear part are consistent and asymptotically normal, which enables us to construct confidence intervals and make inferences about the regression coefficients and the component functions. The performance of the method is evaluated by simulation studies. The proposed method is also applied to a dataset on the shoot apical meristem of maize genotypes.
- Award ID(s):
- 1903226
- NSF-PAR ID:
- 10230055
- Date Published:
- Journal Name:
- Biometrika
- Volume:
- 107
- Issue:
- 3
- ISSN:
- 0006-3444
- Page Range / eLocation ID:
- 723 to 735
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Summary Large samples are generated routinely from various sources. Classic statistical models, such as smoothing spline ANOVA models, are not well equipped to analyse such large samples because of high computational costs. In particular, the daunting computational cost of selecting smoothing parameters renders smoothing spline ANOVA models impractical. In this article, we develop an asympirical, i.e., asymptotic and empirical, smoothing parameters selection method for smoothing spline ANOVA models in large samples. The idea of our approach is to use asymptotic analysis to show that the optimal smoothing parameter is a polynomial function of the sample size and an unknown constant. The unknown constant is then estimated through empirical subsample extrapolation. The proposed method significantly reduces the computational burden of selecting smoothing parameters in high-dimensional and large samples. We show that smoothing parameters chosen by the proposed method tend to the optimal smoothing parameters that minimize a specific risk function. In addition, the estimator based on the proposed smoothing parameters achieves the optimal convergence rate. Extensive simulation studies demonstrate the numerical advantage of the proposed method over competing methods in terms of relative efficacy and running time. In an application to molecular dynamics data containing nearly one million observations, the proposed method has the best prediction performance.more » « less
-
Nonparametric estimation of multivariate functions is an important problem in statisti- cal machine learning with many applications, ranging from nonparametric regression to nonparametric graphical models. Several authors have proposed to estimate multivariate functions under the smoothing spline analysis of variance (SSANOVA) framework, which assumes that the multivariate function can be decomposed into the summation of main effects, two-way interaction effects, and higher order interaction effects. However, existing methods are not scalable to the dimension of the random variables and the order of inter- actions. We propose a LAyer-wiSE leaRning strategy (LASER) to estimate multivariate functions under the SSANOVA framework. The main idea is to approximate the multivari- ate function sequentially starting from a model with only the main effects. Conditioned on the support of the estimated main effects, we estimate the two-way interaction effects only when the corresponding main effects are estimated to be non-zero. This process is con- tinued until no more higher order interaction effects are identified. The proposed strategy provides a data-driven approach for estimating multivariate functions under the SSANOVA framework. Our proposal yields a sequence of estimators. To study the theoretical prop- erties of the sequence of estimators, we establish the notion of post-selection persistency. Extensive numerical studies are performed to evaluate the performance of LASER.more » « less
-
SUMMARY We present investigations of rapidly rotating convection in a thick spherical shell geometry relevant to planetary cores, comparing results from quasi-geostrophic (QG), 3-D and hybrid QG-3D models. The 170 reported calculations span Ekman numbers, Ek, between 10−4 and 10−10, Rayleigh numbers, Ra, between 2 and 150 times supercritical and Prandtl numbers, Pr, between 10 and 10−2. The default boundary conditions are no-slip at both the ICB and the CMB for the velocity field, with fixed temperatures at the ICB and the CMB. Cases driven by both homogeneous and inhomogeneous CMB heat flux patterns are also explored, the latter including lateral variations, as measured by Q*, the peak-to-peak amplitude of the pattern divided by its mean, taking values up to 5. The QG model is based on the open-source pizza code. We extend this in a hybrid approach to include the temperature field on a 3-D grid. In general, we find convection is dominated by zonal jets at mid-depths in the shell, with thermal Rossby waves prominent close to the outer boundary when the driving is weaker. For the thick spherical shell geometry studied here the hybrid method is best suited for studying convection at modest forcing, $Ra \le 10 \, Ra_c$ when Pr = 1, and departs from the 3-D model results at higher Ra, displaying systematically lower heat transport characterized by lower Nusselt and Reynolds numbers. We find that the lack of equatorially-antisymmetric motions and z-correlations between temperature and velocity in the buoyancy force contributes to the weaker flows in the hybrid formulation. On the other hand, the QG models yield broadly similar results to the 3-D models, for the specific aspect ratio and range of Rayleigh numbers explored here. We cannot point to major disagreements between these two data sets at Pr ≥ 0.1, with the QG model effectively more strongly driven than the hybrid case due to its cylindrically averaged thermal boundary conditions. When Pr is decreased, the range of agreement between the hybrid and 3-D models expands, for example up to $Ra \le 15 \, Ra_c$ at Pr = 0.1, indicating the hybrid method may be better suited to study convection in the low Pr regime. We thus observe a transition between two regimes: (i) at Pr ≥ 0.1 the QG and 3-D models agree in the studied range of Ra/Rac while the hybrid model fails when $Ra\gt 15\, Ra_c$ and (ii) at Pr = 0.01 the QG and 3-D models disagree for $Ra\gt 10\, Ra_c$ while the hybrid and 3-D models agree fairly well up to $Ra \sim 20\, Ra_c$. Models that include laterally varying heat flux at the outer boundary reproduce regional convection patterns that compare well with those found in similarly forced 3-D models. Previously proposed scaling laws for rapidly rotating convection are tested; our simulations are overall well described by a triple balance between Coriolis, inertia and Archimedean forces with the length-scale of the convection following the diffusion-free Rhines-scaling. The magnitude of Pr affects the number and the size of the jets with larger structures obtained at lower Pr. Higher velocities and lower heat transport are seen on decreasing Pr with the scaling behaviour of the convective velocity displaying a strong dependence on Pr. This study is an intermediate step towards a hybrid model of core convection also including 3-D magnetic effects.
-
Abstract We consider estimating average treatment effects (ATE) of a binary treatment in observational data when data‐driven variable selection is needed to select relevant covariates from a moderately large number of available covariates . To leverage covariates among predictive of the outcome for efficiency gain while using regularization to fit a parametric propensity score (PS) model, we consider a dimension reduction of based on fitting both working PS and outcome models using adaptive LASSO. A novel PS estimator, the Double‐index Propensity Score (DiPS), is proposed, in which the treatment status is smoothed over the linear predictors for from both the initial working models. The ATE is estimated by using the DiPS in a normalized inverse probability weighting estimator, which is found to maintain double robustness and also local semiparametric efficiency with a fixed number of covariates
p . Under misspecification of working models, the smoothing step leads to gains in efficiency and robustness over traditional doubly robust estimators. These results are extended to the case wherep diverges with sample size and working models are sparse. Simulations show the benefits of the approach in finite samples. We illustrate the method by estimating the ATE of statins on colorectal cancer risk in an electronic medical record study and the effect of smoking on C‐reactive protein in the Framingham Offspring Study.