Summary We consider scenarios in which the likelihood function for a semiparametric regression model factors into separate components, with an efficient estimator of the regression parameter available for each component. An optimal weighted combination of the component estimators, named an ensemble estimator, may be employed as an overall estimate of the regression parameter, and may be fully efficient under uncorrelatedness conditions. This approach is useful when the full likelihood function may be difficult to maximize, but the components are easy to maximize. It covers settings where the nuisance parameter may be estimated at different rates in the component likelihoods. As a motivating example we consider proportional hazards regression with prospective doubly censored data, in which the likelihood factors into a current status data likelihood and a left-truncated right-censored data likelihood. Variable selection is important in such regression modelling, but the applicability of existing techniques is unclear in the ensemble approach. We propose ensemble variable selection using the least squares approximation technique on the unpenalized ensemble estimator, followed by ensemble re-estimation under the selected model. The resulting estimator has the oracle property such that the set of nonzero parameters is successfully recovered and the semiparametric efficiency bound is achieved for this parameter set. Simulations show that the proposed method performs well relative to alternative approaches. Analysis of an AIDS cohort study illustrates the practical utility of the method.
more »
« less
A note on centering in subsample selection for linear regression
Centring is a commonly used technique in linear regression analysis. With centred data on both the responses and covariates, the ordinary least squares estimator of the slope parameter can be calculated from a model without the intercept. If a subsample is selected from a centred full data, the subsample is typically uncentred. In this case, is it still appropriate to fit a model without the intercept? The answer is yes, and we show that the least squares estimator on the slope parameter obtained from a model without the intercept is unbiased and it has a smaller variance covariance matrix in the Loewner order than that obtained from a model with the intercept. We further show that for noninformative weighted subsampling when a weighted least squares estimator is used, using the full data weighted means to relocate the subsample improves the estimation efficiency.
more »
« less
- Award ID(s):
- 2105571
- PAR ID:
- 10387850
- Publisher / Repository:
- Wiley Blackwell (John Wiley & Sons)
- Date Published:
- Journal Name:
- Stat
- Volume:
- 11
- Issue:
- 1
- ISSN:
- 2049-1573
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
For massive survival data, we propose a subsampling algorithm to efficiently approximate the estimates of regression parameters in the additive hazards model. We establish consistency and asymptotic normality of the subsample‐based estimator given the full data. The optimal subsampling probabilities are obtained via minimizing asymptotic variance of the resulting estimator. The subsample‐based procedure can largely reduce the computational cost compared with the full data method. In numerical simulations, our method has low bias and satisfactory coverage probabilities. We provide an illustrative example on the survival analysis of patients with lymphoma cancer from the Surveillance, Epidemiology, and End Results Program.more » « less
-
null (Ed.)This work derives a residual-based a posteriori error estimator for reduced models learned with non-intrusive model reduction from data of high-dimensional systems governed by linear parabolic partial differential equations with control inputs. It is shown that quantities that are necessary for the error estimator can be either obtained exactly as the solutions of least-squares problems in a non-intrusive way from data such as initial conditions, control inputs, and high-dimensional solution trajectories or bounded in a probabilistic sense. The computational procedure follows an offline/online decomposition. In the offline (training) phase, the high-dimensional system is judiciously solved in a black-box fashion to generate data and to set up the error estimator. In the online phase, the estimator is used to bound the error of the reduced-model predictions for new initial conditions and new control inputs without recourse to the high-dimensional system. Numerical results demonstrate the workflow of the proposed approach from data to reduced models to certified predictions.more » « less
-
Abstract This paper considers binary classification of high-dimensional features under a postulated model with a low-dimensional latent Gaussian mixture structure and nonvanishing noise. A generalized least-squares estimator is used to estimate the direction of the optimal separating hyperplane. The estimated hyperplane is shown to interpolate on the training data. While the direction vector can be consistently estimated, as could be expected from recent results in linear regression, a naive plug-in estimate fails to consistently estimate the intercept. A simple correction, which requires an independent hold-out sample, renders the procedure minimax optimal in many scenarios. The interpolation property of the latter procedure can be retained, but surprisingly depends on the way the labels are encoded.more » « less
-
In this paper, we propose improved estimation method for logistic regression based on subsamples taken according the optimal subsampling probabilities developed in Wang et al. (2018). Both asymptotic results and numerical results show that the new estimator has a higher estimation efficiency. We also develop a new algorithm based on Poisson subsampling, which does not require to approximate the optimal subsampling probabilities all at once. This is computationally advantageous when available random-access memory is not enough to hold the full data. Interestingly, asymptotic distributions also show that Poisson subsampling produces a more efficient estimator if the sampling ratio, the ratio of the subsample size to the full data sample size, does not converge to zero. We also obtain the unconditional asymptotic distribution for the estimator based on Poisson subsampling. Pilot estimators are required to calculate subsampling probabilities and to correct biases in un-weighted estimators; interestingly, even if pilot estimators are inconsistent, the proposed method still produce consistent and asymptotically normal estimators.more » « less
An official website of the United States government
