skip to main content


Title: A note on centering in subsample selection for linear regression

Centring is a commonly used technique in linear regression analysis. With centred data on both the responses and covariates, the ordinary least squares estimator of the slope parameter can be calculated from a model without the intercept. If a subsample is selected from a centred full data, the subsample is typically uncentred. In this case, is it still appropriate to fit a model without the intercept? The answer is yes, and we show that the least squares estimator on the slope parameter obtained from a model without the intercept is unbiased and it has a smaller variance covariance matrix in the Loewner order than that obtained from a model with the intercept. We further show that for noninformative weighted subsampling when a weighted least squares estimator is used, using the full data weighted means to relocate the subsample improves the estimation efficiency.

 
more » « less
Award ID(s):
2105571
NSF-PAR ID:
10387850
Author(s) / Creator(s):
 
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Stat
Volume:
11
Issue:
1
ISSN:
2049-1573
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary We consider scenarios in which the likelihood function for a semiparametric regression model factors into separate components, with an efficient estimator of the regression parameter available for each component. An optimal weighted combination of the component estimators, named an ensemble estimator, may be employed as an overall estimate of the regression parameter, and may be fully efficient under uncorrelatedness conditions. This approach is useful when the full likelihood function may be difficult to maximize, but the components are easy to maximize. It covers settings where the nuisance parameter may be estimated at different rates in the component likelihoods. As a motivating example we consider proportional hazards regression with prospective doubly censored data, in which the likelihood factors into a current status data likelihood and a left-truncated right-censored data likelihood. Variable selection is important in such regression modelling, but the applicability of existing techniques is unclear in the ensemble approach. We propose ensemble variable selection using the least squares approximation technique on the unpenalized ensemble estimator, followed by ensemble re-estimation under the selected model. The resulting estimator has the oracle property such that the set of nonzero parameters is successfully recovered and the semiparametric efficiency bound is achieved for this parameter set. Simulations show that the proposed method performs well relative to alternative approaches. Analysis of an AIDS cohort study illustrates the practical utility of the method. 
    more » « less
  2. null (Ed.)
    Concerning power systems, real-time monitoring of cyber–physical security, false data injection attacks on wide-area measurements are of major concern. However, the database of the network parameters is just as crucial to the state estimation process. Maintaining the accuracy of the system model is the other part of the equation, since almost all applications in power systems heavily depend on the state estimator outputs. While much effort has been given to measurements of false data injection attacks, seldom reported work is found on the broad theme of false data injection on the database of network parameters. State-of-the-art physics-based model solutions correct false data injection on network parameter database considering only available wide-area measurements. In addition, deterministic models are used for correction. In this paper, an overdetermined physics-based parameter false data injection correction model is presented. The overdetermined model uses a parameter database correction Jacobian matrix and a Taylor series expansion approximation. The method further applies the concept of synthetic measurements, which refers to measurements that do not exist in the real-life system. A machine learning linear regression-based model for measurement prediction is integrated in the framework through deriving weights for synthetic measurements creation. Validation of the presented model is performed on the IEEE 118-bus system. Numerical results show that the approximation error is lower than the state-of-the-art, while providing robustness to the correction process. Easy-to-implement model on the classical weighted-least-squares solution, highlights real-life implementation potential aspects. 
    more » « less
  3. In this paper, we study the impact of the presence of byzantine sensors on the reduced-rank linear least squares (LS) estimator. A sensor network with N sensors makes observations of the physical phenomenon and transmits them to a fusion center which computes the LS estimate of the parameter of interest. It is well-known that rank reduction exploits the bias-variance trade-off in the full-rank estimator by putting higher priority on highly informative content of the data. The low-rank LS estimator is constructed using this highly informative content, while the remaining data can be discarded without affecting the overall performance of the estimator. We consider the scenario where a fraction of the N sensors are subject to data falsification attack from byzantine sensors, wherein an intruder injects a higher noise power (compared to the unattacked sensors) to the measurements of the attacked sensors. Our main contribution is an analytical characterization of the impact of data falsification attack of the above type on the performance of reduced-rank LS estimator. In particular, we show how optimally prioritizing the highly informative content of the data gets affected in the presence of attacks. A surprising result is that, under sensor attacks, when the elements of the data matrix are all positive the error performance of the low rank estimator experiences a phenomenon wherein the estimate of the mean-squared error comprises negative components. A complex nonlinear programming-based recipe is known to exist that resolves this undesirable effect; however, the phenomenon is oftentimes considered very objectionable in the statistical literature. On the other hand, to our advantage this effect can serve to detect cyber attacks on sensor systems. Numerical results are presented to complement the theoretical findings of the paper. 
    more » « less
  4. ABSTRACT

    We have measured the angular autocorrelation function of near-infrared galaxies in SERVS + DeepDrill, the Spitzer Extragalactic Representative Volume Survey and its follow-up survey of the Deep Drilling Fields, in three large fields totalling over 20 deg2 on the sky, observed in two bands centred on 3.6 and 4.5 μm. We performed this analysis on the full sample as well as on sources selected by [3.6]–[4.5] colour in order to probe clustering for different redshift regimes. We estimated the spatial correlation strength as well, using the redshift distribution from S-COSMOS with the same source selection. The strongest clustering was found for our bluest subsample, with 〈z〉 ∼ 0.7, which has the narrowest redshift distribution of all our subsamples. We compare these estimates to previous results from the literature, but also to estimates derived from mock samples, selected in the same way as the observational data, using deep light-cones generated from the SHARK semi-analytical model of galaxy formation. For all simulated (sub)samples, we find a slightly steeper slope than for the corresponding observed ones, but the spatial clustering length is comparable in most cases.

     
    more » « less
  5. Abstract

    This paper considers binary classification of high-dimensional features under a postulated model with a low-dimensional latent Gaussian mixture structure and nonvanishing noise. A generalized least-squares estimator is used to estimate the direction of the optimal separating hyperplane. The estimated hyperplane is shown to interpolate on the training data. While the direction vector can be consistently estimated, as could be expected from recent results in linear regression, a naive plug-in estimate fails to consistently estimate the intercept. A simple correction, which requires an independent hold-out sample, renders the procedure minimax optimal in many scenarios. The interpolation property of the latter procedure can be retained, but surprisingly depends on the way the labels are encoded.

     
    more » « less