skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Estimating the error variance in a high-dimensional linear model
Summary The lasso has been studied extensively as a tool for estimating the coefficient vector in the high-dimensional linear model; however, considerably less is known about estimating the error variance in this context. In this paper, we propose the natural lasso estimator for the error variance, which maximizes a penalized likelihood objective. A key aspect of the natural lasso is that the likelihood is expressed in terms of the natural parameterization of the multi-parameter exponential family of a Gaussian with unknown mean and variance. The result is a remarkably simple estimator of the error variance with provably good performance in terms of mean squared error. These theoretical results do not require placing any assumptions on the design matrix or the true regression coefficients. We also propose a companion estimator, called the organic lasso, which theoretically does not require tuning of the regularization parameter. Both estimators do well empirically compared to pre-existing methods, especially in settings where successful recovery of the true support of the coefficient vector is hard. Finally, we show that existing methods can do well under fewer assumptions than previously known, thus providing a fuller story about the problem of estimating the error variance in high-dimensional linear models.  more » « less
Award ID(s):
1748166
PAR ID:
10146537
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Biometrika
Volume:
106
Issue:
3
ISSN:
0006-3444
Page Range / eLocation ID:
533 to 546
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract We consider high‐dimensional inference for potentially misspecified Cox proportional hazard models based on low‐dimensional results by Lin and Wei (1989). A desparsified Lasso estimator is proposed based on the log partial likelihood function and shown to converge to a pseudo‐true parameter vector. Interestingly, the sparsity of the true parameter can be inferred from that of the above limiting parameter. Moreover, each component of the above (nonsparse) estimator is shown to be asymptotically normal with a variance that can be consistently estimated even under model misspecifications. In some cases, this asymptotic distribution leads to valid statistical inference procedures, whose empirical performances are illustrated through numerical examples. 
    more » « less
  2. Summary In this paper, we develop a systematic theory for high-dimensional analysis of variance in multivariate linear regression, where the dimension and the number of coefficients can both grow with the sample size. We propose a new U-type statistic to test linear hypotheses and establish a high-dimensional Gaussian approximation result under fairly mild moment assumptions. Our general framework and theory can be used to deal with the classical one-way multivariate analysis of variance, and the nonparametric one-way multivariate analysis of variance in high dimensions. To implement the test procedure, we introduce a sample-splitting-based estimator of the second moment of the error covariance and discuss its properties. A simulation study shows that our proposed test outperforms some existing tests in various settings. 
    more » « less
  3. Estimation and inference in statistics pose significant challenges when data are collected adaptively. Even in linear models, the Ordinary Least Squares (OLS) estimator may fail to exhibit asymptotic normality for single coordinate estimation and have inflated error. This issue is highlighted by a recent minimax lower bound, which shows that the error of estimating a single coordinate can be enlarged by a multiple of $$\sqrt{d}$$ when data are allowed to be arbitrarily adaptive, compared with the case when they are i.i.d. Our work explores this striking difference in estimation performance between utilizing i.i.d. and adaptive data. We investigate how the degree of adaptivity in data collection impacts the performance of estimating a low-dimensional parameter component in high-dimensional linear models. We identify conditions on the data collection mechanism under which the estimation error for a low-dimensional parameter component matches its counterpart in the i.i.d. setting, up to a factor that depends on the degree of adaptivity. We show that OLS or OLS on centered data can achieve this matching error. In addition, we propose a novel estimator for single coordinate inference via solving a Two-stage Adaptive Linear Estimating equation (TALE). Under a weaker form of adaptivity in data collection, we establish an asymptotic normality property of the proposed estimator. 
    more » « less
  4. The expectation-maximization (EM) algorithm and its variants are widely used in statistics. In high-dimensional mixture linear regression, the model is assumed to be a finite mixture of linear regression and the number of predictors is much larger than the sample size. The standard EM algorithm, which attempts to find the maximum likelihood estimator, becomes infeasible for such model. We devise a group lasso penalized EM algorithm and study its statistical properties. Existing theoretical results of regularized EM algorithms often rely on dividing the sample into many independent batches and employing a fresh batch of sample in each iteration of the algorithm. Our algorithm and theoretical analysis do not require sample-splitting, and can be extended to multivariate response cases. The proposed methods also have encouraging performances in numerical studies. 
    more » « less
  5. Multivariate adaptive regression splines (MARS) is a popular method for nonparametric regression introduced by Friedman in 1991. MARS fits sim- ple nonlinear and non-additive functions to regression data. We propose and study a natural lasso variant of the MARS method. Our method is based on least squares estimation over a convex class of functions obtained by con- sidering infinite-dimensional linear combinations of functions in the MARS basis and imposing a variation based complexity constraint. Our estimator can be computed via finite-dimensional convex optimization, although it is defined as a solution to an infinite-dimensional optimization problem. Under a few standard design assumptions, we prove that our estimator achieves a rate of convergence that depends only logarithmically on dimension and thus avoids the usual curse of dimensionality to some extent. We also show that our method is naturally connected to nonparametric estimation techniques based on smoothness constraints. We implement our method with a cross- validation scheme for the selection of the involved tuning parameter and compare it to the usual MARS method in various simulation and real data settings. 
    more » « less