skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Consistent Estimation of Generalized Linear Models with High Dimensional Predictors via Stepwise Regression
Predictive models play a central role in decision making. Penalized regression approaches, such as least absolute shrinkage and selection operator (LASSO), have been widely used to construct predictive models and explain the impacts of the selected predictors, but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized regression is usable only after applying variable screening methods to downsize variables. We propose a stepwise procedure for fitting generalized linear models with ultrahigh dimensional predictors. Our procedure can provide a final model; control both false negatives and false positives; and yield consistent estimates, which are useful to gauge the actual effect size of risk factors. Simulations and applications to two clinical studies verify the utility of the method.  more » « less
Award ID(s):
1915099
PAR ID:
10249956
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Entropy
Volume:
22
Issue:
9
ISSN:
1099-4300
Page Range / eLocation ID:
965
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The Fine‐Gray proportional sub‐distribution hazards (PSH) model is among the most popular regression model for competing risks time‐to‐event data. This article develops a fast safe feature elimination method, named PSH‐SAFE, for fitting the penalized Fine‐Gray PSH model with a Lasso (or adaptive Lasso) penalty. Our PSH‐SAFE procedure is straightforward to implement, fast, and scales well to ultrahigh dimensional data. We also show that as a feature screening procedure, PSH‐SAFE is safe in a sense that the eliminated features are guaranteed to be inactive features in the original Lasso (or adaptive Lasso) estimator for the penalized PSH model. We evaluate the performance of the PSH‐SAFE procedure in terms of computational efficiency, screening efficiency and safety, run‐time, and prediction accuracy on multiple simulated datasets and a real bladder cancer data. Our empirical results show that the PSH‐SAFE procedure possesses desirable screening efficiency and safety properties and can offer substantially improved computational efficiency as well as similar or better prediction performance in comparison to their baseline competitors. 
    more » « less
  2. Abstract Large datasets make it possible to build predictive models that can capture heterogenous relationships between the response variable and features. The mixture of high-dimensional linear experts model posits that observations come from a mixture of high-dimensional linear regression models, where the mixture weights are themselves feature-dependent. In this article, we show how to construct valid prediction sets for an ℓ1-penalized mixture of experts model in the high-dimensional setting. We make use of a debiasing procedure to account for the bias induced by the penalization and propose a novel strategy for combining intervals to form a prediction set with coverage guarantees in the mixture setting. Synthetic examples and an application to the prediction of critical temperatures of superconducting materials show our method to have reliable practical performance. 
    more » « less
  3. High‐dimensional multinomial regression models are very useful in practice but have received less research attention than logistic regression models, especially from the perspective of statistical inference. In this work, we analyze the estimation and prediction error of the contrast‐based ‐penalized multinomial regression model and extend the debiasing method to the multinomial case, providing a valid confidence interval for each coefficient and value of the individual hypothesis test. We also examine cases of model misspecification and non‐identically distributed data to demonstrate the robustness of our method when some assumptions are violated. We apply the debiasing method to identify important predictors in the progression into dementia of different subtypes. Results from extensive simulations show the superiority of the debiasing method compared to other inference methods. 
    more » « less
  4. Predictive modeling of a rare event using an unbalanced data set leads to poor prediction sensitivity. Although this obstacle is often accompanied by other analytical issues such as a large number of predictors and multicollinearity, little has been done to address these issues simultaneously. The objective of this study is to compare several predictive modeling techniques in this setting. The unbalanced data set is addressed using four resampling methods: undersampling, oversampling, hybrid sampling, and ROSE synthetic data generation. The large number of predictors is addressed using penalized regression methods and ensemble methods. The predictive models are evaluated in terms of sensitivity and F1 score via simulation studies and applied to the prediction of food deserts in North Carolina. Our results show that balancing the data via resampling methods leads to an improved prediction sensitivity for every classifier. The application analysis shows that resampling also leads to an increase in F1 score for every classifier while the simulated data showed that the F1 score tended to decrease slightly in most cases. Our findings may help improve classification performance for unbalanced rare event data in many other applications. 
    more » « less
  5. Multi-modal data are prevalent in many scientific fields. In this study, we consider the parameter estimation and variable selection for a multi-response regression using block-missing multi-modal data. Our method allows the dimensions of both the responses and the predictors to be large, and the responses to be incomplete and correlated, a common practical problem in high-dimensional settings. Our proposed method uses two steps to make a prediction from a multi-response linear regression model with block-missing multi-modal predictors. In the first step, without imputing missing data, we use all available data to estimate the covariance matrix of the predictors and the cross-covariance matrix between the predictors and the responses. In the second step, we use these matrices and a penalized method to simultaneously estimate the precision matrix of the response vector, given the predictors, and the sparse regression parameter matrix. Lastly, we demonstrate the effectiveness of the proposed method using theoretical studies, simulated examples, and an analysis of a multi-modal imaging data set from the Alzheimer’s Disease Neuroimaging Initiative. 
    more » « less