skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Lazy Estimation of Variable Importance for Large Neural Networks
As opaque predictive models increasingly impact many areas of modern life, interest in quantifying the importance of a given input variable for making a specific prediction has grown. Recently, there has been a proliferation of model-agnostic methods to measure variable importance (VI) that analyze the difference in predictive power between a full model trained on all variables and a reduced model that excludes the variable(s) of interest. A bottleneck common to these methods is the estimation of the reduced model for each variable (or subset of variables), which is an expensive process that often does not come with theoretical guarantees. In this work, we propose a fast and flexible method for approximating the reduced model with important inferential guarantees. We replace the need for fully retraining a wide neural network by a linearization initialized at the full model parameters. By adding a ridge-like penalty to make the problem convex, we prove that when the ridge penalty parameter is sufficiently large, our method estimates the variable importance measure with an error rate of O(1/n) where n is the number of training samples. We also show that our estimator is asymptotically normal, enabling us to provide confidence bounds for the VI estimates. We demonstrate through simulations that our method is fast and accurate under several data-generating regimes, and we demonstrate its real-world applicability on a seasonal climate forecasting example.  more » « less
Award ID(s):
1811767
PAR ID:
10409995
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of the 39th International Conference on Machine Learning
Page Range / eLocation ID:
7122-7143
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Subdata selection from big data is an active area of research that facilitates inferences based on big data with limited computational expense. For linear regression models, the optimal design-inspired Information-Based Optimal Subdata Selection (IBOSS) method is a computationally efficient method for selecting subdata that has excellent statistical properties. But the method can only be used if the subdata size, k, is at last twice the number of regression variables, p. In addition, even when $$k\ge 2p$$, under the assumption of effect sparsity, one can expect to obtain subdata with better statistical properties by trying to focus on active variables. Inspired by recent efforts to extend the IBOSS method to situations with a large number of variables p, we introduce a method called Combining Lasso And Subdata Selection (CLASS) that, as shown, improves on other proposed methods in terms of variable selection and building a predictive model based on subdata when the full data size n is very large and the number of variables p is large. In terms of computational expense, CLASS is more expensive than recent competitors for moderately large values of n, but the roles reverse under effect sparsity for extremely large values of n. 
    more » « less
  2. Tree taper has been of interest for over a century, yet questions remain regarding the effects of silvicultural treatments and forest health on recoverable volume. This work utilizes data from Douglas-fir ( Pseudotsuga menziesii (Mirb.)) ( n = 608) and red alder ( Alnus rubra (Bong.)) ( n = 495) trees to assess the influences of fertilization, pruning, thinning, regeneration origin, and defoliation caused by Swiss Needle Cast (SNC; Nothophaeocryptopus gaeumannii), on stem taper in the Pacific Northwest. The Kozak (2004; For. Chor. 80: 507–515) variable-exponent equation was used to test the addition of treatment and crown variables as the model is widely regarded for its flexibility in application. Using a mixed effects framework, results reveal that thinning of Douglas-fir can result in a 3.5% increase in upper stem diameter inside bark, while pruning may lead to a 4.1% decrease. SNC-induced defoliation of Douglas-fir reduced mean diameter above-breast height by 11.5%. Total volume of artificially regenerated red alder was 16% greater than naturally regenerated stems. Overall, thinning of healthy Douglas-fir and planting red alder may increase recoverable volume and C captured in long-term timber products in the region, and the inclusion of crown variables can increase the predictive power of taper estimates for some species. 
    more » « less
  3. null (Ed.)
    Predictive models play a central role in decision making. Penalized regression approaches, such as least absolute shrinkage and selection operator (LASSO), have been widely used to construct predictive models and explain the impacts of the selected predictors, but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized regression is usable only after applying variable screening methods to downsize variables. We propose a stepwise procedure for fitting generalized linear models with ultrahigh dimensional predictors. Our procedure can provide a final model; control both false negatives and false positives; and yield consistent estimates, which are useful to gauge the actual effect size of risk factors. Simulations and applications to two clinical studies verify the utility of the method. 
    more » « less
  4. Martelli, Pier Luigi (Ed.)
    Abstract MotivationThere is a growing interest in longitudinal omics data paired with some longitudinal clinical outcome. Given a large set of continuous omics variables and some continuous clinical outcome, each measured for a few subjects at only a few time points, we seek to identify those variables that co-vary over time with the outcome. To motivate this problem we study a dataset with hundreds of urinary metabolites along with Tuberculosis mycobacterial load as our clinical outcome, with the objective of identifying potential biomarkers for disease progression. For such data clinicians usually apply simple linear mixed effects models which often lack power given the low number of replicates and time points. We propose a penalized regression approach on the first differences of the data that extends the lasso + Laplacian method [Li and Li (Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 2008;24:1175–82.)] to a longitudinal group lasso + Laplacian approach. Our method, PROLONG, leverages the first differences of the data to increase power by pairing the consecutive time points. The Laplacian penalty incorporates the dependence structure of the variables, and the group lasso penalty induces sparsity while grouping together all contemporaneous and lag terms for each omic variable in the model. ResultsWith an automated selection of model hyper-parameters, PROLONG correctly selects target metabolites with high specificity and sensitivity across a wide range of scenarios. PROLONG selects a set of metabolites from the real data that includes interesting targets identified during EDA. Availability and implementationAn R package implementing described methods called “prolong” is available at https://github.com/stevebroll/prolong. Code snapshot available at 10.5281/zenodo.14804245. 
    more » « less
  5. Summary Genomic data sets contain the effects of various unobserved biological variables in addition to the variable of primary interest. These latent variables often affect a large number of features (e.g., genes), giving rise to dense latent variation. This latent variation presents both challenges and opportunities for classification. While some of these latent variables may be partially correlated with the phenotype of interest and thus helpful, others may be uncorrelated and merely contribute additional noise. Moreover, whether potentially helpful or not, these latent variables may obscure weaker effects that impact only a small number of features but more directly capture the signal of primary interest. To address these challenges, we propose the cross-residualization classifier (CRC). Through an adjustment and ensemble procedure, the CRC estimates and residualizes out the latent variation, trains a classifier on the residuals, and then reintegrates the latent variation in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information. We apply the method to simulated data and a variety of genomic data sets from multiple platforms. In general, we find that the CRC performs well relative to existing classifiers and sometimes offers substantial gains. 
    more » « less