skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, February 13 until 2:00 AM ET on Friday, February 14 due to maintenance. We apologize for the inconvenience.


Title: Explaining the Shortcomings of Log‐Transforming the Dependent Variable in Regression Models and Recommending a Better Alternative: Evidence From Soil CO 2 Emission Studies
Abstract

Log‐transforming the dependent variable of a regression model, though convenient and frequently used, is accompanied by an under‐prediction problem. We found that this underprediction can reach up to 20%, which is significant in studies that aim to estimate annual budgets. The fundamental reason for this problem is simply that the log‐function is concave, and it has nothing to do with whether the dependent variable has a log‐normal distribution or not. Using field‐observed data of soil CO2emission, soil temperature and soil moisture in a saturated‐specification of a regression model for predicting emissions, we revealed that the under‐predictions of the log‐transformed approach were pervasive and systematically biased. The key determinant of the problem's severity was the coefficient of variation in the dependent variable that differed among different combinations of the values of the explanatory factors. By applying a parsimonious (Gaussian‐Gamma) specification of the regression model to data from four different ecosystems, we found that this under‐prediction problem was serious to various extents, and that for a relatively weak explanatory factor, the log‐transformed approach is prone to yield a physically nonsensical estimated coefficient. Finally, we showed and concluded that the problem can be avoided by switching to the nonlinear approach, which does not require the assumption of homoscedasticity for the error term in computing the standard errors of the estimated coefficients.

 
more » « less
PAR ID:
10360699
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
DOI PREFIX: 10.1029
Date Published:
Journal Name:
Journal of Geophysical Research: Biogeosciences
Volume:
126
Issue:
5
ISSN:
2169-8953
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. To link a clinical outcome with compositional predictors in microbiome analysis, the linear log‐contrast model is a popular choice, and the inference procedure for assessing the significance of each covariate is also available. However, with the existence of multiple potentially interrelated outcomes and the information of the taxonomic hierarchy of bacteria, a multivariate analysis method that considers the group structure of compositional covariates and an accompanying group inference method are still lacking. Motivated by a study for identifying the microbes in the gut microbiome of preterm infants that impact their later neurobehavioral outcomes, we formulate a constrained integrative multi‐view regression. The neurobehavioral scores form multivariate responses, the log‐transformed sub‐compositional microbiome data form multi‐view feature matrices, and a set of linear constraints on their corresponding sub‐coefficient matrices ensures the sub‐compositional nature. We assume all the sub‐coefficient matrices are possible of low‐rank to enable joint selection and inference of sub‐compositions/views. We propose a scaled composite nuclear norm penalization approach for model estimation and develop a hypothesis testing procedure through de‐biasing to assess the significance of different views. Simulation studies confirm the effectiveness of the proposed procedure. We apply the method to the preterm infant study, and the identified microbes are mostly consistent with existing studies and biological understandings.

     
    more » « less
  2. Abstract

    This work examines methods for predicting the partition coefficient (logP) for a dataset of small molecules. Here, we use atomic attributes such as radius and partial charge, which are typically used as force field parameters in classical molecular dynamics simulations. These atomic attributes are transformed into index‐invariant molecular features using a recently developed method called geometric scattering for graphs (GSG). We call this approach “ClassicalGSG” and examine its performance under a broad range of conditions and hyperparameters. We train ClassicalGSG logPpredictors with neural networks using 10,722 molecules from the OpenChem dataset and apply them to predict the logPvalues from four independent test sets. The ClassicalGSG method's performance is compared to a baseline model that employs graph convolutional networks. Our results show that the best prediction accuracies are obtained using atomic attributes generated with the CHARMM generalized force field and 2D molecular structures.

     
    more » « less
  3. Abstract

    There is a growing need for flexible general frameworks that integrate individual-level data with external summary information for improved statistical inference. External information relevant for a risk prediction model may come in multiple forms, through regression coefficient estimates or predicted values of the outcome variable. Different external models may use different sets of predictors and the algorithm they used to predict the outcome Y given these predictors may or may not be known. The underlying populations corresponding to each external model may be different from each other and from the internal study population. Motivated by a prostate cancer risk prediction problem where novel biomarkers are measured only in the internal study, this paper proposes an imputation-based methodology, where the goal is to fit a target regression model with all available predictors in the internal study while utilizing summary information from external models that may have used only a subset of the predictors. The method allows for heterogeneity of covariate effects across the external populations. The proposed approach generates synthetic outcome data in each external population, uses stacked multiple imputation to create a long dataset with complete covariate information. The final analysis of the stacked imputed data is conducted by weighted regression. This flexible and unified approach can improve statistical efficiency of the estimated coefficients in the internal study, improve predictions by utilizing even partial information available from models that use a subset of the full set of covariates used in the internal study, and provide statistical inference for the external population with potentially different covariate effects from the internal population.

     
    more » « less
  4. Abstract

    Cover crops improve soil health and reduce the risk of soil erosion. However, their impact on the carbon dioxide equivalence (CO2e) is unknown. Therefore, the objective of this 2‐yr study was to quantify the effect of cover crop‐induced differences in soil moisture, temperature, organic C, and microorganisms on CO2e, and to develop machine learning algorithms that predict daily N2O–N and CO2–C emissions. The prediction models tested were multiple linear regression, partial least square regression, support vector machine, random forest (RF), and artificial neural network. Models’ performance was accessed using R2, RMSE and mean of absolute value of error. Rye (Secale cerealeL.) was dormant seeded in mid‐October, and in the following spring it was terminated at corn's (Zea maysL.) V4 growth stage. Soil temperature, moisture, and N2O–N and CO2–C emissions were measured near continuously from soil thaw to harvest in 2019 and 2020. Prior to termination, the cover crop decreased N2O–N emissions by 34% (p = .05), and over the entire season, N2O–N emissions from cover crop and no cover crop treatments were similar (p = .71). Based on N2O–N and CO2–C emissions over the entire season and the estimated fixed cover crop‐C remaining in the soil, the partial CO2ewere −1,061 and 496 kg CO2eha–1in the cover crop and no cover crop treatments, respectively. The RF algorithm explained more of the daily N2O–N (73%) and CO2–C (85%) emissions variability during validation than the other models. Across models, the most important variables were temperature and the amount of cover crop‐C added to the soil.

     
    more » « less
  5. Summary We develop a Bayesian methodology aimed at simultaneously estimating low-rank and row-sparse matrices in a high-dimensional multiple-response linear regression model. We consider a carefully devised shrinkage prior on the matrix of regression coefficients which obviates the need to specify a prior on the rank, and shrinks the regression matrix towards low-rank and row-sparse structures. We provide theoretical support to the proposed methodology by proving minimax optimality of the posterior mean under the prediction risk in ultra-high-dimensional settings where the number of predictors can grow subexponentially relative to the sample size. A one-step post-processing scheme induced by group lasso penalties on the rows of the estimated coefficient matrix is proposed for variable selection, with default choices of tuning parameters. We additionally provide an estimate of the rank using a novel optimization function achieving dimension reduction in the covariate space. We exhibit the performance of the proposed methodology in an extensive simulation study and a real data example. 
    more » « less