skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Leveraging independence in high-dimensional mixed linear regression
ABSTRACT We address the challenge of estimating regression coefficients and selecting relevant predictors in the context of mixed linear regression in high dimensions, where the number of predictors greatly exceeds the sample size. Recent advancements in this field have centered on incorporating sparsity-inducing penalties into the expectation-maximization (EM) algorithm, which seeks to maximize the conditional likelihood of the response given the predictors. However, existing procedures often treat predictors as fixed or overlook their inherent variability. In this paper, we leverage the independence between the predictor and the latent indicator variable of mixtures to facilitate efficient computation and also achieve synergistic variable selection across all mixture components. We establish the non-asymptotic convergence rate of the proposed fast group-penalized EM estimator to the true regression parameters. The effectiveness of our method is demonstrated through extensive simulations and an application to the Cancer Cell Line Encyclopedia dataset for the prediction of anticancer drug sensitivity.  more » « less
Award ID(s):
2113590 2053697 1908969
PAR ID:
10543721
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Biometrics
Volume:
80
Issue:
3
ISSN:
0006-341X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The expectation-maximization (EM) algorithm and its variants are widely used in statistics. In high-dimensional mixture linear regression, the model is assumed to be a finite mixture of linear regression and the number of predictors is much larger than the sample size. The standard EM algorithm, which attempts to find the maximum likelihood estimator, becomes infeasible for such model. We devise a group lasso penalized EM algorithm and study its statistical properties. Existing theoretical results of regularized EM algorithms often rely on dividing the sample into many independent batches and employing a fresh batch of sample in each iteration of the algorithm. Our algorithm and theoretical analysis do not require sample-splitting, and can be extended to multivariate response cases. The proposed methods also have encouraging performances in numerical studies. 
    more » « less
  2. ABSTRACT In this paper, we propose Varying Effects Regression with Graph Estimation (VERGE), a novel Bayesian method for feature selection in regression. Our model has key aspects that allow it to leverage the complex structure of data sets arising from genomics or imaging studies. We distinguish between the predictors, which are the features utilized in the outcome prediction model, and the subject-level covariates, which modulate the effects of the predictors on the outcome. We construct a varying coefficients modeling framework where we infer a network among the predictor variables and utilize this network information to encourage the selection of related predictors. We employ variable selection spike-and-slab priors that enable the selection of both network-linked predictor variables and covariates that modify the predictor effects. We demonstrate through simulation studies that our method outperforms existing alternative methods in terms of both feature selection and predictive accuracy. We illustrate VERGE with an application to characterizing the influence of gut microbiome features on obesity, where we identify a set of microbial taxa and their ecological dependence relations. We allow subject-level covariates, including sex and dietary intake variables to modify the coefficients of the microbiome predictors, providing additional insight into the interplay between these factors. 
    more » « less
  3. Multi-modal data are prevalent in many scientific fields. In this study, we consider the parameter estimation and variable selection for a multi-response regression using block-missing multi-modal data. Our method allows the dimensions of both the responses and the predictors to be large, and the responses to be incomplete and correlated, a common practical problem in high-dimensional settings. Our proposed method uses two steps to make a prediction from a multi-response linear regression model with block-missing multi-modal predictors. In the first step, without imputing missing data, we use all available data to estimate the covariance matrix of the predictors and the cross-covariance matrix between the predictors and the responses. In the second step, we use these matrices and a penalized method to simultaneously estimate the precision matrix of the response vector, given the predictors, and the sparse regression parameter matrix. Lastly, we demonstrate the effectiveness of the proposed method using theoretical studies, simulated examples, and an analysis of a multi-modal imaging data set from the Alzheimer’s Disease Neuroimaging Initiative. 
    more » « less
  4. Abstract AimIntraspecific genetic variation is key for adaptation and survival in changing environments and is known to be influenced by many factors, including population size, dispersal and life‐history traits. We investigated genetic variation within Neotropical amphibian species to provide insights into how natural history traits, phylogenetic relatedness, climatic and geographic characteristics can explain intraspecific genetic diversity. LocationNeotropics. TaxonAmphibians. MethodsWe assembled data sets using open‐access databases for natural history traits, genetic sequences, phylogenetic trees, climatic and geographic data. For each species, we calculated overall nucleotide diversity (π) and tested for isolation by distance (IBD) and isolation by environment (IBE). We then identified predictors ofπ, IBD and IBE using random forest (RF) regression or RF classification. We also fitted phylogenetic generalized linear mixed models (PGLMMs) to predictπ, IBD and IBE. ResultsWe compiled 4052 mitochondrial DNA sequences from 256 amphibian species (230 frogs and 26 salamanders), georeferencing 2477 sequences from 176 species that were not linked to occurrence data. RF regressions and PGLMMs were congruent in identifying range size and precipitation (σ) as the most important predictors ofπ, influencing it positively. RF classification and PGLMMs identified minimum elevation as an important predictor of IBD; most species without IBD tended to occur at higher elevations. Maximum latitude and precipitation (σ) were the best predictors of IBE, and most species without IBE occur at lower latitudes and in areas with more variable precipitation. Main ConclusionsThis study identified predictors of genetic variation in Neotropical amphibians using both machine learning and phylogenetic methods. This approach was valuable to determine which predictors were congruent between methods. We found that species with small ranges or living in zones with less variable precipitation tended to have low genetic diversity. We also showed that Western Mesoamerica, Andes and Atlantic Forest biogeographic units harbour high diversity across many species that should be prioritized for protection. These results could play a key role in the development of conservation strategies for Neotropical amphibians. 
    more » « less
  5. Abstract Private lands provide key habitat for imperiled species and are core components of function protectected area networks; yet, their incorporation into national and regional conservation planning has been challenging. Identifying locations where private landowners are likely to participate in conservation initiatives can help avoid conflict and clarify trade‐offs between ecological benefits and sociopolitical costs. Empirical, spatially explicit assessment of the factors associated with conservation on private land is an emerging tool for identifying future conservation opportunities. However, most data on private land conservation are voluntarily reported and incomplete, which complicates these assessments. We used a novel application of occupancy models to analyze the occurrence of conservation easements on private land. We compared multiple formulations of occupancy models with a logistic regression model to predict the locations of conservation easements based on a spatially explicit social–ecological systems framework. We combined a simulation experiment with a case study of easement data in Idaho and Montana (United States) to illustrate the utility of the occupancy framework for modeling conservation on private land. Occupancy models that explicitly accounted for variation in reporting produced estimates of predictors that were substantially less biased than estimates produced by logistic regression under all simulated conditions. Occupancy models produced estimates for the 6 predictors we evaluated in our case study that were larger in magnitude, but less certain than those produced by logistic regression. These results suggest that occupancy models result in qualitatively different inferences regarding the effects of predictors on conservation easement occurrence than logistic regression and highlight the importance of integrating variable and incomplete reporting of participation in empirical analysis of conservation initiatives. Failure to do so can lead to emphasizing the wrong social, institutional, and environmental factors that enable conservation and underestimating conservation opportunities in landscapes where social norms or institutional constraints inhibit reporting. 
    more » « less