skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Leveraging independence in high-dimensional mixed linear regression
ABSTRACT We address the challenge of estimating regression coefficients and selecting relevant predictors in the context of mixed linear regression in high dimensions, where the number of predictors greatly exceeds the sample size. Recent advancements in this field have centered on incorporating sparsity-inducing penalties into the expectation-maximization (EM) algorithm, which seeks to maximize the conditional likelihood of the response given the predictors. However, existing procedures often treat predictors as fixed or overlook their inherent variability. In this paper, we leverage the independence between the predictor and the latent indicator variable of mixtures to facilitate efficient computation and also achieve synergistic variable selection across all mixture components. We establish the non-asymptotic convergence rate of the proposed fast group-penalized EM estimator to the true regression parameters. The effectiveness of our method is demonstrated through extensive simulations and an application to the Cancer Cell Line Encyclopedia dataset for the prediction of anticancer drug sensitivity.  more » « less
Award ID(s):
2113590 2053697 1908969
PAR ID:
10543721
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Biometrics
Volume:
80
Issue:
3
ISSN:
0006-341X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The expectation-maximization (EM) algorithm and its variants are widely used in statistics. In high-dimensional mixture linear regression, the model is assumed to be a finite mixture of linear regression and the number of predictors is much larger than the sample size. The standard EM algorithm, which attempts to find the maximum likelihood estimator, becomes infeasible for such model. We devise a group lasso penalized EM algorithm and study its statistical properties. Existing theoretical results of regularized EM algorithms often rely on dividing the sample into many independent batches and employing a fresh batch of sample in each iteration of the algorithm. Our algorithm and theoretical analysis do not require sample-splitting, and can be extended to multivariate response cases. The proposed methods also have encouraging performances in numerical studies. 
    more » « less
  2. ABSTRACT In this paper, we propose Varying Effects Regression with Graph Estimation (VERGE), a novel Bayesian method for feature selection in regression. Our model has key aspects that allow it to leverage the complex structure of data sets arising from genomics or imaging studies. We distinguish between the predictors, which are the features utilized in the outcome prediction model, and the subject-level covariates, which modulate the effects of the predictors on the outcome. We construct a varying coefficients modeling framework where we infer a network among the predictor variables and utilize this network information to encourage the selection of related predictors. We employ variable selection spike-and-slab priors that enable the selection of both network-linked predictor variables and covariates that modify the predictor effects. We demonstrate through simulation studies that our method outperforms existing alternative methods in terms of both feature selection and predictive accuracy. We illustrate VERGE with an application to characterizing the influence of gut microbiome features on obesity, where we identify a set of microbial taxa and their ecological dependence relations. We allow subject-level covariates, including sex and dietary intake variables to modify the coefficients of the microbiome predictors, providing additional insight into the interplay between these factors. 
    more » « less
  3. Abstract In small area estimation, different data sources are integrated in order to produce reliable estimates of target parameters (e.g., a mean or a proportion) for a collection of small subsets (areas) of a finite population. Regression models such as the linear mixed effects model or M-quantile regression are often used to improve the precision of survey sample estimates by leveraging auxiliary information for which means or totals are known at the area level. In many applications, the unit-level linkage of records from different sources is probabilistic and potentially error-prone. In this article, we present adjustments of the small area predictors that are based on either the linear mixed effects model or M-quantile regression to account for the presence of linkage error. These adjustments are developed from a two-component mixture model that hinges on the assumption of independence of the target and auxiliary variable given incorrect linkage. Estimation and inference is based on composite likelihoods and machinery revolving around the Expectation-Maximization Algorithm. For each of the two regression methods, we propose modified small area predictors and approximations for their mean squared errors. The empirical performance of the proposed approaches is studied in both design-based and model-based simulations that include comparisons to a variety of baselines. 
    more » « less
  4. Multi-modal data are prevalent in many scientific fields. In this study, we consider the parameter estimation and variable selection for a multi-response regression using block-missing multi-modal data. Our method allows the dimensions of both the responses and the predictors to be large, and the responses to be incomplete and correlated, a common practical problem in high-dimensional settings. Our proposed method uses two steps to make a prediction from a multi-response linear regression model with block-missing multi-modal predictors. In the first step, without imputing missing data, we use all available data to estimate the covariance matrix of the predictors and the cross-covariance matrix between the predictors and the responses. In the second step, we use these matrices and a penalized method to simultaneously estimate the precision matrix of the response vector, given the predictors, and the sparse regression parameter matrix. Lastly, we demonstrate the effectiveness of the proposed method using theoretical studies, simulated examples, and an analysis of a multi-modal imaging data set from the Alzheimer’s Disease Neuroimaging Initiative. 
    more » « less
  5. Abstract AimIntraspecific genetic variation is key for adaptation and survival in changing environments and is known to be influenced by many factors, including population size, dispersal and life‐history traits. We investigated genetic variation within Neotropical amphibian species to provide insights into how natural history traits, phylogenetic relatedness, climatic and geographic characteristics can explain intraspecific genetic diversity. LocationNeotropics. TaxonAmphibians. MethodsWe assembled data sets using open‐access databases for natural history traits, genetic sequences, phylogenetic trees, climatic and geographic data. For each species, we calculated overall nucleotide diversity (π) and tested for isolation by distance (IBD) and isolation by environment (IBE). We then identified predictors ofπ, IBD and IBE using random forest (RF) regression or RF classification. We also fitted phylogenetic generalized linear mixed models (PGLMMs) to predictπ, IBD and IBE. ResultsWe compiled 4052 mitochondrial DNA sequences from 256 amphibian species (230 frogs and 26 salamanders), georeferencing 2477 sequences from 176 species that were not linked to occurrence data. RF regressions and PGLMMs were congruent in identifying range size and precipitation (σ) as the most important predictors ofπ, influencing it positively. RF classification and PGLMMs identified minimum elevation as an important predictor of IBD; most species without IBD tended to occur at higher elevations. Maximum latitude and precipitation (σ) were the best predictors of IBE, and most species without IBE occur at lower latitudes and in areas with more variable precipitation. Main ConclusionsThis study identified predictors of genetic variation in Neotropical amphibians using both machine learning and phylogenetic methods. This approach was valuable to determine which predictors were congruent between methods. We found that species with small ranges or living in zones with less variable precipitation tended to have low genetic diversity. We also showed that Western Mesoamerica, Andes and Atlantic Forest biogeographic units harbour high diversity across many species that should be prioritized for protection. These results could play a key role in the development of conservation strategies for Neotropical amphibians. 
    more » « less