skip to main content

Title: Debiased lasso for generalized linear models with a diverging number of covariates

Modeling and drawing inference on the joint associations between single‐nucleotide polymorphisms and a disease has sparked interest in genome‐wide associations studies. In the motivating Boston Lung Cancer Survival Cohort (BLCSC) data, the presence of a large number of single nucleotide polymorphisms of interest, though smaller than the sample size, challenges inference on their joint associations with the disease outcome. In similar settings, we find that neither the debiased lasso approach (van de Geer et al., 2014), which assumes sparsity on the inverse information matrix, nor the standard maximum likelihood method can yield confidence intervals with satisfactory coverage probabilities for generalized linear models. Under this “largen, divergingp” scenario, we propose an alternative debiased lasso approach by directly inverting the Hessian matrix without imposing the matrix sparsity assumption, which further reduces bias compared to the original debiased lasso and ensures valid confidence intervals with nominal coverage probabilities. We establish the asymptotic distributions of any linear combinations of the parameter estimates, which lays the theoretical ground for drawing inference. Simulations show that the proposedrefineddebiased estimating method performs well in removing bias and yields honest confidence interval coverage. We use the proposed method to analyze the aforementioned BLCSC data, a large‐scale hospital‐based epidemiology cohort study investigating the joint effects of genetic variants on lung cancer risks.

more » « less
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Medium: X Size: p. 344-357
["p. 344-357"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    For statistical inference on regression models with a diverging number of covariates, the existing literature typically makes sparsity assumptions on the inverse of the Fisher information matrix. Such assumptions, however, are often violated under Cox proportion hazards models, leading to biased estimates with under‐coverage confidence intervals. We propose a modified debiased lasso method, which solves a series of quadratic programming problems to approximate the inverse information matrix without posing sparse matrix assumptions. We establish asymptotic results for the estimated regression coefficients when the dimension of covariates diverges with the sample size. As demonstrated by extensive simulations, our proposed method provides consistent estimates and confidence intervals with nominal coverage probabilities. The utility of the method is further demonstrated by assessing the effects of genetic markers on patients' overall survival with the Boston Lung Cancer Survival Cohort, a large‐scale epidemiology study investigating mechanisms underlying the lung cancer.

    more » « less
  2. null (Ed.)
    The social and financial costs associated with Alzheimer's disease (AD) result in significant burdens on our society. In order to understand the causes of this disease, public-private partnerships such as the Alzheimer's Disease Neuroimaging Initiative (ADNI) release data into the scientific community. These data are organized into various modalities (genetic, brain-imaging, cognitive scores, diagnoses, etc.) for analysis. Many statistical learning approaches used in medical image analysis do not explicitly take advantage of this multimodal data structure. In this work we propose a novel objective function and optimization algorithm that is designed to handle multimodal information for the prediction and analysis of AD. Our approach relies on robust matrix-factorization and row-wise sparsity provided by the ℓ2,1- norm in order to integrate multimodal data provided by the ADNI. These techniques are jointly optimized with a classification task to guide the feature selection in our proposed Task Balanced Multimodal Feature Selection method. Our results, when compared against some widely used machine learning algorithms, show improved balanced accuracies, precision, and Matthew's correlation coefficients for identifying cognitive decline. In addition to the improved prediction performance, our method is able to identify brain and genetic biomarkers that are of interest to the clinical research community. Our experiments validate existing brain biomarkers and single nucleotide polymorphisms located on chromosome 11 and detail novel polymorphisms on chromosome 10 that, to the best of the authors' knowledge, have not previously been reported. We anticipate that our method will be of interest to the greater research community and have released our method's code online.11Code is provided at: 
    more » « less
  3. We consider a semiparametric additive partially linear regression model (APLM) for analysing ultra‐high‐dimensional data where both the number of linear components and the number of non‐linear components can be much larger than the sample size. We propose a two‐step approach for estimation, selection, and simultaneous inference of the components in the APLM. In the first step, the non‐linear additive components are approximated using polynomial spline basis functions, and a doubly penalized procedure is proposed to select nonzero linear and non‐linear components based on adaptive lasso. In the second step, local linear smoothing is then applied to the data with the selected variables to obtain the asymptotic distribution of the estimators of the nonparametric functions of interest. The proposed method selects the correct model with probability approaching one under regularity conditions. The estimators of both the linear part and the non‐linear part are consistent and asymptotically normal, which enables us to construct confidence intervals and make inferences about the regression coefficients and the component functions. The performance of the method is evaluated by simulation studies. The proposed method is also applied to a dataset on the shoot apical meristem of maize genotypes.

    more » « less
  4. Abstract Background

    Alzheimer's disease (AD), the most prevalent form of dementia, affects 6.5 million Americans and over 50 million people globally. Clinical, genetic, and phenotypic studies of dementia provide some insights of the observed progressive neurodegenerative processes, however, the mechanisms underlying AD onset remain enigmatic.


    This paper examines late‐onset dementia‐related cognitive impairment utilizing neuroimaging‐genetics biomarker associations.

    Materials and Methods

    The participants, ages 65–85, included 266 healthy controls (HC), 572 volunteers with mild cognitive impairment (MCI), and 188 Alzheimer's disease (AD) patients. Genotype dosage data for AD‐associated single nucleotide polymorphisms (SNPs) were extracted from the imputed ADNI genetics archive using sample‐major additive coding. Such 29 SNPs were selected, representing a subset of independent SNPs reported to be highly associated with AD in a recent AD meta‐GWAS study by Jansen and colleagues.


    We identified the significant correlations between the 29 genomic markers (GMs) and the 200 neuroimaging markers (NIMs). The odds ratios and relative risks for AD and MCI (relative to HC) were predicted using multinomial linear models.


    In the HC and MCI cohorts, mainly cortical thickness measures were associated with GMs, whereas the AD cohort exhibited different GM‐NIM relations. Network patterns within the HC and AD groups were distinct in cortical thickness, volume, and proportion of White to Gray Matter (pct), but not in the MCI cohort. Multinomial linear models of clinical diagnosis showed precisely the specific NIMs and GMs that were most impactful in discriminating between AD and HC, and between MCI and HC.


    This study suggests that advanced analytics provide mechanisms for exploring the interrelations between morphometric indicators and GMs. The findings may facilitate further clinical investigations of phenotypic associations that support deep systematic understanding of AD pathogenesis.

    more » « less
  5. Background

    Markov chains (MC) have been widely used to model molecular sequences. The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades. In next generation sequencing (NGS), a large amount of short reads are generated. These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data. Based on NGS data, the transition probabilities of MC can be estimated by moment estimators. However, the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid.


    In this study, we present the asymptotic distributions of several statistics related to MC based on NGS data. We show that, after scaling by the effective coverageddefined in a previous study by the authors, these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences.


    We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data. We validate our theoretical confidence intervals using both simulated data and real data sets, and compare the results with those by the parametric bootstrap method.


    We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate, providing a powerful tool for NGS data analysis.

    more » « less