skip to main content


Title: Penalized Composite Quasi-Likelihood for Ultrahigh Dimensional Variable Selection
Summary

In high dimensional model selection problems, penalized least square approaches have been extensively used. The paper addresses the question of both robustness and efficiency of penalized model selection methods and proposes a data-driven weighted linear combination of convex loss functions, together with weighted L1-penalty. It is completely data adaptive and does not require prior knowledge of the error distribution. The weighted L1-penalty is used both to ensure the convexity of the penalty term and to ameliorate the bias that is caused by the L1-penalty. In the setting with dimensionality much larger than the sample size, we establish a strong oracle property of the method proposed that has both the model selection consistency and estimation efficiency for the true non-zero coefficients. As specific examples, we introduce a robust method of composite L1–L2, and an optimal composite quantile method and evaluate their performance in both simulated and real data examples.

 
more » « less
NSF-PAR ID:
10396630
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
73
Issue:
3
ISSN:
1369-7412
Page Range / eLocation ID:
p. 325-349
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    We introduce a general formulation for dimension reduction and coefficient estimation in the multivariate linear model. We argue that many of the existing methods that are commonly used in practice can be formulated in this framework and have various restrictions. We continue to propose a new method that is more flexible and more generally applicable. The method proposed can be formulated as a novel penalized least squares estimate. The penalty that we employ is the coefficient matrix's Ky Fan norm. Such a penalty encourages the sparsity among singular values and at the same time gives shrinkage coefficient estimates and thus conducts dimension reduction and coefficient estimation simultaneously in the multivariate linear model. We also propose a generalized cross-validation type of criterion for the selection of the tuning parameter in the penalized least squares. Simulations and an application in financial econometrics demonstrate competitive performance of the new method. An extension to the non-parametric factor model is also discussed.

     
    more » « less
  2. A popular method for flexible function estimation in nonparametric models is the smoothing spline. When applying the smoothing spline method, the nonparametric function is estimated via penalized least squares, where the penalty imposes a soft constraint on the function to be estimated. The specification of the penalty functional is usually based on a set of assumptions about the function. Choosing a reasonable penalty function is the key to the success of the smoothing spline method. In practice, there may exist multiple sets of widely accepted assumptions, leading to different penalties, which then yield different estimates. We refer to this problem as the problem of ambiguous penalties. Neglecting the underlying ambiguity and proceeding to the model with one of the candidate penalties may produce misleading results. In this article, we adopt a Bayesian perspective and propose a fully Bayesian approach that takes into consideration all the penalties as well as the ambiguity in choosing them. We also propose a sampling algorithm for drawing samples from the posterior distribution. Data analysis based on simulated and real‐world examples is used to demonstrate the efficiency of our proposed method.

     
    more » « less
  3. Summary

    We consider the problem of selecting covariates in a spatial regression model when the response is binary. Penalized likelihood-based approach is proved to be effective for both variable selection and estimation simultaneously. In the context of a spatially dependent binary variable, an uniquely interpretable likelihood is not available, rather a quasi-likelihood might be more suitable. We develop a penalized quasi-likelihood with spatial dependence for simultaneous variable selection and parameter estimation along with an efficient computational algorithm. The theoretical properties including asymptotic normality and consistency are studied under increasing domain asymptotics framework. An extensive simulation study is conducted to validate the methodology. Real data examples are provided for illustration and applicability. Although theoretical justification has not been made, we also investigate empirical performance of the proposed penalized quasi-likelihood approach for spatial count data to explore suitability of this method to a general exponential family of distributions.

     
    more » « less
  4. ABSTRACT

    A genome‐wide association study (GWAS) correlates marker and trait variation in a study sample. Each subject is genotyped at a multitude of SNPs (single nucleotide polymorphisms) spanning the genome. Here, we assume that subjects are randomly collected unrelateds and that trait values are normally distributed or can be transformed to normality. Over the past decade, geneticists have been remarkably successful in applying GWAS analysis to hundreds of traits. The massive amount of data produced in these studies present unique computational challenges. Penalized regression with the ℓ1penalty (LASSO) or minimax concave penalty (MCP) penalties is capable of selecting a handful of associated SNPs from millions of potential SNPs. Unfortunately, model selection can be corrupted by false positives and false negatives, obscuring the genetic underpinning of a trait. Here, we compare LASSO and MCP penalized regression to iterative hard thresholding (IHT). On GWAS regression data, IHT is better at model selection and comparable in speed to both methods of penalized regression. This conclusion holds for both simulated and real GWAS data. IHT fosters parallelization and scales well in problems with large numbers of causal markers. Our parallel implementation of IHT accommodates SNP genotype compression and exploits multiple CPU cores and graphics processing units (GPUs). This allows statistical geneticists to leverage commodity desktop computers in GWAS analysis and to avoid supercomputing.Availability: Source code is freely available athttps://github.com/klkeys/IHT.jl.

     
    more » « less
  5. Abstract

    Identifying the underlying trajectory pattern in the spatial‐temporal data analysis is a fundamental but challenging task. In this paper, we study the problem of simultaneously identifying temporal trends and spatial clusters of spatial‐temporal trajectories. To achieve this goal, we propose a novel method named spatial clustered and sparse nonparametric regression (). Our method leverages the B‐spline model to fit the temporal data and penalty terms on spline coefficients to reveal the underlying spatial‐temporal patterns. In particular, our method estimates the model by solving a doubly‐penalized least square problem, in which we use a group sparse penalty for trend detection and a spanning tree‐based fusion penalty for spatial cluster recovery. We also develop an algorithm based on the alternating direction method of multipliers (ADMM) algorithm to efficiently minimize the penalized least square loss. The statistical consistency properties of estimator are established in our work. In the end, we conduct thorough numerical experiments to verify our theoretical findings and validate that our method outperforms the existing competitive approaches.

     
    more » « less