skip to main content


Title: Asymptotics of Selective Inference
Abstract

In this paper, we seek to establish asymptotic results for selective inference procedures removing the assumption of Gaussianity. The class of selection procedures we consider are determined by affine inequalities, which we refer to as affine selection procedures. Examples of affine selection procedures include selective inference along the solution path of the least absolute shrinkage and selection operator (LASSO), as well as selective inference after fitting the least absolute shrinkage and selection operator at a fixed value of the regularization parameter. We also consider some tests in penalized generalized linear models. Our result proves asymptotic convergence in the high‐dimensional setting wheren<p, andncan be of a logarithmic factor of the dimensionpfor some procedures.

 
more » « less
NSF-PAR ID:
10214258
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Scandinavian Journal of Statistics
Volume:
44
Issue:
2
ISSN:
0303-6898
Page Range / eLocation ID:
p. 480-499
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The prognosis of hepatocellular carcinoma (HCC) after R0 resection is unsatisfactory due to the high rate of recurrence. In this study, we investigated the recurrence‐related RNAs and the underlying mechanism. The long noncoding RNA (lncRNA), microRNA (miRNA), and messenger RNA (mRNA) expression data and clinical information of 247 patients who underwent R0 resection patients with HCC were obtained from The Cancer Genome Atlas. Comparing the 1‐year recurrence group (n = 56) with the nonrecurrence group (n = 60), we detected 34 differentially expressed lncRNAs (DElncRNAs), five DEmiRNAs, and 216 DEmRNAs. Of these, three DElncRNAs, hsa‐mir‐150‐5p, and 11 DEmRNAs were selected for constructing the competing endogenous RNA (ceRNA) network. Next, two nomogram models were constructed based separately on the lncRNAs and mRNAs that were further selected by Cox and least absolute shrinkage and selection operator regression analysis. The two nomogram models that showed a high prediction accuracy for disease‐free survival with the concordance indexes at 0.725 and 0.639. Further functional enrichment analysis of DEmRNAs showed that the mRNAs in the ceRNA network and nomogram models were associated with immune pathways. Hence, we constructed a hsa‐mir‐150‐5p‐centric ceRNA network and two effective nomogram prognostic models, and the related RNAs may be useful as potential biomarkers for predicting recurrence in patients with HCC.

     
    more » « less
  2. We consider the high-dimensional linear regression problem, where the algorithmic goal is to efficiently infer an unknown feature vector $\beta^*\in\mathbb{R}^p$ from its linear measurements, using a small number $n$ of samples. Unlike most of the literature, we make no sparsity assumption on $\beta^*$, but instead adopt a different regularization: In the noiseless setting, we assume $\beta^*$ consists of entries, which are either rational numbers with a common denominator $Q\in\mathbb{Z}^+$ (referred to as $Q-$rationality); or irrational numbers taking values in a rationally independent set of bounded cardinality, known to learner; collectively called as the mixed-range assumption. Using a novel combination of the Partial Sum of Least Squares (PSLQ) integer relation detection, and the Lenstra-Lenstra-Lov\'asz (LLL) lattice basis reduction algorithms, we propose a polynomial-time algorithm which provably recovers a $\beta^*\in\mathbb{R}^p$ enjoying the mixed-range assumption, from its linear measurements $Y=X\beta^*\in\mathbb{R}^n$ for a large class of distributions for the random entries of $X$, even with one measurement ($n=1$). In the noisy setting, we propose a polynomial-time, lattice-based algorithm, which recovers a $\beta^*\in\mathbb{R}^p$ enjoying the $Q-$rationality property, from its noisy measurements $Y=X\beta^*+W\in\mathbb{R}^n$, even from a single sample ($n=1$). We further establish that for large $Q$, and normal noise, this algorithm tolerates information-theoretically optimal level of noise. We then apply these ideas to develop a polynomial-time, single-sample algorithm for the phase retrieval problem. Our methods address the single-sample ($n=1$) regime, where the sparsity-based methods such as the Least Absolute Shrinkage and Selection Operator (LASSO) and the Basis Pursuit are known to fail. Furthermore, our results also reveal algorithmic connections between the high-dimensional linear regression problem, and the integer relation detection, randomized subset-sum, and shortest vector problems. 
    more » « less
  3. Abstract Background

    Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods.

    Methods

    Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction.

    Results

    Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance.

    Conclusions

    We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

     
    more » « less
  4. Abstract

    Large dams are a leading cause of river ecosystem degradation. Although dams have cumulative effects as water flows downstream in a river network, most flow alteration research has focused on local impacts of single dams. Here we examined the highly regulated Colorado River Basin (CRB) to understand how flow alteration propagates in river networks, as influenced by the location and characteristics of dams as well as the structure of the river network—including the presence of tributaries. We used a spatial Markov network model informed by 117 upstream‐downstream pairs of monthly flow series (2003–2017) to estimate flow alteration from 84 intermediate‐to‐large dams representing >83% of the total storage in the CRB. Using Least Absolute Shrinkage and Selection Operator regression, we then investigated how flow alteration was influenced by local dam properties (e.g., purpose, storage capacity) and network‐level attributes (e.g., position, upstream cumulative storage). Flow alteration was highly variable across the network, but tended to accumulate downstream and remained high in the main stem. Dam impacts were explained by network‐level attributes (63%) more than by local dam properties (37%), underscoring the need to consider network context when assessing dam impacts. High‐impact dams were often located in sub‐watersheds with high levels of native fish biodiversity, fish imperilment, or species requiring seasonal flows that are no longer present. These three biodiversity dimensions, as well as the amount of dam‐free downstream habitat, indicate potential to restore river ecosystems via controlled flow releases. Our methods are transferrable and could guide screening for dam reoperation in other highly regulated basins.

     
    more » « less
  5. Abstract

    In statistics, the least absolute shrinkage and selection operator (Lasso) is a regression method that performs both variable selection and regularization. There is a lot of literature available, discussing the statistical properties of the regression coefficients estimated by the Lasso method. However, there lacks a comprehensive review discussing the algorithms to solve the optimization problem in Lasso. In this review, we summarize five representative algorithms to optimize the objective function in Lasso, including iterative shrinkage threshold algorithm (ISTA), fast iterative shrinkage‐thresholding algorithms (FISTA), coordinate gradient descent algorithm (CGDA), smooth L1 algorithm (SLA), and path following algorithm (PFA). Additionally, we also compare their convergence rate, as well as their potential strengths and weakness.

    This article is categorized under:

    Statistical Models > Linear Models

    Algorithms and Computational Methods > Numerical Methods

    Algorithms and Computational Methods > Computational Complexity

     
    more » « less