In many applications, we need to study a linear regression model that consists of a response variable and a large number of potential explanatory variables, and determine which variables are truly associated with the response. In Foygel Barber & Candès (2015, Ann. Statist., 43, 2055–2085), the authors introduced a new variable selection procedure called the knockoff filter to control the false discovery rate (FDR) and proved that this method achieves exact FDR control. In this paper, we propose a prototype knockoff filter for group selection by extending the Reid–Tibshirani (2016, Biostatistics, 17, 364–376) prototype method. Our prototype knockoff filter improves the computational efficiency and statistical power of the Reid–Tibshirani prototype method when it is applied for group selection. In some cases when the group features are spanned by one or a few hidden factors, we demonstrate that the Principal Component Analysis (PCA) prototype knockoff filter outperforms the Dai–Foygel Barber (2016, 33rd International Conference on Machine Learning (ICML 2016)) group knockoff filter. We present several numerical experiments to compare our prototype knockoff filter with the Reid–Tibshirani prototype method and the group knockoff filter. We have also conducted some analysis of the knockoff filter. Our analysis reveals that some knockoff path method statistics, including the Lasso path statistic, may lead to loss of power for certain design matrices and a specially designed response even if their signal strengths are still relatively strong.
In this paper, we develop a drift homotopy implicit particle filter method. The methodology of our approach is to adopt the concept of drift homotopy in the resampling procedure of the particle filter method for solving the nonlinear filtering problem, and we introduce an implicit particle filter method to improve the efficiency of the drift homotopy resampling procedure. Numerical experiments are carried out to demonstrate the effectiveness and efficiency of our drift homotopy implicit particle filter.more » « less
- Award ID(s):
- NSF-PAR ID:
- Date Published:
- Journal Name:
- Discrete & Continuous Dynamical Systems - S
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
The thermal radiative transfer (TRT) equations form an integro-differential system that describes the propagation and collisional interactions of photons. Computing accurate and efficient numerical solutions TRT are challenging for several reasons, the first of which is that TRT is defined on a high-dimensional phase space that includes the independent variables of time, space, and velocity. In order to reduce the dimensionality of the phase space, classical approaches such as the P$_N$ (spherical harmonics) or the S$_N$ (discrete ordinates) ansatz are often used in the literature. In this work, we introduce a novel approach: the hybrid discrete (H$^T_N$) approximation to the radiative thermal transfer equations. This approach acquires desirable properties of both P$_N$ and S$_N$, and indeed reduces to each of these approximations in various limits: H$^1_N$ $\equiv$ P$_N$ and H$^T_0$ $\equiv$ S$_T$. We prove that H$^T_N$ results in a system of hyperbolic partial differential equations for all $T\ge 1$ and $N\ge 0$. Another challenge in solving the TRT system is the inherent stiffness due to the large timescale separation between propagation and collisions, especially in the diffusive (i.e., highly collisional) regime. This stiffness challenge can be partially overcome via implicit time integration, although fully implicit methods may become computationally expensive due to the strong nonlinearity and system size. On the other hand, explicit time-stepping schemes that are not also asymptotic-preserving in the highly collisional limit require resolving the mean-free path between collisions, making such schemes prohibitively expensive. In this work we develop a numerical method that is based on a nodal discontinuous Galerkin discretization in space, coupled with a semi-implicit discretization in time. In particular, we make use of a second order explicit Runge-Kutta scheme for the streaming term and an implicit Euler scheme for the material coupling term. Furthermore, in order to solve the material energy equation implicitly after each predictor and corrector step, we linearize the temperature term using a Taylor expansion; this avoids the need for an iterative procedure, and therefore improves efficiency. In order to reduce unphysical oscillation, we apply a slope limiter after each time step. Finally, we conduct several numerical experiments to verify the accuracy, efficiency, and robustness of the H$^T_N$ ansatz and the numerical discretizations.more » « less
Genomewide association studies (GWAS) sometimes identify loci at which both the number and identities of the underlying causal variants are ambiguous. In such cases, statistical methods that model effects of multiple single‐nucleotide polymorphisms (SNPs) simultaneously can help disentangle the observed patterns of association and provide information about how those SNPs could be prioritized for follow‐up studies. Current multi‐SNP methods, however, tend to assume that SNP effects are well captured by additive genetics; yet when genetic dominance is present, this assumption translates to reduced power and faulty prioritizations. We describe a statistical procedure for prioritizing SNPs at GWAS loci that efficiently models both additive and dominance effects. Our method, LLARRMA‐dawg, combines a group LASSO procedure for sparse modeling of multiple SNP effects with a resampling procedure based on fractional observation weights. It estimates for each SNP the robustness of association with the phenotype both to sampling variation and to competing explanations from other SNPs. In producing an SNP prioritization that best identifies underlying true signals, we show the following: our method easily outperforms a single‐marker analysis; when additive‐only signals are present, our joint model for additive and dominance is equivalent to or only slightly less powerful than modeling additive‐only effects; and when dominance signals are present, even in combination with substantial additive effects, our joint model is unequivocally more powerful than a model assuming additivity. We also describe how performance can be improved through calibrated randomized penalization, and discuss how dominance in ungenotyped SNPs can be incorporated through either heterozygote dosage or multiple imputation.
Summary Identifying dependency in multivariate data is a common inference task that arises in numerous applications. However, existing nonparametric independence tests typically require computation that scales at least quadratically with the sample size, making it difficult to apply them in the presence of massive sample sizes. Moreover, resampling is usually necessary to evaluate the statistical significance of the resulting test statistics at finite sample sizes, further worsening the computational burden. We introduce a scalable, resampling-free approach to testing the independence between two random vectors by breaking down the task into simple univariate tests of independence on a collection of $2\times 2$ contingency tables constructed through sequential coarse-to-fine discretization of the sample , transforming the inference task into a multiple testing problem that can be completed with almost linear complexity with respect to the sample size. To address increasing dimensionality, we introduce a coarse-to-fine sequential adaptive procedure that exploits the spatial features of dependency structures. We derive a finite-sample theory that guarantees the inferential validity of our adaptive procedure at any given sample size. We show that our approach can achieve strong control of the level of the testing procedure at any sample size without resampling or asymptotic approximation and establish its large-sample consistency. We demonstrate through an extensive simulation study its substantial computational advantage in comparison to existing approaches while achieving robust statistical power under various dependency scenarios, and illustrate how its divide-and-conquer nature can be exploited to not just test independence, but to learn the nature of the underlying dependency. Finally, we demonstrate the use of our method through analysing a dataset from a flow cytometry experiment.more » « less
Motivated by gene coexpression pattern analysis, we propose a novel sample quantile contingency (SQUAC) statistic to infer quantile associations conditioning on covariates. It features enhanced flexibility in handling variables with both arbitrary distributions and complex association patterns conditioning on covariates. We first derive its asymptotic null distribution, and then develop a multiple-testing procedure based on the SQUAC statistic to test simultaneously the independence between one pair of variables conditioning on covariates for all p(p − 1)/2 pairs. Here, p is the length of the outcomes and could exceed the sample size. The testing procedure does not require resampling or perturbation and thus is computationally efficient. We prove by theory and numerical experiments that this testing method asymptotically controls the false discovery rate. It outperforms all alternative methods when the complex association patterns exist. Applied to a gastric cancer data set, this testing method successfully inferred the gene coexpression networks of early and late stage patients. It identified more changes in the networks which are associated with cancer survivals. We extend our method to the case that both the length of the outcomes and the length of covariates exceed the sample size, and show that the asymptotic theory still holds.