skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, May 23 until 2:00 AM ET on Friday, May 24 due to maintenance. We apologize for the inconvenience.

Title: High-dimensional randomization-based inference capitalizing on classical design and modern computing

A common complication that can arise with analyses of high-dimensional data is the repeated use of hypothesis tests. A second complication, especially with small samples, is the reliance on asymptoticp-values. Our proposed approach for addressing both complications uses a scientifically motivated scalar summary statistic, and although not entirely novel, seems rarely used. The method is illustrated using a crossover study of seventeen participants examining the effect of exposure to ozone versus clean air on the DNA methylome, where the multivariate outcome involved 484,531 genomic locations. Our proposed test yields a single null randomization distribution, and thus a single Fisher-exactp-value that is statistically valid whatever the structure of the data. However, the relevance and power of the resultant test requires the careful a priori selection of a single test statistic. The common practice using asymptoticp-values or meaningless thresholds for “significance” is inapposite in general.

more » « less
Author(s) / Creator(s):
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Page Range / eLocation ID:
p. 9-26
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    Penalized regression spline models afford a simple mixed model representation in which variance components control the degree of non-linearity in the smooth function estimates. This motivates the study of lack-of-fit tests based on the restricted maximum likelihood ratio statistic which tests whether variance components are 0 against the alternative of taking on positive values. For this one-sided testing problem a further complication is that the variance component belongs to the boundary of the parameter space under the null hypothesis. Conditions are obtained on the design of the regression spline models under which asymptotic distribution theory applies, and finite sample approximations to the asymptotic distribution are provided. Test statistics are studied for simple as well as multiple-regression models.

    more » « less
  2. Summary

    Principal component analysis has become a fundamental tool of functional data analysis. It represents the functional data as Xi(t) = μ(t)+Σ1≤l<∞ηi, l+ vl(t), where μ is the common mean, vl are the eigenfunctions of the covariance operator and the ηi, l are the scores. Inferential procedures assume that the mean function μ(t) is the same for all values of i. If, in fact, the observations do not come from one population, but rather their mean changes at some point(s), the results of principal component analysis are confounded by the change(s). It is therefore important to develop a methodology to test the assumption of a common functional mean. We develop such a test using quantities which can be readily computed in the R package fda. The null distribution of the test statistic is asymptotically pivotal with a well-known asymptotic distribution. The asymptotic test has excellent finite sample performance. Its application is illustrated on temperature data from England.

    more » « less
  3. null (Ed.)
    We examine the accuracy of p values obtained using the asymptotic mean and variance (MV) correction to the distribution of the sample standardized root mean squared residual (SRMR) proposed by Maydeu-Olivares to assess the exact fit of SEM models. In a simulation study, we found that under normality, the MV-corrected SRMR statistic provides reasonably accurate Type I errors even in small samples and for large models, clearly outperforming the current standard, that is, the likelihood ratio (LR) test. When data shows excess kurtosis, MV-corrected SRMR p values are only accurate in small models ( p = 10), or in medium-sized models ( p = 30) if no skewness is present and sample sizes are at least 500. Overall, when data are not normal, the MV-corrected LR test seems to outperform the MV-corrected SRMR. We elaborate on these findings by showing that the asymptotic approximation to the mean of the SRMR sampling distribution is quite accurate, while the asymptotic approximation to the standard deviation is not. 
    more » « less
  4. Abstract

    Copula is a popular method for modeling the dependence among marginal distributions in multivariate censored data. As many copula models are available, it is essential to check if the chosen copula model fits the data well for analysis. Existing approaches to testing the fitness of copula models are mainly for complete or right-censored data. No formal goodness-of-fit (GOF) test exists for interval-censored or recurrent events data. We develop a general GOF test for copula-based survival models using the information ratio (IR) to address this research gap. It can be applied to any copula family with a parametric form, such as the frequently used Archimedean, Gaussian, and D-vine families. The test statistic is easy to calculate, and the test procedure is straightforward to implement. We establish the asymptotic properties of the test statistic. The simulation results show that the proposed test controls the type-I error well and achieves adequate power when the dependence strength is moderate to high. Finally, we apply our method to test various copula models in analyzing multiple real datasets. Our method consistently separates different copula models for all these datasets in terms of model fitness.

    more » « less
  5. Abstract

    Genome‐wide association studies (GWAS) are used to investigate genetic variants contributing to complex traits. Despite discovering many loci, a large proportion of “missing” heritability remains unexplained. Gene–gene interactions may help explain some of this gap. Traditionally, gene–gene interactions have been evaluated using parametric statistical methods such as linear and logistic regression, with multifactor dimensionality reduction (MDR) used to address sparseness of data in high dimensions. We propose a method for the analysis of gene–gene interactions across independent single‐nucleotide polymorphisms (SNPs) in two genes. Typical methods for this problem use statistics based on an asymptotic chi‐squared mixture distribution, which is not easy to use. Here, we propose a Kullback–Leibler‐type statistic, which follows an asymptotic, positive, normal distribution under the null hypothesis of no relationship between SNPs in the two genes, and normally distributed under the alternative hypothesis. The performance of the proposed method is evaluated by simulation studies, which show promising results. The method is also used to analyze real data and identifies gene–gene interactions amongRAB3A,MADD, andPTPRNon type 2 diabetes (T2D) status.

    more » « less