This article introduces a comprehensive framework to adjust a discrete test statistic for improving its hypothesis testing procedure. The adjustment minimizes the Wasserstein distance to a null‐approximating continuous distribution, tackling some fundamental challenges inherent in combining statistical significances derived from discrete distributions. The related theory justifies Lancaster's mid‐p and mean‐value chi‐squared statistics for Fisher's combination as special cases. To counter the conservative nature of Lancaster's testing procedures, we propose an updated null‐approximating distribution. It is achieved by further minimizing the Wasserstein distance to the adjusted statistics within an appropriate distribution family. Specifically, in the context of Fisher's combination, we propose an optimal gamma distribution as a substitute for the traditionally used chi‐squared distribution. This new approach yields an asymptotically consistent test that significantly improves Type I error control and enhances statistical power.
more »
« less
Testing high-dimensional multinomials with applications to text analysis
Abstract Motivated by applications in text mining and discrete distribution inference, we test for equality of probability mass functions of K groups of high-dimensional multinomial distributions. Special cases of this problem include global testing for topic models, two-sample testing in authorship attribution, and closeness testing for discrete distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null hypothesis, is proposed. This parameter-free limiting null distribution holds true without requiring identical multinomial parameters within each group or equal group sizes. The optimal detection boundary for this testing problem is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyse two real-world datasets to examine, respectively, variation among customer reviews of Amazon movies and the diversity of statistical paper abstracts.
more »
« less
- Award ID(s):
- 1943902
- PAR ID:
- 10531755
- Publisher / Repository:
- Oxford Academic
- Date Published:
- Journal Name:
- Journal of the Royal Statistical Society Series B: Statistical Methodology
- ISSN:
- 1369-7412
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper considers the problem of testing whether there exists a non‐negative solution to a possibly under‐determined system of linear equations with known coefficients. This hypothesis testing problem arises naturally in a number of settings, including random coefficient, treatment effect, and discrete choice models, as well as a class of linear programming problems. As a first contribution, we obtain a novel geometric characterization of the null hypothesis in terms of identified parameters satisfying an infinite set of inequality restrictions. Using this characterization, we devise a test that requires solving only linear programs for its implementation, and thus remains computationally feasible in the high‐dimensional applications that motivate our analysis. The asymptotic size of the proposed test is shown to equal at most the nominal level uniformly over a large class of distributions that permits the number of linear equations to grow with the sample size.more » « less
-
Abstract In many applications of hierarchical models, there is often interest in evaluating the inherent heterogeneity in view of observed data. When the underlying hypothesis involves parameters resting on the boundary of their support space such as variances and mixture proportions, it is a usual practice to entertain testing procedures that rely on common heterogeneity assumptions. Such procedures, albeit omnibus for general alternatives, may entail a substantial loss of power for specific alternatives such as heterogeneity varying with covariates. We introduce a novel and flexible approach that uses covariate information to improve the power to detect heterogeneity, without imposing unnecessary restrictions. With continuous covariates, the approach does not impose a regression model relating heterogeneity parameters to covariates or rely on arbitrary discretizations. Instead, a scanning approach requiring continuous dichotomizations of the covariates is proposed. Empirical processes resulting from these dichotomizations are then used to construct the test statistics, with limiting null distributions shown to be functionals of tight random processes. We illustrate our proposals and results on a popular class of two‐component mixture models, followed by simulation studies and applications to two real datasets in cancer and caries research.more » « less
-
Abstract Regression models are often used to analyze discrete outcomes, but classical goodness‐of‐fit tests such as those based on the deviance or Pearson's statistic can be misleading or have little power in this context. To address this issue, we propose a new test, inspired by the work of Czado et al. (Biometrics, 65(4):1254–1261, 2009), which involves no randomization, tuning parameter, or binning of covariates. The statistic's large‐sample distribution under the null hypothesis is determined; as it involves unknown parameter values, one must resort to a bootstrap procedure to compute ‐values. Simulations are conducted to investigate the ability of the test to detect a broad range of model misspecifications commonly seen in practice. The proposed procedure is seen to perform well in all the scenarios considered as well as on real data.more » « less
-
Abstract Mediation analysis aims to assess if, and how, a certain exposure influences an outcome of interest through intermediate variables. This problem has recently gained a surge of attention due to the tremendous need for such analyses in scientific fields. Testing for the mediation effect (ME) is greatly challenged by the fact that the underlying null hypothesis (i.e. the absence of MEs) is composite. Most existing mediation tests are overly conservative and thus underpowered. To overcome this significant methodological hurdle, we develop an adaptive bootstrap testing framework that can accommodate different types of composite null hypotheses in the mediation pathway analysis. Applied to the product of coefficients test and the joint significance test, our adaptive testing procedures provide type I error control under the composite null, resulting in much improved statistical power compared to existing tests. Both theoretical properties and numerical examples of the proposed methodology are discussed.more » « less
An official website of the United States government

