skip to main content


This content will become publicly available on February 28, 2025

Title: Testing high-dimensional multinomials with applications to text analysis
Abstract

Motivated by applications in text mining and discrete distribution inference, we test for equality of probability mass functions of K groups of high-dimensional multinomial distributions. Special cases of this problem include global testing for topic models, two-sample testing in authorship attribution, and closeness testing for discrete distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null hypothesis, is proposed. This parameter-free limiting null distribution holds true without requiring identical multinomial parameters within each group or equal group sizes. The optimal detection boundary for this testing problem is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyse two real-world datasets to examine, respectively, variation among customer reviews of Amazon movies and the diversity of statistical paper abstracts.

 
more » « less
Award ID(s):
1943902
PAR ID:
10531755
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford Academic
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
ISSN:
1369-7412
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper considers the problem of testing whether there exists a non‐negative solution to a possibly under‐determined system of linear equations with known coefficients. This hypothesis testing problem arises naturally in a number of settings, including random coefficient, treatment effect, and discrete choice models, as well as a class of linear programming problems. As a first contribution, we obtain a novel geometric characterization of the null hypothesis in terms of identified parameters satisfying an infinite set of inequality restrictions. Using this characterization, we devise a test that requires solving only linear programs for its implementation, and thus remains computationally feasible in the high‐dimensional applications that motivate our analysis. The asymptotic size of the proposed test is shown to equal at most the nominal level uniformly over a large class of distributions that permits the number of linear equations to grow with the sample size. 
    more » « less
  2. Summary

    The problem of detecting heterogeneous and heteroscedastic Gaussian mixtures is considered. The focus is on how the parameters of heterogeneity, heteroscedasticity and proportion of non-null component influence the difficulty of the problem. We establish an explicit detection boundary which separates the detectable region where the likelihood ratio test is shown to detect the presence of non-null effects reliably from the undetectable region where no method can do so. In particular, the results show that the detection boundary changes dramatically when the proportion of non-null component shifts from the sparse regime to the dense regime. Furthermore, it is shown that the higher criticism test, which does not require specific information on model parameters, is optimally adaptive to the unknown degrees of heterogeneity and heteroscedasticity in both the sparse and the dense cases.

     
    more » « less
  3. Summary

    Penalized regression spline models afford a simple mixed model representation in which variance components control the degree of non-linearity in the smooth function estimates. This motivates the study of lack-of-fit tests based on the restricted maximum likelihood ratio statistic which tests whether variance components are 0 against the alternative of taking on positive values. For this one-sided testing problem a further complication is that the variance component belongs to the boundary of the parameter space under the null hypothesis. Conditions are obtained on the design of the regression spline models under which asymptotic distribution theory applies, and finite sample approximations to the asymptotic distribution are provided. Test statistics are studied for simple as well as multiple-regression models.

     
    more » « less
  4. Summary

    Motivated by an imaging study, the paper develops a non-parametric testing procedure for testing the null hypothesis that two samples of curves observed at discrete grids and with noise have the same underlying distribution. The objective is to compare formally white matter tract profiles between healthy individuals and multiple-sclerosis patients, as assessed by conventional diffusion tensor imaging measures. We propose to decompose the curves by using functional principal component analysis of a mixture process, which we refer to as marginal functional principal component analysis. This approach reduces the dimension of the testing problem in a way that enables the use of traditional non-parametric univariate testing procedures. The procedure is computationally efficient and accommodates different sampling designs. Numerical studies are presented to validate the size and power properties of the test in many realistic scenarios. In these cases, the test proposed has been found to be more powerful than its primary competitor. Application to the diffusion tensor imaging data reveals that all the tracts studied are associated with multiple sclerosis and the choice of the diffusion tensor image measurement is important when assessing axonal disruption.

     
    more » « less
  5. Abstract

    When the observed data are contaminated with errors, the standard two‐sample testing approaches that ignore measurement errors may produce misleading results, including a higher type‐I error rate than the nominal level. To tackle this inconsistency, a nonparametric test is proposed for testing equality of two distributions when the observed contaminated data follow the classical additive measurement error model. The proposed test takes into account the presence of errors in the observed data, and the test statistic is defined in terms of the (deconvoluted) characteristic functions of the latent variables. Proposed method is applicable to a wide range of scenarios as no parametric restrictions are imposed either on the distribution of the underlying latent variables or on the distribution of the measurement errors. Asymptotic null distribution of the test statistic is derived, which is given by an integral of a squared Gaussian process with a complicated covariance structure. For data‐based calibration of the test, a new nonparametric Bootstrap method is developed under the two‐sample measurement error framework and its validity is established. Finite sample performance of the proposed test is investigated through simulation studies, and the results show superior performance of the proposed method than the standard tests that exhibit inconsistent behavior. Finally, the proposed method was applied to real data sets from the National Health and Nutrition Examination Survey. AnRpackageMEtestis available through CRAN.

     
    more » « less