skip to main content


Title: Empirical likelihood method for complete independence test on high-dimensional data
Given a random sample of size n from a p dimensional random vector, we are interested in testing whether the p components of the random vector are mutually independent. This is the so-called complete independence test. In the multivariate normal case, it is equivalent to testing whether the correlation matrix is an identity matrix. In this paper, we propose a one-sided empirical likelihood method for the complete independence test based on squared sample correlation coefficients. The limiting distribution for our one-sided empirical likelihood test statistic is proved to be Z^2I(Z > 0) when both n and p tend to infinity, where Z is a standard normal random variable. In order to improve the power of the empirical likelihood test statistic, we also introduce a rescaled empirical likelihood test statistic. We carry out an extensive simulation study to compare the performance of the rescaled empirical likelihood method and two other statistics.  more » « less
Award ID(s):
1916014
NSF-PAR ID:
10355249
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Journal of statistical computation and simulation
ISSN:
0094-9655
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    We introduce an L2-type test for testing mutual independence and banded dependence structure for high dimensional data. The test is constructed on the basis of the pairwise distance covariance and it accounts for the non-linear and non-monotone dependences among the data, which cannot be fully captured by the existing tests based on either Pearson correlation or rank correlation. Our test can be conveniently implemented in practice as the limiting null distribution of the test statistic is shown to be standard normal. It exhibits excellent finite sample performance in our simulation studies even when the sample size is small albeit the dimension is high and is shown to identify non-linear dependence in empirical data analysis successfully. On the theory side, asymptotic normality of our test statistic is shown under quite mild moment assumptions and with little restriction on the growth rate of the dimension as a function of sample size. As a demonstration of good power properties for our distance-covariance-based test, we further show that an infeasible version of our test statistic has the rate optimality in the class of Gaussian distributions with equal correlation.

     
    more » « less
  2. null (Ed.)
    Summary This paper is concerned with empirical likelihood inference on the population mean when the dimension $p$ and the sample size $n$ satisfy $p/n\rightarrow c\in [1,\infty)$. As shown in Tsao (2004), the empirical likelihood method fails with high probability when $p/n>1/2$ because the convex hull of the $n$ observations in $\mathbb{R}^p$ becomes too small to cover the true mean value. Moreover, when $p> n$, the sample covariance matrix becomes singular, and this results in the breakdown of the first sandwich approximation for the log empirical likelihood ratio. To deal with these two challenges, we propose a new strategy of adding two artificial data points to the observed data. We establish the asymptotic normality of the proposed empirical likelihood ratio test. The proposed test statistic does not involve the inverse of the sample covariance matrix. Furthermore, its form is explicit, so the test can easily be carried out with low computational cost. Our numerical comparison shows that the proposed test outperforms some existing tests for high-dimensional mean vectors in terms of power. We also illustrate the proposed procedure with an empirical analysis of stock data. 
    more » « less
  3. Abstract The $p$-tensor Ising model is a one-parameter discrete exponential family for modeling dependent binary data, where the sufficient statistic is a multi-linear form of degree $p \geqslant 2$. This is a natural generalization of the matrix Ising model that provides a convenient mathematical framework for capturing, not just pairwise, but higher-order dependencies in complex relational data. In this paper, we consider the problem of estimating the natural parameter of the $p$-tensor Ising model given a single sample from the distribution on $N$ nodes. Our estimate is based on the maximum pseudolikelihood (MPL) method, which provides a computationally efficient algorithm for estimating the parameter that avoids computing the intractable partition function. We derive general conditions under which the MPL estimate is $\sqrt N$-consistent, that is, it converges to the true parameter at rate $1/\sqrt N$. Our conditions are robust enough to handle a variety of commonly used tensor Ising models, including spin glass models with random interactions and models where the rate of estimation undergoes a phase transition. In particular, this includes results on $\sqrt N$-consistency of the MPL estimate in the well-known $p$-spin Sherrington–Kirkpatrick model, spin systems on general $p$-uniform hypergraphs and Ising models on the hypergraph stochastic block model (HSBM). In fact, for the HSBM we pin down the exact location of the phase transition threshold, which is determined by the positivity of a certain mean-field variational problem, such that above this threshold the MPL estimate is $\sqrt N$-consistent, whereas below the threshold no estimator is consistent. Finally, we derive the precise fluctuations of the MPL estimate in the special case of the $p$-tensor Curie–Weiss model, which is the Ising model on the complete $p$-uniform hypergraph. An interesting consequence of our results is that the MPL estimate in the Curie–Weiss model saturates the Cramer–Rao lower bound at all points above the estimation threshold, that is, the MPL estimate incurs no loss in asymptotic statistical efficiency in the estimability regime, even though it is obtained by minimizing only an approximation of the true likelihood function for computational tractability. 
    more » « less
  4. Abstract

    When testing hypotheses about which of two competing models is better, say A and B, the difference is often not significant. An alternative, complementary approach, is to measure how often model A is better than model B regardless of how slight or large the difference. The hypothesis concerns whether or not the percentage of time that model A is better than model B is larger than 50%. One generalized test statistic that can be used is the power-divergence test, which encompasses many familiar goodness-of-fit test statistics, such as the loglikelihood-ratio and PearsonX2tests. Theoretical results justify using thedistribution for the entire family of test statistics, wherekis the number of categories. However, these results assume that the underlying data are independent and identically distributed, which is often violated. Empirical results demonstrate that the reduction to two categories (i.e., model A is better than model B versus model B is better than A) results in a test that is reasonably robust to even severe departures from temporal independence, as well as contemporaneous correlation. The test is demonstrated on two different example verification sets: 6-h forecasts of eddy dissipation rate (m2/3s−1) from two versions of the Graphical Turbulence Guidance model and for 12-h forecasts of 2-m temperature (°C) and 10-m wind speed (m s−1) from two versions of the High-Resolution Rapid Refresh model. The novelty of this paper is in demonstrating the utility of the power-divergence statistic in the face of temporally dependent data, as well as the emphasis on testing for the “frequency-of-better” alongside more traditional measures.

     
    more » « less
  5. Summary

    Penalized regression spline models afford a simple mixed model representation in which variance components control the degree of non-linearity in the smooth function estimates. This motivates the study of lack-of-fit tests based on the restricted maximum likelihood ratio statistic which tests whether variance components are 0 against the alternative of taking on positive values. For this one-sided testing problem a further complication is that the variance component belongs to the boundary of the parameter space under the null hypothesis. Conditions are obtained on the design of the regression spline models under which asymptotic distribution theory applies, and finite sample approximations to the asymptotic distribution are provided. Test statistics are studied for simple as well as multiple-regression models.

     
    more » « less