NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Generalized kernel two-sample tests

https://doi.org/10.1093/biomet/asad068

Song, Hoseung; Chen, Hao (November 2023, Biometrika)

Summary Kernel two-sample tests have been widely used for multivariate data to test equality of distributions. However, existing tests based on mapping distributions into a reproducing kernel Hilbert space mainly target specific alternatives and do not work well for some scenarios when the dimension of the data is moderate to high due to the curse of dimensionality. We propose a new test statistic that makes use of a common pattern under moderate and high dimensions and achieves substantial power improvements over existing kernel two-sample tests for a wide range of alternatives. We also propose alternative testing procedures that maintain high power with low computational cost, offering easy off-the-shelf tools for large datasets. The new approaches are compared to other state-of-the-art tests under various settings and show good performance. We showcase the new approaches through two applications: the comparison of musks and nonmusks using the shape of molecules, and the comparison of taxi trips starting from John F. Kennedy airport in consecutive months. All proposed methods are implemented in an R package kerTests.
more » « less
New multivariate tests for assessing covariate balance in matched observational studies

https://doi.org/10.1111/biom.13395

Chen, Hao; Small, Dylan_S (November 2020, Biometrics)

Abstract We propose new tests for assessing whether covariates in a treatment group and matched control group are balanced in observational studies. The tests exhibit high power under a wide range of multivariate alternatives, some of which existing tests have little power for. The asymptotic permutation null distributions of the proposed tests are studied and theP‐values calculated through the asymptotic results work well in simulation studies, facilitating the application of the test to large data sets. The tests are illustrated in a study of the effect of smoking on blood lead levels. The proposed tests are implemented in anRpackageBalanceCheck.
more » « less
Asymptotic Distribution-Free Change-Point Detection for Modern Data Based on a New Ranking Scheme

https://doi.org/10.1109/TIT.2025.3575858

Zhou, Doudou; Chen, Hao (June 2025, IEEE Transactions on Information Theory)

Free, publicly-accessible full text available June 2, 2026
On the tightness of graph-based statistics

https://doi.org/10.1214/25-EJS2367

Chu, Lynna; Chen, Hao (January 2025, Electronic Journal of Statistics)

Full Text Available
Practical and Powerful Kernel-Based Change-Point Detection

https://doi.org/10.1109/TSP.2024.3479274

Song, Hoseung; Chen, Hao (October 2024, IEEE Transactions on Signal Processing)

Full Text Available
Limiting distributions of graph-based test statistics on sparse and dense graphs

https://doi.org/10.3150/23-BEJ1616

Zhu, Yejiong; Chen, Hao (February 2024, Bernoulli)

Full Text Available
A new ranking scheme for modern data and its application to two-sample hypothesis testing

Zhou, Doudou; Chen, Hao (July 2023, Proceedings of Machine Learning Research)

Rank-based approaches are among the most popular nonparametric methods for univariate data in tackling statistical problems such as hypothesis testing due to their robustness and effectiveness. However, they are unsatisfactory for more complex data. In the era of big data, high-dimensional and non-Euclidean data, such as networks and images, are ubiquitous and pose challenges for statistical analysis. Existing multivariate ranks such as component-wise, spatial, and depth-based ranks do not apply to non-Euclidean data and have limited performance for high-dimensional data. Instead of dealing with the ranks of observations, we propose two types of ranks applicable to complex data based on a similarity graph constructed on observations: a graph-induced rank defined by the inductive nature of the graph and an overall rank defined by the weight of edges in the graph. To illustrate their utilization, both the new ranks are used to construct test statistics for the two-sample hypothesis testing, which converge to the $$\chi_2^2$$ distribution under the permutation null distribution and some mild conditions of the ranks, enabling an easy type-I error control. Simulation studies show that the new method exhibits good power under a wide range of alternatives compared to existing methods. The new test is illustrated on the New York City taxi data for comparing travel patterns in consecutive months and a brain network dataset comparing male and female subjects.
more » « less
Full Text Available
A new ranking scheme for modern data and its application to two-sample hypothesis testing

Zhou, Doudou; Chen, Hao (July 2023, Proceedings of Machine Learning Research)

Full Text Available
Likelihood Scores for Sparse Signal and Change-Point Detection

https://doi.org/10.1109/TIT.2023.3242297

Hu, Shouri; Huang, Jingyan; Chen, Hao; Chan, Hock Peng (June 2023, IEEE Transactions on Information Theory)

Full Text Available
Graph-Based Change-Point Analysis

https://doi.org/10.1146/annurev-statistics-122121-033817

Chen, Hao; Chu, Lynna (March 2023, Annual Review of Statistics and Its Application)

Recent technological advances allow for the collection of massive data in the study of complex phenomena over time and/or space in various fields. Many of these data involve sequences of high-dimensional or non-Euclidean measurements, where change-point analysis is a crucial early step in understanding the data. Segmentation, or offline change-point analysis, divides data into homogeneous temporal or spatial segments, making subsequent analysis easier; its online counterpart detects changes in sequentially observed data, allowing for real-time anomaly detection. This article reviews a nonparametric change-point analysis framework that utilizes graphs representing the similarity between observations. This framework can be applied to data as long as a reasonable dissimilarity distance among the observations can be defined. Thus, this framework can be applied to a wide range of applications, from high-dimensional data to non-Euclidean data, such as imaging data or network data. In addition, analytic formulas can be derived to control the false discoveries, making them easy off-the-shelf data analysis tools.
more » « less
Full Text Available

« Prev Next »

Search for: All records