Rank-based approaches are among the most popular nonparametric methods for univariate data in tackling statistical problems such as hypothesis testing due to their robustness and effectiveness. However, they are unsatisfactory for more complex data. In the era of big data, high-dimensional and non-Euclidean data, such as networks and images, are ubiquitous and pose challenges for statistical analysis. Existing multivariate ranks such as component-wise, spatial, and depth-based ranks do not apply to non-Euclidean data and have limited performance for high-dimensional data. Instead of dealing with the ranks of observations, we propose two types of ranks applicable to complex data based on a similarity graph constructed on observations: a graph-induced rank defined by the inductive nature of the graph and an overall rank defined by the weight of edges in the graph. To illustrate their utilization, both the new ranks are used to construct test statistics for the two-sample hypothesis testing, which converge to the $$\chi_2^2$$ distribution under the permutation null distribution and some mild conditions of the ranks, enabling an easy type-I error control. Simulation studies show that the new method exhibits good power under a wide range of alternatives compared to existing methods. The new test is illustrated on the New York City taxi data for comparing travel patterns in consecutive months and a brain network dataset comparing male and female subjects.
more »
« less
This content will become publicly available on February 1, 2026
Understanding and mitigating the impact of passing ships on underwater environmental estimation from ambient sound
We investigate the impact of low-rank interference on the problem of distinguishing between two seabed types using ambient sound as an acoustic source. The resulting frequency-domain snapshots follow a zero-mean, circularly-symmetric Gaussian distribution, where each seabed type has a unique covariance matrix. Detecting changes in the seabed type across distinct spatial locations can be formulated as a two-sample hypothesis test for equality of covariance, for which Box's M-test is the classical solution. Interference sources such as passing ships result in additive noise with a low-rank covariance that can reduce the performance of hypothesis testing. We first present a method to construct a worst-case interference field, making hypothesis testing as difficult as possible. We then provide an alternating optimization procedure to recover the interference-free covariance matrix. Experiments on synthetic data show that the optimized interferer can greatly reduce hypothesis testing performance, while our recovery method perfectly eliminates this interference for a sufficiently small interference rank. On real data from the New England Shelf Break Acoustics experiment, we show that our approach successfully mitigates interference, allowing for accurate hypothesis testing and improving bottom loss estimation.
more »
« less
- Award ID(s):
- 2046175
- PAR ID:
- 10615457
- Publisher / Repository:
- The Journal of the Acoustical Society of America
- Date Published:
- Journal Name:
- The Journal of the Acoustical Society of America
- Volume:
- 157
- Issue:
- 2
- ISSN:
- 0001-4966
- Page Range / eLocation ID:
- 811 to 823
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Mateu, Jorge (Ed.)When dealing with very high-dimensional and functional data, rank deficiency of sample covariance matrix often complicates the tests for population mean. To alleviate this rank deficiency problem, Munk et al. (J Multivar Anal 99:815–833, 2008) proposed neighborhood hypothesis testing procedure that tests whether the population mean is within a small, pre-specified neighborhood of a known quantity, M. How could we objectively specify a reasonable neighborhood, particularly when the sample space is unbounded? What should be the size of the neighborhood? In this article, we develop the modified neighborhood hypothesis testing framework to answer these two questions.We define the neighborhood as a proportion of the total amount of variation present in the population of functions under study and proceed to derive the asymptotic null distribution of the appropriate test statistic. Power analyses suggest that our approach is appropriate when sample space is unbounded and is robust against error structures with nonzero mean. We then apply this framework to assess whether the near-default sigmoidal specification of dose-response curves is adequate for widely used CCLE database. Results suggest that our methodology could be used as a pre-processing step before using conventional efficacy metrics, obtained from sigmoid models (for example: IC50 or AUC), as downstream predictive targets.more » « less
-
Robust PCA is a widely used statistical procedure to recover an underlying low-rank matrix with grossly corrupted observations. This work considers the problem of robust PCA as a nonconvex optimization problem on the manifold of low-rank matrices and proposes two algorithms based on manifold optimization. It is shown that, with a properly designed initialization, the proposed algorithms are guaranteed to converge to the underlying lowrank matrix linearly. Compared with a previous work based on the factorization of low-rank matrices Yi et al. (2016), the proposed algorithms reduce the dependence on the condition number of the underlying low-rank matrix theoretically. Simulations and real data examples con rm the competitive performance of our method.more » « less
-
This paper expands the analysis of randomized low-rank approximation beyond the Gaussian distribution to four classes of random matrices: (1) independent sub-Gaussian entries, (2) independent sub-Gaussian columns, (3) independent bounded columns, and (4) independent columns with bounded second moment. Using a novel interpretation of the low-rank approximation error involving sample covariance matrices, we provide insight into the requirements of a good random matrix for randomized low-rank approximations. Although our bounds involve unspecified absolute constants (a consequence of underlying nonasymptotic theory of random matrices), they allow for qualitative comparisons across distributions. The analysis offers some details on the minimal number of samples (the number of columns of the random matrix ) and the error in the resulting low-rank approximation. We illustrate our analysis in the context of the randomized subspace iteration method as a representative algorithm for low-rank approximation; however, all the results are broadly applicable to other low-rank approximation techniques. We conclude our discussion with numerical examples using both synthetic and real-world test matrices.more » « less
-
Esposito, Lauren (Ed.)Abstract This article investigates a form of rank deficiency in phenotypic covariance matrices derived from geometric morphometric data, and its impact on measures of phenotypic integration. We first define a type of rank deficiency based on information theory then demonstrate that this deficiency impairs the performance of phenotypic integration metrics in a model system. Lastly, we propose methods to treat for this information rank deficiency. Our first goal is to establish how the rank of a typical geometric morphometric covariance matrix relates to the information entropy of its eigenvalue spectrum. This requires clear definitions of matrix rank, of which we define three: the full matrix rank (equal to the number of input variables), the mathematical rank (the number of nonzero eigenvalues), and the information rank or “effective rank” (equal to the number of nonredundant eigenvalues). We demonstrate that effective rank deficiency arises from a combination of methodological factors—Generalized Procrustes analysis, use of the correlation matrix, and insufficient sample size—as well as phenotypic covariance. Secondly, we use dire wolf jaws to document how differences in effective rank deficiency bias two metrics used to measure phenotypic integration. The eigenvalue variance characterizes the integration change incorrectly, and the standardized generalized variance lacks the sensitivity needed to detect subtle changes in integration. Both metrics are impacted by the inclusion of many small, but nonzero, eigenvalues arising from a lack of information in the covariance matrix, a problem that usually becomes more pronounced as the number of landmarks increases. We propose a new metric for phenotypic integration that combines the standardized generalized variance with information entropy. This metric is equivalent to the standardized generalized variance but calculated only from those eigenvalues that carry nonredundant information. It is the standardized generalized variance scaled to the effective rank of the eigenvalue spectrum. We demonstrate that this metric successfully detects the shift of integration in our dire wolf sample. Our third goal is to generalize the new metric to compare data sets with different sample sizes and numbers of variables. We develop a standardization for matrix information based on data permutation then demonstrate that Smilodon jaws are more integrated than dire wolf jaws. Finally, we describe how our information entropy-based measure allows phenotypic integration to be compared in dense semilandmark data sets without bias, allowing characterization of the information content of any given shape, a quantity we term “latent dispersion”. [Canis dirus; Dire wolf; effective dispersion; effective rank; geometric morphometrics; information entropy; latent dispersion; modularity and integration; phenotypic integration; relative dispersion.]more » « less
An official website of the United States government
