skip to main content


Title: On Information Rank Deficiency in Phenotypic Covariance Matrices
Abstract This article investigates a form of rank deficiency in phenotypic covariance matrices derived from geometric morphometric data, and its impact on measures of phenotypic integration. We first define a type of rank deficiency based on information theory then demonstrate that this deficiency impairs the performance of phenotypic integration metrics in a model system. Lastly, we propose methods to treat for this information rank deficiency. Our first goal is to establish how the rank of a typical geometric morphometric covariance matrix relates to the information entropy of its eigenvalue spectrum. This requires clear definitions of matrix rank, of which we define three: the full matrix rank (equal to the number of input variables), the mathematical rank (the number of nonzero eigenvalues), and the information rank or “effective rank” (equal to the number of nonredundant eigenvalues). We demonstrate that effective rank deficiency arises from a combination of methodological factors—Generalized Procrustes analysis, use of the correlation matrix, and insufficient sample size—as well as phenotypic covariance. Secondly, we use dire wolf jaws to document how differences in effective rank deficiency bias two metrics used to measure phenotypic integration. The eigenvalue variance characterizes the integration change incorrectly, and the standardized generalized variance lacks the sensitivity needed to detect subtle changes in integration. Both metrics are impacted by the inclusion of many small, but nonzero, eigenvalues arising from a lack of information in the covariance matrix, a problem that usually becomes more pronounced as the number of landmarks increases. We propose a new metric for phenotypic integration that combines the standardized generalized variance with information entropy. This metric is equivalent to the standardized generalized variance but calculated only from those eigenvalues that carry nonredundant information. It is the standardized generalized variance scaled to the effective rank of the eigenvalue spectrum. We demonstrate that this metric successfully detects the shift of integration in our dire wolf sample. Our third goal is to generalize the new metric to compare data sets with different sample sizes and numbers of variables. We develop a standardization for matrix information based on data permutation then demonstrate that Smilodon jaws are more integrated than dire wolf jaws. Finally, we describe how our information entropy-based measure allows phenotypic integration to be compared in dense semilandmark data sets without bias, allowing characterization of the information content of any given shape, a quantity we term “latent dispersion”. [Canis dirus; Dire wolf; effective dispersion; effective rank; geometric morphometrics; information entropy; latent dispersion; modularity and integration; phenotypic integration; relative dispersion.]  more » « less
Award ID(s):
1758108
NSF-PAR ID:
10318475
Author(s) / Creator(s):
; ;
Editor(s):
Esposito, Lauren
Date Published:
Journal Name:
Systematic Biology
ISSN:
1063-5157
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The quantification of Hutchinson's n‐dimensional hypervolume has enabled substantial progress in community ecology, species niche analysis and beyond. However, most existing methods do not support a partitioning of the different components of hypervolume. Such a partitioning is crucial to address the ‘curse of dimensionality’ in hypervolume measures and interpret the metrics on the original niche axes instead of principal components. Here, we propose the use of multivariate normal distributions for the comparison of niche hypervolumes and introduce this as the multivariate‐normal hypervolume (MVNH) framework (R package available onhttps://github.com/lvmuyang/MVNH).

    The framework provides parametric measures of the size and dissimilarity of niche hypervolumes, each of which can be partitioned into biologically interpretable components. Specifically, the determinant of the covariance matrix (i.e. the generalized variance) of a MVNH is a measure of total niche size, which can be partitioned into univariate niche variance components and a correlation component (a measure of dimensionality, i.e. the effective number of independent niche axes standardized by the number of dimensions). The Bhattacharyya distance (BD; a function of the geometric mean of two probability distributions) between two MVNHs is a measure of niche dissimilarity. The BD partitions total dissimilarity into the components of Mahalanobis distance (standardized Euclidean distance with correlated variables) between hypervolume centroids and the determinant ratio which measures hypervolume size difference. The Mahalanobis distance and determinant ratio can be further partitioned into univariate divergences and a correlation component.

    We use empirical examples of community‐ and species‐level analysis to demonstrate the new insights provided by these metrics. We show that the newly proposed framework enables us to quantify the relative contributions of different hypervolume components and to connect these analyses to the ecological drivers of functional diversity and environmental niche variation.

    Our approach overcomes several operational and computational limitations of popular nonparametric methods and provides a partitioning framework that has wide implications for understanding functional diversity, niche evolution, niche shifts and expansion during biotic invasions, etc.

     
    more » « less
  2. Abstract The field of comparative morphology has entered a new phase with the rapid generation of high-resolution three-dimensional (3D) data. With freely available 3D data of thousands of species, methods for quantifying morphology that harness this rich phenotypic information are quickly emerging. Among these techniques, high-density geometric morphometric approaches provide a powerful and versatile framework to robustly characterize shape and phenotypic integration, the covariances among morphological traits. These methods are particularly useful for analyses of complex structures and across disparate taxa, which may share few landmarks of unambiguous homology. However, high-density geometric morphometrics also brings challenges, for example, with statistical, but not biological, covariances imposed by placement and sliding of semilandmarks and registration methods such as Procrustes superimposition. Here, we present simulations and case studies of high-density datasets for squamates, birds, and caecilians that exemplify the promise and challenges of high-dimensional analyses of phenotypic integration and modularity. We assess: (1) the relative merits of “big” high-density geometric morphometrics data over traditional shape data; (2) the impact of Procrustes superimposition on analyses of integration and modularity; and (3) differences in patterns of integration between analyses using high-density geometric morphometrics and those using discrete landmarks. We demonstrate that for many skull regions, 20–30 landmarks and/or semilandmarks are needed to accurately characterize their shape variation, and landmark-only analyses do a particularly poor job of capturing shape variation in vault and rostrum bones. Procrustes superimposition can mask modularity, especially when landmarks covary in parallel directions, but this effect decreases with more biologically complex covariance patterns. The directional effect of landmark variation on the position of the centroid affects recovery of covariance patterns more than landmark number does. Landmark-only and landmark-plus-sliding-semilandmark analyses of integration are generally congruent in overall pattern of integration, but landmark-only analyses tend to show higher integration between adjacent bones, especially when landmarks placed on the sutures between bones introduces a boundary bias. Allometry may be a stronger influence on patterns of integration in landmark-only analyses, which show stronger integration prior to removal of allometric effects compared to analyses including semilandmarks. High-density geometric morphometrics has its challenges and drawbacks, but our analyses of simulated and empirical datasets demonstrate that these potential issues are unlikely to obscure genuine biological signal. Rather, high-density geometric morphometric data exceed traditional landmark-based methods in characterization of morphology and allow more nuanced comparisons across disparate taxa. Combined with the rapid increases in 3D data availability, high-density morphometric approaches have immense potential to propel a new class of studies of comparative morphology and phenotypic integration. 
    more » « less
  3. Mateu, Jorge (Ed.)
    When dealing with very high-dimensional and functional data, rank deficiency of sample covariance matrix often complicates the tests for population mean. To alleviate this rank deficiency problem, Munk et al. (J Multivar Anal 99:815–833, 2008) proposed neighborhood hypothesis testing procedure that tests whether the population mean is within a small, pre-specified neighborhood of a known quantity, M. How could we objectively specify a reasonable neighborhood, particularly when the sample space is unbounded? What should be the size of the neighborhood? In this article, we develop the modified neighborhood hypothesis testing framework to answer these two questions.We define the neighborhood as a proportion of the total amount of variation present in the population of functions under study and proceed to derive the asymptotic null distribution of the appropriate test statistic. Power analyses suggest that our approach is appropriate when sample space is unbounded and is robust against error structures with nonzero mean. We then apply this framework to assess whether the near-default sigmoidal specification of dose-response curves is adequate for widely used CCLE database. Results suggest that our methodology could be used as a pre-processing step before using conventional efficacy metrics, obtained from sigmoid models (for example: IC50 or AUC), as downstream predictive targets. 
    more » « less
  4. In the context of principal components analysis (PCA), the bootstrap is commonly applied to solve a variety of inference problems, such as constructing confidence intervals for the eigenvalues of the population covariance matrix Σ. However, when the data are high-dimensional, there are relatively few theoretical guarantees that quantify the performance of the bootstrap. Our aim in this paper is to analyze how well the bootstrap can approximate the joint distribution of the leading eigenvalues of the sample covariance matrix \hat{Σ}, and we establish non-asymptotic rates of approximation with respect to the multivariate Kolmogorov metric. Under certain assumptions, we show that the bootstrap can achieve a dimension-free rate of r(Σ)/sqrt{n} up to logarithmic factors, where r(Σ) is the effective rank of Σ, and n is the sample size. From a methodological standpoint, we show that applying a transformation to the eigenvalues of \hat{Σ} before bootstrapping is an important consideration in high-dimensional settings. 
    more » « less
  5. Abstract

    We study the volume growth of metric balls as a function of the radius in discrete spaces and focus on the relationship between volume growth and discrete curvature. We improve volume growth bounds under a lower bound on the so-called Ollivier curvature and discuss similar results under other types of discrete Ricci curvature.

    Following recent work in the continuous setting of Riemannian manifolds (by the 1st author), we then bound the eigenvalues of the Laplacian of a graph under bounds on the volume growth. In particular, $\lambda _2$ of the graph can be bounded using a weighted discrete Hardy inequality and the higher eigenvalues of the graph can be bounded by the eigenvalues of a tridiagonal matrix times a multiplicative factor, both of which only depend on the volume growth of the graph. As a direct application, we relate the eigenvalues to the Cheeger isoperimetric constant. Using these methods, we describe classes of graphs for which the Cheeger inequality is tight on the 2nd eigenvalue (i.e. the 1st nonzero eigenvalue). We also describe a method for proving Buser’s Inequality in graphs, particularly under a lower bound assumption on curvature.

     
    more » « less