skip to main content

Title: Plücker coordinates of the best-fit Stiefel tropical linear space to a mixture of Gaussian distributions

In this research, we investigate a tropical principal component analysis (PCA) as a best-fit Stiefel tropical linear space to a given sample over the tropical projective torus for its dimensionality reduction and visualization. Especially, we characterize the best-fit Stiefel tropical linear space to a sample generated from a mixture of Gaussian distributions as the variances of the Gaussians go to zero. For a single Gaussian distribution, we show that the sum of residuals in terms of the tropical metric with the max-plus algebra over a given sample to a fitted Stiefel tropical linear space converges to zero by giving an upper bound for its convergence rate. Meanwhile, for a mixtures of Gaussian distribution, we show that the best-fit tropical linear space can be determined uniquely when we send variances to zero. We briefly consider the best-fit topical polynomial as an extension for the mixture of more than two Gaussians over the tropical projective space of dimension three. We show some geometric properties of these tropical linear spaces and polynomials.

more » « less
Author(s) / Creator(s):
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Information Geometry
Medium: X Size: p. 171-201
["p. 171-201"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Context. Galaxy clusters are an important tool for cosmology, and their detection and characterization are key goals for current and future surveys. Using data from the Wide-field Infrared Survey Explorer (WISE), the Massive and Distant Clusters of WISE Survey (MaDCoWS) located 2839 significant galaxy overdensities at redshifts 0.7 ≲  z  ≲ 1.5, which included extensive follow-up imaging from the Spitzer Space Telescope to determine cluster richnesses. Concurrently, the Atacama Cosmology Telescope (ACT) has produced large area millimeter-wave maps in three frequency bands along with a large catalog of Sunyaev-Zeldovich (SZ)-selected clusters as part of its Data Release 5 (DR5). Aims. We aim to verify and characterize MaDCoWS clusters using measurements of, or limits on, their thermal SZ effect signatures. We also use these detections to establish the scaling relation between SZ mass and the MaDCoWS-defined richness. Methods. Using the maps and cluster catalog from DR5, we explore the scaling between SZ mass and cluster richness. We do this by comparing cataloged detections and extracting individual and stacked SZ signals from the MaDCoWS cluster locations. We use complementary radio survey data from the Very Large Array, submillimeter data from Herschel , and ACT 224 GHz data to assess the impact of contaminating sources on the SZ signals from both ACT and MaDCoWS clusters. We use a hierarchical Bayesian model to fit the mass-richness scaling relation, allowing for clusters to be drawn from two populations: one, a Gaussian centered on the mass-richness relation, and the other, a Gaussian centered on zero SZ signal. Results. We find that MaDCoWS clusters have submillimeter contamination that is consistent with a gray-body spectrum, while the ACT clusters are consistent with no submillimeter emission on average. Additionally, the intrinsic radio intensities of ACT clusters are lower than those of MaDCoWS clusters, even when the ACT clusters are restricted to the same redshift range as the MaDCoWS clusters. We find the best-fit ACT SZ mass versus MaDCoWS richness scaling relation has a slope of p 1 = 1.84 −0.14 +0.15 , where the slope is defined as M λ ∝ 15 p 1 and λ 15 is the richness. We also find that the ACT SZ signals for a significant fraction (∼57%) of the MaDCoWS sample can statistically be described as being drawn from a noise-like distribution, indicating that the candidates are possibly dominated by low-mass and unvirialized systems that are below the mass limit of the ACT sample. Further, we note that a large portion of the optically confirmed ACT clusters located in the same volume of the sky as MaDCoWS are not selected by MaDCoWS, indicating that the MaDCoWS sample is not complete with respect to SZ selection. Finally, we find that the radio loud fraction of MaDCoWS clusters increases with richness, while we find no evidence that the submillimeter emission of the MaDCoWS clusters evolves with richness. Conclusions. We conclude that the original MaDCoWS selection function is not well defined and, as such, reiterate the MaDCoWS collaboration’s recommendation that the sample is suited for probing cluster and galaxy evolution, but not cosmological analyses. We find a best-fit mass-richness relation slope that agrees with the published MaDCoWS preliminary results. Additionally, we find that while the approximate level of infill of the ACT and MaDCoWS cluster SZ signals (1–2%) is subdominant to other sources of uncertainty for current generation experiments, characterizing and removing this bias will be critical for next-generation experiments hoping to constrain cluster masses at the sub-percent level. 
    more » « less
  2. We consider the problem of spherical Gaussian Mixture models with 𝑘≥3 components when the components are well separated. A fundamental previous result established that separation of Ω(log𝑘‾‾‾‾‾√) is necessary and sufficient for identifiability of the parameters with \textit{polynomial} sample complexity (Regev and Vijayaraghavan, 2017). In the same context, we show that 𝑂̃ (𝑘𝑑/𝜖2) samples suffice for any 𝜖≲1/𝑘, closing the gap from polynomial to linear, and thus giving the first optimal sample upper bound for the parameter estimation of well-separated Gaussian mixtures. We accomplish this by proving a new result for the Expectation-Maximization (EM) algorithm: we show that EM converges locally, under separation Ω(log𝑘‾‾‾‾‾√). The previous best-known guarantee required Ω(𝑘‾‾√) separation (Yan, et al., 2017). Unlike prior work, our results do not assume or use prior knowledge of the (potentially different) mixing weights or variances of the Gaussian components. Furthermore, our results show that the finite-sample error of EM does not depend on non-universal quantities such as pairwise distances between means of Gaussian components. 
    more » « less
  3. We provide improved differentially private algorithms for identity testing of high-dimensional distributions. Specifically, for d-dimensional Gaussian distributions with known covariance Σ, we can test whether the distribution comes from N(μ∗,Σ) for some fixed μ∗ or from some N(μ,Σ) with total variation distance at least α from N(μ∗,Σ) with (ε,0)-differential privacy, using only O~(d1/2α2+d1/3α4/3⋅ε2/3+1α⋅ε) samples if the algorithm is allowed to be computationally inefficient, and only O~(d1/2α2+d1/4α⋅ε) samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including mean testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over {−1,1}d, and tolerant testing. Our results improve over the previous best work of Canonne et al.~\cite{CanonneKMUZ20} for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{non-private} sample complexity of O(d√α2) in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of d-dimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size d \cite{AcharyaSZ18}, which refutes a conjectured lower bound of~\cite{CanonneKMUZ20}. 
    more » « less
  4. Summary

    It is now common to survey microbial communities by sequencing nucleic acid material extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, which gives a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used tools for these analyses. We provide a foundation for such methods by establishing that, if we equate a metagenomic sample with its empirical distribution on a reference phylogenetic tree, then the weighted UniFrac distance between two samples is just the classical Kantorovich–Rubinstein, or earth mover’s, distance between the corresponding empirical distributions. We demonstrate that this Kantorovich–Rubinstein distance and extensions incorporating uncertainty in the sample locations can be written as a readily computable integral over the tree, we develop Lp Zolotarev-type generalizations of the metric, and we show how the p-value of the resulting natural permutation test of the null hypothesis ‘no difference between two communities’ can be approximated by using a Gaussian process functional. We relate the L2-case to an analysis-of-variance type of decomposition, finding that the distribution of its associated Gaussian functional is that of a computable linear combination of independent X12 random variables.

    more » « less
  5. Abstract

    Linear mixed‐effects models are powerful tools for analysing complex datasets with repeated or clustered observations, a common data structure in ecology and evolution. Mixed‐effects models involve complex fitting procedures and make several assumptions, in particular about the distribution of residual and random effects. Violations of these assumptions are common in real datasets, yet it is not always clear how much these violations matter to accurate and unbiased estimation.

    Here we address the consequences of violations in distributional assumptions and the impact of missing random effect components on model estimates. In particular, we evaluate the effects of skewed, bimodal and heteroscedastic random effect and residual variances, of missing random effect terms and of correlated fixed effect predictors. We focus on bias and prediction error on estimates of fixed and random effects.

    Model estimates were usually robust to violations of assumptions, with the exception of slight upward biases in estimates of random effect variance if the generating distribution was bimodal but was modelled by Gaussian error distributions. Further, estimates for (random effect) components that violated distributional assumptions became less precise but remained unbiased. However, this particular problem did not affect other parameters of the model. The same pattern was found for strongly correlated fixed effects, which led to imprecise, but unbiased estimates, with uncertainty estimates reflecting imprecision.

    Unmodelled sources of random effect variance had predictable effects on variance component estimates. The pattern is best viewed as a cascade of hierarchical grouping factors. Variances trickle down the hierarchy such that missing higher‐level random effect variances pool at lower levels and missing lower‐level and crossed random effect variances manifest as residual variance.

    Overall, our results show remarkable robustness of mixed‐effects models that should allow researchers to use mixed‐effects models even if the distributional assumptions are objectively violated. However, this does not free researchers from careful evaluation of the model. Estimates that are based on data that show clear violations of key assumptions should be treated with caution because individual datasets might give highly imprecise estimates, even if they will be unbiased on average across datasets.

    more » « less