skip to main content


Title: A Dirichlet Model of Alignment Cost in Mixed-Membership Unsupervised Clustering
Mixed-membership unsupervised clustering is widely used to extract informative patterns from data in many application areas. For a shared dataset, the stochasticity and unsupervised nature of clustering algorithms can cause difficulties in comparing clustering results produced by different algorithms, or even multiple runs of the same algorithm, as outcomes can differ owing to permutation of the cluster labels or genuine differences in clustering results. Here, with a focus on inference of individual genetic ancestry in population-genetic studies, we study the cost of misalignment of mixed-membership unsupervised clustering replicates under a theoretical model of cluster memberships. Using Dirichlet distributions to model membership coefficient vectors, we provide theoretical results quantifying the alignment cost as a function of the Dirichlet parameters and the Hamming permutation difference between replicates. For fixed Dirichlet parameters, the alignment cost is seen to increase with the Hamming distance between permutations. Datasets with low variance across individuals of membership coefficients for specific clusters generally produce high misalignment costs—so that a single optimal permutation has far lower cost than suboptimal permutations. Higher variability in data, as represented by greater variance of membership coefficients, generally results in alignment costs that are similar between the optimal permutation and suboptimal permutations. We demonstrate the application of the theoretical results to data simulated under the Dirichlet model, as well as to membership estimates from inference of human-genetic ancestry. The results can contribute to improving cluster alignment algorithms that seek to find optimal permutations of replicates. Supplementary materials for this article are available online.  more » « less
Award ID(s):
2116322
NSF-PAR ID:
10522048
Author(s) / Creator(s):
; ;
Publisher / Repository:
Taylor & Francis
Date Published:
Journal Name:
Journal of Computational and Graphical Statistics
Volume:
32
Issue:
3
ISSN:
1061-8600
Page Range / eLocation ID:
1145 to 1159
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Clustering is a fundamental unsupervised learning problem where a data-set is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the “price of fairness,” can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms. 
    more » « less
  2. Clustering is a fundamental unsupervised learning problem where a dataset is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the `''price of fairness,'' can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms. 
    more » « less
  3. We provide a new bi-criteria O(log2k) competitive algorithm for explainable k-means clustering. Explainable k-means was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable k-means clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bi-criteria algorithm for explainable clustering O(k) competitive, and this bound is tight. Our randomized bi-criteria algorithm constructs a threshold decision tree that partitions the data set into (1+δ)k clusters (where δ∈(0,1) is a parameter of the algorithm). The cost of this clustering is at most O(1/δ⋅log2k) times the cost of the optimal unconstrained k-means clustering. We show that this bound is almost optimal. 
    more » « less
  4. Abstract Objectives

    Most research in human dental age estimation has focused on point estimates of age, and most research on dental development theories has focused on morphology or eruption. Correlations between developing teeth using ordinal staging have received less attention. The effect of demographic variables on these correlations is unknown. I tested the effect of reference sample demographic variables on the residual correlation matrix using the lens of cooperative genetic interaction (CGI).

    Materials and Methods

    The sample consisted of Moorrees et al.,Journal of Dental Research, 1963, 42, 1490–1502, scores of left mandibular permanent teeth from panoramic radiographs of 880 London children 3–22.99 years of age stratified by year of age, sex, and Bangladeshi or European ancestry. A multivariate cumulative probit model was fit to each sex/ancestry group (n = 220), each sex or ancestry (n = 440), and all individuals (n = 880). Residual correlation matrices from nine reference sample configurations were compared using Bartlett's tests of between‐sample difference matrices against the identity matrix, hierarchical cluster analysis, and dendrogram cophenetic correlations.

    Results

    Bartlett's test results were inconclusive. Cluster analysis showed clustering by tooth class, position within class, and developmental timing. Clustering patterns and dendrogram correlations showed similarity by sex but not ancestry.

    Discussion

    Expectations of CGI were supported for developmental staging. This supports using CGI as a model for explaining patterns of variation within the dentition. Sex was found to produce consistent patterns of dental correlations, whereas ancestry did not. Clustering by timing of development supports phenotypic plasticity in the dentition and suggests shared environment over genetic ancestry to explain population differences.

     
    more » « less
  5. We consider random permutations which are spherically symmetric with respect to a metric on the symmetric groupSnand are consistent asnvaries. The extreme infinitely spherically symmetric permutation‐valued processes are identified for the Hamming, Kendall‐tau and Cayley metrics. The proofs in all three cases are based on a unified approach through stochastic monotonicity.

     
    more » « less