- Award ID(s):
- 2116322
- NSF-PAR ID:
- 10522048
- Publisher / Repository:
- Taylor & Francis
- Date Published:
- Journal Name:
- Journal of Computational and Graphical Statistics
- Volume:
- 32
- Issue:
- 3
- ISSN:
- 1061-8600
- Page Range / eLocation ID:
- 1145 to 1159
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Clustering is a fundamental unsupervised learning problem where a data-set is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the “price of fairness,” can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms.more » « less
-
Clustering is a fundamental unsupervised learning problem where a dataset is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the `''price of fairness,'' can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms.more » « less
-
We provide a new bi-criteria O(log2k) competitive algorithm for explainable k-means clustering. Explainable k-means was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable k-means clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bi-criteria algorithm for explainable clustering O(k) competitive, and this bound is tight. Our randomized bi-criteria algorithm constructs a threshold decision tree that partitions the data set into (1+δ)k clusters (where δ∈(0,1) is a parameter of the algorithm). The cost of this clustering is at most O(1/δ⋅log2k) times the cost of the optimal unconstrained k-means clustering. We show that this bound is almost optimal.more » « less
-
Abstract Objectives Most research in human dental age estimation has focused on point estimates of age, and most research on dental development theories has focused on morphology or eruption. Correlations between developing teeth using ordinal staging have received less attention. The effect of demographic variables on these correlations is unknown. I tested the effect of reference sample demographic variables on the residual correlation matrix using the lens of cooperative genetic interaction (CGI).
Materials and Methods The sample consisted of Moorrees et al.,
Journal of Dental Research , 1963, 42, 1490–1502, scores of left mandibular permanent teeth from panoramic radiographs of 880 London children 3–22.99 years of age stratified by year of age, sex, and Bangladeshi or European ancestry. A multivariate cumulative probit model was fit to each sex/ancestry group (n = 220), each sex or ancestry (n = 440), and all individuals (n = 880). Residual correlation matrices from nine reference sample configurations were compared using Bartlett's tests of between‐sample difference matrices against the identity matrix, hierarchical cluster analysis, and dendrogram cophenetic correlations.Results Bartlett's test results were inconclusive. Cluster analysis showed clustering by tooth class, position within class, and developmental timing. Clustering patterns and dendrogram correlations showed similarity by sex but not ancestry.
Discussion Expectations of CGI were supported for developmental staging. This supports using CGI as a model for explaining patterns of variation within the dentition. Sex was found to produce consistent patterns of dental correlations, whereas ancestry did not. Clustering by timing of development supports phenotypic plasticity in the dentition and suggests shared environment over genetic ancestry to explain population differences.
-
We consider random permutations which are spherically symmetric with respect to a metric on the symmetric group
S n and are consistent asn varies. The extreme infinitely spherically symmetric permutation‐valued processes are identified for the Hamming, Kendall‐tau and Cayley metrics. The proofs in all three cases are based on a unified approach through stochastic monotonicity.