skip to main content


Title: A Pairwise Fair and Community-preserving Approach to k-Center Clustering
Clustering is a foundational problem in machine learning with numerous applications. As machine learning increases in ubiquity as a back-end for automated systems, concerns about fairness arise. Much of the current literature on fairness deals with discrimination against protected classes in supervised learning (group fairness). We define a different notion of fair clustering wherein the probability that two points (or a community of points) become separated is bounded by an increasing function of their pairwise distance (or community diameter). We capture the situation where data points represent people who gain some benefit from being clustered together. Unfairness arises when certain points are deterministically separated, either arbitrarily or by someone who intends to harm them as in the case of gerrymandering election districts. In response, we formally define two new types of fairness in the clustering setting, pairwise fairness and community preservation. To explore the practicality of our fairness goals, we devise an approach for extending existing k-center algorithms to satisfy these fairness constraints. Analysis of this approach proves that reasonable approximations can be achieved while maintaining fairness. In experiments, we compare the effectiveness of our approach to classical k-center algorithms/heuristics and explore the tradeoff between optimal clustering and fairness.  more » « less
Award ID(s):
1918749
NSF-PAR ID:
10212313
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
International Conference on Machine Learning (ICML)
Page Range / eLocation ID:
1178-1189
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Metric clustering is fundamental in areas ranging from Combinatorial Optimization and Data Mining, to Machine Learning and Operations Research. However, in a variety of situations we may have additional requirements or knowledge, distinct from the underlying metric, regarding which pairs of points should be clustered together. To capture and analyze such scenarios, we introduce a novel family of stochastic pairwise constraints, which we incorporate into several essential clustering objectives (radius/median/means). Moreover, we demonstrate that these constraints can succinctly model an intriguing collection of applications, including among others Individual Fairness in clustering and Must-link constraints in semi-supervised learning. Our main result consists of a general framework that yields approximation algorithms with provable guarantees for important clustering objectives, while at the same time producing solutions that respect the stochastic pairwise constraints. Furthermore, for certain objectives we devise improved results in the case of Must-link constraints, which are also the best possible from a theoretical perspective. Finally, we present experimental evidence that validates the effectiveness of our algorithms. 
    more » « less
  2. null (Ed.)
    Metric clustering is fundamental in areas ranging from Combinatorial Optimization and Data Mining, to Machine Learning and Operations Research. However, in a variety of situations we may have additional requirements or knowledge,distinct from the underlying metric, regarding which pairs of points should be clustered together. To capture and analyze such scenarios, we introduce a novel family of stochastic pairwise constraints, which we incorporate into several essential clustering objectives (radius/median/means). Moreover, we demonstrate that these constraints can succinctly model an intriguing collection of applications, including among others, Individual Fairness in clustering and Must-link constraints in semi-supervised learning. Our main result consists of a general framework that yields approximation algorithms with provable guarantees for important clustering objectives, while at the same time producing solutions that respect the stochastic pairwise constraints. Furthermore, for certain objectives we devise improved results in the case of Must-link constraints, which are also the best possible from a theoretical perspective. Finally, we present experimental evidence that validates the effectiveness of our algorithms. 
    more » « less
  3. Clustering is a fundamental problem in unsupervised machine learning, and due to its numerous societal implications fair variants of it have recently received significant attention. In this work we introduce a novel definition of individual fairness for clustering problems. Specifically, in our model, each point j has a set of other points S(j) that it perceives as similar to itself, and it feels that it is being fairly treated if the quality of service it receives in the solution is α-close (in a multiplicative sense, for some given α ≥ 1) to that of the points in S(j). We begin our study by answering questions regarding the combinatorial structure of the problem, namely for what values of α the problem is well-defined, and what the behavior of the Price of Fairness (PoF) for it is. For the well-defined region of α, we provide efficient and easily-implementable approximation algorithms for the k-center objective, which in certain cases also enjoy bounded-PoF guarantees. We finally complement our analysis by an extensive suite of experiments that validates the effectiveness of our theoretical results. 
    more » « less
  4. Motivated by concerns surrounding the fairness effects of sharing and transferring fair machine learning tools, we propose two algorithms: Fairness Warnings and Fair-MAML. The first is a model-agnostic algorithm that provides interpretable boundary conditions for when a fairly trained model may not behave fairly on similar but slightly different tasks within a given domain. The second is a fair meta-learning approach to train models that can be quickly fine-tuned to specific tasks from only a few number of sample instances while balancing fairness and accuracy. We demonstrate experimentally the individual utility of each model using relevant baselines and provide the first experiment to our knowledge of K-shot fairness, i.e. training a fair model on a new task with only K data points. Then, we illustrate the usefulness of both algorithms as a combined method for training models from a few data points on new tasks while using Fairness Warnings as interpretable boundary conditions under which the newly trained model may not be fair. 
    more » « less
  5. Hierarchical clustering is a fundamental tool in data mining, machine learning and statistics. Popular hierarchical clustering algorithms include top-down divisive approaches such as bisecting k-means, k-median, and k-center and bottom-up agglomerative approaches such as single- linkage, average-linkage, and centroid-linkage. Unfortunately, only a few scalable hierarchical clustering algorithms are known, mostly based on the single-linkage algorithm. So, as datasets increase in size every day, there is a pressing need to scale other popular methods. We introduce efficient distributed algorithms for bisecting k-means, k- median, and k-center as well as centroid-linkage. In particular, we first formalize a notion of closeness for a hierarchical clustering algorithm, and then we use this notion to design new scalable distributed methods with strong worst case bounds on the running time and the quality of the solutions. Finally, we show experimentally that the introduced algorithms are efficient and close to their sequential variants in practice. 
    more » « less