skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A scalable and unbiased discordance metric with H +
Summary A standard unsupervised analysis is to cluster observations into discrete groups using a dissimilarity measure, such as Euclidean distance. If there does not exist a ground-truth label for each observation necessary for external validity metrics, then internal validity metrics, such as the tightness or separation of the clusters, are often used. However, the interpretation of these internal metrics can be problematic when using different dissimilarity measures as they have different magnitudes and ranges of values that they span. To address this problem, previous work introduced the “scale-agnostic” $$G_{+}$$ discordance metric; however, this internal metric is slow to calculate for large data. Furthermore, in the setting of unsupervised clustering with $$k$$ groups, we show that $$G_{+}$$ varies as a function of the proportion of observations assigned to each of the groups (or clusters), referred to as the group balance, which is an undesirable property. To address this problem, we propose a modification of $$G_{+}$$, referred to as $$H_{+}$$, and demonstrate that $$H_{+}$$ does not vary as a function of group balance using a simulation study and with public single-cell RNA-sequencing data. Finally, we provide scalable approaches to estimate $$H_{+}$$, which are available in the $$\mathtt{fasthplus}$$ R package.  more » « less
Award ID(s):
2244899
PAR ID:
10508705
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Oxford Academic
Date Published:
Journal Name:
Biostatistics
Volume:
25
Issue:
1
ISSN:
1465-4644
Page Range / eLocation ID:
188 to 202
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Clustering is a fundamental unsupervised learning problem where a data-set is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the “price of fairness,” can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms. 
    more » « less
  2. Clustering is a fundamental unsupervised learning problem where a dataset is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the `''price of fairness,'' can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms. 
    more » « less
  3. We initiate the study of X-ray tomography on sub-Riemannian manifolds, for which the Heisenberg group exhibits the simplest nontrivial example. With the language of the group Fourier transform, we prove an operator-valued incarnation of the Fourier Slice Theorem, and apply this new tool to show that a sufficiently regular function on the Heisenberg group is determined by its line integrals over sub-Riemannian geodesics. We also consider the family of taming metrics $$g_\epsilon$$ approximating the sub-Riemannian metric, and show that the associated X-ray transform is injective for all $$\epsilon > 0$$. This result gives a concrete example of an injective X-ray transform in a geometry with an abundance of conjugate points. 
    more » « less
  4. We address the following natural extension problem for group actions: Given a group [Formula: see text], a subgroup [Formula: see text], and an action of [Formula: see text] on a metric space, when is it possible to extend it to an action of the whole group [Formula: see text] on a (possibly different) metric space? When does such an extension preserve interesting properties of the original action of [Formula: see text]? We begin by formalizing this problem and present a construction of an induced action which behaves well when [Formula: see text] is hyperbolically embedded in [Formula: see text]. Moreover, we show that induced actions can be used to characterize hyperbolically embedded subgroups. We also obtain some results for elementary amenable groups. 
    more » « less
  5. Problem. Extant measures of students’ cybersecurity self-efficacy lack sufficient evidence of validity based on internal structure. Such evidence of validity is needed to enhance confidence in conclusions drawn from use of self-efficacy measures in the cybersecurity domain. Research Question. To address this identified problem, we sought to answer our research question: What is the underlying factor structure of a new self-efficacy for Information Security measure? Method. We leveraged exploratory factor analysis (EFA) to deter- mine the number of factors underlying a new measure of student self-efficacy to conduct information security. This measure was created to align with the five elements of the information security section of the K-12 Cybersecurity Education framework. Participants were 190 undergraduate students recruited from computer science courses across the U.S. Findings. Results from the EFA indicated that a four-factor solution best fit the data while maximizing interpretability of the factors. The internal reliability of the measure was quite strong (𝛼 = .99). Implications. The psychometric quality of this measure was demonstrated, and thus evidence of validity based on internal structure has been established. Future work will conduct a confirmatory factor analysis (CFA) and assess measurement invariance across sub- groups of interest (e.g., over- vs. under-represented race/ethnicity groups, gender). 
    more » « less