skip to main content


Title: Measuring Group Advantage: A Comparative Study of Fair Ranking Metrics
Ranking evaluation metrics play an important role in information retrieval, providing optimization objectives during development and means of assessment of deployed performance. Recently, fairness of rankings has been recognized as crucial, especially as automated systems are increasingly used for high impact decisions. While numerous fairness metrics have been proposed, a comparative analysis to understand their interrelationships is lacking. Even for fundamental statistical parity metrics which measure group advantage, it remains unclear whether metrics measure the same phenomena, or when one metric may produce different results than another. To address these open questions, we formulate a conceptual framework for analytical comparison of metrics.We prove that under reasonable assumptions, popular metrics in the literature exhibit the same behavior and that optimizing for one optimizes for all. However, our analysis also shows that the metrics vary in the degree of unfairness measured, in particular when one group has a strong majority. Based on this analysis, we design a practical statistical test to identify whether observed data is likely to exhibit predictable group bias. We provide a set of recommendations for practitioners to guide the choice of an appropriate fairness metric.  more » « less
Award ID(s):
2007932
NSF-PAR ID:
10251285
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’21)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Cell motility is a critical aspect of several processes, such as wound healing and immunity; however, it is dysregulated in cancer. Current limitations of imaging tools make it difficult to study cell migrationin vivo. To overcome this, and to identify drivers from the microenvironment that regulate cell migration, bioengineers have developed 2D (two‐dimensional) and 3D (three‐dimensional) tissue model systems in which to study cell motilityin vitro, with the aim of mimicking elements of the environments in which cells movein vivo. However, there has been no systematic study to explicitly relate and compare cell motility measurements between these geometries or systems. Here, we provide such analysis on our own data, as well as across data in existing literature to understand whether, and which, metrics are conserved across systems. To our surprise, only one metric of cell movement on 2D surfaces significantly and positively correlates with cell migration in 3D environments (percent migrating cells), and cell invasion in 3D has a weak, negative correlation with glioblastoma invasionin vivo. Finally, to compare across complex model systems,in vivodata, and data from different labs, we suggest that groups report an effect size, a statistical tool that is most translatable across experiments and labs, when conducting experiments that affect cellular motility.

     
    more » « less
  2. Information access systems, such as search and recommender systems, often use ranked lists to present results believed to be relevant to the user’s information need. Evaluating these lists for their fairness along with other traditional metrics provide a more complete understanding of an information access system’s behavior beyond accuracy or utility constructs. To measure the (un)fairness of rankings, particularly with respect to protected group(s) of producers or providers, several metrics have been proposed in the last several years. However, an empirical and comparative analyses of these metrics showing the applicability to specific scenario or real data, conceptual similarities, and differences is still lacking. We aim to bridge the gap between theoretical and practical application of these metrics. In this paper we describe several fair ranking metrics from the existing literature in a common notation, enabling direct comparison of their approaches and assumptions, and empirically compare them on the same experimental setup and data sets in the context of three information access tasks. We also provide a sensitivity analysis to assess the impact of the design choices and parameter settings that go in to these metrics and point to additional work needed to improve fairness measurement. 
    more » « less
  3. A variety of fairness constraints have been proposed in the literature to mitigate group-level statistical bias. Their impacts have been largely evaluated for different groups of populations corresponding to a set of sensitive attributes, such as race or gender. Nonetheless, the community has not observed sufficient explorations for how imposing fairness constraints fare at an instance level. Building on the concept of influence function, a measure that characterizes the impact of a training example on the target model and its predictive performance, this work studies the influence of training examples when fairness constraints are imposed. We find out that under certain assumptions, the influence function with respect to fairness constraints can be decomposed into a kernelized combination of training examples. One promising application of the proposed fairness influence function is to identify suspicious training examples that may cause model discrimination by ranking their influence scores. We demonstrate with extensive experiments that training on a subset of weighty data examples leads to lower fairness violations with a trade-off of accuracy. 
    more » « less
  4. Where machine-learned predictive risk scores inform high-stakes decisions, such as bail and sentencing in criminal justice, fairness has been a serious concern. Recent work has characterized the disparate impact that such risk scores can have when used for a binary classification task. This may not account, however, for the more diverse downstream uses of risk scores and their non-binary nature. To better account for this, in this paper, we investigate the fairness of predictive risk scores from the point of view of a bipartite ranking task, where one seeks to rank positive examples higher than negative ones. We introduce the xAUC disparity as a metric to assess the disparate impact of risk scores and define it as the difference in the probabilities of ranking a random positive example from one protected group above a negative one from another group and vice versa. We provide a decomposition of bipartite ranking loss into components that involve the discrepancy and components that involve pure predictive ability within each group. We use xAUC analysis to audit predictive risk scores for recidivism prediction, income prediction, and cardiac arrest prediction, where it describes disparities that are not evident from simply comparing within-group predictive performance. 
    more » « less
  5. INTRODUCTION: Quadriceps tendon autografts have experienced a rapid rise in popularity for anterior cruciate ligament (ACL) reconstruction due to advantages in graft sizing and potential improvement in biomechanics. While there is a growing body of literature on use of quadriceps tendon grafts, deeper investigation into the biomechanical properties of stitch techniques in this construct has been limited. The purpose of this study was to evaluate the performance of a novel suture needle against different conventional suture needles by comparing the biomechanical properties of two commonly used stitch methods, a whip stitch, and a locking stitch in quadriceps tendon. It was hypothesized that the new device would be capable of creating both whip stitches and locking stitches that are biomechanically equivalent to similar stitch techniques performed with conventional needle products. METHODS: This was a controlled biomechanical study. A total of 24 matched pair cadaveric knees were dissected and a total of 48 quadriceps tendons were harvested and tested. All tendon grafts were standardized to the same size. Samples were then randomized into the following groups, keeping the matched pairs together: (Group 1, n=16) consisted of Company W’s novel two-part suture needle design, (Group 2, n=16) consisted of Company A suture, and (Group 3, n=16) consisted of Company B suture. For each group, the matched pairs were categorized into subgroups to be instrumented with either a whip stitch or a locking stitch. Two fellowship-trained surgeons performed all stitching, where they each instrumented 8 tendon grafts per group. For instrumentation, the grafts were clamped to a preparation stand in accordance with the manufacturer’s recommendations for passing each suture needle. A skin marker was used to identify and mark five evenly spaced points, 0.5 cm apart, as a guide to create a 5-stitch series. For Group 1, the whip stitch as well as the locking whip stitch were performed with a novel 2-part needle. For Group 2, the whip stitch was performed with loop suture needle and the locking stitch was krackow with a curved needle. Similarly, for Group 3, the whip stitch was performed with loop suture needle and the locking stitch was krackow with a curved needle (Figure 1). Cyclical testing was performed using a servohydraulic testing machine (MTS Bionix) equipped with a 5kN load cell. A standardized length of tendon, 7 cm, was coupled to the MTS actuator by passing it through a cryoclamp cooled by dry ice to a temperature of -5°C (Figure 2). All testing samples were then pre-conditioned to normalize viscoelastic effects and testing variability through application of cyclical loading to 25-100 N for three cycles. The samples were then held at 89 N for 15 minutes. Thereafter, the samples were loaded to 50-200 N for 500 cycles at 1 Hz. If samples survived, they were ramped to failure at 20 mm/min. Displacement and force data was collected throughout testing. Metrics of interest were total elongation (mm), stiffness (N/mm), ultimate failure load (N) and failure mode. Data are presented as averages plus/minus standard deviation. A one-way analysis of variance (ANOVA) with a Tukey pairwise comparison post hoc analysis was used to evaluate differences between the various stitching methods. Statistical significance was set at P = .05. RESULTS SECTION: For the whip stitch methods, the total elongation was found to be equivalent across all methods (W: 36 ± 10 mm; A: 32 ± 18 mm; B: 33 ± 8 mm). The stiffness of Company A (103 ± 11 N/mm) method was significantly larger than Company W (64 ± 8 N/mm; p=.001), whereas stiffness of whip stitch by Company W was equivalent to Company B (80 ± 32 N/mm). The ultimate failure load was equivalent across all whip stitch methods (W: 379 ± 31 mm; A: 412 ± 103 mm; B: 438 ± 63 mm). For the locking stitch method, the total elongation (W: 26 ± 10 mm; A: 14 ± 2 mm; B: 29 ± 5 mm), stiffness (W: 75 ± 11 N/mm; A: 104 ± 23 N/mm; B: 79 ± 10 N/mm) and ultimate load (W: 343 ± 22 N; A: 369 ± 30 N; B: 438 ± 63 N) were found to be equivalent across all methods. The failure mode for all groups is in Table 1. The common mode of failure across study groups and stitch configuration was suture breakage. However, the whip stitch from Company A and Company B had varied failure modes. DISCUSSION: Products from the three manufacturers were found to produce biomechanically equivalent whip stitches and locking stitches with respect to elongation and ultimate failure load. The only significant difference observed was that the whip stitch created with Company A’s product had a higher stiffness than Company W’s product, which could have been due to differences in the suture material. In this cadaveric quadriceps tendon model, it was shown that when using Company W’s novel two-part suture needle, users were capable of creating whip stitches and locking stitches that achieved equivalent biomechanical performance compared to similar stitch techniques performed with conventional needle products. A failure mode limited solely to suture breakage for methods completed with Company W’s needle product suggest a reliable suture construct with limited tissue damage. SIGNIFICANCE/CLINICAL RELEVANCE: Having a suture needle device with the versatility to easily perform different stitching constructs may provide surgeons an advantage needed to improve clinical outcomes. The data presented illustrates a strong new suture technique that has equivalent performance when compared to conventional needle devices and has promising applications in graft preparation for ligament and tendon reconstruction. 
    more » « less