Automated hiring systems are among the fastest-developing of all high-stakes AI systems. Among these are algorithmic personality tests that use insights from psychometric testing, and promise to surface personality traits indicative of future success based on job seekers’ resumes or social media profiles. We interrogate the validity of such systems using stability of the outputs they produce, noting that reliability is a necessary, but not a sufficient, condition for validity. Crucially, rather than challenging or affirming the assumptions made in psychometric testing — that personality is a meaningful and measurable construct, and that personality traits are indicative of future success on the job — we frame our audit methodology around testing the underlying assumptions made by the vendors of the algorithmic personality tests themselves. Our main contribution is the development of a socio-technical framework for auditing the stability of algorithmic systems. This contribution is supplemented with an open-source software library that implements the technical components of the audit, and can be used to conduct similar stability audits of algorithmic systems. We instantiate our framework with the audit of two real-world personality prediction systems, namely, Humantic AI and Crystal. The application of our audit framework demonstrates that both these systems show substantial instability with respect to key facets of measurement, and hence cannot be considered valid testing instruments.
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Abstract -
Counterfactuals are often described as 'retrospective,' focusing on hypothetical alternatives to a realized past. This description relates to an often implicit assumption about the structure and stability of exogenous variables in the system being modeled --- an assumption that is reasonable in many settings where counterfactuals are used. In this work, we consider cases where we might reasonably make a different assumption about exogenous variables; namely, that the exogenous noise terms of each unit do exhibit some unit-specific structure and/or stability. This leads us to a different use of counterfactuals --- a forward-looking rather than retrospective counterfactual. We introduce "counterfactual treatment choice," a type of treatment choice problem that motivates using forward-looking counterfactuals. We then explore how mismatches between interventional versus forward-looking counterfactual approaches to treatment choice, consistent with different assumptions about exogenous noise, can lead to counterintuitive results.more » « less
-
It remains an open question how to determine the winner of an election given incomplete or uncertain voter preferences. One solution is to assume some probability space for the voting profile and declare that the candidates having the best chance of winning are the (co-)winners. We refer to this interpretation as the Most Probable Winner (MPW). In this paper, we focus on elections that use positional scoring rules, and propose an alternative winner interpretation, the Most Expected Winner (MEW), according to the expected performance of the candidates. We separate the uncertainty in voter preferences into the generation step and the observation step, which gives rise to a unified voting profile combining both incomplete and probabilistic voting profiles. We use this framework to establish the theoretical hardness of MEW over incomplete voter preferences, and then identify a collection of tractable cases for a variety of voting profiles, including those based on the popular Repeated Insertion Model (RIM) and its special case, the Mallows model. We develop solvers customized for various voter preference types to quantify the candidate performance for the individual voters, and propose a pruning strategy that optimizes computation. The performance of the proposed solvers and pruning strategy is evaluated extensively on real and synthetic benchmarks, showing that our methods are practical.more » « less
-
In the past few years, there has been much work on incorporating fairness requirements into the design of algorithmic rankers, with contributions from the data management, algorithms, information retrieval, and recommender systems communities. In this tutorial, we give a systematic overview of this work, offering a broad perspective that connects formalizations and algorithmic approaches across subfields. During the first part of the tutorial, we present a classification framework for fairness-enhancing interventions, along which we will then relate the technical methods. This framework allows us to unify the presentation of mitigation objectives and of algorithmic techniques to help meet those objectives or identify trade-offs. Next, we discuss fairness in score-based ranking and in supervised learning-to-rank. We conclude with recommendations for practitioners, to help them select a fair ranking method based on the requirements of their specific application domain.more » « less
-
It remains an open question how to determine the winner of an election when voter preferences are incomplete or uncertain. One option is to assume some probability space over the voting profile and select the Most Probable Winner (MPW) -- the candidate or candidates with the best chance of winning. In this paper, we propose an alternative winner interpretation, selecting the Most Expected Winner (MEW) according to the expected performance of the candidates.
We separate the uncertainty in voter preferences into the generation step and the observation step, which gives rise to a unified voting profile combining both incomplete and probabilistic voting profiles. We use this framework to establish the theoretical hardness of MEW over incomplete voter preferences, and then identify a collection of tractable cases for a variety of voting profiles, including those based on the popular Repeated Insertion Model (RIM) and its special case, the Mallows model. We develop solvers customized for various voter preference types to quantify the candidate performance for the individual voters, and propose a pruning strategy that optimizes computation. The performance of the proposed solvers and pruning strategy is evaluated extensively on real and synthetic benchmarks, showing that our methods are practical.
-
Abstract Increasingly, laws are being proposed and passed by governments around the world to regulate artificial intelligence (AI) systems implemented into the public and private sectors. Many of these regulations address the transparency of AI systems, and related citizen-aware issues like allowing individuals to have the right to an explanation about how an AI system makes a decision that impacts them. Yet, almost all AI governance documents to date have a significant drawback: they have focused on what to do (or what not to do) with respect to making AI systems transparent, but have left the brunt of the work to technologists to figure out how to build transparent systems. We fill this gap by proposing a stakeholder-first approach that assists technologists in designing transparent, regulatory-compliant systems. We also describe a real-world case study that illustrates how this approach can be used in practice.more » « less
-
Counterfactuals are often described as 'retrospective,' focusing on hypothetical alternatives to a realized past. This description relates to an often implicit assumption about the structure and stability of exogenous variables in the system being modeled –– an assumption that is reasonable in many settings where counterfactuals are used. In this work, we consider cases where we might reasonably make a different assumption about exogenous variables, namely, that the exogenous noise terms of each unit do exhibit some unit-specific structure and/or stability. This leads us to a different use of counterfactuals — a 'forward-looking' rather than 'retrospective' counterfactual. We introduce counterfactual treatment choice, a type of treatment choice problem that motivates using forward-looking counterfactuals. We then explore how mismatches between interventional versus forward-looking counterfactual approaches to treatment choice, consistent with different assumptions about exogenous noise, can lead to counterintuitive results.more » « less
-
In this work we use Equal Opportunity (EO) doctrines from political philosophy to make explicit the normative judgements embedded in different conceptions of algorithmic fairness. We contrast formal EO approaches that narrowly focus on fair contests at discrete decision points, with substantive EO doctrines that look at people’s fair life chances more holistically over the course of a lifetime. We use this taxonomy to provide a moral interpretation of the impossibility results as the incompatibility between different conceptions of a fair contest — foward-facing versus backward-facing — when people do not have fair life chances. We use this result to motivate substantive conceptions of algorithmic fairness and outline two plausible fair decision procedures based on the luck egalitarian doctrine of EO, and Rawls’s principle of fair equality of opportunity.more » « less
-
Automated hiring systems are among the fastest-developing of all high-stakes AI systems. Among these are algorithmic personality tests that use insights from psychometric testing, and promise to surface personality traits indicative of future success based on job seekers' resumes or social media profiles. We interrogate the reliability of such systems using stability of the outputs they produce, noting that reliability is a necessary, but not a sufficient, condition for validity. We develop a methodology for an external audit of stability of algorithmic personality tests, and instantiate this methodology in an audit of two systems, Humantic AI and Crystal. Rather than challenging or affirming the assumptions made in psychometric testing --- that personality traits are meaningful and measurable constructs, and that they are indicative of future success on the job --- we frame our methodology around testing the underlying assumptions made by the vendors of the algorithmic personality tests themselves. In our audit of Humantic AI and Crystal, we find that both systems show substantial instability on key facets of measurement, and so cannot be considered valid testing instruments. For example, Crystal frequently computes different personality scores if the same resume is given in PDF vs. in raw text, violating the assumption that the output of an algorithmic personality test is stable across job-irrelevant input variations. Among other notable findings is evidence of persistent --- and often incorrect --- data linkage by Humantic AI. An open-source implementation of our auditing methodology, and of the audits of Humantic AI and Crystal, is available at https://github.com/DataResponsibly/hiring-stability-audit.more » « less