<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Provable detection of propagating sampling bias in prediction models</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10392150</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the 37th AAAI Conference on Artificial Intelligence</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Pavan Ravishankar</author><author>Qingyu Mo</author><author>Edward McFowland III</author><author>Daniel B. Neill</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[With an increased focus on incorporating fairness in machine learning models, it becomes imperative not only to assess and mitigate bias at each stage of the machine learning pipeline but also to understand the downstream impacts of bias across stages. Here we consider a general, but realistic, scenario in which a predictive model is learned from (potentially biased)training data, and model predictions are assessed post-hoc for fairness by some auditing method. We provide a theoretical analysis of how a specific form of data bias, differential sampling bias, propagates from the data stage to the prediction stage. Unlike prior work, we evaluate the downstream impacts of data biases quantitatively rather than qualitativelyand prove theoretical guarantees for detection. Under reasonable assumptions, we quantify how the amount of bias in the model predictions varies as a function of the amount ofdifferential sampling bias in the data, and at what point this bias becomes provably detectable by the auditor. Through experiments on two criminal justice datasets– the well-known COMPAS dataset and historical data from NYPD’s stop and frisk policy– we demonstrate that the theoretical results hold in practice even when our assumptions are relaxed.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Machine learning models are being used in numerous applications such as healthcare <ref type="bibr">(De Fauw et al. 2018)</ref>, online advertising <ref type="bibr">(Perlich et al. 2014)</ref>, and finance <ref type="bibr">(Malekipirbazari and Aksakalli 2015)</ref>. Due to its increased proliferation, there is a rising concern in the machine learning community to deploy fair machine learning models <ref type="bibr">(Barocas, Hardt, and Narayanan 2017;</ref><ref type="bibr">Mehrabi et al. 2021)</ref>. Since decision-making in machine learning comprises of various stages such as the data stage, modeling stage, and prediction stage <ref type="bibr">(Suresh and Guttag 2019)</ref>, it becomes imperative to look at the fairness problem across stages, rather than limiting the discussion to a single stage. For instance, the data stage could be biased due to members of a subgroup being systematically selected with a higher or a lower probability than others (Medical-Dictionary 2016), also known as sample selection bias. Such biases could propagate to the prediction stage, and the resulting biases in prediction could be compounded by other sources such as model misspecification <ref type="bibr">(Gajane and Pechenizkiy 2017)</ref>. However, it is unclear precisely how and to what extent the data bias would affect the predictions, and when the resulting prediction biases would be detectable by some auditing approach. Such biases, once detected and precisely characterized, could then be corrected, e.g., by resampling to de-bias the data.</p><p>In this paper, we analyze the propagation of differential sampling bias from the data stage to the prediction stage. Differential sampling bias is a form of sample selection bias in which some subpopulation S is sampled non-uniformly, such that the distribution of an outcome variable Y given predictor variables X in the sampled data for S differs from the true (population) distribution of Y given X for S. <ref type="foot">1</ref>This bias can arise in many different circumstances. For example, in criminal justice, both the organizational biases of police departments (e.g., a policy of conducting large numbers of pedestrian stops in predominantly minority neighborhoods) and the perceptual biases of individual police officers (e.g., higher likelihood of stopping and frisking Black individuals) led to much higher proportions of Black individuals being arrested for marijuana possession, despite similar rates of use in the population as a whole <ref type="bibr">(Edwards et al. 2020)</ref>. In our analysis of NYPD stop and frisk data, we consider the race of the stopped individual as our outcome variable, and observe that Pr(race = "Black") is significantly increased as compared to a "less biased" alternative policing strategy.</p><p>Differential sampling bias can also result from concept shift: a model meant for prediction of outcome variable Y in one setting is learned using data from a different setting where the relationship between Y and the predictor variables X differs. For example, if criminal justice data from one jurisdiction is used to predict a defendant's risk of reoffending in a different jurisdiction, or if historical data is used and reoffending patterns have changed over time, the training data will exhibit differential sampling bias: the proportion of reoffenders for certain demographics may be higher or lower in the training data as compared to the true probabilities for the jurisdiction and time period of interest. In our experimental analysis of the COMPAS dataset, we inject simulated differential sampling bias (assuming concept shift) by weighted resampling of the training data. Step 4: Bias Scan finds the most biased subgroup S * and its log-likelihood ratio score F * based on the predictions.</p><p>Here we introduce the first formal analysis of how differential sampling bias induced in the data stage (i.e., biased training data) propagates through the modeling and prediction stages, leading to significant biases in prediction. These propagated biases can then be detected by an auditor that compares the model predictions with the observed outcomes.</p><p>Our problem setup is shown in Figure <ref type="figure">1</ref>: first, in the data stage, we assume initially unbiased training and test data records drawn i.i.d. from some joint probability distribution f X,Y (x, y) of the predictor variables X and binary outcome variable Y . Then differential sampling bias &#8710; is injected into the "true" subgroup S T for the training data only. Without loss of generality, we define Y such that the differential sampling bias increases the probability P(Y = 1|X), thus over-sampling records with Y = 1 in subgroup S T . We parameterize the multiplicative increase in the odds of Y = 1 by &#8710; &gt; 1. Second, in the modeling stage, a classification model is trained using the biased data. Third, in the prediction stage, the classifier makes predictions p i (the estimated probability that Y = 1 for each data record) for the test data. Fourth, Bias Scan <ref type="bibr">(Zhang and Neill 2016</ref>) is used to assess whether the predictions p i are systematically biased as compared to the test outcomes y i for any intersectional subgroup.</p><p>Given this problem setup, we present theoretical and empirical results showing (a) the amount of bias that propagates from the data stage to the prediction stage, as measured by the log-likelihood ratio (LLR) score found by Bias Scan; and (b) when the bias will exceed a threshold for significance, assuming a fixed false positive rate &#945;, thus enabling detection by Bias Scan. Our specific contributions are as follows:</p><p>1. We define and quantify the differential sampling bias &#8710; induced into subgroup S T in the binary outcome Y . 2. We derive a new closed-form expression for the LLR score of Bias Scan, used to audit a consistent classifier trained on large data with differential sampling bias. 3. We present a new asymptotic result for the null distribution of the Bias Scan score, which leads to a threshold score h(&#945;) for detection at a fixed false positive rate &#945;. 4. We demonstrate detection with full asymptotic power, P H1 (Reject H 0 ) &#8594; 1, as the data size becomes large. 5. Using the threshold h(&#945;), we find the minimum amount of bias &#8710; that needs to be induced in subgroup S T for it to be provably detectable in the finite sample case. 6. We evaluate our theoretical results empirically on two different criminal justice datasets. On the well-known COMPAS dataset, we compare the empirical and theoretical relationships between the Bias Scan score F * and the amount of injected bias &#8710;, across two different classification models and two types of bias injection (marginal and intersectional). We also analyze historical data from the NYPD's "stop-question-frisk" (SQF) policy, estimating the amount of differential sampling bias &#8710; in the data as compared to a "less biased" alternative policing strategy. 7. For both datasets, we observe that the empirical relationship between the propagated bias in predictions (as measured by the Bias Scan score F * ) and the differential sampling bias in data (as measured by &#8710;) corresponds well to the theoretical values. We also confirm that, if enough bias is present in the data stage, then the affected subgroup is detectable by the auditor in the prediction stage with high accuracy. These two conclusions demonstrate the validity of the theoretical assumptions and provide reasoning when theoretical and empirical results differ.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Related Work</head><p>Stage-specific notions of fairness and bias: The machine learning community has typically centered the fairness problem in either the data stage or the prediction stage <ref type="bibr">(Barocas, Hardt, and Narayanan 2017)</ref>  <ref type="bibr">and Goel (2018)</ref> discuss the limitations of these fairness definitions; <ref type="bibr">Kleinberg, Mullainathan, and Raghavan (2016)</ref> and <ref type="bibr">Chouldechova (2017)</ref> prove that, except in special cases, these definitions are incompatible; <ref type="bibr">Zadrozny (2004)</ref> proposes a framework to correct bias in model predictions; and <ref type="bibr">Pedreschi, Ruggieri, and Turini (2009)</ref> propose novel measures of discrimination to correct discriminatory patterns. None of the aforementioned works have analyzed how bias propagates downstream, across different stages of the pipeline.</p><p>Bias propagation pipelines: Suresh and Guttag (2019) discuss the bias problem holistically, rather than centering it to a particular stage, by laying out a framework comprising of biases originating at different stages of the pipeline. Similarly, an opinion article by <ref type="bibr">Hooker (2021)</ref> proposes that bias should be viewed and analyzed as an aggregation of the biases arising in different stages. However, neither of these works provide any formal, quantitative analysis of how bias propagates between stages. <ref type="bibr">Rambachan and Roth (2019)</ref> quantitatively analyze how selection bias propagates from the data stage to the prediction stage. However, the study makes a strong assumption about the form of the selection process, and does not discuss whether the propagated bias is detectable or how it can be detected in the prediction stage.</p><p>Frameworks for detection of intersectional biases: Several recent approaches have been proposed to detect biases affecting a subpopulation defined along multiple data dimensions <ref type="bibr">(Zhang and Neill 2016;</ref><ref type="bibr">Kearns et al. 2018</ref>). Here we apply Bias Scan <ref type="bibr">(Zhang and Neill 2016)</ref> to assess models learned from biased data, detecting intersectional subgroups where the model predictions p i most significantly overestimate P(Y = 1 | X = x i ). Bias Scan builds on previous univariate and multivariate subset scan approaches <ref type="bibr">(Neill 2012;</ref><ref type="bibr">Neill, McFowland III, and Zheng 2013)</ref>. <ref type="bibr">Additionally, McFowland III, Somanchi, and Neill (2018)</ref> use a similar multidimensional scan framework to discover the subgroups that are most significantly affected by a treatment in a randomized experiment, and provide statistical guarantees on detection. However, all of these approaches focus on a single pipeline stage (predictions or outcomes), while our work examines the propagation of data biases into model predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Preliminaries Notations</head><p>Here Y is a binary outcome variable, and thus we can write f Y |X (y|x) and f Y |X (y|x) as the probabilities P(Y = y | X = x) and</p><p>, for test and training data respectively. Let p i = P(Y = 1 | X = x i ) be the true probability that Y = 1 for test record s i = (x i , y i ), and let p i and pi be the estimated probabilities that Y = 1 for test record s i from classification models learned from training data with and without differential sampling bias. Note that p i = pi when &#8710; = 1 (under the null hypothesis of no bias).</p><p>We assume that X consists of a set of discrete-valued 2 predictor variables {X 1 , . . . , X Q } and that each variable X i takes on a set of values V i . An intersectional subgroup S is defined as a subset of the Cartesian product</p><p>Race, and V 2 = {Black, White, Other}, then {Male, Female} &#215; {Black, White} = {(Male, Black), (Female, Black), (Male, White), (Female, White)} is a rectangular subgroup, while {(Male, Black), (Female, White)} is non-rectangular. Let rect(X) denote the set of all rectangular subgroups of X. Finally, for test dataset D, we associate with any given subgroup S the subset of matching data records</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Bias Scan</head><p>Bias Scan <ref type="bibr">(Zhang and Neill 2016</ref>) is a multi-dimensional subset scanning algorithm used to detect intersectional subgroups for which a classifier's probabilistic predictions p i of a 2 Sensitive covariates (e.g. race, ethnicity, and gender) are usually discrete. Continuous covariates can be discretized as a preprocessing step, using the observed covariate distribution or domain knowledge.</p><p>binary outcome y i are significantly biased as compared to the observed outcomes y i . More precisely, Bias Scan searches for the rectangular subgroup S * which maximizes a Bernoulli log-likelihood ratio (LLR) scan statistic,<ref type="foot">foot_1</ref> </p><p>To obtain the score function for a given subgroup S, Bias Scan computes the generalized log-likelihood ratio F (S) = max q log P (D | H1(S, q)) P (D | H0) , assuming the following hypotheses:</p><p>Here we detect biases where the probabilities p i are overestimated, and thus 0 &lt; q &lt; 1. As derived in the Technical Appendix, the resulting log-likelihood ratio score F (S) is</p><p>(1) The Bias Scan algorithm for optimizing F (S) over rectangular subgroups is provided in the Technical Appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Differential Sampling Bias</head><p>In this section, we quantify differential sampling bias for a subgroup S, as follows: Definition 1. A subgroup S exhibits differential sampling bias &#8710; &gt; 1 towards the outcome Y = 1 if, for all x &#8712; S,</p><p>For example, differential sampling bias could be injected into unbiased training data by re-drawing data elements {( x i , y i )}, for x i &#8712; S, with replacement, with sampling weights w i = &#8710; for y i = 1 and w i = 1 for y i = 0. We use this approach to inject bias into the COMPAS dataset in our experiments below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theoretical Results</head><p>In this section, we derive theoretical results to understand the propagated effects of differential sampling bias and to provide statistical guarantees for detectability.</p><p>More precisely, we prove four main theorems. Given the problem setup described above and the assumptions listed below, Theorem 1 provides an asymptotic closed-form formulation of the Bias Scan log-likelihood ratio (LLR) score F (S T ) of the injected subgroup S T as a function of the amount of differential sampling bias &#8710;. If S T is a rectangular subgroup, S T &#8712; rect(X), this score is a lower bound on the overall Bias Scan score F * = max S&#8712;rect(X) F (S). Theorem 2 provides an upper bound for the null distribution of F * (i.e., assuming no bias is present), enabling us to compute a threshold score for detection. Finally, Theorems 3 and 4 combine these results to show asymptotic detection with full power for any &#8710; &gt; 1 as the sizes of the training and test data go to infinity, as well as computing the minimum amount of bias &#8710; needed for detection in finite test data.</p><p>These Theorems rely on three key assumptions: (A1) Consistency of the classifier used in the prediction stage, for learning the conditional distribution f Y |X .</p><p>(A2) Full support of the biased training data:</p><p>Given these assumptions, we first derive the relationship between the amount of differential sampling bias &#8710; injected into subgroup S, and the Bias Scan score F (S):</p><p>Theorem 1. Assume that a classifier is trained on data D with differential sampling bias &#8710; &gt; 1 for subgroup S and makes predictions p i for unbiased test data D = {(x i , y i )}. If Bias Scan is used to assess bias in p i as compared to y i , then under assumptions (A1)-(A3), as the number of training data records | D| &#8594; &#8734;, the Bias Scan score F (S) of subgroup S converges to:</p><p>if &#8710; &gt; qMLE , and F (S) &#8594; 0 otherwise, where qMLE is the maximum likelihood estimate of q for Bias Scan assuming no differential sampling bias (&#8710; = 1), satisfying</p><p>and</p><p>is the Bias Scan score of subgroup S assuming no differential sampling bias (&#8710; = 1).</p><p>The proof of Theorem 1 is provided in the Technical Appendix. Critically, under assumptions (A1)-(A3), as | D| &#8594; &#8734;, we have</p><p>and the corresponding predicted probabilities with no differential sampling bias, pi</p><p>We then show that the maximum likelihood estimate (MLE) of q for Bias Scan is qMLE /&#8710;, where qMLE is the corresponding MLE with no differential sampling bias. Finally, we plug in the expressions for p i , pi , and q M LE , and simplify. Corollary 1. Under the conditions of Theorem 1, as the number of test data records |D| &#8594; &#8734;, the normalized Bias Scan score F (S)/|D| of subgroup S converges to:</p><p>an increasing function of &#8710;.</p><p>Next, we provide statistical guarantees for the detection of bias. To do so, we first consider the distribution of the Bias Scan score F * = max S&#8712;rect(X) F (S) under the null hypothesis of no bias, H 0 . For a given false positive rate &#945;, we find a score threshold h(&#945;) such that P H0 (F * &gt; h(&#945;)) &#8804; &#945;.</p><p>To do so, we make the additional assumption:</p><p>(A4) The number of unique covariate profiles in the test data, M , is large enough so that Gaussian approximations hold (e.g., M &gt; 30) but finite (i.e., M remains constant as the number of test data records |D| &#8594; &#8734;).</p><p>Then we can show the following: Theorem 2. Assume that a classifier is trained on unbiased training data D and makes predictions pi for unbiased test data D = {(x i , y i )}, and Bias Scan is used to assess bias in pi as compared to y i . Let F * = max S&#8712;rect(X) F (S) be the Bias Scan score, maximized over all rectangular subgroups S. Then under assumptions (A1)-(A4), as the number of training data records | D| &#8594; &#8734; and the number of test data records |D| &#8594; &#8734;, for a given Type-I error rate &#945; &gt; 0, there exists a critical value h(&#945;) and constants k 1 &#8776; 0.202, k 2 &#8776; 0.523 such that P(F * &gt; h(&#945;)) &#8804; &#945;, where</p><p>and &#934; is the Gaussian cdf.</p><p>Critically, h(&#945;) does not depend on the number of test data records |D|, but only on the number of unique covariate profiles in the test data M . Now, we prove that under the presence of bias &#8710;, h(&#945;) serves as a threshold for rejecting the null hypothesis of no bias with full asymptotic power.</p><p>Theorem 3. Assume that a classifier is trained on data D with differential sampling bias &#8710; &gt; 1 for rectangular subgroup S T and makes predictions p i for unbiased test data D = {(x i , y i )}, and Bias Scan is used to assess bias in p i as compared to y i . Let F * = max S&#8712;rect(X) F (S) be the Bias Scan score, and let h(&#945;) be the score threshold for detection at a fixed Type-I error rate of &#945;, as given in Equation (3). Then for any &#945; &gt; 0 and &#8710; &gt; 1, under assumptions (A1)-(A4), as the number of training data records | D| &#8594; &#8734; and the number of test data records |D| &#8594; &#8734;, P(F * &gt; h(&#945;)) &#8594; 1.</p><p>We now find the minimum bias that needs to be induced into subgroup S to be detectable for a given Type-I error rate.</p><p>Theorem 4. Assume that a classifier is trained on data D with differential sampling bias &#8710; &gt; 1 for rectangular subgroup S T and makes predictions p i for unbiased test data D = {(x i , y i )}, and Bias Scan is used to assess bias in p i as compared to y i . Let F * = max S&#8712;rect(X) F (S) be the Bias Scan score, and let h(&#945;) be the score threshold for detection at a fixed Type-I error rate of &#945;, as given in Equation (3). Further, assume D S T is fixed, with finite size |D S T | and si&#8712;D S T y i &lt; |D S T |. Then for any &#945; &gt; 0, under assumptions (A1)-(A4), as the number of training data records | D| &#8594; &#8734;, there exists &#8710; thresh &#8805; 1 such that, if &#8710; &gt; &#8710; thresh , then P(F * &gt; h(&#945;)) &#8594; 1, where</p><p>and F old (S T ) is the Bias Scan score of subgroup S T assuming no differential sampling bias (&#8710; = 1). Proofs of Theorems 1-4 are provided in the Appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiments</head><p>We perform experiments on two criminal justice datasets to validate our theoretical results: semi-synthetic predictions of recidivism risk derived from the well-known COMPAS dataset, and real-world "stop, question and frisk" (SQF) data from the New York Police Department (NYPD).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiments on COMPAS/ProPublica Data</head><p>COMPAS is a commercial decision-support algorithm which has been applied in many jurisdictions to estimate a defendant's probability of reoffending, with impacts on criminal justice outcomes such as bail, sentencing, and parole. COM-PAS gained notoriety when investigative journalists from ProPublica published a study arguing that COMPAS was racially biased against Black defendants <ref type="bibr">(Angwin et al. 2016)</ref>.</p><p>The public dataset compiled by ProPublica 4 , including COM-PAS risk predictions for 7,214 defendants in Broward County, Florida, from 2013-2014, and a two-year follow-up to record which defendants were rearrested, has been studied by numerous algorithmic bias researchers <ref type="bibr">(Barenstein 2019)</ref>. While most of these analyses focus on assessing biases in the COMPAS risk predictions <ref type="bibr">(Chouldechova 2017;</ref><ref type="bibr">Kleinberg et al. 2018)</ref>, we instead utilize this dataset to learn predictive models for the binary outcome (rearrest within two years) as a function of five categorical predictor variables 5 , and use these models to study how differential sampling bias in the data propagates to the model predictions.</p><p>To do so, we consider differential sampling biases &#8710; &#8712; {1, 1.25, 1.5, . . . , 10} injected into one of two rectangular subgroups. Letting X 1 = Gender, X 2 = Race, and V j = the set of all possible values for attribute X j , we consider the subgroups S T = {Female} &#215; V 2 &#215; . . . &#215; V 5 and S T = {Female} &#215; {Caucasian} &#215; V 3 &#215; . . . &#215; V 5 . The first subgroup represents a marginal bias against females (since we are oversampling females who reoffended, as compared to females who did not reoffend, by a factor of &#8710; in the training data, thus leading to an overestimate of their reoffending risk), while the second subgroup represents an intersectional bias against white females. We also consider two different classifiers, random forest and logistic regression, and average results over 100 trials for each combination of classifier, injected subgroup S T , and amount of bias &#8710;.</p><p>4 <ref type="url">https://github.com/propublica/compas-analysis/compas-</ref>scores-two-years.csv 5 Predictors include gender, race, charge degree, age &lt; 25, and number of prior offenses ("none", "1 to 5", or "more than 5").</p><p>For each trial, we randomly partition the data into 80% training and 20% testing data. If &#8710; &gt; 1, then differential sampling bias &#8710; is injected into subset S T for the training data D, resampling data records ( x i , y i ) &#8712; D S T with replacement (where records with y i = 1 have weight &#8710; and records with y i = 0 have weight 1), and leaving the test data D and the rest of the training data unchanged. The classifier is trained on the biased training data, and used to make predictions p i on the unbiased test data. Then Bias Scan is used to assess whether these predictions are biased, reporting the highest scoring subgroup S * = arg max S&#8712;rect(X) F (S) and its score F * = F (S * ). We then compare the values of the Bias Scan score F * , the score of the injected subgroup F (S T ) (calculated by equation ( <ref type="formula">1</ref>)), and the theoretical score of S T , which we denote as F theo (S T ). The value of F theo (S T ) is computed using only the unbiased training and test data, as defined in Theorem 1:</p><p>if &#8710; &gt; qMLE , and F theo (S T ) = 0 otherwise. We also compute the overlap (Jaccard coefficient) between the injected subset of test data records D S T and the detected subset D S * :</p><p>Finally, we use Theorems 2 and 4 to estimate the critical value h(&#945;) and the corresponding threshold value &#8710; thresh , for which we expect</p><p>Given these values for each amount of bias &#8710; (averaged over the 100 trials, for a given classifier and a given injected subgroup S T ), we form two plots: one comparing F * , F (S T ), and F theo (S T ) as a function of &#8710;, and one showing overlap between D S T and D S * as a function of &#8710;, as compared to &#8710; thresh .</p><p>If assumptions (A1)-(A4) hold, as the size of the training data grows to infinity, we expect perfect overlap between the curves for F theo (S T ) and F (S T ) by Thm. 1. As &#8710; becomes large compared to &#8710; thresh , we expect S * &#8776; S T , and thus F * &#8776; F (S T ) and overlap &#8776; 1, while for small &#8710;, we expect F * &gt; F (S T ) and overlap &#8810; 1. We now examine whether these expectations are met for the finite, real-world COMPAS dataset, for each classifier and each injected subgroup S T .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experimental results</head><p>For the logistic regression classifier learned from training data injected with marginal differential sampling bias (Figure <ref type="figure">2</ref>), we observe near-perfect overlap between the observed score F (S T ) and theoretical score F theo (S T ) for the injected subgroup S T , suggesting the validity of our theoretical results above. As expected, the Bias Scan score F * &#8776; F (S T ) and overlap &#8776; 1 for &#8710; &gt; &#8710; thresh , while F * &gt; F (S T ) and overlap &#8810; 1 for small &#8710;. For the random forest classifier learned from training data injected with marginal differential sampling bias (Figure <ref type="figure">3</ref>), we see similar results, but with F (S T ) slightly greater than F theo (S T ) for large &#8710;. This is likely due to data sparsity: the combination of finite training data and high bias may lead to few or no training data records with y i = 0 for some covariate profiles in the injected subgroup, leading to inaccurate estimation of P(Y = 1 | X). This pattern is repeated for the random forest classifier learned from training data injected with intersectional differential sampling bias (Figure <ref type="figure">4</ref>), with a larger gap between F (S T ) and F theo (S T ), most likely due to the smaller amount of training data in S T . Similarly, the smaller amount of test data in S T leads to some noise in the detected subgroup, resulting in overlap &#8776; 0.9 rather than 1, and thus F * = max S&#8712;rect(X) F (S) &gt; F (S T ). Nevertheless, these results suggest that the theoretical values of F theo (S T ) and &#916; thresh are good approximations even for finite data.</p><p>For the logistic regression classifier learned from training data injected with intersectional differential sampling bias (Figure <ref type="figure">5</ref>), however, we see a very different picture: as &#916; increases, the Bias Scan score F * and the score of the injected subgroup F (S T ) are both much smaller than the theoretical score F theo (S T ), and the overlap between S * and S T plateaus around 0.4 even for large &#916;. This is because assumption (A1) is violated: the logistic regression model is misspecified and cannot learn the intersectional bias against white females, instead learning separate (and much smaller) marginal biases against all females and all white individuals via the learned model coefficients on these terms. When an interaction term for white females is manually added to the logistic regression model specification (Figure <ref type="figure">6</ref>), we observe that this additional term resolves the problem, and we again have a near-perfect match between the theoretical and observed scores for the injected subgroup S T .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiments on NYPD Stop and Frisk Data</head><p>The New York Police Department (NYPD) has long been plagued with accusations of racially discriminatory policing practices related to its "stop, question, and frisk" (SQF) policies. <ref type="bibr">Gelman, Fagan, and Kiss (2007)</ref> found that persons of color "were stopped more frequently than whites, even after controlling for precinct variability and race-specific estimates of crime participation". <ref type="bibr">Goel, Rao, and Shroff (2016)</ref> concluded that Black and Hispanic individuals were disproportionately impacted by "low hit rate" stops, where the officer suspected the stopped individual of criminal possession of a weapon (CPW) but the ex ante probability of recovering a weapon was low. Here we assess racial bias in NYPD policing practices by analyzing five years of SQF data during the peak of the stop and frisk policy, prior to a 2013 court ruling (Floyd v. City of New York) that NYPD stop-and-frisk tactics were unconstitutionally targeting New Yorkers of color. Thus our dataset consists of 760,489 pedestrian stops (made by NYPD officers for suspected CPW) from 2008-2012, downloaded from the city's web site<ref type="foot">foot_2</ref> . Following <ref type="bibr">Goel, Rao, and Shroff (2016)</ref>, we first fit a logistic regression model to predict the probability that each stopped individual was found to have a weapon, using location ("housing", "transit", or "neither"), precinct, and 18 binary variables describing the circumstances of the stop<ref type="foot">foot_3</ref> as predictors. Stops with ex ante probability of recovering a weapon at least 0.1 were marked as "high probability". If only high probability stops were conducted, 4.8% of stops would have been made, 46% of weapons would have been recovered, and the proportion of stopped individuals who were neither Black nor Hispanic would have more than doubled, from 9% to 23%.</p><p>Next we create a new dataset with the demographics of each stopped individual (borough, sex, race, and age decile, all of which were excluded from the predictive model above), and whether each was a high or low probability stop. We then assess racial bias by considering the race of the stopped individual as the outcome variable, and comparing the original, biased policing data to an alternative, "less biased"<ref type="foot">foot_4</ref> policing practice in which only high probability stops were made.</p><p>More precisely, we perform the following steps, for each value of k &#8712; {0, 10, . . . , 100}: Thus, for k &gt; 0, this process can be thought of as injecting differential sampling bias, increasing the odds that Race = Black by some factor &#916; &gt; 1, as compared to the alternative policing practice of only making high probability stops. However, this scenario poses several new challenges for our theoretical analysis: we do not know the injected subgroup S T or the amount of bias &#916;, and in fact the bias may be heterogeneous (different &#916; for different covariate profiles). Thus we make several simplifying assumptions. First, when auditing predictions from the model learned from the most biased training data (k = 100), Bias Scan identifies a large, high-scoring subgroup S * consisting of individuals with Gender &#8712; {Male, Female}, Age &lt; 70, and Borough &#8712; {Manhattan, Brooklyn, Queens, Staten Island} (excluding the Bronx). We assume that this S * is the true injected subgroup S T . Second, we assume that &#916; is constant over subgroup S T , and thus compute the odds ratio &#916; = p k (1-p0)</p><p>(1-p k )p0 , where p k is the proportion of Black individuals in subgroup S T of the training dataset for a given value of k. Thus we have amounts of differential sampling bias ranging from &#916; = 1 for k = 0 to &#916; = 2.675 for k = 100. We then use these values of &#916; along with the "less biased" training and test data (k = 0) to plot F theo (S T ) as a function of k, and compare these theoretical values to the Bias Scan score F * and the subgroup score F (S T ). In Figure <ref type="figure">7</ref>, we observe that F * = F (S T ) except when k = 0, i.e., the same subgroup S * is detected for all k &gt; 0. Additionally, we see that F theo (S T ) is a relatively good approximation for F (S T ), with F (S T ) consistently about 16% lower than F theo (S T ) across all values of k. This difference can be explained by our approximation of the heterogeneous bias &#916; x , for covariate profiles x &#8712; S T , by estimating a single, constant &#916; value.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>It is critical both to analyze the downstream impacts of biases as they propagate through the learning pipeline, and to create new analytical tools to detect and mitigate propagating biases. With this work, we take a step toward these goals by quantifying how a particular data bias, differential sampling bias, propagates into biased model predictions, and providing theoretical guarantees for detection of the propagated biases. We validate our theoretical results through experiments on real-world criminal justice data where our assumptions are relaxed. In future work, we plan to extend our theoretical analysis of propagating biases to other types of data bias (e.g., measurement bias) as well as biases in other pipeline stages. We are particularly interested in analyzing when model predictions are impacted by multiple, interacting biases, which we believe is often the case in complex, real-world settings.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Note that differential sampling bias would not be present if subpopulation S was under-or over-sampled but the distribution of Y given X for S remained unchanged. We do not address other forms of sample selection bias here.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>We assume that the bias is injected into a rectangular subgroup, a common formulation (e.g., used in decision trees), as it is representative of a cohesive and interpretable subpopulation.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2"><p>www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_3"><p>These circumstances include suspicious object, fits description, casing, acting as lookout, suspicious clothing, drug transaction, furtive movements, actions of violent crime, suspicious bulge, witness report, ongoing investigation, proximity to crime scene, evasive response, associating with criminals, changed direction, high crime area, time of day, and sights and sounds of criminal activity.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_4"><p>We refer to the high probability stop data as "less biased" rather than "unbiased" because it still contains biases based on which neighborhoods the NYPD officers chose to patrol, but eliminates the many low probability stops which predominantly and unfairly target racial minorities.</p></note>
		</body>
		</text>
</TEI>
