skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A calibrated BISG for inferring race from surname and geolocation
Abstract Bayesian Improved Surname Geocoding (BISG) is a ubiquitous tool for predicting race and ethnicity using an individual’s geolocation and surname. Here we demonstrate that statistical dependence of surname and geolocation within racial/ethnic categories in the US results in biases for minority subpopulations, and we introduce a raking-based improvement. Our method augments the data used by BISG—distributions of race by geolocation and race by surname—with the distribution of surname by geolocation obtained from state voter files. We validate our algorithm on state voter registration lists that contain self-identified race/ethnicity.  more » « less
Award ID(s):
2311354
PAR ID:
10567677
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series A: Statistics in Society
ISSN:
0964-1998
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Assessing the fairness of a decision making system with respect to a protected class, such as gender or race, is challenging when class membership labels are unavailable. Probabilistic models for predicting the protected class based on observable proxies, such as surname and geolocation for race, are sometimes used to impute these missing labels for compliance assessments. Empirically, these methods are observed to exaggerate disparities, but the reason why is unknown. In this paper, we decompose the biases in estimating outcome disparity via threshold-based imputation into multiple interpretable bias sources, allowing us to explain when over- or underestimation occurs. We also propose an alternative weighted estimator that uses soft classification, and show that its bias arises simply from the conditional covariance of the outcome with the true class membership. Finally, we illustrate our results with numerical simulations and a public dataset of mortgage applications, using geolocation as a proxy for race. We confirm that the bias of threshold-based imputation is generally upward, but its magnitude varies strongly with the threshold chosen. Our new weighted estimator tends to have a negative bias that is much simpler to analyze and reason about. 
    more » « less
  2. Levy, Morris (Ed.)
    Abstract Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions—e.g. based on name and geography—and then to often discretize the predictions by selecting the most likely class (argmax), potentially with a minimum threshold (thresholding). We study how this practice produces discretization bias. For example, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of Black voters, e.g. by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a joint optimization approach—and a tractable data-driven threshold heuristic—that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences. 
    more » « less
  3. When it comes to registering to vote, Hispanic voters can only register as “Hispanic” in the “Race/ Ethnicity” category, causing difficulties when analyzing voting trends amongst the Hispanic community. Upon the recent idea that not all Hispanic Groups vote the same, the goal is to create a model that can possibly identify a voter’s Hispanic Group with the information provided on the public Florida voter file. This is accomplished using name and zip code data for all voters in Palm Beach, Florida. This paper will explore the model implemented, its findings and limitations. Palm Beach, Florida, is met with low confidence in classification, leaving the final sample of highly confident active Hispanic voters with 15% of its original sample. Further analysis on other counties will be needed to gauge how impactful this limitation might be on the rest of the state. 
    more » « less
  4. Existing research demonstrates gender- and race/ethnicity-based inequities in college outcomes. Separately, recent research suggests a relationship between time poverty and college outcomes for student parents and online students. However, to date, no studies have empirically explored whether differential access to time as a resource for college may explain differential outcomes by gender or race/ethnicity. To address this, this study explored the relationship between time poverty, gender or race/ethnicity, and college outcomes at a large urban public university with two and four year campuses. Time poverty explained a significant proportion of differential outcomes (retention and credit accumulation) by gender and race/ethnicity. More time-poor groups also dedicated a larger proportion of their (relatively limited) discretionary time to their education, suggesting that inequitable distributions of time may contribute to other negative outcomes (e.g., reduced time for sleep, exercise, healthcare). This suggests that time poverty is a significant but understudied equity issue in higher education. 
    more » « less
  5. Abstract BackgroundWe used an opportunity gap framework to analyze the pathways through which students enter into and depart from science, technology, engineering, and mathematics (STEM) degrees in an R1 higher education institution and to better understand the demographic disparities in STEM degree attainment. ResultsWe found disparities in 6-year STEM graduation rates on the basis of gender, race/ethnicity, and parental education level. Using mediation analysis, we showed that the gender disparity in STEM degree attainment was explained by disparities in aspiration: a gender disparity in students’ intent to pursue STEM at the beginning of college; women were less likely to graduate with STEM degrees because they were less likely to intend to pursue STEM degrees. However, disparities in STEM degree attainment across race/ethnicities and parental education level were largely explained by disparities in attrition: persons excluded because of their ethnicity or race (PEERs) and first generation students were less likely to graduate with STEM degrees due to fewer academic opportunities provided prior to college (estimated using college entrance exams scores) and more academic challenges during college as captured by first year GPAs. ConclusionsOur results reinforce the idea that patterns of departure from STEM pathways differ among marginalized groups. To promote and retain students in STEM, it is critical that we understand these differing patterns and consider structural efforts to support students at different stages in their education. 
    more » « less