  1. Abstract

    Bayesian Improved Surname Geocoding (BISG) is a ubiquitous tool for predicting race and ethnicity using an individual’s geolocation and surname. Here we demonstrate that statistical dependence of surname and geolocation within racial/ethnic categories in the US results in biases for minority subpopulations, and we introduce a raking-based improvement. Our method augments the data used by BISG—distributions of race by geolocation and race by surname—with the distribution of surname by geolocation obtained from state voter files. We validate our algorithm on state voter registration lists that contain self-identified race/ethnicity.

