skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Measuring Skin Color: Consistency, Comparability, and Meaningfulness of Rating Scale Scores and Handheld Device Readings
As US society continues to diversify and calls for better measurements of racialized appearance increase, survey researchers need guidance about effective strategies for assessing skin color in field research. This study examined the consistency, comparability, and meaningfulness of the two most widely used skin tone rating scales (Massey–Martin and PERLA) and two portable and inexpensive handheld devices for skin color measurement (Nix colorimeter and Labby spectrophotometer). We collected data in person using these four instruments from forty-six college students selected to reflect a wide range of skin tones across four racial-ethnic groups (Asian, Black, Latinx, White). These college students, five study staff, and 459 adults from an online sample also rated forty stock photos, again selected for skin tone diversity. Our results—based on data collected under controlled conditions—demonstrate high consistency across raters and readings. The Massey–Martin and PERLA scale scores were highly linearly related to each other, although PERLA better differentiated among people with the lightest skin tones. The Nix and Labby darkness-to-lightness (L*) readings were likewise linearly related to each other and to the Massey–Martin and PERLA scores, in addition to showing expected variation within and between race ethnicities. In addition, darker Massey–Martin and PERLA ratings correlated with online raters’ expectations that a photographed person experienced greater discrimination. In contrast, the redness (a*) and yellowness (b*) undertones were highest in the mid-range of the rating scale scores and demonstrated greater overlap across race-ethnicities. Overall, each instrument showed sufficient consistency, comparability, and meaningfulness for use in field surveys when implemented soundly (e.g., not requiring memorization). However, PERLA might be preferred to Massey–Martin in studies representing individuals with the lightest skin tones, and handheld devices may be preferred to rating scales to reduce measurement error when studies could gather only a single rating.  more » « less
Award ID(s):
1921526
PAR ID:
10352585
Author(s) / Creator(s):
Date Published:
Journal Name:
Journal of Survey Statistics and Methodology
Volume:
10
Issue:
2
ISSN:
2325-0992
Page Range / eLocation ID:
337-364
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Surveys that assess skin color support evidence building about colorism and related systemic inequalities that affect health and wellbeing. Methodologists have increasing choices for such assessments, including a growing array of digital images for rating scales and increasingly cost-effective handheld mechanical devices based on color science. Guidance is needed for choosing among these growing options. We used data from a diverse sample of 102 college students to produce new empirical evidence and practical guidance about various options. We compared three handheld devices that ranged in price, considering variations in their reliabilities and how their results differed by where on the body and with what device settings readings were taken. We also offered evidence regarding how reliably interviewers and participants could choose from a large array of color swatches offering variation in skin undertone (redness, yellowness) in addition to skin shade (lightness-to-darkness). Overall, the results were promising, demonstrating that modern handheld devices and rating scales could be feasibly and reliably used. For instance, results demonstrated that just one or two device readings were needed at any given location, and, the device readings and rating scale scores similarly captured the relative darkness of skin. In other cases, recommendations were less certain. For instance, skin undertones of redness and yellowness were more sensitive to device choices and body locations. We encourage future studies that pursue why such variability exists and for which substantive questions it matters most. 
    more » « less
  2. Ratings are present in many areas of assessment including peer review of research proposals and journal articles, teacher observations, university admissions and selection of new hires. One feature present in any rating process with multiple raters is that different raters often assign different scores to the same assessee, with the potential for bias and inconsistencies related to rater or assessee covariates. This paper analyzes disparities in ratings of internal and external applicants to teaching positions using applicant data from Spokane Public Schools. We first test for biases in rating while accounting for measures of teacher applicant qualifications and quality. Then, we develop model-based inter-rater reliability (IRR) estimates that allow us to account for various sources of measurement error, the hierarchical structure of the data, and to test whether covariates, such as applicant status, moderate IRR. We find that applicants external to the district receive lower ratings for job applications compared to internal applicants. This gap in ratings remains significant even after including measures of qualifications and quality such as experience, state licensure scores, or estimated teacher value added. With model-based IRR, we further show that consistency between raters is significantly lower when rating external applicants. We conclude the paper by discussing policy implications and possible applications of our model-based IRR estimate for hiring and selection practices in and out of the teacher labor market. 
    more » « less
  3. To investigate the well-observed racial disparities in computer vision systems that analyze images of humans, researchers have turned to skin tone as a more objective annotation than race metadata for fairness performance evaluations. However, the current state of skin tone annotation procedures is highly varied. For instance, researchers use a range of untested scales and skin tone categories, have unclear annotation procedures, and provide inadequate analyses of uncertainty. In addition, little attention is paid to the positionality of the humans involved in the annotation process—both designers and annotators alike—and the historical and sociological context of skin tone in the United States. Our work is the first to investigate the skin tone annotation process as a sociotechnical project. We surveyed recent skin tone annotation procedures and conducted annotation experiments to examine how subjective understandings of skin tone are embedded in skin tone annotation procedures. Our systematic literature review revealed the uninterrogated association between skin tone and race and the limited effort to analyze annotator uncertainty in current procedures for skin tone annotation in computer vision evaluation. Our experiments demonstrated that design decisions in the annotation procedure such as the order in which the skin tone scale is presented or additional context in the image (i.e., presence of a face) significantly affected the resulting inter-annotator agreement and individual uncertainty of skin tone annotations. We call for greater reflexivity in the design, analysis, and documentation of procedures for evaluation using skin tone. 
    more » « less
  4. In documenting an undescribed language, tone can pose a significant challenge. In practically no other aspect of the phonology can such a small set of categories show such an overlapping range of pronunciation, especially in level-tone languages where f0 slope offers fewer clues to category. This paper demonstrates the unexpected tool offered by musical surrogate languages in the documentation of these tone systems. It draws on the case study of the Sambla balafon, a resonator xylophone played by many ethnicities in Burkina Faso and neighboring West African countries. The language of the Sambla people, Seenku (Northwestern Mande, Samogo), has a highly complex tonal system, whose four contrastive levels and multiple contour tones are encoded musically in the notes of the balafon, allowing musicians to communicate with each other and with spectators without ever opening their mouths. I show how the balafon data have shed light on a number of tonal contrasts and phenomena and raised questions about levels of the grammar and their mental representations. 
    more » « less
  5. Machine learning (ML) based skin cancer detection tools are an example of a transformative medical technology that could potentially democratize early detection for skin cancer cases for everyone. However, due to the dependency of datasets for training, ML based skin cancer detection always suffers from a systemic racial bias. Racial communities and ethnicity not well represented within the training datasets will not be able to use these tools, leading to health disparities being amplified. Based on empirical observations we posit that skin cancer training data is biased as it’s dataset represents mostly communities of lighter skin tones, despite skin cancer being far more lethal for people of color. In this paper we use domain adaptation techniques by employing CycleGANs to mitigate racial biases existing within state of the art machine learning based skin cancer detection tools by adapting minority images to appear as the majority. Using our domain adaptation techniques to augment our minority datasets, we are able to improve the accuracy, precision, recall, and F1 score of typical image classification machine learning models for skin cancer classification from the biased 50% accuracy rate to a 79% accuracy rate when testing on minority skin tone images. We evaluate and demonstrate a proof-of-concept smartphone application. 
    more » « less