skip to main content


Title: Measuring Skin Color: Consistency, Comparability, and Meaningfulness of Rating Scale Scores and Handheld Device Readings
As US society continues to diversify and calls for better measurements of racialized appearance increase, survey researchers need guidance about effective strategies for assessing skin color in field research. This study examined the consistency, comparability, and meaningfulness of the two most widely used skin tone rating scales (Massey–Martin and PERLA) and two portable and inexpensive handheld devices for skin color measurement (Nix colorimeter and Labby spectrophotometer). We collected data in person using these four instruments from forty-six college students selected to reflect a wide range of skin tones across four racial-ethnic groups (Asian, Black, Latinx, White). These college students, five study staff, and 459 adults from an online sample also rated forty stock photos, again selected for skin tone diversity. Our results—based on data collected under controlled conditions—demonstrate high consistency across raters and readings. The Massey–Martin and PERLA scale scores were highly linearly related to each other, although PERLA better differentiated among people with the lightest skin tones. The Nix and Labby darkness-to-lightness (L*) readings were likewise linearly related to each other and to the Massey–Martin and PERLA scores, in addition to showing expected variation within and between race ethnicities. In addition, darker Massey–Martin and PERLA ratings correlated with online raters’ expectations that a photographed person experienced greater discrimination. In contrast, the redness (a*) and yellowness (b*) undertones were highest in the mid-range of the rating scale scores and demonstrated greater overlap across race-ethnicities. Overall, each instrument showed sufficient consistency, comparability, and meaningfulness for use in field surveys when implemented soundly (e.g., not requiring memorization). However, PERLA might be preferred to Massey–Martin in studies representing individuals with the lightest skin tones, and handheld devices may be preferred to rating scales to reduce measurement error when studies could gather only a single rating.  more » « less
Award ID(s):
1921526
NSF-PAR ID:
10352585
Author(s) / Creator(s):
Date Published:
Journal Name:
Journal of Survey Statistics and Methodology
Volume:
10
Issue:
2
ISSN:
2325-0992
Page Range / eLocation ID:
337-364
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ratings are present in many areas of assessment including peer review of research proposals and journal articles, teacher observations, university admissions and selection of new hires. One feature present in any rating process with multiple raters is that different raters often assign different scores to the same assessee, with the potential for bias and inconsistencies related to rater or assessee covariates. This paper analyzes disparities in ratings of internal and external applicants to teaching positions using applicant data from Spokane Public Schools. We first test for biases in rating while accounting for measures of teacher applicant qualifications and quality. Then, we develop model-based inter-rater reliability (IRR) estimates that allow us to account for various sources of measurement error, the hierarchical structure of the data, and to test whether covariates, such as applicant status, moderate IRR. We find that applicants external to the district receive lower ratings for job applications compared to internal applicants. This gap in ratings remains significant even after including measures of qualifications and quality such as experience, state licensure scores, or estimated teacher value added. With model-based IRR, we further show that consistency between raters is significantly lower when rating external applicants. We conclude the paper by discussing policy implications and possible applications of our model-based IRR estimate for hiring and selection practices in and out of the teacher labor market. 
    more » « less
  2. This evidence-based practices paper discusses the method employed in validating the use of a project modified version of the PROCESS tool (Grigg, Van Dyken, Benson, & Morkos, 2013) for measuring student problem solving skills. The PROCESS tool allows raters to score students’ ability in the domains of Problem definition, Representing the problem, Organizing information, Calculations, Evaluating the solution, Solution communication, and Self-assessment. Specifically, this research compares student performance on solving traditional textbook problems with novel, student-generated learning activities (i.e. reverse engineering videos in order to then create their own homework problem and solution). The use of student-generated learning activities to assess student problem solving skills has theoretical underpinning in Felder’s (1987) work of “creating creative engineers,” as well as the need to develop students’ abilities to transfer learning and solve problems in a variety of real world settings. In this study, four raters used the PROCESS tool to score the performance of 70 students randomly selected from two undergraduate chemical engineering cohorts at two Midwest universities. Students from both cohorts solved 12 traditional textbook style problems and students from the second cohort solved an additional nine student-generated video problems. Any large scale assessment where multiple raters use a rating tool requires the investigation of several aspects of validity. The many-facets Rasch measurement model (MFRM; Linacre, 1989) has the psychometric properties to determine if there are any characteristics other than “student problem solving skills” that influence the scores assigned, such as rater bias, problem difficulty, or student demographics. Before implementing the full rating plan, MFRM was used to examine how raters interacted with the six items on the modified PROCESS tool to score a random selection of 20 students’ performance in solving one problem. An external evaluator led “inter-rater reliability” meetings where raters deliberated rationale for their ratings and differences were resolved by recourse to Pretz, et al.’s (2003) problem-solving cycle that informed the development of the PROCESS tool. To test the new understandings of the PROCESS tool, raters were assigned to score one new problem from a different randomly selected group of six students. Those results were then analyzed in the same manner as before. This iterative process resulted in substantial increases in reliability, which can be attributed to increased confidence that raters were operating with common definitions of the items on the PROCESS tool and rating with consistent and comparable severity. This presentation will include examples of the student-generated problems and a discussion of common discrepancies and solutions to the raters’ initial use of the PROCESS tool. Findings as well as the adapted PROCESS tool used in this study can be useful to engineering educators and engineering education researchers. 
    more » « less
  3. In documenting an undescribed language, tone can pose a significant challenge. In practically no other aspect of the phonology can such a small set of categories show such an overlapping range of pronunciation, especially in level-tone languages where f0 slope offers fewer clues to category. This paper demonstrates the unexpected tool offered by musical surrogate languages in the documentation of these tone systems. It draws on the case study of the Sambla balafon, a resonator xylophone played by many ethnicities in Burkina Faso and neighboring West African countries. The language of the Sambla people, Seenku (Northwestern Mande, Samogo), has a highly complex tonal system, whose four contrastive levels and multiple contour tones are encoded musically in the notes of the balafon, allowing musicians to communicate with each other and with spectators without ever opening their mouths. I show how the balafon data have shed light on a number of tonal contrasts and phenomena and raised questions about levels of the grammar and their mental representations. 
    more » « less
  4. Abstract Background

    We used an opportunity gap framework to analyze the pathways through which students enter into and depart from science, technology, engineering, and mathematics (STEM) degrees in an R1 higher education institution and to better understand the demographic disparities in STEM degree attainment.

    Results

    We found disparities in 6-year STEM graduation rates on the basis of gender, race/ethnicity, and parental education level. Using mediation analysis, we showed that the gender disparity in STEM degree attainment was explained by disparities in aspiration: a gender disparity in students’ intent to pursue STEM at the beginning of college; women were less likely to graduate with STEM degrees because they were less likely to intend to pursue STEM degrees. However, disparities in STEM degree attainment across race/ethnicities and parental education level were largely explained by disparities in attrition: persons excluded because of their ethnicity or race (PEERs) and first generation students were less likely to graduate with STEM degrees due to fewer academic opportunities provided prior to college (estimated using college entrance exams scores) and more academic challenges during college as captured by first year GPAs.

    Conclusions

    Our results reinforce the idea that patterns of departure from STEM pathways differ among marginalized groups. To promote and retain students in STEM, it is critical that we understand these differing patterns and consider structural efforts to support students at different stages in their education.

     
    more » « less
  5. Abstract

    Images depicting dark skin tones are significantly underrepresented in the educational materials used to teach primary care physicians and dermatologists to recognize skin diseases. This could contribute to disparities in skin disease diagnosis across different racial groups. Previously, domain experts have manually assessed textbooks to estimate the diversity in skin images. Manual assessment does not scale to many educational materials and introduces human errors. To automate this process, we present the Skin Tone Analysis for Representation in EDucational materials (STAR-ED) framework, which assesses skin tone representation in medical education materials using machine learning. Given a document (e.g., a textbook in .pdf), STAR-ED applies content parsing to extract text, images, and table entities in a structured format. Next, it identifies images containing skin, segments the skin-containing portions of those images, and estimates the skin tone using machine learning. STAR-ED was developed using the Fitzpatrick17k dataset. We then externally tested STAR-ED on four commonly used medical textbooks. Results show strong performance in detecting skin images (0.96 ± 0.02 AUROC and 0.90 ± 0.06 F1score) and classifying skin tones (0.87 ± 0.01 AUROC and 0.91 ± 0.00 F1score). STAR-ED quantifies the imbalanced representation of skin tones in four medical textbooks: brown and black skin tones (Fitzpatrick V-VI) images constitute only 10.5% of all skin images. We envision this technology as a tool for medical educators, publishers, and practitioners to assess skin tone diversity in their educational materials.

     
    more » « less