Previous studies have shown that artificial intelligence can be used to classify instruction-related activities in classroom videos. The automated classi- fication of human activities, however, is vulnerable to biases in which the model performs substantially better or worse for different people groups. Although algo- rithmic bias has been highlighted as an important area for research in artificial intelligence in education, there have been few studies that empirically investigate potential bias in instruction-related activity recognition systems. In this paper, we report on an investigation of potential racial and skin tone biases in the automated classification of teachers’ activities in classroom videos. We examine whether a neural network’s classification of teachers’ activities differs with respect to teacher race and skin tone and whether differently balanced training datasets affect the performance of the neural network. Our results indicate that, under ordinary class- room lighting conditions, the neural network performs equally well regardless of teacher race or skin tone. Furthermore, our results suggest the balance of the training dataset with respect to teacher skin tone and race has a small—but not necessarily positive—effect on the neural network’s performance. Our study, how- ever, also suggests the importance of quality lighting for accurate classification of teacher-related instructional activities for teachers of color. We conclude with a discussion of our mixed findings, the limitations of our study, and potential directions for future research.
more »
« less
Skin Deep: Investigating Subjectivity in Skin Tone Annotations for Computer Vision Benchmark Datasets
To investigate the well-observed racial disparities in computer vision systems that analyze images of humans, researchers have turned to skin tone as a more objective annotation than race metadata for fairness performance evaluations. However, the current state of skin tone annotation procedures is highly varied. For instance, researchers use a range of untested scales and skin tone categories, have unclear annotation procedures, and provide inadequate analyses of uncertainty. In addition, little attention is paid to the positionality of the humans involved in the annotation process—both designers and annotators alike—and the historical and sociological context of skin tone in the United States. Our work is the first to investigate the skin tone annotation process as a sociotechnical project. We surveyed recent skin tone annotation procedures and conducted annotation experiments to examine how subjective understandings of skin tone are embedded in skin tone annotation procedures. Our systematic literature review revealed the uninterrogated association between skin tone and race and the limited effort to analyze annotator uncertainty in current procedures for skin tone annotation in computer vision evaluation. Our experiments demonstrated that design decisions in the annotation procedure such as the order in which the skin tone scale is presented or additional context in the image (i.e., presence of a face) significantly affected the resulting inter-annotator agreement and individual uncertainty of skin tone annotations. We call for greater reflexivity in the design, analysis, and documentation of procedures for evaluation using skin tone.
more »
« less
- Award ID(s):
- 2120497
- PAR ID:
- 10426294
- Date Published:
- Journal Name:
- ACM Conference on Fairness, Accountability, and Transparency
- Page Range / eLocation ID:
- 1757 to 1771
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
As US society continues to diversify and calls for better measurements of racialized appearance increase, survey researchers need guidance about effective strategies for assessing skin color in field research. This study examined the consistency, comparability, and meaningfulness of the two most widely used skin tone rating scales (Massey–Martin and PERLA) and two portable and inexpensive handheld devices for skin color measurement (Nix colorimeter and Labby spectrophotometer). We collected data in person using these four instruments from forty-six college students selected to reflect a wide range of skin tones across four racial-ethnic groups (Asian, Black, Latinx, White). These college students, five study staff, and 459 adults from an online sample also rated forty stock photos, again selected for skin tone diversity. Our results—based on data collected under controlled conditions—demonstrate high consistency across raters and readings. The Massey–Martin and PERLA scale scores were highly linearly related to each other, although PERLA better differentiated among people with the lightest skin tones. The Nix and Labby darkness-to-lightness (L*) readings were likewise linearly related to each other and to the Massey–Martin and PERLA scores, in addition to showing expected variation within and between race ethnicities. In addition, darker Massey–Martin and PERLA ratings correlated with online raters’ expectations that a photographed person experienced greater discrimination. In contrast, the redness (a*) and yellowness (b*) undertones were highest in the mid-range of the rating scale scores and demonstrated greater overlap across race-ethnicities. Overall, each instrument showed sufficient consistency, comparability, and meaningfulness for use in field surveys when implemented soundly (e.g., not requiring memorization). However, PERLA might be preferred to Massey–Martin in studies representing individuals with the lightest skin tones, and handheld devices may be preferred to rating scales to reduce measurement error when studies could gather only a single rating.more » « less
-
Discourse parsing has proven to be useful for a number of NLP tasks that require complex reasoning. However, over a decade since the advent of the Penn Discourse Treebank, predicting implicit discourse relations in text remains challenging. There are several possible reasons for this, and we hypothesize that models should be exposed to more context as it plays an important role in accurate human annotation; meanwhile adding uncertainty measures can improve model accuracy and calibration. To thoroughly investigate this phenomenon, we perform a series of experiments to determine 1) the effects of context on human judgments, and 2) the effect of quantifying uncertainty with annotator confidence ratings on model accuracy and calibration (which we measure using the Brier score (Brier et al, 1950)). We find that including annotator accuracy and confidence improves model accuracy, and incorporating confidence in the model’s temperature function can lead to models with significantly better-calibrated confidence measures. We also find some insightful qualitative results regarding human and model behavior on these datasets.more » « less
-
Face recognition (FR) systems are fast becoming ubiquitous. However, differential performance among certain demographics was identified in several widely used FR models. The skin tone of the subject is an important factor in addressing the differential performance. Previous work has used modeling methods to propose skin tone measures of subjects across different illuminations or utilized subjective labels of skin color and demographic information. However, such models heavily rely on consistent background and lighting for calibration, or utilize labeled datasets, which are time-consuming to generate or are unavailable. In this work, we have developed a novel and data-driven skin color measure capable of accurately representing subjects' skin tone from a single image, without requiring a consistent background or illumination. Our measure leverages the dichromatic reflection model in RGB space to decompose skin patches into diffuse and specular bases.more » « less
-
Abstract The evolution of skin pigmentation has been shaped by numerous biological and cultural shifts throughout human history. Vitamin D is considered a driver of depigmentation evolution in humans, given the deleterious health effects associated with vitamin D deficiency, which is often shaped by cultural factors. New advancements in genomics and epigenomics have opened the door to a deeper exploration of skin pigmentation evolution in both contemporary and ancient populations. Data from ancient Europeans has offered great context to the spread of depigmentation alleles via the evaluation of migration events and cultural shifts that occurred during the Neolithic. However, novel insights can further be gained via the inclusion of diverse ancient and contemporary populations. Here we present on how potential biases and limitations in skin pigmentation research can be overcome with the integration of interdisciplinary data that includes both cultural and biological elements, which have shaped the evolutionary history of skin pigmentation in humans.more » « less
An official website of the United States government

