skip to main content

This content will become publicly available on April 30, 2024

Title: Who Broke Amazon Mechanical Turk?: An Analysis of Crowdsourcing Data Quality over Time
We present the results of a survey fielded in June of 2022 as a lens to examine recent data reliability issues on Amazon Mechanical Turk. We contrast bad data from this survey with bad data from the same survey fielded among US workers in October 2013, April 2018, and February 2019. Application of an established data cleaning scheme reveals that unusable data has risen from a little over 2% in 2013 to almost 90% in 2022. Through symptomatic diagnosis, we attribute the data reliability drop not to an increase in bad faith work, but rather to a continuum of English proficiency levels. A qualitative analysis of workers’ responses to open-ended questions allows us to distinguish between low fluency workers, ultra-low fluency workers, satisficers, and bad faith workers. We go on to show the effects of the new low fluency work on Likert scale data and on the study’s qualitative results. Attention checks are shown to be much less effective than they once were at identifying survey responses that should be discarded.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
WebSci '23: Proceedings of the 15th ACM Web Science Conference 2023
Page Range / eLocation ID:
335 to 345
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The COVID-19 pandemic has dramatically altered family life in the United States. Over the long duration of the pandemic, parents had to adapt to shifting work conditions, virtual schooling, the closure of daycare facilities, and the stress of not only managing households without domestic and care supports but also worrying that family members may contract the novel coronavirus. Reports early in the pandemic suggest that these burdens have fallen disproportionately on mothers, creating concerns about the long-term implications of the pandemic for gender inequality and mothers’ well-being. Nevertheless, less is known about how parents’ engagement in domestic labor and paid work has changed throughout the pandemic, what factors may be driving these changes, and what the long-term consequences of the pandemic may be for the gendered division of labor and gender inequality more generally.

    The Study on U.S. Parents’ Divisions of Labor During COVID-19 (SPDLC) collects longitudinal survey data from partnered U.S. parents that can be used to assess changes in parents’ divisions of domestic labor, divisions of paid labor, and well-being throughout and after the COVID-19 pandemic. The goal of SPDLC is to understand both the short- and long-term impacts of the pandemic for the gendered division of labor, work-family issues, and broader patterns of gender inequality.

    Survey data for this study is collected using Prolifc (, an opt-in online platform designed to facilitate scientific research. The sample is comprised U.S. adults who were residing with a romantic partner and at least one biological child (at the time of entry into the study). In each survey, parents answer questions about both themselves and their partners. Wave 1 of SPDLC was conducted in April 2020, and parents who participated in Wave 1 were asked about their division of labor both prior to (i.e., early March 2020) and one month after the pandemic began. Wave 2 of SPDLC was collected in November 2020. Parents who participated in Wave 1 were invited to participate again in Wave 2, and a new cohort of parents was also recruited to participate in the Wave 2 survey. Wave 3 of SPDLC was collected in October 2021. Parents who participated in either of the first two waves were invited to participate again in Wave 3, and another new cohort of parents was also recruited to participate in the Wave 3 survey. This research design (follow-up survey of panelists and new cross-section of parents at each wave) will continue through 2024, culminating in six waves of data spanning the period from March 2020 through October 2024. An estimated total of approximately 6,500 parents will be surveyed at least once throughout the duration of the study.

    SPDLC data will be released to the public two years after data is collected; Waves 1 and 2 are currently publicly available. Wave 3 will be publicly available in October 2023, with subsequent waves becoming available yearly. Data will be available to download in both SPSS (.sav) and Stata (.dta) formats, and the following data files will be available: (1) a data file for each individual wave, which contains responses from all participants in that wave of data collection, (2) a longitudinal panel data file, which contains longitudinal follow-up data from all available waves, and (3) a repeated cross-section data file, which contains the repeated cross-section data (from new respondents at each wave) from all available waves. Codebooks for each survey wave and a detailed user guide describing the data are also available. Response Rates: Of the 1,157 parents who participated in Wave 1, 828 (72%) also participated in the Wave 2 study. Presence of Common Scales: The following established scales are included in the survey:
    • Self-Efficacy, adapted from Pearlin's mastery scale (Pearlin et al., 1981) and the Rosenberg self-esteem scale (Rosenberg, 2015) and taken from the American Changing Lives Survey
    • Communication with Partner, taken from the Marriage and Relationship Survey (Lichter & Carmalt, 2009)
    • Gender Attitudes, taken from the National Survey of Families and Households (Sweet & Bumpass, 1996)
    • Depressive Symptoms (CES-D-10)
    • Stress, measured using Cohen's Perceived Stress Scale (Cohen, Kamarck, & Mermelstein, 1983)
    Full details about these scales and all other items included in the survey can be found in the user guide and codebook
    The second wave of the SPDLC was fielded in November 2020 in two stages. In the first stage, all parents who participated in W1 of the SPDLC and who continued to reside in the United States were re-contacted and asked to participate in a follow-up survey. The W2 survey was posted on Prolific, and messages were sent via Prolific’s messaging system to all previous participants. Multiple follow-up messages were sent in an attempt to increase response rates to the follow-up survey. Of the 1,157 respondents who completed the W1 survey, 873 at least started the W2 survey. Data quality checks were employed in line with best practices for online surveys (e.g., removing respondents who did not complete most of the survey or who did not pass the attention filters). After data quality checks, 5.2% of respondents were removed from the sample, resulting in a final sample size of 828 parents (a response rate of 72%).

    In the second stage, a new sample of parents was recruited. New parents had to meet the same sampling criteria as in W1 (be at least 18 years old, reside in the United States, reside with a romantic partner, and be a parent living with at least one biological child). Also similar to the W1 procedures, we oversampled men, Black individuals, individuals who did not complete college, and individuals who identified as politically conservative to increase sample diversity. A total of 1,207 parents participated in the W2 survey. Data quality checks led to the removal of 5.7% of the respondents, resulting in a final sample size of new respondents at Wave 2 of 1,138 parents.

    In both stages, participants were informed that the survey would take approximately 20 minutes to complete. All panelists were provided monetary compensation in line with Prolific’s compensation guidelines, which require that all participants earn above minimum wage for their time participating in studies.
    To be included in SPDLC, respondents had to meet the following sampling criteria at the time they enter the study: (a) be at least 18 years old, (b) reside in the United States, (c) reside with a romantic partner (i.e., be married or cohabiting), and (d) be a parent living with at least one biological child. Follow-up respondents must be at least 18 years old and reside in the United States, but may experience changes in relationship and resident parent statuses. Smallest Geographic Unit: U.S. State

    This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit In accordance with this license, all users of these data must give appropriate credit to the authors in any papers, presentations, books, or other works that use the data. A suggested citation to provide attribution for these data is included below:            

    Carlson, Daniel L. and Richard J. Petts. 2022. Study on U.S. Parents’ Divisions of Labor During COVID-19 User Guide: Waves 1-2.  

    To help provide estimates that are more representative of U.S. partnered parents, the SPDLC includes sampling weights. Weights can be included in statistical analyses to make estimates from the SPDLC sample representative of U.S. parents who reside with a romantic partner (married or cohabiting) and a child aged 18 or younger based on age, race/ethnicity, and gender. National estimates for the age, racial/ethnic, and gender profile of U.S. partnered parents were obtained using data from the 2020 Current Population Survey (CPS). Weights were calculated using an iterative raking method, such that the full sample in each data file matches the nationally representative CPS data in regard to the gender, age, and racial/ethnic distributions within the data. This variable is labeled CPSweightW2 in the Wave 2 dataset, and CPSweightLW2 in the longitudinal dataset (which includes Waves 1 and 2). There is not a weight variable included in the W1-W2 repeated cross-section data file.
    more » « less
  2. Introduction and Theoretical Frameworks Our study draws upon several theoretical foundations to investigate and explain the educational experiences of Black students majoring in ME, CpE, and EE: intersectionality, critical race theory, and community cultural wealth theory. Intersectionality explains how gender operates together with race, not independently, to produce multiple, overlapping forms of discrimination and social inequality (Crenshaw, 1989; Collins, 2013). Critical race theory recognizes the unique experiences of marginalized groups and strives to identify the micro- and macro-institutional sources of discrimination and prejudice (Delgado & Stefancic, 2001). Community cultural wealth integrates an asset-based perspective to our analysis of engineering education to assist in the identification of factors that contribute to the success of engineering students (Yosso, 2005). These three theoretical frameworks are buttressed by our use of Racial Identity Theory, which expands understanding about the significance and meaning associated with students’ sense of group membership. Sellers and colleagues (1997) introduced the Multidimensional Model of Racial Identity (MMRI), in which they indicated that racial identity refers to the “significance and meaning that African Americans place on race in defining themselves” (p. 19). The development of this model was based on the reality that individuals vary greatly in the extent to which they attach meaning to being a member of the Black racial group. Sellers et al. (1997) posited that there are four components of racial identity: 1. Racial salience: “the extent to which one’s race is a relevant part of one’s self-concept at a particular moment or in a particular situation” (p. 24). 2. Racial centrality: “the extent to which a person normatively defines himself or herself with regard to race” (p. 25). 3. Racial regard: “a person’s affective or evaluative judgment of his or her race in terms of positive-negative valence” (p. 26). This element consists of public regard and private regard. 4. Racial ideology: “composed of the individual’s beliefs, opinions and attitudes with respect to the way he or she feels that the members of the race should act” (p. 27). The resulting 56-item inventory, the Multidimensional Inventory of Black Identity (MIBI), provides a robust measure of Black identity that can be used across multiple contexts. Research Questions Our 3-year, mixed-method study of Black students in computer (CpE), electrical (EE) and mechanical engineering (ME) aims to identify institutional policies and practices that contribute to the retention and attrition of Black students in electrical, computer, and mechanical engineering. Our four study institutions include historically Black institutions as well as predominantly white institutions, all of which are in the top 15 nationally in the number of Black engineering graduates. We are using a transformative mixed-methods design to answer the following overarching research questions: 1. Why do Black men and women choose and persist in, or leave, EE, CpE, and ME? 2. What are the academic trajectories of Black men and women in EE, CpE, and ME? 3. In what way do these pathways vary by gender or institution? 4. What institutional policies and practices promote greater retention of Black engineering students? Methods This study of Black students in CpE, EE, and ME reports initial results from in-depth interviews at one HBCU and one PWI. We asked students about a variety of topics, including their sense of belonging on campus and in the major, experiences with discrimination, the impact of race on their experiences, and experiences with microaggressions. For this paper, we draw on two methodological approaches that allowed us to move beyond a traditional, linear approach to in-depth interviews, allowing for more diverse experiences and narratives to emerge. First, we used an identity circle to gain a better understanding of the relative importance to the participants of racial identity, as compared to other identities. The identity circle is a series of three concentric circles, surrounding an “inner core” representing one’s “core self.” Participants were asked to place various identities from a provided list that included demographic, family-related, and school-related identities on the identity circle to reflect the relative importance of the different identities to participants’ current engineering education experiences. Second, participants were asked to complete an 8-item survey which measured the “centrality” of racial identity as defined by Sellers et al. (1997). Following Enders’ (2018) reflection on the MMRI and Nigrescence Theory, we chose to use the measure of racial centrality as it is generally more stable across situations and best “describes the place race holds in the hierarchy of identities an individual possesses and answers the question ‘How important is race to me in my life?’” (p. 518). Participants completed the MIBI items at the end of the interview to allow us to learn more about the participants’ identification with their racial group, to avoid biasing their responses to the Identity Circle, and to avoid potentially creating a stereotype threat at the beginning of the interview. This paper focuses on the results of the MIBI survey and the identity circles to investigate whether these measures were correlated. Recognizing that Blackness (race) is not monolithic, we were interested in knowing the extent to which the participants considered their Black identity as central to their engineering education experiences. Combined with discussion about the identity circles, this approach allowed us to learn more about how other elements of identity may shape the participants’ educational experiences and outcomes and revealed possible differences in how participants may enact various points of their identity. Findings For this paper, we focus on the results for five HBCU students and 27 PWI students who completed the MIBI and identity circle. The overall MIBI average for HBCU students was 43 (out of a possible 56) and the overall MIBI scores ranged from 36-51; the overall MIBI average for the PWI students was 40; the overall MIBI scores for the PWI students ranged from 24-51. Twenty-one students placed race in the inner circle, indicating that race was central to their identity. Five placed race on the second, middle circle; three placed race on the third, outer circle. Three students did not place race on their identity circle. For our cross-case qualitative analysis, we will choose cases across the two institutions that represent low, medium and high MIBI scores and different ranges of centrality of race to identity, as expressed in the identity circles. Our final analysis will include descriptive quotes from these in-depth interviews to further elucidate the significance of race to the participants’ identities and engineering education experiences. The results will provide context for our larger study of a total of 60 Black students in engineering at our four study institutions. Theoretically, our study represents a new application of Racial Identity Theory and will provide a unique opportunity to apply the theories of intersectionality, critical race theory, and community cultural wealth theory. Methodologically, our findings provide insights into the utility of combining our two qualitative research tools, the MIBI centrality scale and the identity circle, to better understand the influence of race on the education experiences of Black students in engineering. 
    more » « less
  3. Crowdsourcing has rapidly become a computing paradigm in machine learning and artificial intelligence. In crowdsourcing, multiple labels are collected from crowd-workers on an instance usually through the Internet. These labels are then aggregated as a single label to match the ground truth of the instance. Due to its open nature, human workers in crowdsourcing usually come with various levels of knowledge and socio-economic backgrounds. Effectively handling such human factors has been a focus in the study and applications of crowdsourcing. For example, Bi et al studied the impacts of worker's dedication, expertise, judgment, and task difficulty (Bi et al 2014). Qiu et al offered methods for selecting workers based on behavior prediction (Qiu et al 2016). Barbosa and Chen suggested rehumanizing crowdsourcing to deal with human biases (Barbosa 2019). Checco et al studied adversarial attacks on crowdsourcing for quality control (Checco et al 2020). There are many more related works available in literature. In contrast to commonly used binary-valued labels, interval-valued labels (IVLs) have been introduced very recently (Hu et al 2021). Applying statistical and probabilistic properties of interval-valued datasets, Spurling et al quantitatively defined worker's reliability in four measures: correctness, confidence, stability, and predictability (Spurling et al 2021). Calculating these measures, except correctness, does not require the ground truth of each instance but only worker’s IVLs. Applying these quantified reliability measures, people have significantly improved the overall quality of crowdsourcing (Spurling et al 2022). However, in real world applications, the reliability of a worker may vary from time to time rather than a constant. It is necessary to monitor worker’s reliability dynamically. Because a worker j labels instances sequentially, we treat j’s IVLs as an interval-valued time series in our approach. Assuming j’s reliability relies on the IVLs within a time window only, we calculate j’s reliability measures with the IVLs in the current time window. Moving the time window forward with our proposed practical strategies, we can monitor j’s reliability dynamically. Furthermore, the four reliability measures derived from IVLs are time varying too. With regression analysis, we can separate each reliability measure as an explainable trend and possible errors. To validate our approaches, we use four real world benchmark datasets in our computational experiments. Here are the main findings. The reliability weighted interval majority voting (WIMV) and weighted preferred matching probability (WPMP) schemes consistently overperform the base schemes in terms of much higher accuracy, precision, recall, and F1-score. Note: the base schemes are majority voting (MV), interval majority voting (IMV), and preferred matching probability (PMP). Through monitoring worker’s reliability, our computational experiments have successfully identified possible attackers. By removing identified attackers, we have ensured the quality. We have also examined the impact of window size selection. It is necessary to monitor worker’s reliability dynamically, and our computational results evident the potential success of our approaches.This work is partially supported by the US National Science Foundation through the grant award NSF/OIA-1946391.

    more » « less
  4. The future of work is ambiguous at best. Despite widespread shifts to remote/hybrid work during the COVID-19 lockdown, there is a paucity of knowledge about changing job conditions in tandem with different work locales. Is the move to remote/hybrid work a disrupter or accentuator of existing norms and inequalities? Drawing on nationally representative, four-wave panel survey data (October 2020 to April 2022) collected from U.S. workers who spent at least some time working from home since the pandemic onset, we examine effects of within-person changes in where respondents work on changes in job conditions (psychological job demands, job control, coworker support, and monitoring). Estimates from fixed-effects models show that, compared with returning to working at work, ongoing remote and moving to hybrid work lead to greater reductions in psychological job demands, especially among older women and men. Black and Hispanic women moving back to the office experience the greatest loss of decision latitude and schedule control. While white workers see increased coworker support when returning to the office, returning Black and Hispanic men report a decline in coworker support. Family caregivers’ job conditions do not improve whether remote/hybrid or returning to work. Qualitative data collected from Amazon Mechanic Turk illuminate mechanisms leading to salutary effects of remote work, but also the stress of combining jobs with family carework. 
    more » « less
  5. One of the goals of undergraduate education is to prepare students to adapt to a challenging career that requires continual learning and application of knowledge. Working professionals should have deep conceptual knowledge that they can apply in a range of contexts and possess the attitudes and skills of lifelong learners. The literature suggests the concept of Adaptive Expertise (AE), which can be defined as the ability to apply and extend knowledge and skills to new situations, describes some of these characteristics. Survey data concerning the level of AE displayed by various populations is extremely limited in most contexts, be it education or working professionals. As such, data concerning the level of adaptiveness displayed among various groups needs to be measured if activities designed to promote the development of AE are to be created and then tested in terms of their efficacy. This investigation provides this critical baseline data for future studies as we track the AE development of individual, first-year college students through their undergraduate program of study, with a focus on low-income students as a means to support retention. In this work, we assessed adaptive expertise among low-income STEM students using surveys and interviews. Low-income STEM students from various stages of their four year undergraduate program (n=208) completed an adaptive expertise survey in spring 2022. Following the survey, 24 of the low-income students (6 per year, 3 male, 3 female) were selected for targeted qualitative interviews to better understand the differences displayed by low and high AE students. Survey results from prior studies were used to draw comparisons between adaptiveness of low-income and non-low-income students. Results of the AE survey indicated no statistically-significant differences between low-income and non-low-income first-year students in terms of their level of adaptiveness. In addition, the level of AE displayed by low-income students increased through the program in a manner similar to that of non-low-income STEM students. Themes that emerged from the interviews included a general understanding of the importance and likelihood of learning new concepts continually while working in a professional role, and that students expressed growth in understanding the acceptance of reaching out for assistance from other students and faculty after exploring information on their own as they work through challenges in their academic assignments. Two dominant and divergent metacognitive processes were also observed: teaching/explaining concepts to others (highly adaptive) or primarily relying on exam/course grades for feedback on learning (low adaptiveness). Data gathered from interviews demonstrate the need for a greater emphasis on metacognitive practice to promote various aspects of AE. 
    more » « less