skip to main content


Title: The Expertise Involved in Deciding which HITs are Worth Doing on Amazon Mechanical Turk
Crowdworkers depend on Amazon Mechanical Turk (AMT) as an important source of income and it is left to workers to determine which tasks on AMT are fair and worth completing. While there are existing tools that assist workers in making these decisions, workers still spend significant amounts of time finding fair labor. Difficulties in this process may be a contributing factor in the imbalance between the median hourly earnings ($2.00/hour) and what the average requester pays ($11.00/hour). In this paper, we study how novices and experts select what tasks are worth doing. We argue that differences between the two populations likely lead to the wage imbalances. For this purpose, we first look at workers' comments in TurkOpticon (a tool where workers share their experience with requesters on AMT). We use this study to start to unravel what fair labor means for workers. In particular, we identify the characteristics of labor that workers consider is of "good quality'' and labor that is of "poor quality'' (e.g., work that pays too little.) Armed with this knowledge, we then conduct an experiment to study how experts and novices rate tasks that are of both good and poor quality. Through our research we uncover that experts and novices both treat good quality labor in the same way. However, there are significant differences in how experts and novices rate poor quality labor, and whether they believe the poor quality labor is worth doing. This points to several future directions, including machine learning models that support workers in detecting poor quality labor, and paths for educating novice workers on how to make better labor decisions on AMT.  more » « less
Award ID(s):
1928528
NSF-PAR ID:
10276156
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Computer supported cooperative work CSCW
ISSN:
1573-7551
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Crowdsourcing markets provide workers with a centralized place to find paid work. What may not be obvious at first glance is that, in addition to the work they do for pay, crowd workers also have to shoulder a variety of unpaid invisible labor in these markets, which ultimately reduces workers' hourly wages. Invisible labor includes finding good tasks, messaging requesters, or managing payments. However, we currently know little about how much time crowd workers actually spend on invisible labor or how much it costs them economically. To ensure a fair and equitable future for crowd work, we need to be certain that workers are being paid fairly for ALL of the work they do. In this paper, we conduct a field study to quantify the invisible labor in crowd work. We build a plugin to record the amount of time that 100 workers on Amazon Mechanical Turk dedicate to invisible labor while completing 40,903 tasks. If we ignore the time workers spent on invisible labor, workers' median hourly wage was $3.76. But, we estimated that crowd workers in our study spent 33% of their time daily on invisible labor, dropping their median hourly wage to $2.83. We found that the invisible labor differentially impacts workers depending on their skill level and workers' demographics. The invisible labor category that took the most time and that was also the most common revolved around workers having to manage their payments. The second most time-consuming invisible labor category involved hyper-vigilance, where workers vigilantly watched over requesters' profiles for newly posted work or vigilantly searched for labor. We hope that through our paper, the invisible labor in crowdsourcing becomes more visible, and our results help to reveal the larger implications of the continuing invisibility of labor in crowdsourcing. 
    more » « less
  2. Abstract: 100 words Jurors are increasingly exposed to scientific information in the courtroom. To determine whether providing jurors with gist information would assist in their ability to make well-informed decisions, the present experiment utilized a Fuzzy Trace Theory-inspired intervention and tested it against traditional legal safeguards (i.e., judge instructions) by varying the scientific quality of the evidence. The results indicate that jurors who viewed high quality evidence rated the scientific evidence significantly higher than those who viewed low quality evidence, but were unable to moderate the credibility of the expert witness and apply damages appropriately resulting in poor calibration. Summary: <1000 words Jurors and juries are increasingly exposed to scientific information in the courtroom and it remains unclear when they will base their decisions on a reasonable understanding of the relevant scientific information. Without such knowledge, the ability of jurors and juries to make well-informed decisions may be at risk, increasing chances of unjust outcomes (e.g., false convictions in criminal cases). Therefore, there is a critical need to understand conditions that affect jurors’ and juries’ sensitivity to the qualities of scientific information and to identify safeguards that can assist with scientific calibration in the courtroom. The current project addresses these issues with an ecologically valid experimental paradigm, making it possible to assess causal effects of evidence quality and safeguards as well as the role of a host of individual difference variables that may affect perceptions of testimony by scientific experts as well as liability in a civil case. Our main goal was to develop a simple, theoretically grounded tool to enable triers of fact (individual jurors) with a range of scientific reasoning abilities to appropriately weigh scientific evidence in court. We did so by testing a Fuzzy Trace Theory-inspired intervention in court, and testing it against traditional legal safeguards. Appropriate use of scientific evidence reflects good calibration – which we define as being influenced more by strong scientific information than by weak scientific information. Inappropriate use reflects poor calibration – defined as relative insensitivity to the strength of scientific information. Fuzzy Trace Theory (Reyna & Brainerd, 1995) predicts that techniques for improving calibration can come from presentation of easy-to-interpret, bottom-line “gist” of the information. Our central hypothesis was that laypeople’s appropriate use of scientific information would be moderated both by external situational conditions (e.g., quality of the scientific information itself, a decision aid designed to convey clearly the “gist” of the information) and individual differences among people (e.g., scientific reasoning skills, cognitive reflection tendencies, numeracy, need for cognition, attitudes toward and trust in science). Identifying factors that promote jurors’ appropriate understanding of and reliance on scientific information will contribute to general theories of reasoning based on scientific evidence, while also providing an evidence-based framework for improving the courts’ use of scientific information. All hypotheses were preregistered on the Open Science Framework. Method Participants completed six questionnaires (counterbalanced): Need for Cognition Scale (NCS; 18 items), Cognitive Reflection Test (CRT; 7 items), Abbreviated Numeracy Scale (ABS; 6 items), Scientific Reasoning Scale (SRS; 11 items), Trust in Science (TIS; 29 items), and Attitudes towards Science (ATS; 7 items). Participants then viewed a video depicting a civil trial in which the defendant sought damages from the plaintiff for injuries caused by a fall. The defendant (bar patron) alleged that the plaintiff (bartender) pushed him, causing him to fall and hit his head on the hard floor. Participants were informed at the outset that the defendant was liable; therefore, their task was to determine if the plaintiff should be compensated. Participants were randomly assigned to 1 of 6 experimental conditions: 2 (quality of scientific evidence: high vs. low) x 3 (safeguard to improve calibration: gist information, no-gist information [control], jury instructions). An expert witness (neuroscientist) hired by the court testified regarding the scientific strength of fMRI data (high [90 to 10 signal-to-noise ratio] vs. low [50 to 50 signal-to-noise ratio]) and gist or no-gist information both verbally (i.e., fairly high/about average) and visually (i.e., a graph). After viewing the video, participants were asked if they would like to award damages. If they indicated yes, they were asked to enter a dollar amount. Participants then completed the Positive and Negative Affect Schedule-Modified Short Form (PANAS-MSF; 16 items), expert Witness Credibility Scale (WCS; 20 items), Witness Credibility and Influence on damages for each witness, manipulation check questions, Understanding Scientific Testimony (UST; 10 items), and 3 additional measures were collected, but are beyond the scope of the current investigation. Finally, participants completed demographic questions, including questions about their scientific background and experience. The study was completed via Qualtrics, with participation from students (online vs. in-lab), MTurkers, and non-student community members. After removing those who failed attention check questions, 469 participants remained (243 men, 224 women, 2 did not specify gender) from a variety of racial and ethnic backgrounds (70.2% White, non-Hispanic). Results and Discussion There were three primary outcomes: quality of the scientific evidence, expert credibility (WCS), and damages. During initial analyses, each dependent variable was submitted to a separate 3 Gist Safeguard (safeguard, no safeguard, judge instructions) x 2 Scientific Quality (high, low) Analysis of Variance (ANOVA). Consistent with hypotheses, there was a significant main effect of scientific quality on strength of evidence, F(1, 463)=5.099, p=.024; participants who viewed the high quality evidence rated the scientific evidence significantly higher (M= 7.44) than those who viewed the low quality evidence (M=7.06). There were no significant main effects or interactions for witness credibility, indicating that the expert that provided scientific testimony was seen as equally credible regardless of scientific quality or gist safeguard. Finally, for damages, consistent with hypotheses, there was a marginally significant interaction between Gist Safeguard and Scientific Quality, F(2, 273)=2.916, p=.056. However, post hoc t-tests revealed significantly higher damages were awarded for low (M=11.50) versus high (M=10.51) scientific quality evidence F(1, 273)=3.955, p=.048 in the no gist with judge instructions safeguard condition, which was contrary to hypotheses. The data suggest that the judge instructions alone are reversing the pattern, though nonsignificant, those who received the no gist without judge instructions safeguard awarded higher damages in the high (M=11.34) versus low (M=10.84) scientific quality evidence conditions F(1, 273)=1.059, p=.30. Together, these provide promising initial results indicating that participants were able to effectively differentiate between high and low scientific quality of evidence, though inappropriately utilized the scientific evidence through their inability to discern expert credibility and apply damages, resulting in poor calibration. These results will provide the basis for more sophisticated analyses including higher order interactions with individual differences (e.g., need for cognition) as well as tests of mediation using path analyses. [References omitted but available by request] Learning Objective: Participants will be able to determine whether providing jurors with gist information would assist in their ability to award damages in a civil trial. 
    more » « less
  3. As K-12 engineering education becomes more ubiquitous in the U.S, increased attention has been paid to preparing the heterogeneous group of in-service teachers who have taken on the challenge of teaching engineering. Standards have emerged for professional development along with research on teacher learning in engineering that call for teachers to facilitate and support engineering learning environments. Given that many teachers may not have experienced engineering practice calls have been made to engage teaches K-12 teachers in the “doing” of engineering as part of their preparation. However, there is a need for research studying more specific nature of the “doing” and the instructional implications for engaging teachers in “doing” engineering. In general, to date, limited time and constrained resources necessitate that many professional development programs for K-12 teachers to engage participants in the same engineering activities they will enact with their students. While this approach supports teachers’ familiarity with curriculum and ability to anticipate students’ ideas, there is reason to believe that these experiences may not be authentic enough to support teachers in developing a rich understanding of the “doing” of engineering. K-12 teachers are often familiar with the materials and curricular solutions, given their experiences as adults, which means that engaging in the same tasks as their students may not be challenging enough to develop their understandings about engineering. This can then be consequential for their pedagogy: In our prior work, we found that teachers’ linear conceptions of the engineering design process can limit them from recognizing and supporting student engagement in productive design practices. Research on the development of engineering design practices with adults in undergraduate and professional engineering settings has shown significant differences in how adults approach and understand problems. Therefore, we conjectured that engaging teachers in more rigorous engineering challenges designed for adult engineering novices would more readily support their developing rich understandings of the ways in which professional engineers move through the design process. We term this approach meaningful engineering for teachers, and it is informed by work in science education that highlights the importance of learning environments creating a need for learners to develop and engage in disciplinary practices. We explored this approach to teachers’ professional learning experiences in doing engineering in an online graduate program for in-service teachers in engineering education at Tufts University entitled the Teacher Engineering Education Program (teep.tufts.edu). In this exploratory study, we asked: 1. How did teachers respond to engaging in meaningful engineering for teachers in the TEEP program? 2. What did teachers identify as important things they learned about engineering content and pedagogy? This paper focuses on one theme that emerged from teachers’ reflections. Our analysis found that teachers reported that meaningful engineering supported their development of epistemic empathy (“the act of understanding and appreciating someone's cognitive and emotional experience within an epistemic activity”) as a result of their own affective experiences in doing engineering that required significant iteration as well as using novel robotic materials. We consider how epistemic empathy may be an important aspect of teacher learning in K-12 engineering education and the potential implications for designing engineering teacher education. 
    more » « less
  4. Ethical decision-making is difficult, certainly for robots let alone humans. If a robot's ethical decision-making process is going to be designed based on some approximation of how humans operate, then the assumption is that a good model of how humans make an ethical choice is readily available. Yet no single ethical framework seems sufficient to capture the diversity of human ethical decision making. Our work seeks to develop the computational underpinnings that will allow a robot to use multiple ethical frameworks that guide it towards doing the right thing. As a step towards this goal, we have collected data investigating how regular adults and ethics experts approach ethical decisions related to the use in a healthcare and game playing scenario. The decisions made by the former group is intended to represent an approximation of a folk morality approach to these dilemmas. On the other hand, experts were asked to judge what decision would result if a person was using one of several different types of ethical frameworks. The resulting data may reveal which features of the pill sorting and game playing scenarios contribute to similarities and differences between expert and non-expert responses. This type of approach to programming a robot may one day be able to rely on specific features of an interaction to determine which ethical framework to use in the robot's decision making. 
    more » « less
  5. Many AI system designers grapple with how best to collect human input for different types of training data. Online crowds provide a cheap on-demand source of intelligence, but they often lack the expertise required in many domains. Experts offer tacit knowledge and more nuanced input, but they are harder to recruit. To explore this trade off, we compared novices and experts in terms of performance and perceptions on human intelligence tasks in the context of designing a text-based conversational agent. We developed a preliminary chatbot that simulates conversations with someone seeking mental health advice to help educate volunteer listeners at 7cups.com. We then recruited experienced listeners (domain experts) and MTurk novice workers (crowd workers) to conduct tasks to improve the chatbot with different levels of complexity. Novice crowds perform comparably to experts on tasks that only require natural language understanding, such as correcting how the system classifies a user statement. For more generative tasks, like creating new lines of chatbot dialogue, the experts demonstrated higher quality, novelty, and emotion. We also uncovered a motivational gap: crowd workers enjoyed the interactive tasks, while experts found the work to be tedious and repetitive. We offer design considerations for allocating crowd workers and experts on input tasks for AI systems, and for better motivating experts to participate in low-level data work for AI. 
    more » « less