Many AI system designers grapple with how best to collect human input for different types of training data. Online crowds provide a cheap on-demand source of intelligence, but they often lack the expertise required in many domains. Experts offer tacit knowledge and more nuanced input, but they are harder to recruit. To explore this trade off, we compared novices and experts in terms of performance and perceptions on human intelligence tasks in the context of designing a text-based conversational agent. We developed a preliminary chatbot that simulates conversations with someone seeking mental health advice to help educate volunteer listeners at 7cups.com. We then recruited experienced listeners (domain experts) and MTurk novice workers (crowd workers) to conduct tasks to improve the chatbot with different levels of complexity. Novice crowds perform comparably to experts on tasks that only require natural language understanding, such as correcting how the system classifies a user statement. For more generative tasks, like creating new lines of chatbot dialogue, the experts demonstrated higher quality, novelty, and emotion. We also uncovered a motivational gap: crowd workers enjoyed the interactive tasks, while experts found the work to be tedious and repetitive. We offer design considerations for allocating crowd workers and experts on input tasks for AI systems, and for better motivating experts to participate in low-level data work for AI.
more »
« less
DialCrowd 2.0: A Quality-Focused Dialog System Crowdsourcing Toolkit
Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help requesters obtain higher quality data by, for example, presenting tasks more clearly and facilitating effective communication with workers. DialCrowd 2.0 guides developers in creating improved Human Intelligence Tasks (HITs) and is directly applicable to the workflows used currently by developers and researchers.
more »
« less
- Award ID(s):
- 1924855
- PAR ID:
- 10346882
- Date Published:
- Journal Name:
- LREC proceedings
- ISSN:
- 2522-2686
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Sserwanga, I. (Ed.)Citizen scientists make valuable contributions to science but need to learn about the data they are working with to be able to perform more advanced tasks. We present a set of design principles for identifying the kinds of background knowledge that are important to support learning at different stages of engagement, drawn from a study of how free/libre open source software developers are guided to create and use documents. Specifically, we suggest that newcomers require help understanding the purpose, form and content of the documents they engage with, while more advanced developers add understanding of information provenance and the boundaries, relevant participants and work processes. We apply those principles in two separate but related studies. In study 1, we analyze the background knowledge presented to volunteers in the Gravity Spy citizen-science project, mapping the resources to the framework and identifying kinds of knowledge that were not initially provided. In study 2, we use the principles proactively to develop design suggestions for Gravity Spy 2.0, which will involve volunteers in analyzing more diverse sources of data. This new project extends the application of the principles by seeking to use them to support understanding of the relationships between documents, not just the documents individually. We conclude by discussing future work, including a planned evaluation of Gravity Spy 2.0 that will provide a further test of the design principles.more » « less
-
null (Ed.)Crowdworkers depend on Amazon Mechanical Turk (AMT) as an important source of income and it is left to workers to determine which tasks on AMT are fair and worth completing. While there are existing tools that assist workers in making these decisions, workers still spend significant amounts of time finding fair labor. Difficulties in this process may be a contributing factor in the imbalance between the median hourly earnings ($2.00/hour) and what the average requester pays ($11.00/hour). In this paper, we study how novices and experts select what tasks are worth doing. We argue that differences between the two populations likely lead to the wage imbalances. For this purpose, we first look at workers' comments in TurkOpticon (a tool where workers share their experience with requesters on AMT). We use this study to start to unravel what fair labor means for workers. In particular, we identify the characteristics of labor that workers consider is of "good quality'' and labor that is of "poor quality'' (e.g., work that pays too little.) Armed with this knowledge, we then conduct an experiment to study how experts and novices rate tasks that are of both good and poor quality. Through our research we uncover that experts and novices both treat good quality labor in the same way. However, there are significant differences in how experts and novices rate poor quality labor, and whether they believe the poor quality labor is worth doing. This points to several future directions, including machine learning models that support workers in detecting poor quality labor, and paths for educating novice workers on how to make better labor decisions on AMT.more » « less
-
Crowdsourcing has become a popular means to solicit assistance for scientific research. From classifying images or texts to responding to surveys, tapping into the knowledge of crowds to complete complex tasks has become a common strategy in social and information sciences. Although the timeliness and cost-effectiveness of crowdsourcing may provide desirable advantages to researchers, the data it generates may be of lower quality for some scientific purposes. The quality control mechanisms, if any, offered by common crowdsourcing platforms may not provide robust measures of data quality. This study explores whether research task participants may engage in motivated misreporting whereby participants tend to cut corners to reduce their workload while performing various scientific tasks online. We conducted an experiment with three common crowdsourcing tasks: answering surveys, coding images, and classifying online social media content. The experiment recruited workers from three sources: a crowdsourcing platform (Amazon Mechanical Turk) and a commercial online survey panel. The analysis seeks to address the following two questions: (1) whether online panelists or crowd workers may engage in motivated misreporting differently and (2) whether the patterns of misreporting vary by different task types. The study focuses on the analysis of the experiment in answering surveys and offers quality assurance practice guideline of using crowdsourcing for social science research.more » « less
-
It has been well documented that a large portion of the cost of any software lies in the time spent by developers in understanding a program’s source code before any changes can be undertaken. One of the main contributors to software comprehension, by subsequent developers or by the authors themselves, has to do with the quality of the lexicon, (i.e., the identifiers and comments) that is used by developers to embed domain concepts and to communicate with their teammates. In fact, previous research shows that there is a positive correlation between the quality of identifiers and the quality of a software project. Results suggest that poor quality lexicon impairs program comprehension and consequently increases the effort that developers must spend to maintain the software. However, we do not yet know or have any empirical evidence, of the relationship between the quality of the lexicon and the cognitive load that developers experience when trying to understand a piece of software. Given the associated costs, there is a critical need to empirically characterize the impact of the quality of the lexicon on developers’ ability to comprehend a program. In this study, we explore the effect of poor source code lexicon and readability on developers’ cognitive load as measured by a cutting-edge and minimally invasive functional brain imaging technique called functional Near Infrared Spectroscopy (fNIRS). Additionally, while developers perform software comprehension tasks, we map cognitive load data to source code identifiers using an eye tracking device. Our results show that the presence of linguistic antipatterns in source code significantly increases the developers’ cognitive load.more » « less
An official website of the United States government

