In recent years crowdsourcing has become the method of choice for gathering labeled training
data for learning algorithms. Standard approaches to crowdsourcing view the process of acquiring
labeled data separately from the process of learning a classifier from the gathered data. This can
give rise to computational and statistical challenges. For example, in most cases there are no known
computationally efficient learning algorithms that are robust to the high level of noise that exists in
crowdsourced data, and efforts to eliminate noise through voting often require a large number of
queries per example.
In this paper, we show how by interleaving the process of labeling and learning, we can attain
computational efficiency with much less overhead in the labeling cost. In particular, we consider
the realizable setting where there exists a true target function in F and consider a pool of labelers.
When a noticeable fraction of the labelers are perfect, and the rest behave arbitrarily, we show
that any F that can be efficiently learned in the traditional realizable PAC model can be learned
in a computationally efficient manner by querying the crowd, despite high amounts of noise in
the responses. Moreover, we show that this can be done while each labeler only labels a constant
number of examples and the number of labels requested per example, on average, is a constant.
When no perfect labelers exist, a related task is to find a set of the labelers which are good but not
perfect. We show that we can identify all good labelers, when at least the majority of labelers are
good.
more »
« less
Crowdsourced PAC Learning under Classification Noise
In this paper, we analyze PAC learnability from labels produced by crowdsourcing. In our setting, unlabeled examples are drawn from a distribution and labels are crowdsourced from workers who operate under classification noise, each with their own noise parameter. We develop an endtoend crowdsourced PAC learning algorithm that takes unlabeled data points as input and outputs a trained classifier. Our threestep algorithm incorporates majority voting, pureexploration bandits, and noisyPAC learning. We prove several guarantees on the number of tasks labeled by workers for PAC learning in this setting and show that our algorithm improves upon the baseline by reducing the total number of tasks given to workers. We demonstrate the robustness of our algorithm by exploring its application to additional realistic crowdsourcing settings.
more »
« less
 Award ID(s):
 1848966
 NSFPAR ID:
 10201451
 Editor(s):
 Law, Edith; Vaughan, Jennifer W
 Date Published:
 Journal Name:
 Proceedings of the Seventh AAAI Conference on Human Computation and Crowdsourcing
 Volume:
 7
 Issue:
 1
 Page Range / eLocation ID:
 4149
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Crowdsourcing platforms emerged as popular venues for purchasing human intelligence at low cost for large volume of tasks. As many lowpaid workers are prone to give noisy answers, a common practice is to add redundancy by assigning multiple workers to each task and then simply average out these answers. However, to fully harness the wisdom of the crowd, one needs to learn the heterogeneous quality of each worker. We resolve this fundamental challenge in crowdsourced regression tasks, i.e., the answer takes continuous labels, where identifying good or bad workers becomes much more nontrivial compared to a classification setting of discrete labels. In particular, we introduce a Bayesian iterative scheme and show that it provably achieves the optimal mean squared error. Our evaluations on synthetic and realworld datasets support our theoretical results and show the superiority of the proposed scheme.more » « less

In the problem of learning a class ratio from unlabeled data, which we call CR learning, the training data is unlabeled, and only the ratios, or proportions, of examples receiving each label are given. The goal is to learn a hypothesis that predicts the proportions of labels on the distribution underlying the sample. This model of learning is applicable to a wide variety of settings, including predicting the number of votes for candidates in political elections from polls. In this paper, we formally define this class and resolve foundational questions regarding the computational complexity of CR learning and characterize its relationship to PAC learning. Among our results, we show, perhaps surprisingly, that for finite VC classes what can be efficiently CR learned is a strict subset of what can be learned efficiently in PAC, under standard complexity assumptions. We also show that there exist classes of functions whose CR learnability is independent of ZFC, the standard set theoretic axioms. This implies that CR learning cannot be easily characterized (like PAC by VC dimension).more » « less

null (Ed.)Crowdsourcing provides an efficient label collection schema for supervised machine learning. However, to control annotation cost, each instance in the crowdsourced data is typically annotated by a small number of annotators. This creates a sparsity issue and limits the quality of machine learning models trained on such data. In this paper, we study how to handle sparsity in crowdsourced data using data augmentation. Specifically, we propose to directly learn a classifier by augmenting the raw sparse annotations. We implement two principles of highquality augmentation using Generative Adversarial Networks: 1) the generated annotations should follow the distribution of authentic ones, which is measured by a discriminator; 2) the generated annotations should have high mutual information with the groundtruth labels, which is measured by an auxiliary network. Extensive experiments and comparisons against an array of stateoftheart learning from crowds methods on three realworld datasets proved the effectiveness of our data augmentation framework. It shows the potential of our algorithm for lowbudget crowdsourcing in general.more » « less

null (Ed.)Crowdsourcing provides a practical way to obtain large amounts of labeled data at a low cost. However, the annotation quality of annotators varies considerably, which imposes new challenges in learning a highquality model from the crowdsourced annotations. In this work, we provide a new perspective to decompose annotation noise into common noise and individual noise and differentiate the source of confusion based on instance difficulty and annotator expertise on a perinstanceannotator basis. We realize this new crowdsourcing model by an endtoend learning solution with two types of noise adaptation layers: one is shared across annotators to capture their commonly shared confusions, and the other one is pertaining to each annotator to realize individual confusion. To recognize the source of noise in each annotation, we use an auxiliary network to choose from the two noise adaptation layers with respect to both instances and annotators. Extensive experiments on both synthesized and realworld benchmarks demonstrate the effectiveness of our proposed common noise adaptation solution.more » « less