Classifying images using supervised machine learning (ML) relies on labeled training data—classes or text descriptions, for example, associated with each image. Data‐driven models are only as good as the data used for training, and this points to the importance of high‐quality labeled data for developing a ML model that has predictive skill. Labeling data is typically a time‐consuming, manual process. Here, we investigate the process of labeling data, with a specific focus on coastal aerial imagery captured in the wake of hurricanes that affected the Atlantic and Gulf Coasts of the United States. The imagery data set is a rich observational record of storm impacts and coastal change, but the imagery requires labeling to render that information accessible. We created an online interface that served labelers a stream of images and a fixed set of questions. A total of 1,600 images were labeled by at least two or as many as seven coastal scientists. We used the resulting data set to investigate interrater agreement: the extent to which labelers labeled each image similarly. Interrater agreement scores, assessed with percent agreement and Krippendorff's alpha, are higher when the questions posed to labelers are relatively simple, when the labelers are provided with a user manual, and when images are smaller. Experiments in interrater agreement point toward the benefit of multiple labelers for understanding the uncertainty in labeling data for machine learning research.
Efficient PAC Learning from the Crowd
In recent years crowdsourcing has become the method of choice for gathering labeled training
data for learning algorithms. Standard approaches to crowdsourcing view the process of acquiring
labeled data separately from the process of learning a classifier from the gathered data. This can
give rise to computational and statistical challenges. For example, in most cases there are no known
computationally efficient learning algorithms that are robust to the high level of noise that exists in
crowdsourced data, and efforts to eliminate noise through voting often require a large number of
queries per example.
In this paper, we show how by interleaving the process of labeling and learning, we can attain
computational efficiency with much less overhead in the labeling cost. In particular, we consider
the realizable setting where there exists a true target function in F and consider a pool of labelers.
When a noticeable fraction of the labelers are perfect, and the rest behave arbitrarily, we show
that any F that can be efficiently learned in the traditional realizable PAC model can be learned
in a computationally efficient manner by querying the crowd, despite high amounts of noise in
the responses. Moreover, we show that this can be done while each labeler only labels a constant
number of examples and the number of labels requested per example, on average, is a constant.
When no perfect labelers exist, a related task is to find a set of the labelers which are good but not
perfect. We show that we can identify all good labelers, when at least the majority of labelers are
good.
more »
« less
 NSFPAR ID:
 10057814
 Date Published:
 Journal Name:
 COLT
 Volume:
 65
 Page Range / eLocation ID:
 127150
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Abstract 
Crowdsourcing has been widely adopted to perform large projects suitable for human participation, in which tasks are usually distributed to workers. Many such projects involve classification/labeling certain collections of items through semisupervised clustering, in which queries on small subsets of the items are assigned to workers in the crowd. The answers are collected by a taskmaster and the goal is to fully recover the labels. This problem can be modeled as a sparsely encoded source coding problem, where each query answer, regarded as a code bit, is the XOR of a small number of labels, as source information bits. While the problem of designing compression/source coding schemes achieving Shannon’s optimal compression rate is very wellstudied, a few have considered sparsely encoded schemes. In this paper we leverage the connections between this problem and wellstudied codes with sparse representations for the channel coding problem to provide querying schemes with almost optimal number of queries, each of which involving only a constant number of labels. We also extend this scenario to the case where some workers can be unresponsive. For this case, we propose querying schemes where each query involves only log n items, where n is the total number of items to be labeled. Furthermore, we consider classification of two correlated labeling systems and provide twostage querying schemes with almost optimal number of queries each involving a constant number of labels.more » « less

Producing highquality labeled data is a challenge in any supervised learning problem, where in many cases, human involvement is necessary to ensure the label quality. However, human annotations are not flawless, especially in the case of a challenging problem. In nontrivial problems, the high disagreement among annotators results in noisy labels, which affect the performance of any machine learning model. In this work, we consider three noise reduction strategies to improve the label quality in the ArticleComment Alignment Problem, where the main task is to classify articlecomment pairs according to their relevancy level. The first considered labeling disagreement reduction strategy utilizes annotators' background knowledge during the label aggregation step. The second strategy utilizes user disagreement during the training process. In the third and final strategy, we ask annotators to perform corrections and relabel the examples with noisy labels. We deploy these strategies and compare them to a resampling strategy for addressing the class imbalance, another common supervised learning challenge. These alternatives were evaluated on ACAP, a multiclass text pairs classification problem with highly imbalanced data, where one of the classes represents at most 15% of the dataset's entire population. Our results provide evidence that considered strategies can reduce disagreement between annotators. However, data quality improvement is insufficient to enhance classification accuracy in the articlecomment alignment problem, which exhibits a highclass imbalance. The model performance is enhanced for the same problem by addressing the imbalance issue with a weight lossbased class distribution resampling. We show that allowing the model to pay more attention to the minority class during the training process with the presence of noisy examples improves the test accuracy by 3%.more » « less

In the domains of dataset construction and crowdsourcing, a notable challenge is to aggregate labels from a heterogeneous set of labelers, each of whom is potentially an expert in some subset of tasks (and less reliable in others). To reduce costs of hiring human labelers or training automated labeling systems, it is of interest to minimize the number of labelers while ensuring the reliability of the resulting dataset. We model this as the problem of performing Kclass classification using the predictions of smaller classifiers, each trained on a subset of [K], and derive bounds on the number of classifiers needed to accurately infer the true class of an unlabeled sample under both adversarial and stochastic assumptions. By exploiting a connection to the classical set cover problem, we produce a nearoptimal scheme for designing such configurations of classifiers which recovers the well known onevs.one classification approach as a special case. Experiments with the MNIST and CIFAR10 datasets demonstrate the favorable accuracy (compared to a centralized classifier) of our aggregation scheme applied to classifiers trained on subsets of the data. These results suggest a new way to automatically label data or adapt an existing set of local classifiers to largerscale multiclass problems.more » « less

We consider the problem of computing the bestfitting ReLU with respect to squareloss on a training set when the examples have been drawn according to a spherical Gaussian distribution (the labels can be arbitrary). Let 𝗈𝗉𝗍<1 be the population loss of the bestfitting ReLU. We prove: 1. Finding a ReLU with squareloss 𝗈𝗉𝗍+ϵ is as hard as the problem of learning sparse parities with noise, widely thought to be computationally intractable. This is the first hardness result for learning a ReLU with respect to Gaussian marginals, and our results imply {\emph unconditionally} that gradient descent cannot converge to the global minimum in polynomial time. 2. There exists an efficient approximation algorithm for finding the bestfitting ReLU that achieves error O(𝗈𝗉𝗍^{2/3}). The algorithm uses a novel reduction to noisy halfspace learning with respect to 0/1 loss. Prior work due to Soltanolkotabi [Sol17] showed that gradient descent can find the bestfitting ReLU with respect to Gaussian marginals, if the training set is exactly labeled by a ReLU.more » « less