This paper models the crowdsourced labeling/classification problem as a sparsely encoded source coding problem, where each query answer, regarded as a code bit, is the XOR of a small number of labels, as source information bits.
In this paper we leverage the connections between this problem and wellstudied codes with sparse representations for the channel coding problem to provide querying schemes with almost optimal number of queries, each of which involving only a constant number of labels. We also extend this scenario to the case where some workers can be unresponsive. For this case, we propose querying schemes where each query involves only $\log n$ items, where $n$ is the total number of items to be labeled. Furthermore, we consider classification of two correlated labeling systems and provide twostage querying schemes with almost optimal number of queries each involving a constant number of labels.
more »
« less
Coding for crowdsourced classification with XOR queries
Crowdsourcing has been widely adopted to perform
large projects suitable for human participation, in which tasks
are usually distributed to workers. Many such projects involve
classification/labeling certain collections of items through semisupervised clustering, in which queries on small subsets of the
items are assigned to workers in the crowd. The answers are collected by a taskmaster and the goal is to fully recover the labels.
This problem can be modeled as a sparsely encoded source coding problem, where each query answer, regarded as a code bit,
is the XOR of a small number of labels, as source information
bits. While the problem of designing compression/source coding
schemes achieving Shannon’s optimal compression rate is very
wellstudied, a few have considered sparsely encoded schemes.
In this paper we leverage the connections between this problem and wellstudied codes with sparse representations for the
channel coding problem to provide querying schemes with almost optimal number of queries, each of which involving only a
constant number of labels. We also extend this scenario to the
case where some workers can be unresponsive. For this case,
we propose querying schemes where each query involves only
log n items, where n is the total number of items to be labeled.
Furthermore, we consider classification of two correlated labeling systems and provide twostage querying schemes with almost
optimal number of queries each involving a constant number of
labels.
more »
« less
 Award ID(s):
 1717299
 NSFPAR ID:
 10123113
 Date Published:
 Journal Name:
 IEEE Information theory workshop
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Twosample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are easily measured whereas sample labels are unknown and costly to obtain. Accordingly, we devise a threestage framework in service of performing an effective twosample test with only a small number of sample label queries: first, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; second, a novel query scheme dubbed bimodal query is used to query labels of samples from both classes, and last, the classical FriedmanRafsky (FR) twosample test is performed on the queried samples. Theoretical analysis and extensive experiments performed on several datasets demonstrate that the proposed test controls the Type I error and has decreased Type II error relative to uniform querying and certaintybased querying. Source code for our algorithms and experimental results is available at https://github.com/wayne0908/LabelEfficientTwoSample.more » « less

We study the classic set cover problem from the perspective of sublinear algorithms. Given access to a collection of m sets over n elements in the query model, we show that sublinear algorithms derived from existing techniques have almost tight query complexities. On one hand, first we show an adaptation of the streaming algorithm presented in [17] to the sublinear query model, that returns an αapproximate cover using Õ(m(n/k)^1/(α–1) + nk) queries to the input, where k denotes the value of a minimum set cover. We then complement this upper bound by proving that for lower values of k, the required number of queries is , even for estimating the optimal cover size. Moreover, we prove that even checking whether a given collection of sets covers all the elements would require Ω(nk) queries. These two lower bounds provide strong evidence that the upper bound is almost tight for certain values of the parameter k. On the other hand, we show that this bound is not optimal for larger values of the parameter k, as there exists a (1 + ε)approximation algorithm with Õ(mn/kε^2) queries. We show that this bound is essentially tight for sufficiently small constant ε, by establishing a lower bound of query complexity. Our lowerbound results follow by carefully designing two distributions of instances that are hard to distinguish. In particular, our first lower bound involves a probabilistic construction of a certain set system with a minimum set cover of size αk, with the key property that a small number of “almost uniformly distributed” modifications can reduce the minimum set cover size down to k. Thus, these modifications are not detectable unless a large number of queries are asked. We believe that our probabilistic construction technique might find applications to lower bounds for other combinatorial optimization problems.more » « less

In recent years crowdsourcing has become the method of choice for gathering labeled training data for learning algorithms. Standard approaches to crowdsourcing view the process of acquiring labeled data separately from the process of learning a classifier from the gathered data. This can give rise to computational and statistical challenges. For example, in most cases there are no known computationally efficient learning algorithms that are robust to the high level of noise that exists in crowdsourced data, and efforts to eliminate noise through voting often require a large number of queries per example. In this paper, we show how by interleaving the process of labeling and learning, we can attain computational efficiency with much less overhead in the labeling cost. In particular, we consider the realizable setting where there exists a true target function in F and consider a pool of labelers. When a noticeable fraction of the labelers are perfect, and the rest behave arbitrarily, we show that any F that can be efficiently learned in the traditional realizable PAC model can be learned in a computationally efficient manner by querying the crowd, despite high amounts of noise in the responses. Moreover, we show that this can be done while each labeler only labels a constant number of examples and the number of labels requested per example, on average, is a constant. When no perfect labelers exist, a related task is to find a set of the labelers which are good but not perfect. We show that we can identify all good labelers, when at least the majority of labelers are good.more » « less

Crowdsourcing is an effective and efficient paradigm for obtaining labels for unlabeled corpus employing crowd workers. This work considers the budget allocation problem for a generalized setting on a graph of instances to be labeled where edges encode instance dependencies. Specifically, given a graph and a labeling budget, we propose an optimal policy to allocate the budget among the instances to maximize the overall labeling accuracy. We formulate the problem as a Bayesian Markov Decision Process (MDP), where we define our task as an optimization problem that maximizes the overall label accuracy under budget constraints. Then, we propose a novel stagewise reward function that considers the effect of worker labels on the whole graph at each timestamp. This reward function is utilized to find an optimal policy for the optimization problem. Theoretically, we show that our proposed policies are consistent when the budget is infinite. We conduct extensive experiments on five realworld graph datasets and demonstrate the effectiveness of the proposed policies to achieve a higher label accuracy under budget constraints.more » « less