skip to main content


Title: Coding for crowdsourced classification with XOR queries
Crowdsourcing has been widely adopted to perform large projects suitable for human participation, in which tasks are usually distributed to workers. Many such projects involve classification/labeling certain collections of items through semisupervised clustering, in which queries on small subsets of the items are assigned to workers in the crowd. The answers are collected by a taskmaster and the goal is to fully recover the labels. This problem can be modeled as a sparsely encoded source coding problem, where each query answer, regarded as a code bit, is the XOR of a small number of labels, as source information bits. While the problem of designing compression/source coding schemes achieving Shannon’s optimal compression rate is very well-studied, a few have considered sparsely encoded schemes. In this paper we leverage the connections between this problem and well-studied codes with sparse representations for the channel coding problem to provide querying schemes with almost optimal number of queries, each of which involving only a constant number of labels. We also extend this scenario to the case where some workers can be unresponsive. For this case, we propose querying schemes where each query involves only log n items, where n is the total number of items to be labeled. Furthermore, we consider classification of two correlated labeling systems and provide two-stage querying schemes with almost optimal number of queries each involving a constant number of labels.  more » « less
Award ID(s):
1717299
NSF-PAR ID:
10123113
Author(s) / Creator(s):
 
Date Published:
Journal Name:
IEEE Information theory workshop
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper models the crowdsourced labeling/classification problem as a sparsely encoded source coding problem, where each query answer, regarded as a code bit, is the XOR of a small number of labels, as source information bits. In this paper we leverage the connections between this problem and well-studied codes with sparse representations for the channel coding problem to provide querying schemes with almost optimal number of queries, each of which involving only a constant number of labels. We also extend this scenario to the case where some workers can be unresponsive. For this case, we propose querying schemes where each query involves only $\log n$ items, where $n$ is the total number of items to be labeled. Furthermore, we consider classification of two correlated labeling systems and provide two-stage querying schemes with almost optimal number of queries each involving a constant number of labels. 
    more » « less
  2. In recent years crowdsourcing has become the method of choice for gathering labeled training data for learning algorithms. Standard approaches to crowdsourcing view the process of acquiring labeled data separately from the process of learning a classifier from the gathered data. This can give rise to computational and statistical challenges. For example, in most cases there are no known computationally efficient learning algorithms that are robust to the high level of noise that exists in crowdsourced data, and efforts to eliminate noise through voting often require a large number of queries per example. In this paper, we show how by interleaving the process of labeling and learning, we can attain computational efficiency with much less overhead in the labeling cost. In particular, we consider the realizable setting where there exists a true target function in F and consider a pool of labelers. When a noticeable fraction of the labelers are perfect, and the rest behave arbitrarily, we show that any F that can be efficiently learned in the traditional realizable PAC model can be learned in a computationally efficient manner by querying the crowd, despite high amounts of noise in the responses. Moreover, we show that this can be done while each labeler only labels a constant number of examples and the number of labels requested per example, on average, is a constant. When no perfect labelers exist, a related task is to find a set of the labelers which are good but not perfect. We show that we can identify all good labelers, when at least the majority of labelers are good. 
    more » « less
  3. Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are easily measured whereas sample labels are unknown and costly to obtain. Accordingly, we devise a three-stage framework in service of performing an effective two-sample test with only a small number of sample label queries: first, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; second, a novel query scheme dubbed bimodal query is used to query labels of samples from both classes, and last, the classical Friedman-Rafsky (FR) two-sample test is performed on the queried samples. Theoretical analysis and extensive experiments performed on several datasets demonstrate that the proposed test controls the Type I error and has decreased Type II error relative to uniform querying and certainty-based querying. Source code for our algorithms and experimental results is available at https://github.com/wayne0908/Label-Efficient-Two-Sample. 
    more » « less
  4. Ta-Shma, Amnon (Ed.)
    In a recent work, Gryaznov, Pudlák and Talebanfard (CCC '22) introduced a linear variant of read-once branching programs, with motivations from circuit and proof complexity. Such a read-once linear branching program is a branching program where each node is allowed to make 𝔽₂-linear queries, and is read-once in the sense that the queries on each path is linearly independent. As their main result, they constructed an explicit function with average-case complexity 2^{n/3-o(n)} against a slightly restricted model, which they call strongly read-once linear branching programs. The main tool in their lower bound result is a new type of extractor, called directional affine extractors, that they introduced. Our main result is an explicit function with 2^{n-o(n)} average-case complexity against the strongly read-once linear branching program model, which is almost optimal. This result is based on a new connection from this problem to sumset extractors, which is a randomness extractor model introduced by Chattopadhyay and Li (STOC '16) as a generalization of many other well-studied models including two-source extractors, affine extractors and small-space extractors. With this new connection, our lower bound naturally follows from a recent construction of sumset extractors by Chattopadhyay and Liao (STOC '22). In addition, we show that directional affine extractors imply sumset extractors in a restricted setting. We observe that such restricted sumset sources are enough to derive lower bounds, and obtain an arguably more modular proof of the lower bound by Gryaznov, Pudlák and Talebanfard. We also initiate a study of pseudorandomness against linear branching programs. Our main result here is a hitting set generator construction against regular linear branching programs with constant width. We derive this result based on a connection to Kakeya sets over finite fields. 
    more » « less
  5. We study the classic set cover problem from the perspective of sub-linear algorithms. Given access to a collection of m sets over n elements in the query model, we show that sub-linear algorithms derived from existing techniques have almost tight query complexities. On one hand, first we show an adaptation of the streaming algorithm presented in [17] to the sub-linear query model, that returns an α-approximate cover using Õ(m(n/k)^1/(α–1) + nk) queries to the input, where k denotes the value of a minimum set cover. We then complement this upper bound by proving that for lower values of k, the required number of queries is , even for estimating the optimal cover size. Moreover, we prove that even checking whether a given collection of sets covers all the elements would require Ω(nk) queries. These two lower bounds provide strong evidence that the upper bound is almost tight for certain values of the parameter k. On the other hand, we show that this bound is not optimal for larger values of the parameter k, as there exists a (1 + ε)-approximation algorithm with Õ(mn/kε^2) queries. We show that this bound is essentially tight for sufficiently small constant ε, by establishing a lower bound of query complexity. Our lower-bound results follow by carefully designing two distributions of instances that are hard to distinguish. In particular, our first lower bound involves a probabilistic construction of a certain set system with a minimum set cover of size αk, with the key property that a small number of “almost uniformly distributed” modifications can reduce the minimum set cover size down to k. Thus, these modifications are not detectable unless a large number of queries are asked. We believe that our probabilistic construction technique might find applications to lower bounds for other combinatorial optimization problems. 
    more » « less