skip to main content


Title: Dynamically monitoring crowd-worker's reliability with interval-valued labels

Crowdsourcing has rapidly become a computing paradigm in machine learning and artificial intelligence. In crowdsourcing, multiple labels are collected from crowd-workers on an instance usually through the Internet. These labels are then aggregated as a single label to match the ground truth of the instance. Due to its open nature, human workers in crowdsourcing usually come with various levels of knowledge and socio-economic backgrounds. Effectively handling such human factors has been a focus in the study and applications of crowdsourcing. For example, Bi et al studied the impacts of worker's dedication, expertise, judgment, and task difficulty (Bi et al 2014). Qiu et al offered methods for selecting workers based on behavior prediction (Qiu et al 2016). Barbosa and Chen suggested rehumanizing crowdsourcing to deal with human biases (Barbosa 2019). Checco et al studied adversarial attacks on crowdsourcing for quality control (Checco et al 2020). There are many more related works available in literature. In contrast to commonly used binary-valued labels, interval-valued labels (IVLs) have been introduced very recently (Hu et al 2021). Applying statistical and probabilistic properties of interval-valued datasets, Spurling et al quantitatively defined worker's reliability in four measures: correctness, confidence, stability, and predictability (Spurling et al 2021). Calculating these measures, except correctness, does not require the ground truth of each instance but only worker’s IVLs. Applying these quantified reliability measures, people have significantly improved the overall quality of crowdsourcing (Spurling et al 2022). However, in real world applications, the reliability of a worker may vary from time to time rather than a constant. It is necessary to monitor worker’s reliability dynamically. Because a worker j labels instances sequentially, we treat j’s IVLs as an interval-valued time series in our approach. Assuming j’s reliability relies on the IVLs within a time window only, we calculate j’s reliability measures with the IVLs in the current time window. Moving the time window forward with our proposed practical strategies, we can monitor j’s reliability dynamically. Furthermore, the four reliability measures derived from IVLs are time varying too. With regression analysis, we can separate each reliability measure as an explainable trend and possible errors. To validate our approaches, we use four real world benchmark datasets in our computational experiments. Here are the main findings. The reliability weighted interval majority voting (WIMV) and weighted preferred matching probability (WPMP) schemes consistently overperform the base schemes in terms of much higher accuracy, precision, recall, and F1-score. Note: the base schemes are majority voting (MV), interval majority voting (IMV), and preferred matching probability (PMP). Through monitoring worker’s reliability, our computational experiments have successfully identified possible attackers. By removing identified attackers, we have ensured the quality. We have also examined the impact of window size selection. It is necessary to monitor worker’s reliability dynamically, and our computational results evident the potential success of our approaches.This work is partially supported by the US National Science Foundation through the grant award NSF/OIA-1946391.

 
more » « less
Award ID(s):
1946391
NSF-PAR ID:
10497403
Author(s) / Creator(s):
;
Publisher / Repository:
AHFE International
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. With inputs from human crowds, usually through the Internet, crowdsourcing has become a promising methodology in AI and machine learning for applications that require human knowledge. Researchers have recently proposed interval-valued labels (IVLs), instead of commonly used binary-valued ones, to manage uncertainty in crowdsourcing [19]. However, that work has not yet taken the crowd worker’s reliability into consideration. Crowd workers usually come with various social and economic backgrounds, and have different levels of reliability. To further improve the overall quality of crowdsourcing with IVLs, this work presents practical methods that quantitatively estimate worker’s reliability in terms of his/her correctness, confidence, stability, and predictability from his/her IVLs. With worker’s reliability, this paper proposes two learning schemes: weighted interval majority voting (WIMV) and weighted preferred matching probability (WPMP). Computational experiments on sample datasets demonstrate that both WIMV and WPMP can significantly improve learning results in terms of higher precision, accuracy, and F1-score than other methods. 
    more » « less
  2. Law, Edith ; Vaughan, Jennifer W (Ed.)
    In this paper, we analyze PAC learnability from labels produced by crowdsourcing. In our setting, unlabeled examples are drawn from a distribution and labels are crowdsourced from workers who operate under classification noise, each with their own noise parameter. We develop an end-to-end crowdsourced PAC learning algorithm that takes unlabeled data points as input and outputs a trained classifier. Our threestep algorithm incorporates majority voting, pure-exploration bandits, and noisy-PAC learning. We prove several guarantees on the number of tasks labeled by workers for PAC learning in this setting and show that our algorithm improves upon the baseline by reducing the total number of tasks given to workers. We demonstrate the robustness of our algorithm by exploring its application to additional realistic crowdsourcing settings. 
    more » « less
  3. null (Ed.)
    Similar content has tremendous utility in classroom and online learning environments. For example, similar content can be used to combat cheating, track students’ learning over time, and model students’ latent knowledge. These different use cases for similar content all rely on different notions of similarity, which make it difficult to determine contents’ similarities. Crowdsourcing is an effective way to identify similar content in a variety of situations by providing workers with guidelines on how to identify similar content for a particular use case. However, crowdsourced opinions are rarely homogeneous and therefore must be aggregated into what is most likely the truth. This work presents the Dynamically Weighted Majority Vote method. A novel algorithm that combines aggregating workers’ crowdsourced opinions with estimating the reliability of each worker. This method was compared to the traditional majority vote method in both a simulation study and an empirical study, in which opinions on seventh grade mathematics problems’ similarity were crowdsourced from middle school math teachers and college students. In both the simulation and the empirical study the Dynamically Weighted Majority Vote method outperformed the traditional majority vote method, suggesting that this method should be used instead of majority vote in future crowdsourcing endeavors. 
    more » « less
  4. Similar content has tremendous utility in classroom and online learning environments. For example, similar content can be used to combat cheating, track students’ learning over time, and model students’ latent knowledge. These different use cases for similar content all rely on different notions of similarity, which make it difficult to determine contents’ similarities. Crowdsourcing is an effective way to identify similar content in a variety of situations by providing workers with guidelines on how to identify similar content for a particular use case. However, crowdsourced opinions are rarely homogeneous and therefore must be aggregated into what is most likely the truth. This work presents the Dynamically Weighted Majority Vote method. A novel algorithm that combines aggregating workers’ crowdsourced opinions with estimating the reliability of each worker. This method was compared to the traditional majority vote method in both a simulation study and an empirical study, in which opinions on seventh grade mathematics problems’ similarity were crowdsourced from middle school math teachers and college students. In both the simulation and the empirical study the Dynamically Weighted Majority Vote method outperformed the traditional majority vote method, suggesting that this method should be used instead of majority vote in future crowdsourcing endeavors. 
    more » « less
  5. Crowdsourcing has been widely adopted to perform large projects suitable for human participation, in which tasks are usually distributed to workers. Many such projects involve classification/labeling certain collections of items through semisupervised clustering, in which queries on small subsets of the items are assigned to workers in the crowd. The answers are collected by a taskmaster and the goal is to fully recover the labels. This problem can be modeled as a sparsely encoded source coding problem, where each query answer, regarded as a code bit, is the XOR of a small number of labels, as source information bits. While the problem of designing compression/source coding schemes achieving Shannon’s optimal compression rate is very well-studied, a few have considered sparsely encoded schemes. In this paper we leverage the connections between this problem and well-studied codes with sparse representations for the channel coding problem to provide querying schemes with almost optimal number of queries, each of which involving only a constant number of labels. We also extend this scenario to the case where some workers can be unresponsive. For this case, we propose querying schemes where each query involves only log n items, where n is the total number of items to be labeled. Furthermore, we consider classification of two correlated labeling systems and provide two-stage querying schemes with almost optimal number of queries each involving a constant number of labels. 
    more » « less