We consider the problem of reconstructing a rankone matrix from a revealed subset of its entries when some of the revealed entries are corrupted with perturbations that are unknown and can be arbitrarily large. It is not known which revealed entries are corrupted. We propose a new algorithm combining alternating minimization with extremevalue filtering and provide sufficient and necessary conditions to recover the original rankone matrix. In particular, we show that our proposed algorithm is optimal when the set of revealed entries is given by an ErdosRenyi random graph.
These results are then applied to the problem of classification from crowdsourced data under the assumption that while the majority of the workers are governed by the standard singlecoin DavidSkene model (i.e., they output the correct answer with a certain probability), some of the workers can deviate arbitrarily from this model. In particular, the adversarial'' workers could even make decisions designed to make the algorithm output an incorrect answer. Extensive experimental results show our algorithm for this problem, based on rankone matrix completion with perturbations, outperforms all other stateoftheart methods in such an adversarial scenario.
more »
« less
Gradient Descent for Sparse RankOne Matrix Completion for CrowdSourced Aggregation of Sparsely Interacting Workers
We consider worker skill estimation for the singlecoin DawidSkene crowdsourcing model. In
practice, skillestimation is challenging because worker assignments are sparse and irregular
due to the arbitrary and uncontrolled availability of workers. We formulate skill estimation
as a rankone correlationmatrix completion problem, where the observed components
correspond to observed label correlation between workers. We show that the correlation
matrix can be successfully recovered and skills are identifiable if and only if the sampling
matrix (observed components) does not have a bipartite connected component. We then
propose a projected gradient descent scheme and show that skill estimates converge to the
desired global optima for such sampling matrices. Our proof is original and the results are
surprising in light of the fact that even the weighted rankone matrix factorization problem
is NPhard in general. Next, we derive sample complexity bounds in terms of spectral
properties of the signless Laplacian of the sampling matrix. Our proposed scheme achieves
stateofart performance on a number of realworld datasets.
more »
« less
 NSFPAR ID:
 10382054
 Date Published:
 Journal Name:
 Journal of machine learning research
 ISSN:
 15337928
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Crowdsourcing platforms emerged as popular venues for purchasing human intelligence at low cost for large volume of tasks. As many lowpaid workers are prone to give noisy answers, a common practice is to add redundancy by assigning multiple workers to each task and then simply average out these answers. However, to fully harness the wisdom of the crowd, one needs to learn the heterogeneous quality of each worker. We resolve this fundamental challenge in crowdsourced regression tasks, i.e., the answer takes continuous labels, where identifying good or bad workers becomes much more nontrivial compared to a classification setting of discrete labels. In particular, we introduce a Bayesian iterative scheme and show that it provably achieves the optimal mean squared error. Our evaluations on synthetic and realworld datasets support our theoretical results and show the superiority of the proposed scheme.more » « less

Distributed learning platforms for processing large scale datasets are becoming increasingly prevalent. In typical distributed implementations, a centralized master node breaks the dataset into smaller batches for parallel processing across distributed workers to achieve speedup and efficiency. Several computational tasks are of sequential nature, and involve multiple passes over the data. At each iteration over the data, it is common practice to randomly reshuffle the data at the master node, assigning different batches for each worker to process. This random reshuffling operation comes at the cost of extra communication overhead, since at each shuffle, new data points need to be delivered to the distributed workers. In this paper, we focus on characterizing the information theoretically optimal communication overhead for the distributed data shuffling problem. We propose a novel coded data delivery scheme for the case of no excess storage, where every worker can only store the assigned data batches under processing. Our scheme exploits a new type of coding opportunity and is applicable to any arbitrary shuffle, and for any number of workers. We also present information theoretic lower bounds on the minimum communication overhead for data shuffling, and show that the proposed scheme matches this lower bound for the worstcase communication overhead.more » « less

Lowrank matrix recovery problems involving highdimensional and heterogeneous data appear in applications throughout statistics and machine learning. The contribution of this paper is to establish the fundamental limits of recovery for a broad class of these problems. In particular, we study the problem of estimating a rankone matrix from Gaussian observations where different blocks of the matrix are observed under different noise levels. In the setting where the number of blocks is fixed while the number of variables tends to infinity, we prove asymptotically exact formulas for the minimum meansquared error in estimating both the matrix and underlying factors. These results are based on a novel reduction from the lowrank matrix tensor product model (with homogeneous noise) to a rankone model with heteroskedastic noise. As an application of our main result, we show that show recently proposed methods based on applying principal component analysis (PCA) to weighted combinations of the data are optimal in some settings but suboptimal in others. We also provide numerical results comparing our asymptotic formulas with the performance of methods based weighted PCA, gradient descent, and approximate message passing.more » « less

We consider sparse matrix estimation where the goal is to estimate an nbyn matrix from noisy observations of a small subset of its entries. We analyze the estimation error of the popularly used collaborative filtering algorithm for the sparse regime. Specifically, we propose a novel iterative variant of the algorithm, adapted to handle the setting of sparse observations. We establish that as long as the number of entries observed at random scale logarithmically larger than linear in n, the estimation error with respect to the entrywise max norm decays to zero as n goes to infinity, assuming the underlying matrix of interest has constant rank r. Our result is robust to model misspecification in that if the underlying matrix is approximately rank r, then the estimation error decays to the approximation error with respect to the [Formula: see text]norm. In the process, we establish the algorithmâ€™s ability to handle arbitrary bounded noise in the observations.more » « less