Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
Free, publicly-accessible full text available December 1, 2022
Human-annotated labels are often prone to noise, and the presence of such noise will degrade the performance of the resulting deep neural network (DNN) models. Much of the literature (with several recent exceptions) of learning with noisy labels focuses on the case when the label noise is independent of features. Practically, annotations errors tend to be instance-dependent and often depend on the difficulty levels of recognizing a certain task. Applying existing results from instance-independent settings would require a significant amount of estimation of noise rates. Therefore, providing theoretically rigorous solutions for learning with instance-dependent label noise remains a challenge. In this paper, we propose CORES (COnfidence REgularized Sample Sieve), which progressively sieves out corrupted examples. The implementation of CORES does not require specifying noise rates and yet we are able to provide theoretical guarantees of CORES in filtering out the corrupted examples. This high-quality sample sieve allows us to treat clean examples and the corrupted ones separately in training a DNN solution, and such a separation is shown to be advantageous in the instance-dependent noise setting. We demonstrate the performance of CORES^2 on CIFAR10 and CIFAR100 datasets with synthetic instance-dependent label noise and Clothing1M with real-world human noise. As ofmore »
It is important to collect credible training samples $(x,y)$ for building data-intensive learning systems (e.g., a deep learning system). Asking people to report complex distribution $p(x)$, though theoretically viable, is challenging in practice. This is primarily due to the cognitive loads required for human agents to form the report of this highly complicated information. While classical elicitation mechanisms apply to eliciting a complex and generative (and continuous) distribution $p(x)$, we are interested in eliciting samples $x_i \sim p(x)$ from agents directly. We coin the above problem sample elicitation. This paper introduces a deep learning aided method to incentivize credible sample contributions from self-interested and rational agents. We show that with an accurate estimation of a certain $f$-divergence function we can achieve approximate incentive compatibility in eliciting truthful samples. We then present an efficient estimator with theoretical guarantees via studying the variational forms of the $f$-divergence function. We also show a connection between this sample elicitation problem and $f$-GAN, and how this connection can help reconstruct an estimator of the distribution based on collected samples. Experiments on synthetic data, MNIST, and CIFAR-10 datasets demonstrate that our mechanism elicits truthful samples. Our implementation is available at https://github.com/weijiaheng/Credible-sample-elicitation.git.
Knowing whether a published research result can be replicated is important. Carrying out direct replication of published research incurs a high cost. There are efforts tried to use machine learning aided methods to predict scientific claims’ replicability. However, existing machine learning aided approaches use only hand-extracted statistics features such as p-value, sample size, etc. without utilizing research papers’ text information and train only on a very small size of annotated data without making the most use of a large number of unlabeled articles. Therefore, it is desirable to develop effective machine learning aided automatic methods which can automatically extract text information as features so that we can benefit from Natural Language Processing techniques. Besides, we aim for an approach that benefits from both labeled and the large number of unlabeled data. In this paper, we propose two weakly supervised learning approaches that use automatically extracted text information of research papers to improve the prediction accuracy of research replication using both labeled and unlabeled datasets. Our experiments over real-world datasets show that our approaches obtain much better prediction performance compared to the supervised models utilizing only statistic features and a small size of labeled dataset. Further, we are able to achievemore »