 Award ID(s):
 2008074
 Publication Date:
 NSFPAR ID:
 10281844
 Journal Name:
 34th Conference on Neural Information Processing Systems (NeurIPS 2020)
 Sponsoring Org:
 National Science Foundation
More Like this

Learning from label proportions (LLP) is a weakly supervised setting for classification in whichunlabeled training instances are grouped into bags, and each bag is annotated with the proportion ofeach class occurring in that bag. Prior work on LLP has yet to establish a consistent learning procedure,nor does there exist a theoretically justified, general purpose training criterion. In this work we addressthese two issues by posing LLP in terms of mutual contamination models (MCMs), which have recentlybeen applied successfully to study various other weak supervision settings. In the process, we establishseveral novel technical results for MCMs, including unbiased losses and generalization error bounds undernoniid sampling plans. We also point out the limitations ofa common experimental setting for LLP,and propose a new one based on our MCM framework.

Abstract Background Natural language processing (NLP) tasks in the health domain often deal with limited amount of labeled data due to high annotation costs and naturally rare observations. To compensate for the lack of training data, health NLP researchers often have to leverage knowledge and resources external to a task at hand. Recently, pretrained largescale language models such as the Bidirectional Encoder Representations from Transformers (BERT) have been proven to be a powerful way of learning rich linguistic knowledge from massive unlabeled text and transferring that knowledge to downstream tasks. However, previous downstream tasks often used training data at such a large scale that is unlikely to obtain in the health domain. In this work, we aim to study whether BERT can still benefit downstream tasks when training data are relatively small in the context of health NLP.
Method We conducted a learning curve analysis to study the behavior of BERT and baseline models as training data size increases. We observed the classification performance of these models on two disease diagnosis data sets, where some diseases are naturally rare and have very limited observations (fewer than 2 out of 10,000). The baselines included commonly used text classification models such as sparse andmore »
Results On the task of classifying all diseases, the learning curves of BERT were consistently above all baselines, significantly outperforming them across the spectrum of training data sizes. But under extreme situations where only one or two training documents per disease were available, BERT was outperformed by linear classifiers with carefully engineered bagofwords features.
Conclusion As long as the amount of training documents is not extremely few, finetuning a pretrained BERT model is a highly effective approach to health NLP tasks like disease classification. However, in extreme cases where each class has only one or two training documents and no more will be available, simple linear models using bagofwords features shall be considered.

Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)The Neuronix highperformance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from lowend consumer grade devices such as the Nvidia GTX 970 to higherend devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely computeintensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floatingpoint calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should producemore »

Temporal networks serve as abstractions of many realworld dynamic systems. These networks typically evolve according to certain laws, such as the law of triadic closure, which is universal in social networks. Inductive representation learning of temporal networks should be able to capture such laws and further be applied to systems that follow the same laws but have not been unseen during the training stage. Previous works in this area depend on either network node identities or rich edge attributes and typically fail to extract these laws. Here, we propose Causal Anonymous Walks (CAWs) to inductively represent a temporal network. CAWs are extracted by temporal random walks and work as automatic retrieval of temporal network motifs to represent network dynamics while avoiding the timeconsuming selection and counting of those motifs. CAWs adopt a novel anonymization strategy that replaces node identities with the hitting counts of the nodes based on a set of sampled walks to keep the method inductive, and simultaneously establish the correlation between motifs. We further propose a neuralnetwork model CAWN to encode CAWs, and pair it with a CAW sampling strategy with constant memory and time cost to support online training and inference. CAWN is evaluated to predictmore »

Structured nonconvex learning problems, for which critical points have favorable statistical properties, arise frequently in statistical machine learning. Algorithmic convergence and statistical estimation rates are wellunderstood for such problems. However, quantifying the uncertainty associated with the underlying training algorithm is not wellstudied in the nonconvex setting. In order to address this shortcoming, in this work, we establish an asymptotic normality result for the constant step size stochastic gradient descent (SGD) algorithm—a widely used algorithm in practice. Specifically, based on the relationship between SGD and Markov Chains [1], we show that the average of SGD iterates is asymptotically normally distributed around the expected value of their unique invariant distribution, as long as the nonconvex and nonsmooth objective function satisfies a dissipativity property. We also characterize the bias between this expected value and the critical points of the objective function under various local regularity conditions. Together, the above two results could be leveraged to construct confidence intervals for nonconvex problems that are trained using the SGD algorithm.