Hierarchical text classification, which aims to classify text documents into a given hierarchy, is an important task in many real-world applications. Recently, deep neural models are gaining increasing popularity for text classification due to their expressive power and minimum requirement for feature engineering. However, applying deep neural networks for hierarchical text classification remains challenging, because they heavily rely on a large amount of training data and meanwhile cannot easily determine appropriate levels of documents in the hierarchical setting. In this paper, we propose a weakly-supervised neural method for hierarchical text classification. Our method does not require a large amount of training data but requires only easy-to-provide weak supervision signals such as a few class-related documents or keywords. Our method effectively leverages such weak supervision signals to generate pseudo documents for model pre-training, and then performs self-training on real unlabeled data to iteratively refine the model. During the training process, our model features a hierarchical neural structure, which mimics the given hierarchy and is capable of determining the proper levels for documents with a blocking mechanism. Experiments on three datasets from different domains demonstrate the efficacy of our method compared with a comprehensive set of baselines.
Weakly-Supervised Neural Text Classification
Deep neural networks are gaining increasing popularity for the
classic text classification task, due to their strong expressive power
and less requirement for feature engineering. Despite such attractiveness,
neural text classification models suffer from the lack of
training data in many real-world applications. Although many semisupervised
and weakly-supervised text classification models exist,
they cannot be easily applied to deep neural models and meanwhile
support limited supervision types. In this paper, we propose a
weakly-supervised method that addresses the lack of training data
in neural text classification. Our method consists of two modules:
(1) a pseudo-document generator that leverages seed information
to generate pseudo-labeled documents for model pre-training, and
(2) a self-training module that bootstraps on real unlabeled data for
model refinement. Our method has the flexibility to handle different
types of weak supervision and can be easily integrated into existing
deep neural models for text classification. We have performed extensive
experiments on three real-world datasets from different domains.
The results demonstrate that our proposed method achieves
inspiring performance without requiring excessive training data
and outperforms baseline methods significantly.
- Publication Date:
- NSF-PAR ID:
- 10079169
- Journal Name:
- Proceedings of the 27th {ACM} International Conference on Information and Knowledge Management, {CIKM} 2018
- Volume:
- 2018
- Issue:
- 1
- Page Range or eLocation-ID:
- 983 to 992
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Weakly supervised text classification methods typically train a deep neural classifier based on pseudo-labels. The quality of pseudo-labels is crucial to final performance but they are inevitably noisy due to their heuristic nature, so selecting the correct ones has a huge potential for performance boost. One straightforward solution is to select samples based on the softmax probability scores in the neural classifier corresponding to their pseudo-labels. However, we show through our experiments that such solutions are ineffective and unstable due to the erroneously high-confidence predictions from poorly calibrated models. Recent studies on the memorization effects of deep neural models suggest that these models first memorize training samples with clean labels and then those with noisy labels. Inspired by this observation, we propose a novel pseudo-label selection method LOPS that takes learning order of samples into consideration. We hypothesize that the learning order reflects the probability of wrong annotation in terms of ranking, and therefore, propose to select the samples that are learnt earlier. LOPS can be viewed as a strong performance-boost plug-in to most existing weakly-supervised text classification methods, as confirmed in extensive experiments on four real-world datasets.
-
We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only and without any annotated training document provided. Most existing classifiers leverage textual information in each document. However, in many domains, documents are accompanied by various types of metadata (e.g., authors, venue, and year of a research paper). These metadata and their combinations may serve as strong category indicators in addition to textual contents. In this paper, we explore the potential of using metadata to help weakly supervised text classification. To be specific, we model the relationships between documents and metadata via a heterogeneous information network. To effectively capture higher-order structures in the network, we use motifs to describe metadata combinations. We propose a novel framework, named MotifClass, which (1) selects category-indicative motif instances, (2) retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances, and (3) trains a text classifier using the pseudo training data. Extensive experiments on real-world datasets demonstrate the superior performance of MotifClass to existing weakly supervised text classification approaches. Further analysis shows the benefit of considering higher-order metadata information in our framework.
-
null (Ed.)Supervised deep learning methods have achieved state-of-the-art performance on the task of named entity recognition (NER). However, such methods suffer from high cost and low efficiency in training data annotation, leading to highly specialized NER models that cannot be easily adapted to new domains. Recently, distant supervision has been applied to replace human annotation, thanks to the fast development of domain-specific knowledge bases. However, the generated noisy labels pose significant challenges in learning effective neural models with distant supervision. We propose PATNER, a distantly supervised NER model that effectively deals with noisy distant supervision from domain-specific dictionaries. PATNER does not require human-annotated training data but only relies on unlabeled data and incomplete domain-specific dictionaries for distant supervision. It incorporates the distant labeling uncertainty into the neural model training to enhance distant supervision. We go beyond the traditional sequence labeling framework and propose a more effective fuzzy neural model using the tie-or-break tagging scheme for the NER task. Extensive experiments on three benchmark datasets in two domains demonstrate the power of PATNER. Case studies on two additional real-world datasets demonstrate that PATNER improves the distant NER performance in both entity boundary detection and entity type recognition. The results show a greatmore »
-
Text categorization is an essential task in Web content analysis. Considering the ever-evolving Web data and new emerging categories, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting that aims to categorize documents effectively, with a couple of seed documents annotated per category. We recognize that texts collected from the Web are often structure-rich, i.e., accompanied by various metadata. One can easily organize the corpus into a text-rich network, joining raw text documents with document attributes, high-quality phrases, label surface names as nodes, and their associations as edges. Such a network provides a holistic view of the corpus’ heterogeneous data sources and enables a joint optimization for network-based analysis and deep textual model training. We therefore propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases – a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Each module generates pseudo training labels from the unlabeled document set, and both modules mutually enhance each other by co-training using pooled pseudo labels. We test our model on two real-world datasets. On the challenging e-commercemore »