skip to main content

Title: Open Set Text Classification using Convolutional Neural Networks
In a closed world setting, classifiers are trained on examples from a number of classes and tested with unseen examples belonging to the same set of classes. However, in most real-world scenarios, a trained classifier is likely to come across novel examples that do not belong to any of the known classes. Such examples should ideally be categorized as belonging to an unknown class. The goal of an open set classifier is to anticipate and be ready to handle test examples of classes unseen during training. The classifier should be able to declare that a test example belongs to a class it does not know, and possi- bly, incorporate it into its knowledge as an example of a new class it has encoun- tered. There is some published research in open world image classification, but open set text classification remains mostly un- explored. In this paper, we investigate the suitability of Convolutional Neural Net- works (CNNs) for open set text classifi- cation. We find that CNNs are good fea- ture extractors and hence perform better than existing state-of-the-art open set clas- sifiers in smaller domains, although their open set classification abilities in general still need to be investigated.
; ;
Award ID(s):
Publication Date:
Journal Name:
International Conference on Natural Language Processing, 2017
Page Range or eLocation-ID:
Sponsoring Org:
National Science Foundation
More Like this
  1. Classic supervised learning makes the closed-world assumption that the classes seen in testing must have appeared in training. However, this assumption is often violated in real-world applications. For example, in a social media site, new topics emerge constantly and in e-commerce, new categories of products appear daily. A model that cannot detect new/unseen topics or products is hard to function well in such open environments. A desirable model working in such environments must be able to (1) reject examples from unseen classes (not appeared in training) and (2) incrementally learn the new/unseen classes to expand the existing model. This is called open-world learning (OWL). This paper proposes a new OWL method based on meta-learning. The key novelty is that the model maintains only a dynamic set of seen classes that allows new classes to be added or deleted with no need for model re-training. Each class is represented by a small set of training examples. In testing, the meta-classifier only uses the examples of the maintained seen classes (including the newly added classes) on-the-fly for classification and rejection. Experimental results with e-commerce product classification show that the proposed method is highly effective.
  2. In traditional graph learning tasks, such as node classification, learning is carried out in a closed-world setting where the number of classes and their training samples are provided to help train models, and the learning goal is to correctly classify unlabeled nodes into classes already known. In reality, due to limited labeling capability and dynamic evolving of networks, some nodes in the networks may not belong to any existing/seen classes, and therefore cannot be correctly classified by closed-world learning algorithms. In this paper, we propose a new open-world graph learning paradigm, where the learning goal is to not only classify nodes belonging to seen classes into correct groups, but also classify nodes not belonging to existing classes to an unseen class. The essential challenge of the openworld graph learning is that (1) unseen class has no labeled samples, and may exist in an arbitrary form different from existing seen classes; and (2) both graph feature learning and prediction should differentiate whether a node may belong to an existing/seen class or an unseen class. To tackle the challenges, we propose an uncertain node representation learning approach, using constrained variational graph autoencoder networks, where the label loss and class uncertainty loss constraintsmore »are used to ensure that the node representation learning are sensitive to unseen class. As a result, node embedding features are denoted by distributions, instead of deterministic feature vectors. By using a sampling process to generate multiple versions of feature vectors, we are able to test the certainty of a node belonging to seen classes, and automatically determine a threshold to reject nodes not belonging to seen classes as unseen class nodes. Experiments on real-world networks demonstrate the algorithm performance, comparing to baselines. Case studies and ablation analysis also show the rationale of our design for open-world graph learning.« less
  3. A fundamental limitation of applying semi-supervised learning in real-world settings is the assumption that unlabeled test data contains only classes previously encountered in the labeled training data. However, this assumption rarely holds for data in-the-wild, where instances belonging to novel classes may appear at testing time. Here, we introduce a novel open-world semi-supervised learning setting that formalizes the notion that novel classes may appear in the unlabeled test data. In this novel setting, the goal is to solve the class distribution mismatch between labeled and unlabeled data, where at the test time every input instance either needs to be classified into one of the existing classes or a new unseen class needs to be initialized. To tackle this challenging problem, we propose ORCA, an end-to-end deep learning approach that introduces uncertainty adaptive margin mechanism to circumvent the bias towards seen classes caused by learning discriminative features for seen classes faster than for the novel classes. In this way, ORCA reduces the gap between intra-class variance of seen with respect to novel classes. Experiments on image classification datasets and a single-cell annotation dataset demonstrate that ORCA consistently outperforms alternative baselines, achieving 25% improvement on seen and 96% improvement on novel classesmore »of the ImageNet dataset.« less
  4. Hierarchical multi-label text classification (HMTC) aims to tag each document with a set of classes from a class hierarchy. Most existing HMTC methods train classifiers using massive human-labeled documents, which are often too costly to obtain in real-world applications. In this paper, we explore to conduct HMTC based on only class surface names as supervision signals. We observe that to perform HMTC, human experts typically first pinpoint a few most essential classes for the document as its “core classes”, and then check core classes’ ancestor classes to ensure the coverage. To mimic human experts, we propose a novel HMTC framework, named TaxoClass. Specifically, TaxoClass (1) calculates document-class similarities using a textual entailment model, (2) identifies a document’s core classes and utilizes confident core classes to train a taxonomyenhanced classifier, and (3) generalizes the classifier via multi-label self-training. Our experiments on two challenging datasets show TaxoClass can achieve around 0.71 Example- F1 using only class names, outperforming the best previous method by 25%.
  5. Abstract We present improvements over our previous approach to automatic winter hydrometeor classification by means of convolutional neural networks (CNNs), using more data and improved training techniques to achieve higher accuracy on a more complicated dataset than we had previously demonstrated. As an advancement of our previous proof-of-concept study, this work demonstrates broader usefulness of deep CNNs by using a substantially larger and more diverse dataset, which we make publicly available, from many more snow events. We describe the collection, processing, and sorting of this dataset of over 25,000 high-quality multiple-angle snowflake camera (MASC) image chips split nearly evenly between five geometric classes: aggregate, columnar crystal, planar crystal, graupel, and small particle. Raw images were collected over 32 snowfall events between November 2014 and May 2016 near Greeley, Colorado and were processed with an automated cropping and normalization algorithm to yield 224x224 pixel images containing possible hydrometeors. From the bulk set of over 8,400,000 extracted images, a smaller dataset of 14,793 images was sorted by image quality and recognizability (Q&R) using manual inspection. A presorting network trained on the Q&R dataset was applied to all 8,400,000+ images to automatically collect a subset of 283,351 good snowflake images. Roughly 5,000 representativemore »examples were then collected from this subset manually for each of the five geometric classes. With a higher emphasis on in-class variety than our previous work, the final dataset yields trained networks that better capture the imperfect cases and diverse forms that occur within the broad categories studied to achieve an accuracy of 96.2% on a vastly more challenging dataset.« less