Mining Text Outliers in Document Directories

Fouche, Edouard; Meng, Yu; Guo, Fang; Zhuang, Honglei; Bohm, Klemens; Han, Jiawei

doi:10.1109/ICDM50108.2020.00024

Citation Details

Mining Text Outliers in Document Directories

Nowadays, it is common to classify collections of documents into (human-generated, domain-specific) directory structures, such as email or document folders. But documents may be classified wrongly, for a multitude of reasons. Then they are outlying w.r.t. the folder they end up in. Orthogonally to this, and more specifically, two kinds of errors can occur: (O) Out-of-distribution: the document does not belong to any existing folder in the directory; and (M) Misclassification: the document belongs to another folder. It is this specific combination of issues that we address in this article, i.e., we mine text outliers from massive document directories, considering both error types. We propose a new proximity-based algorithm, which we dub kj-Nearest Neighbors (kj-NN). Our algorithm detects text outliers by exploiting semantic similarities and introduces a self-supervision mechanism that estimates the relevance of the original labels. Our approach is efficient and robust to large proportions of outliers. kj-NN also promotes the interpretability of the results by proposing alternative label names and by finding the most similar documents for each outlier. Our real-world experiments demonstrate that our approach outperforms the competitors by a large margin. more »

Award ID(s):: 1956151 1741317 1704532

PAR ID:: 10279820

Author(s) / Creator(s):: Fouche, Edouard; Meng, Yu; Guo, Fang; Zhuang, Honglei; Bohm, Klemens; Han, Jiawei

Date Published:: 2020-11-01

Journal Name:: ICDM'20: IEEE 2020 Int. Conf. on Data Mining, Nov. 2020

Volume:: 2020

Issue:: 1

Page Range / eLocation ID:: 152 to 161

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/ICDM50108.2020.00024

More Like this