CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset

Jian Wu, Bharath Kandimalla

Citation Details

We report the preliminary work on cleansing and classifying a scholarly big dataset containing 10+ million academic documents released by CiteSeerX. We design novel approaches to match paper entities in CiteSeerX to reference datasets, including DBLP, Web of Science, and Medline, resulting in 4.2M unique matches, whose metadata can be cleansed. We also investigate traditional machine learning and neural network methods to classify abstracts into 6 subject categories. The classification results reveal that the current CiteSeerX dataset is highly multidisciplinary, containing papers well beyond computer and information sciences. more »

Award ID(s):: 1823288

PAR ID:: 10101539

Author(s) / Creator(s):: Jian Wu, Bharath Kandimalla

Date Published:: 2018-12-01

Journal Name:: IEEE International Conference on Big Data

ISSN:: 2639-1589

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this