On the use of arXiv as a dataset

Clement, Colin B.; Bierbaum, Matthew; O'Keeffe, Kevin; Alemi, Alexander A.

Citation Details

The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other metadata. These rich, multi-modal features, combined with the natural graph structure—created by citation, affiliation, and co-authorship—makes the arXiv an exciting candidate for benchmarking next-generation models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXiv’s publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word cor- pus of full-text research articles. We present some baseline classification results, and motivate application of more exciting generative graph models. more »

Award ID(s):: 1719490

PAR ID:: 10178627

Author(s) / Creator(s):: Clement, Colin B.; Bierbaum, Matthew; O'Keeffe, Kevin; Alemi, Alexander A.

Date Published:: 2019-01-01

Journal Name:: On the use of the arXiv as a dataset

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this