<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>On the use of arXiv as a dataset</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10178627</idno>
					<idno type="doi"></idno>
					<title level='j'>On the use of the arXiv as a dataset</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Colin B. Clement</author><author>Matthew Bierbaum</author><author>Kevin O'Keeffe</author><author>Alexander A. Alemi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other metadata. These rich, multi-modal features, combined with the natural graph structure—created by citation, affiliation, and co-authorship—makes the arXiv an exciting candidate for benchmarking next-generation models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXiv’s publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word cor- pus of full-text research articles. We present some baseline classification results, and motivate application of more exciting generative graph models.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Real world datasets are typically multimodal (comprised of images, text, and time series, etc) and have complex relational structures well captured by a graph. Recently, advances have been made on models which act on graphs, allowing the rich features and relational structures of real-word data to be utilized <ref type="bibr">(Hamilton et al., 2017b;</ref><ref type="bibr">a;</ref><ref type="bibr">Battaglia et al., 2018;</ref><ref type="bibr">Goyal &amp; Ferrara, 2018;</ref><ref type="bibr">Nickel et al., 2016)</ref>. Many of these advances have been facilitated by the availability of large, benchmark datasets: for example, the ImageNet <ref type="bibr">(Russakovsky et al., 2015)</ref> dataset has been widely used as a community standard for image classification. We believe the arXiv can provide a similarly useful benchmark for large scale, multimodal, relational modelling.</p><p>The arXiv <ref type="foot">1</ref> is the de-facto online manuscript pre-print service for Computer Science, Mathematics, Physics, and many interdisciplinary communities. Since 1991 the arXiv has offered a place for researchers to reliably share their work as it undergoes the process of peer-review, and for many researchers it is their primary source of literature. With over 1.5 million articles, a large multigraph dataset can be built, including full-text articles, article metadata, and internal co-citations.</p><p>The arXiv has been used many times as a dataset. <ref type="bibr">Liben-Nowell &amp; Kleinberg (2007)</ref> used the topology of the arXiv co-authorship graph to study link prediction. <ref type="bibr">Dempsey et al. (2019)</ref> used the authorship graph to test a hierarchically structured network model. <ref type="bibr">Lopuszynski &amp; Bolikowski (2013)</ref> used the category labels of arXiv documents to train and assess an automatic text labelling system. <ref type="bibr">Dai et al. (2015)</ref> used a subset of the full text available on the arXiv to study the utility of "paragraph vectors" for capturing document similarity. <ref type="bibr">Alemi &amp; Ginsparg (2015)</ref> used the fulltext to evaluate a method for unsupervised text segmentation. <ref type="bibr">Eger et al. (2019)</ref> and <ref type="bibr">Liu et al. (2018)</ref> built models to predict future research topic trends in machine learning and physics respectively. The arXiv also formed the basis of the popular 2003 KDD Cup <ref type="bibr">(Gehrke et al., 2003)</ref>, in which researchers competed for the prize of best algorithm for citation prediction, download estimation, and data cleaning<ref type="foot">foot_1</ref> . All these works used different subsets of arXiv's data, limiting their potential impact, as future researchers will be unable to directly compare their work to these existing results. The goal of this paper is to improve this situation by providing an open-source pipeline to standardize, simplify, and normalize access to the arXiv's public data, providing a benchmark to facilitate the development of models for multi-modal, relational data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">DATASET</head><p>We built a freely available, open-source pipeline<ref type="foot">foot_2</ref> for collecting arXiv metadata from the Open Archive Initiative <ref type="bibr">(Lagoze &amp; Van de Sompel, 2001)</ref>, and bulk PDF downloading from the arXiv<ref type="foot">foot_3</ref> . Further, this pipeline converts the raw PDFs to plaintext, builds the intra-arXiv co-citation network by searching the full-text for arXiv ids, and cleans and normalizes author strings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">METADATA</head><p>Through its participation in the Open Archives Initiative,<ref type="foot">foot_4</ref> the arXiv makes all article metadata<ref type="foot">foot_5</ref> available, with updates made shortly after new articles are published<ref type="foot">foot_6</ref> . We provide code for utilizing these public APIs to download a full set of current arXiv metadata. As of 2019-03-01, metadata for 1,506,500 articles was available. For verification and ease of use purposes, we provide a copy of the metadata (less abstracts) on the date we accessed it. An example listing is shown in Figure <ref type="figure">1</ref>. Each article includes an arXiv id (e.g. 0704.0001)<ref type="foot">foot_7</ref> used to identify the article, the publicly visible name of the submitter, a list of authors, title, abstract, versions and category listings, as well as optional doi, journal-ref and report-no fields. Of particular note is the first category listed, the primary category, of which there are 171 at this time. Notice that the list of authors is just a single string of author names, potentially joined with commas or 'and's. We've provided a suggested normalization and splitting script for splitting these authors strings into a list of author names. Additional fields may be present to denote doi, journal-ref and report-no, although these are not validated they can potentially be used to find intersections between the arXiv dataset and other scientific literature datasets. Population counts for the optional fields are shown in Table <ref type="table">1</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">FULL TEXT</head><p>One advantage the arXiv has over other graph datasets is that it provides a very rich attribute at each id node: the full raw text and figures of a research article. To extract the raw text from PDFs, we provide a pipeline with two parts. A helper script downloads the full set of PDFs available through the arXiv's bulk download service<ref type="foot">foot_8</ref> . Since arXiv hosts their data in a requester-pay AWS S3 buckets, this constitutes &#8764; 1.1 TB and &#8764; $100 to fully download. For posterity, we have provided MD5</p><p>'abstract': 'The arXiv has collected 1.5 million pre-prints over 28 years, hosting literature from physics, mathematics, computer science, biology, finance, statistics, electrical engineering, and economics. Each pre-print features text, figures, author lists, citation lists, categories, and other metadata. These rich, multi-modal features, combined with the natural relational graph structure created by citation, affiliation, and co-authorship makes the arXiv an exciting candidate for benchmarking nextgeneration models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXiv's publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles. We present some baseline classification results, and motivate application of more exciting relational neural network models.'</p><p>'versions': ['v1']} hashes of the PDFs at the state of the frozen metagraph extraction. Raw T E X source is also available for the subset of articles that provide it. Second, we provide a standard PDF-to-text converterpowered by pdftotext<ref type="foot">foot_14</ref> -to convert the PDFs to plaintext.</p><p>Using this pipeline, it is currently possible to extract a corpus of 1.37 million raw text documents. Figure <ref type="figure">2</ref> shows an example of the text extracted from a PDF. Though the extracted text isn't perfectly clean, we believe it will still be useful for many tasks, and hope future contributions to our repository will provide better data cleaning procedures.</p><p>The extracted raw-text dataset is &#8764; 64 GB in size, totaling &#8764; 11 billion words. An order of magnitude larger than the common billion word corpus <ref type="bibr">(Chelba et al., 2013)</ref>, this large size makes the arXiv raw-text a competitive alternative to other full text datasets. Moreover, the technical nature of the arXiv distinguishes it from other full text datasets. For example, the T E X data the arXiv presents an opportunity to study mathematical formulae in bulk, as is done in the NTCIR-11 Task: Math-2 <ref type="bibr">(Aizawa et al., 2014)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">CO-CITATIONS</head><p>While the arXiv does not currently publicly provide an API to access co-citations, our pipeline allows a simple but large co-citation network to be extracted. We extracted this network by searching the text of each article for valid arXiv ids, thereby finding which nodes should be linked to a given node in the co-citation network. We provide a compressed binary of the resulting network at the repository<ref type="foot">foot_15</ref> , so that researchers can study it directly, and avoid the difficulty of constructing it themselves. Table <ref type="table">2</ref> summarizes the size and statistical structure of our co-citation network, compared with other popular citation networks. <ref type="bibr">&#352;ubelj et al. (2014)</ref> also studied data from the arXiv, but as indicated in the bottom row of Table <ref type="table">2</ref>, it used only the 34,546 articles from the 2003 KDD Cup challenge.</p><p>Table <ref type="table">2</ref> reports standard statistics for the co-citation network. Our arXiv co-citation network contains O(10 6 ) nodes, an order of magnitude larger than the O(10 5 ) nodes in the other citation networks. The exponents of best fit for the degree distributions &#945; in and &#945; out are also consistent with the existing citation networks <ref type="bibr">&#352;ubelj et al. (2014)</ref>, as it the the degree k . 62% of the nodes are contained in the largest weakly connected component, while 31% of the nodes are fully isolatedmeaning their in-degree k in and out-degree k out are zero. Recall that our arXiv co-citation network only contains publications which have been posted on the arXiv; a given paper which cites papers published elsewhere -and not on the arXiv -will have k out = 0 in this set, which is an explanation the large number of isolated nodes. 25 The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other 26 metadata. These rich, multi-modal features, combined with the natural graph 27 structure created by citation, affiliation, and co-authorship makes the arXiv 28 an exciting candidate for benchmarking next-generation models. Here we take the 29 first necessary steps toward this goal, by providing a pipeline which standardizes 30 and simplifies access to the a r X i v s publicly available data. We use this pipeline 31 extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles. We present some baseline classification results, 32 and motivate application of more exciting generative graph models. Table <ref type="table">2</ref>: Graph statistics for popular citation networks. All but the data for this work (first row) were taken from Table <ref type="table">1</ref> and 2 in <ref type="bibr">&#352;ubelj et al. (2014)</ref>. k is the average degree, and &#945; in and &#945; out are power law exponents of best fit for the degree distribution. WCC refers to the largest weakly connected components, computed using the python package 'networkx'. The power law exponents &#945; in , &#945; out were found using the python module powerlaw. When fitting data to a powerlaw, the package discards all data below an automatically computed threshold x min . These thresholds for k in and k out were x min = 73 and x min = 59 respectively. Beyond constructing and analyzing a co-citation network, the arXiv dataset can be used for many tasks, such as relationally powered classification, author attribution, segmentation, clustering, structured prediction, language modeling, link prediction and automatic summary generation. As a basic demonstration, in Table <ref type="table">3</ref> we show some baseline category classification results. These were obtained by training logistic regression on 1.2 million arXiv articles to predict in which category (e.g. cs.Lg, stat.ML) a given article resides. See the supplemental information for a detailed explanation of the experimental setup. Titles and abstracts were represented by vectors from a pre-trained instance<ref type="foot">foot_16</ref> of the Universal Sentence Encoder of <ref type="bibr">Cer et al. (2018)</ref>. We see that including more aspects of each document (titles, abstracts, fulltext) and exposing their relations via co-citation leads to better predictive power. This is only scratching the surface of possible tasks and models applied to this rich dataset. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Dataset</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">CONCLUSION</head><p>As research moves increasingly towards structured relational modelling <ref type="bibr">(Hamilton et al., 2017b;</ref><ref type="bibr">a;</ref><ref type="bibr">Battaglia et al., 2018)</ref>, there is a growing need for large-scale, relational datasets with rich annotations. With its authorship, categories, abstracts, co-citations, and full text, the arXiv presents an exciting opportunity to promote progress in relational modelling. We have provided an open-source repository of tools that make it easy to download and standardize the data available from the arXiv. Our preliminary classification baselines support the claim that each mode of the arXiv's feature set allows for greatly improved category inference. More sophisticated models that include relational inductive biases-encoding the graph structures of the arXiv-will improve these results. Further, this new benchmark dataset will allow more rapid progress in tasks such as link prediction, automatic summary generation, text segmentation, and time-varying topic modeling of scientific disciplines.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://arxiv.org</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>The data for those challenges are available at http://www.cs.cornell.edu/projects/ kddcup/datasets.html</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>https://github.com/mattbierbaum/arxiv-public-datasets/releases/tag/v0. 2.0</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>https://arxiv.org/help/bulk_data</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>http://www.openarchives.org/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5"><p>https://arxiv.org/help/prep</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6"><p>Further details available at https://arxiv.org/help/oa</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7"><p>There are two forms of valid arXiv IDs, delineated by the year 2007, described in https://arxiv. org/help/arxiv_identifier.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_8"><p>https://arxiv.org/help/bulk_data</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_9"><p>'title': 'On the Use of ArXiv as a Dataset',</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_10"><p>'comments': '7 pages, 3 figures, 2 tables',</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_11"><p>'journal-ref': '',   </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_12"><p>'doi': '',</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_13"><p>'categories':['stat.ML cs.LG'],</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_14"><p>Version 0.61.1, available on most Debian systems from the apt package poppler-utils</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_15"><p>As part of one of the tagged releases: https://github.com/mattbierbaum/ arxiv-public-datasets/releases</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_16"><p>From https://tfhub.dev/google/universal-sentence-encoder/2</p></note>
		</body>
		</text>
</TEI>
