skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: On the use of arXiv as a dataset
The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other metadata. These rich, multi-modal features, combined with the natural graph structure—created by citation, affiliation, and co-authorship—makes the arXiv an exciting candidate for benchmarking next-generation models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXiv’s publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word cor- pus of full-text research articles. We present some baseline classification results, and motivate application of more exciting generative graph models.  more » « less
Award ID(s):
1719490
PAR ID:
10178627
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
On the use of the arXiv as a dataset
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Research in affective science includes over one hundred thousand articles, the vast majority of which have been published in only the past two decades. The size and rapid growth of this field have led to unique challenges for the twenty-first-century scientist including how to develop both breadth and depth of scholarship, curb siloing and promote integrative and interdisciplinary framework, and represent and monitor the field in its entirety. Here, we help address these issues by compactly mapping out this enormous field using citation network analysis (CNA). We generated a citation matrix of over 100,000 publications and over 1 million citations since the seminal works on emotion by Charles Darwin (1872) and William James (1884). Using graph theory metric and content analysis of titles and abstracts, we identified and characterized the contents of 69 research communities, their most influential articles, and their interconnectedness with each other. We further identified potential “missed connections” between communities that share similar content but do not have strong citation-based connections. In doing so, we establish the first, low-dimensional representation, or field-wide map, of a substantial portion of the affective sciences literature. This panoramic view of the field provides affective and non-affective scientists alike with the means to rapidly survey dozens of major research communities and topics in the field, guide scholarship development, and identify gaps and connections for developing an integrative science. 
    more » « less
  2. Searching for relevant literature is a fundamental part of academic research. The search for relevant literature is becoming a more difficult and time-consuming task as millions of articles are published each year. As a solution, recommendation systems for academic papers attempt to help researchers find relevant papers quickly. This paper focuses on graph-based recommendation systems for academic papers using citation networks. This type of paper recommendation system leverages a graph of papers linked by citations to create a list of relevant papers. In this study, we explore recommendation systems for academic papers using citation networks incorporating citation relations. We define citation relation based on the number of times the origin paper cites the reference paper, and use this citation relation to measure the strength of the relation between the papers. We created a weighted network using citation relation as citation weight on edges. We evaluate our proposed method on a real-world publication data set, and conduct an extensive comparison with three state-of-the-art baseline methods. Our results show that citation network-based recommendation systems using citation weights perform better than the current methods. 
    more » « less
  3. Systematic reviews (SRs) are a crucial component of evidence-based clinical practice. Unfortunately, SRs are labor-intensive and unscalable with the exponential growth in literature. Automating evidence synthesis using machine learning models has been proposed but solely focuses on the text and ignores additional features like citation information. Recent work demonstrated that citation embeddings can outperform the text itself, suggesting that better network representation may expedite SRs. Yet, how to utilize the rich information in heterogeneous information networks (HIN) for network embeddings is understudied. Existing HIN models fail to produce a high-quality embedding compared to simply running state-of-the-art homogeneous network models. To address existing HIN model limitations, we propose SR-CoMbEr, a community-based multi-view graph convolutional network for learning better embeddings for evidence synthesis. Our model automatically discovers article communities to learn robust embeddings that simultaneously encapsulate the rich semantics in HINs. We demonstrate the effectiveness of our model to automate 15 SRs. 
    more » « less
  4. Benjamin, Paaßen; Carrie, Demmans Epp (Ed.)
    This paper was written with the help of ChatGPT. Recent advancements in the development and deployment of large generative language models to power generative AI tools, including OpenAIż˝fs ChatGPT, have led to their broad usage across virtually all fields of study. While the tools have been trained to generate human-like-dialogue in response to questions or prompts, they are similarly used to compose larger, more complex artifacts, including social media posts, essays, and even research articles. Although this abstract has been written entirely by a human without any input, consultation, or revision from a generative language model, it would likely be difficult to discern any difference as a reader. In light of this, there is growing debate and concern regarding using these models to aid the writing process, particularly concerning publication. Aside from some notable risks, including the unintentional generation of false information, citation of non-existing research articles, or plagiarism by generating text that is sampled from another source without proper citation, there are additional questions pertaining to the originality of ideas expressed in a work has been partially-written or revised by a generative language model. We present this paper as both a case study into the usage of generative models to aid in the writing of academic research articles but also as an example of how transparency and open science practices may help in addressing several issues that have been raised in other contexts and communities. While this paper neither attempts to promote nor contest the use of these language models in any writing task, it is the goal of this work to provide insight and potential guidance into the ethical and effective usage of these models within this domain. 
    more » « less
  5. Abstract A conspiracy theory (CT) suggests covert groups or powerful individuals secretly manipulate events. Not knowing about existing conspiracy theories could make one more likely to believe them, so this work aims to compile a list of CTs shaped as a tree that is as comprehensive as possible. We began with a manually curated ‘tree’ of CTs from academic papers and Wikipedia. Next, we examined 1769 CT-related articles from four fact-checking websites, focusing on their core content, and used a technique called Keyphrase Extraction to label the documents. This process yielded 769 identified conspiracies, each assigned a label and a family name. The second goal of this project was to detect whether an article is a conspiracy theory, so we built a binary classifier with our labeled dataset. This model uses a transformer-based machine learning technique and is pre-trained on a large corpus called RoBERTa, resulting in an F1 score of 87%. This model helps to identify potential conspiracy theories in new articles. We used a combination of clustering (HDBSCAN) and a dimension reduction technique (UMAP) to assign a label from the tree to these new articles detected as conspiracy theories. We then labeled these groups accordingly to help us match them to the tree. These can lead us to detect new conspiracy theories and expand the tree using computational methods. We successfully generated a tree of conspiracy theories and built a pipeline to detect and categorize conspiracy theories within any text corpora. This pipeline gives us valuable insights through any databases formatted as text. 
    more » « less