skip to main content

Title: Large Scale Subject Category Classification of Scholarly Papers With Deep Attentive Neural Networks
Subject categories of scholarly papers generally refer to the knowledge domain(s) to which the papers belong, examples being computer science or physics. Subject category classification is a prerequisite for bibliometric studies, organizing scientific publications for domain knowledge extraction, and facilitating faceted searches for digital library search engines. Unfortunately, many academic papers do not have such information as part of their metadata. Most existing methods for solving this task focus on unsupervised learning that often relies on citation networks. However, a complete list of papers citing the current paper may not be readily available. In particular, new papers that have few or no citations cannot be classified using such methods. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The network is trained using nine million abstracts from Web of Science (WoS). We also use the WoS schema that covers 104 subject categories. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. We compare our model against baselines by varying the architecture and text representation. Our best model achieves micro- F 1 measure of 0.76 with F 1 of individual subject categories ranging from 0.50 to 0.95. The results showed the importance of retraining word embedding models to maximize the vocabulary overlap and the effectiveness of the attention mechanism. The combination of word vectors with TFIDF outperforms character and sentence level embedding models. We discuss imbalanced samples and overlapping categories and suggest possible strategies for mitigation. We also determine the subject category distribution in CiteSeerX by classifying a random sample of one million academic papers.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Frontiers in Research Metrics and Analytics
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We report the preliminary work on cleansing and classifying a scholarly big dataset containing 10+ million academic documents released by CiteSeerX. We design novel approaches to match paper entities in CiteSeerX to reference datasets, including DBLP, Web of Science, and Medline, resulting in 4.2M unique matches, whose metadata can be cleansed. We also investigate traditional machine learning and neural network methods to classify abstracts into 6 subject categories. The classification results reveal that the current CiteSeerX dataset is highly multidisciplinary, containing papers well beyond computer and information sciences. 
    more » « less
  2. Sharing, reuse, and synthesis of knowledge is central to the research process. These core functions are in theory served by the system of monographs, abstracts, and papers in journals and proceedings, with citation indices and search databases that comprise the core of our formal scholarly communication infrastructure; yet, converging lines of empirical and anecdotal evidence suggest that this system does not adequately act as infrastructure for synthesis. Emerging developments in new institutions for science, along with new technical infrastructures and tooling for decentralized knowledge work, offer new opportunities to prototype new technical infrastructures on top of a different installed base than the publish or perish, neoliberal academy. This workshop aims to integrate these developments and communities with CSCW’s deep roots in knowledge infrastructures and collaborative and distributed sensemaking, with new developments in science institutions and tooling, to stimulate and accelerate progress towards prototyping new scholarly communication infrastructures that are actually optimized for sharing, reusing, and synthesizing knowledge. 
    more » « less
  3. null (Ed.)
    The number of published manufacturing science digital articles available from scientifc journals and the broader web have exponentially increased every year since the 1990s. To assimilate all of this knowledge by a novice engineer or an experienced researcher, requires signifcant synthesis of the existing knowledge space contained within published material, to fnd answers to basic and complex queries. Algorithmic approaches through machine learning and specifcally Natural Language Processing (NLP) on a domain specifc area such as manufacturing, is lacking. One of the signifcant challenges to analyzing manufacturing vocabulary is the lack of a named entity recognition model that enables algorithms to classify the manufacturing corpus of words under various manufacturing semantic categories. This work presents a supervised machine learning approach to categorize unstructured text from 500K+manufacturing science related scientifc abstracts and labelling them under various manufacturing topic categories. A neural network model using a bidirectional long-short term memory, plus a conditional random feld (BiLSTM+CRF) is trained to extract information from manufacturing science abstracts. Our classifer achieves an overall accuracy (f1-score) of 88%, which is quite near to the state-of-the-art performance. Two use case examples are presented that demonstrate the value of the developed NER model as a Technical Language Processing (TLP) workfow on manufacturing science documents. The long term goal is to extract valuable knowledge regarding the connections and relationships between key manufacturing concepts/entities available within millions of manufacturing documents into a structured labeled-property graph data structure that allow for programmatic query and retrieval. 
    more » « less
  4. Academic productivity is realized through resources obtained from professional networks in which scientists are embedded. Using a national survey of academic faculty in Science, Technology, Engineering, and Mathematics (STEM) fields across multiple institution types, we examine how the structure of professional networks affects scholarly productivity and how those effects may differ by race, ethnicity, and gender. We find that network size masks important differences in composition. Using negative binomial regression, we find that both the size and composition of professional networks affect scientific productivity, but bigger is not always better. We find that instrumental networks increase scholarly productivity, while advice networks reduce it. There are important interactive effects that are masked by modeling only direct effects. We find that white men are especially advantaged by instrumental networks, and women are especially advantaged by advice networks. 
    more » « less
  5. null (Ed.)
    Communication of scientific findings is fundamental to scholarly discourse. In this article, we show that academic review articles, a quintessential form of interpretive scholarly output, perform curatorial work that substantially transforms the research communities they aim to summarize. Using a corpus of millions of journal articles, we analyze the consequences of review articles for the publications they cite, focusing on citation and co-citation as indicators of scholarly attention. Our analysis shows that, on the one hand, papers cited by formal review articles generally experience a dramatic loss in future citations. Typically, the review gets cited instead of the specific articles mentioned in the review. On the other hand, reviews curate, synthesize, and simplify the literature concerning a research topic. Most reviews identify distinct clusters of work and highlight exemplary bridges that integrate the topic as a whole. These bridging works, in addition to the review, become a shorthand characterization of the topic going forward and receive disproportionate attention. In this manner, formal reviews perform creative destruction so as to render increasingly expansive and redundant bodies of knowledge distinct and comprehensible. 
    more » « less