skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Large Scale Subject Category Classification of Scholarly Papers With Deep Attentive Neural Networks
Subject categories of scholarly papers generally refer to the knowledge domain(s) to which the papers belong, examples being computer science or physics. Subject category classification is a prerequisite for bibliometric studies, organizing scientific publications for domain knowledge extraction, and facilitating faceted searches for digital library search engines. Unfortunately, many academic papers do not have such information as part of their metadata. Most existing methods for solving this task focus on unsupervised learning that often relies on citation networks. However, a complete list of papers citing the current paper may not be readily available. In particular, new papers that have few or no citations cannot be classified using such methods. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The network is trained using nine million abstracts from Web of Science (WoS). We also use the WoS schema that covers 104 subject categories. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. We compare our model against baselines by varying the architecture and text representation. Our best model achieves micro- F 1 measure of 0.76 with F 1 of individual subject categories ranging from 0.50 to 0.95. The results showed the importance of retraining word embedding models to maximize the vocabulary overlap and the effectiveness of the attention mechanism. The combination of word vectors with TFIDF outperforms character and sentence level embedding models. We discuss imbalanced samples and overlapping categories and suggest possible strategies for mitigation. We also determine the subject category distribution in CiteSeerX by classifying a random sample of one million academic papers.  more » « less
Award ID(s):
1823288
PAR ID:
10272152
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Frontiers in Research Metrics and Analytics
Volume:
5
ISSN:
2504-0537
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We report the preliminary work on cleansing and classifying a scholarly big dataset containing 10+ million academic documents released by CiteSeerX. We design novel approaches to match paper entities in CiteSeerX to reference datasets, including DBLP, Web of Science, and Medline, resulting in 4.2M unique matches, whose metadata can be cleansed. We also investigate traditional machine learning and neural network methods to classify abstracts into 6 subject categories. The classification results reveal that the current CiteSeerX dataset is highly multidisciplinary, containing papers well beyond computer and information sciences. 
    more » « less
  2. Global and team science approaches are on the rise, as is attention to the network underpinnings of gender disparities in scientific collaboration. Many network studies of men’s and women’s collaboration rely on bounded case studies of single disciplines and/or single countries and limited measures related to the collaborative process. We deploy network analysis on the scholarly database Scopus to gain insight into gender inequity across regions and subject areas and to better understand contextual underpinnings of stagnancy. Using a dataset of over 1.2 million authors and 144 million collaborative relationships, we capture international and unbounded co-authorship networks that include intra- and inter-disciplinary co-authorship ties across time (2009–2013). We describe how gender informs structural features and status differences in network relationships, focusing on men and women authors in 16 region-subject pairs. We pay particular attention to how connected authors are (first- and second-order degree centrality), attributes of authors’ collaborative relationships (including the “quality” and other characteristics of these ties), tendencies towards gender homophily (proportion of same-gender ties), and the nature of men’s and women’s interdisciplinary and international reach. Men have more advantageous first-order connections, yet second-order collaborative profiles look more similar. Men and women exhibit homophilous attachment to authors of the same gender, consistent over time. There is notable variation in the level of gender disparity within subjects across countries. We discuss this variation in the context of global trends in men’s and women’s scientific participation and cultural- and country-level influences on the organization and production of science. 
    more » « less
  3. null (Ed.)
    The number of published manufacturing science digital articles available from scientifc journals and the broader web have exponentially increased every year since the 1990s. To assimilate all of this knowledge by a novice engineer or an experienced researcher, requires signifcant synthesis of the existing knowledge space contained within published material, to fnd answers to basic and complex queries. Algorithmic approaches through machine learning and specifcally Natural Language Processing (NLP) on a domain specifc area such as manufacturing, is lacking. One of the signifcant challenges to analyzing manufacturing vocabulary is the lack of a named entity recognition model that enables algorithms to classify the manufacturing corpus of words under various manufacturing semantic categories. This work presents a supervised machine learning approach to categorize unstructured text from 500K+manufacturing science related scientifc abstracts and labelling them under various manufacturing topic categories. A neural network model using a bidirectional long-short term memory, plus a conditional random feld (BiLSTM+CRF) is trained to extract information from manufacturing science abstracts. Our classifer achieves an overall accuracy (f1-score) of 88%, which is quite near to the state-of-the-art performance. Two use case examples are presented that demonstrate the value of the developed NER model as a Technical Language Processing (TLP) workfow on manufacturing science documents. The long term goal is to extract valuable knowledge regarding the connections and relationships between key manufacturing concepts/entities available within millions of manufacturing documents into a structured labeled-property graph data structure that allow for programmatic query and retrieval. 
    more » « less
  4. Sharing, reuse, and synthesis of knowledge is central to the research process. These core functions are in theory served by the system of monographs, abstracts, and papers in journals and proceedings, with citation indices and search databases that comprise the core of our formal scholarly communication infrastructure; yet, converging lines of empirical and anecdotal evidence suggest that this system does not adequately act as infrastructure for synthesis. Emerging developments in new institutions for science, along with new technical infrastructures and tooling for decentralized knowledge work, offer new opportunities to prototype new technical infrastructures on top of a different installed base than the publish or perish, neoliberal academy. This workshop aims to integrate these developments and communities with CSCW’s deep roots in knowledge infrastructures and collaborative and distributed sensemaking, with new developments in science institutions and tooling, to stimulate and accelerate progress towards prototyping new scholarly communication infrastructures that are actually optimized for sharing, reusing, and synthesizing knowledge. 
    more » « less
  5. Academic productivity is realized through resources obtained from professional networks in which scientists are embedded. Using a national survey of academic faculty in Science, Technology, Engineering, and Mathematics (STEM) fields across multiple institution types, we examine how the structure of professional networks affects scholarly productivity and how those effects may differ by race, ethnicity, and gender. We find that network size masks important differences in composition. Using negative binomial regression, we find that both the size and composition of professional networks affect scientific productivity, but bigger is not always better. We find that instrumental networks increase scholarly productivity, while advice networks reduce it. There are important interactive effects that are masked by modeling only direct effects. We find that white men are especially advantaged by instrumental networks, and women are especially advantaged by advice networks. 
    more » « less