A standard measure of the influence of a research paper is the number of times it is cited. However, papers may be cited for many reasons, and citation count offers limited information about the extent to which a paper affected the content of subsequent publications. We therefore propose a novel method to quantify linguistic influence in timestamped document collections. There are two main steps: first, identify lexical and semantic changes using contextual embeddings and word frequencies; second, aggregate information about these changes into per-document influence scores by estimating a high-dimensional Hawkes process with a low-rank parameter matrix. We show that this measure of linguistic influence is predictive of future citations: the estimate of linguistic influence from the two years after a paper’s publication is correlated with and predictive of its citation count in the following three years. This is demonstrated using an online evaluation with incremental temporal training/test splits, in comparison with a strong baseline that includes predictors for initial citation counts, topics, and lexical features.
more »
« less
HINTS: Citation Time Series Prediction for New Publications via Dynamic Heterogeneous Information Network Embedding
Accurate prediction of scientific impact is important for scientists, academic recommender systems, and granting organizations alike. Existing approaches rely on many years of leading citation values to predict a scientific paper’s citations (a proxy for impact), even though most papers make their largest contributions in the first few years after they are published. In this paper, we tackle a new problem: predicting a new paper’s citation time series from the date of publication (i.e., without leading values). We propose HINTS, a novel end-to-end deep learning framework that converts citation signals from dynamic heterogeneous information networks (DHIN) into citation time series. HINTS imputes pseudo-leading values for a paper in the years before it is published from DHIN embeddings, and then transforms these embeddings into the parameters of a formal model that can predict citation counts immediately after publication. Empirical analysis on two real-world datasets from Computer Science and Physics show that HINTS is competitive with baseline citation prediction models. While we focus on citations, our approach generalizes to other “cold start” time series prediction tasks where relational data is available and accurate prediction in early timestamps is crucial.
more »
« less
- PAR ID:
- 10259969
- Date Published:
- Journal Name:
- WWW '21: Proceedings of the Web Conference
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Many publications on COVID-19 were released on preprint servers such as medRxiv and bioRxiv. It is unknown how reliable these preprints are, and which ones will eventually be published in scientific journals. In this study, we use crowdsourced human forecasts to predict publication outcomes and future citation counts for a sample of 400 preprints with high Altmetric score. Most of these preprints were published within 1 year of upload on a preprint server (70%), with a considerable fraction (45%) appearing in a high-impact journal with a journal impact factor of at least 10. On average, the preprints received 162 citations within the first year. We found that forecasters can predict if preprints will be published after 1 year and if the publishing journal has high impact. Forecasts are also informative with respect to Google Scholar citations within 1 year of upload on a preprint server. For both types of assessment, we found statistically significant positive correlations between forecasts and observed outcomes. While the forecasts can help to provide a preliminary assessment of preprints at a faster pace than traditional peer-review, it remains to be investigated if such an assessment is suited to identify methodological problems in preprints.more » « less
-
Citations have long been used to characterize the state of a scientific field and to identify influential works. However, writers use citations for different purposes, and this varied purpose influences uptake by future scholars. Unfortunately, our understanding of how scholars use and frame citations has been limited to small-scale manual citation analysis of individual papers. We perform the largest behavioral study of citations to date, analyzing how scientific works frame their contributions through different types of citations and how this framing affects the field as a whole. We introduce a new dataset of nearly 2,000 citations annotated for their function, and use it to develop a state-of-the-art classifier and label the papers of an entire field: Natural Language Processing. We then show how differences in framing affect scientific uptake and reveal the evolution of the publication venues and the field as a whole. We demonstrate that authors are sensitive to discourse structure and publication venue when citing, and that how a paper frames its work through citations is predictive of the citation count it will receive. Finally, we use changes in citation framing to show that the field of NLP is undergoing a significant increase in consensus.more » « less
-
Bolboacă, Sorana D (Ed.)Background and aimCitations in academia have long been regarded as a fundamental means of acknowledging the contribution of past work and promoting scientific advancement. The aim of this paper was to investigate the impact that misconduct allegations made against scholars have on the citations of their work, comparing allegations of sexual misconduct (unrelatedto the research merit) and allegations of scientific misconduct (directly relatedto the research merit). MethodsWe collected citation data from the Web of Science (WoS) in 2021, encompassing 31,941 publications from 172 accused and control scholars across 18 disciplines. We also conducted two studies: one on non-academics (N = 231) and one on academics (N = 240). ResultsThe WoS data shows that scholars accused of sexual misconduct incur a significant citation decrease in the three years after the accusations become public, while we do not detect a significant citation decrease for scholars accused of scientific misconduct. The study involving non-academics suggests that individuals are more averse to sexual than to scientific misconduct. Finally, contrary to the WoS data findings, a sample of academics indicates they are more likely to cite scholars accused of sexual misconduct than those accused of scientific misconduct. ConclusionsIn the first three years after accusations became public, scholars accused of sexual misconduct incur a larger citation penalty than scholars accused of scientific misconduct. However, when asked to predict their citing behavior, scholars indicated the reverse pattern, suggesting they might mis-predict their behavior or be reluctant to disclose their preferences.more » « less
-
Abstract Biologists increasingly rely on computer code to collect and analyze their data, reinforcing the importance of published code for transparency, reproducibility, training, and a basis for further work. Here, we conduct a literature review estimating temporal trends in code sharing in ecology and evolution publications since 2010, and test for an influence of code sharing on citation rate. We find that code is rarely published (only 6% of papers), with little improvement over time. We also found there may be incentives to publish code: Publications that share code have tended to be low‐impact initially, but accumulate citations faster, compensating for this deficit. Studies that additionally meet other Open Science criteria, open‐access publication, or data sharing, have still higher citation rates, with publications meeting all three criteria (code sharing, data sharing, and open access publication) tending to have the most citations and highest rate of citation accumulation.more » « less
An official website of the United States government

