skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 8:00 PM ET on Friday, March 21 until 8:00 AM ET on Saturday, March 22 due to maintenance. We apologize for the inconvenience.


Title: Forecasting the publication and citation outcomes of COVID-19 preprints
Many publications on COVID-19 were released on preprint servers such as medRxiv and bioRxiv. It is unknown how reliable these preprints are, and which ones will eventually be published in scientific journals. In this study, we use crowdsourced human forecasts to predict publication outcomes and future citation counts for a sample of 400 preprints with high Altmetric score. Most of these preprints were published within 1 year of upload on a preprint server (70%), with a considerable fraction (45%) appearing in a high-impact journal with a journal impact factor of at least 10. On average, the preprints received 162 citations within the first year. We found that forecasters can predict if preprints will be published after 1 year and if the publishing journal has high impact. Forecasts are also informative with respect to Google Scholar citations within 1 year of upload on a preprint server. For both types of assessment, we found statistically significant positive correlations between forecasts and observed outcomes. While the forecasts can help to provide a preliminary assessment of preprints at a faster pace than traditional peer-review, it remains to be investigated if such an assessment is suited to identify methodological problems in preprints.  more » « less
Award ID(s):
2007951
PAR ID:
10391556
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Royal Society Open Science
Volume:
9
Issue:
9
ISSN:
2054-5703
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Journals play a critical role in the scientific process because they evaluate the quality of incoming papers and offer an organizing filter for search. However, the role of journals has been called into question because new preprint archives and academic search engines make it easier to find articles independent of the journals that publish them. Research on this issue is complicated by the deeply confounded relationship between article quality and journal reputation. We present an innovative proxy for individual article quality that is divorced from the journal's reputation or impact factor: the number of citations to preprints posted onarXiv.org. Using this measure to study three subfields of physics that were early adopters of arXiv, we show that prior estimates of the effect of journal reputation on an individual article's impact (measured by citations) are likely inflated. While we find that higher‐quality preprints in these subfields are now less likely to be published in journals compared to prior years, we find little systematic evidence that the role of journal reputation on article performance has declined.

     
    more » « less
  2. Information about individual publications associated with grants funded by NSF to support SES research from 2000-2015 (see "SES grants, 2000-2015"). For grants with ten or fewer publications, we included information about all available publications in this dataset. For grants with more than ten publications, we randomly selected ten to include in this dataset. CSV file with 13 columns and names in header row: "Grant ID" is the ID from the Dimensions platform (string); "Grant Number" is the NSF Award number (integer); "Publication Title" is the title of the paper (text); "Publication Year" is the year in which the paper was published (year); "Authors" is a list or abbreviated list of the authors of the paper (text); "Journal" is the name of the scientific journal or outlet in which the paper is published (text); "Interdis Rubric 1" is a metric representing the dataset authors' assessment for the level of interdisciplinarity represented by the paper (integer: “1” indicated social and natural science interdisciplinarity where both social and environmental conditions are measured or explored and/or author affiliations included departments across these disciplines; “2” indicated general interdisciplinarity between two or more different fields (that may both be within natural or social science); and “3” indicated single-disciplinarity) "Citations" is the count of citations the paper had received as of the date listed in "date for cite count", as reported in Google Scholar (integer); "date for cite count" is the date on which citation count for the paper was obtained (ddBBByy); "Abstract" is the text of the abstract of the paper, where available (text); "Notes" are any notes added by the authors of the dataset (text). 
    more » « less
  3. Abstract

    Ecosystems around the globe are experiencing changes in both the magnitude and fluctuations of environmental conditions due to land use and climate change. In response, ecologists are increasingly using near‐term, iterative ecological forecasts to predict how ecosystems will change in the future. To date, many near‐term, iterative forecasting systems have been developed using high temporal frequency (minute to hourly resolution) data streams for assimilation. However, this approach may be cost‐prohibitive or impossible for forecasting ecological variables that lack high‐frequency sensors or have high data latency (i.e., a delay before data are available for modeling after collection). To explore the effects of data assimilation frequency on forecast skill, we developed water temperature forecasts for a eutrophic drinking water reservoir and conducted data assimilation experiments by selectively withholding observations to examine the effect of data availability on forecast accuracy. We used in situ sensors, manually collected data, and a calibrated water quality ecosystem model driven by forecasted weather data to generate future water temperature forecasts using Forecasting Lake and Reservoir Ecosystems (FLARE), an open source water quality forecasting system. We tested the effect of daily, weekly, fortnightly, and monthly data assimilation on the skill of 1‐ to 35‐day‐ahead water temperature forecasts. We found that forecast skill varied depending on the season, forecast horizon, depth, and data assimilation frequency, but overall forecast performance was high, with a mean 1‐day‐ahead forecast root mean square error (RMSE) of 0.81°C, mean 7‐day RMSE of 1.15°C, and mean 35‐day RMSE of 1.94°C. Aggregated across the year, daily data assimilation yielded the most skillful forecasts at 1‐ to 7‐day‐ahead horizons, but weekly data assimilation resulted in the most skillful forecasts at 8‐ to 35‐day‐ahead horizons. Within a year, forecasts with weekly data assimilation consistently outperformed forecasts with daily data assimilation after the 8‐day forecast horizon during mixed spring/autumn periods and 5‐ to 14‐day‐ahead horizons during the summer‐stratified period, depending on depth. Our results suggest that lower frequency data (i.e., weekly) may be adequate for developing accurate forecasts in some applications, further enabling the development of forecasts broadly across ecosystems and ecological variables without high‐frequency sensor data.

     
    more » « less
  4. null (Ed.)
    Accurate prediction of scientific impact is important for scientists, academic recommender systems, and granting organizations alike. Existing approaches rely on many years of leading citation values to predict a scientific paper’s citations (a proxy for impact), even though most papers make their largest contributions in the first few years after they are published. In this paper, we tackle a new problem: predicting a new paper’s citation time series from the date of publication (i.e., without leading values). We propose HINTS, a novel end-to-end deep learning framework that converts citation signals from dynamic heterogeneous information networks (DHIN) into citation time series. HINTS imputes pseudo-leading values for a paper in the years before it is published from DHIN embeddings, and then transforms these embeddings into the parameters of a formal model that can predict citation counts immediately after publication. Empirical analysis on two real-world datasets from Computer Science and Physics show that HINTS is competitive with baseline citation prediction models. While we focus on citations, our approach generalizes to other “cold start” time series prediction tasks where relational data is available and accurate prediction in early timestamps is crucial. 
    more » « less
  5. Abstract

    Biologists increasingly rely on computer code to collect and analyze their data, reinforcing the importance of published code for transparency, reproducibility, training, and a basis for further work. Here, we conduct a literature review estimating temporal trends in code sharing in ecology and evolution publications since 2010, and test for an influence of code sharing on citation rate. We find that code is rarely published (only 6% of papers), with little improvement over time. We also found there may be incentives to publish code: Publications that share code have tended to be low‐impact initially, but accumulate citations faster, compensating for this deficit. Studies that additionally meet other Open Science criteria, open‐access publication, or data sharing, have still higher citation rates, with publications meeting all three criteria (code sharing, data sharing, and open access publication) tending to have the most citations and highest rate of citation accumulation.

     
    more » « less