skip to main content


Title: How do we share data in COVID-19 research? A systematic review of COVID-19 datasets in PubMed Central Articles
Abstract Objective This study aims at reviewing novel coronavirus disease (COVID-19) datasets extracted from PubMed Central articles, thus providing quantitative analysis to answer questions related to dataset contents, accessibility and citations. Methods We downloaded COVID-19-related full-text articles published until 31 May 2020 from PubMed Central. Dataset URL links mentioned in full-text articles were extracted, and each dataset was manually reviewed to provide information on 10 variables: (1) type of the dataset, (2) geographic region where the data were collected, (3) whether the dataset was immediately downloadable, (4) format of the dataset files, (5) where the dataset was hosted, (6) whether the dataset was updated regularly, (7) the type of license used, (8) whether the metadata were explicitly provided, (9) whether there was a PubMed Central paper describing the dataset and (10) the number of times the dataset was cited by PubMed Central articles. Descriptive statistics about these seven variables were reported for all extracted datasets. Results We found that 28.5% of 12 324 COVID-19 full-text articles in PubMed Central provided at least one dataset link. In total, 128 unique dataset links were mentioned in 12 324 COVID-19 full text articles in PubMed Central. Further analysis showed that epidemiological datasets accounted for the largest portion (53.9%) in the dataset collection, and most datasets (84.4%) were available for immediate download. GitHub was the most popular repository for hosting COVID-19 datasets. CSV, XLSX and JSON were the most popular data formats. Additionally, citation patterns of COVID-19 datasets varied depending on specific datasets. Conclusion PubMed Central articles are an important source of COVID-19 datasets, but there is significant heterogeneity in the way these datasets are mentioned, shared, updated and cited.  more » « less
Award ID(s):
1937136
NSF-PAR ID:
10292533
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Briefings in Bioinformatics
Volume:
22
Issue:
2
ISSN:
1467-5463
Page Range / eLocation ID:
800 to 811
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Covid-19 has been an unprecedented challenge that disruptively reshaped societies and brought a massive amount of novel knowledge to the scientific community. However, as this knowledge flood has surged, researchers have been disadvantaged by not having access to a platform that can quickly synthesize rapidly emerging information and link the expertise it contains to established knowledge foundations. Aiming to fill this gap, in this paper we propose a research framework that can assist scientists in identifying, retrieving, and understanding Covid-19 knowledge from the ocean of scholarly articles. Incorporating Principal Component Decomposition (PDC), a knowledge model based on text analytics, and hierarchical topic tree analysis, the proposed framework profiles the research landscape, retrieves topic-specific knowledge and visualizes knowledge structures. Addressing 127,971 Covid-19 research papers from PubMed, our PCD topic analysis identifies 35 research hotspots, along with their correlations and trends. The hierarchical topic tree analysis further segments the knowledge landscape of the whole dataset into clinical and public health branches at a macro level. To supplement this analysis, we also built a knowledge model from research papers on vaccinations and fetched 92,286 pre-Covid publications as the established knowledge foundation for reference. The hierarchical topic tree analysis results on the retrieved papers show multiple relevant biomedical disciplines and four future research topics: monoclonal antibody treatments, vaccinations in diabetic patients, vaccine immunity effectiveness and durability, and vaccination-related allergic sensitization. 
    more » « less
  2. The relationship between physical activity and mental health, especially depression, is one of the most studied topics in the field of exercise science and kinesiology. Although there is strong consensus that regular physical activity improves mental health and reduces depressive symptoms, some debate the mechanisms involved in this relationship as well as the limitations and definitions used in such studies. Meta-analyses and systematic reviews continue to examine the strength of the association between physical activity and depressive symptoms for the purpose of improving exercise prescription as treatment or combined treatment for depression. This dataset covers 27 review articles (either systematic review, meta-analysis, or both) and 365 primary study articles addressing the relationship between physical activity and depressive symptoms. Primary study articles are manually extracted from the review articles. We used a custom-made workflow (Fu, Yuanxi. (2022). Scopus author info tool (1.0.1) [Python]. https://github.com/infoqualitylab/Scopus_author_info_collection that uses the Scopus API and manual work to extract and disambiguate authorship information for the 392 reports. The author information file (author_list.csv) is the product of this workflow and can be used to compute the co-author network of the 392 articles. This dataset can be used to construct the inclusion network and the co-author network of the 27 review articles and 365 primary study articles. A primary study article is "included" in a review article if it is considered in the review article's evidence synthesis. Each included primary study article is cited in the review article, but not all references cited in a review article are included in the evidence synthesis or primary study articles. The inclusion network is a bipartite network with two types of nodes: one type represents review articles, and the other represents primary study articles. In an inclusion network, if a review article includes a primary study article, there is a directed edge from the review article node to the primary study article node. The attribute file (article_list.csv) includes attributes of the 392 articles, and the edge list file (inclusion_net_edges.csv) contains the edge list of the inclusion network. Collectively, this dataset reflects the evidence production and use patterns within the exercise science and kinesiology scientific community, investigating the relationship between physical activity and depressive symptoms. FILE FORMATS 1. article_list.csv - Unicode CSV 2. author_list.csv - Unicode CSV 3. Chinese_author_name_reference.csv - Unicode CSV 4. inclusion_net_edges.csv - Unicode CSV 5. review_article_details.csv - Unicode CSV 6. supplementary_reference_list.pdf - PDF 7. README.txt - text file UPDATES IN THIS VERSION COMPARED TO V2 (Clarke, Caitlin; Lischwe Mueller, Natalie; Joshi, Manasi Ballal; Fu, Yuanxi; Schneider, Jodi (2022): The Inclusion Network of 27 Review Articles Published between 2013-2018 Investigating the Relationship Between Physical Activity and Depressive Symptoms. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4614455_V2) - We updated file article_list.csv to fill in a missing value: row 389, article_id 440. We filled in the "date" column as 2016-08, which was missing before. 
    more » « less
  3. The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing. 
    more » « less
  4. Scientists who perform major survival surgery on laboratory animals face a dual welfare and methodological challenge: how to choose surgical anesthetics and post-operative analgesics that will best control animal suffering, knowing that both pain and the drugs that manage pain can all affect research outcomes. Scientists who publish full descriptions of animal procedures allow critical and systematic reviews of data, demonstrate their adherence to animal welfare norms, and guide other scientists on how to conduct their own studies in the field. We investigated what information on animal pain management a reasonably diligent scientist might find in planning for a successful experiment. To explore how scientists in a range of fields describe their management of this ethical and methodological concern, we scored 400 scientific articles that included major animal survival surgeries as part of their experimental methods, for the completeness of information on anesthesia and analgesia. The 400 articles (250 accepted for publication pre-2011, and 150 in 2014–15, along with 174 articles they reference) included thoracotomies, craniotomies, gonadectomies, organ transplants, peripheral nerve injuries, spinal laminectomies and orthopedic procedures in dogs, primates, swine, mice, rats and other rodents. We scored articles for Publication Completeness (PC), which was any mention of use of anesthetics or analgesics; Analgesia Use (AU) which was any use of post-surgical analgesics, and Analgesia Completeness (a composite score comprising intra-operative analgesia, extended post-surgical analgesia, and use of multimodal analgesia). 338 of 400 articles were PC. 98 of these 338 were AU, with some mention of analgesia, while 240 of 338 mentioned anesthesia only but not postsurgical analgesia. Journals’ caliber, as measured by their 2013 Impact Factor, had no effect on PC or AU. We found no effect of whether a journal instructs authors to consult the ARRIVE publishing guidelines published in 2010 on PC or AC for the 150 mouse and rat articles in our 2014–15 dataset. None of the 302 articles that were silent about analgesic use included an explicit statement that analgesics were withheld, or a discussion of how pain management or untreated pain might affect results. We conclude that current scientific literature cannot be trusted to present full detail on use of animal anesthetics and analgesics. We report that publication guidelines focus more on other potential sources of bias in experimental results, under-appreciate the potential for pain and pain drugs to skew data, PLOS ONE | DOI:10.1371/journal.pone.0155001 May 12, 2016 1 / 24 a11111 OPEN ACCESS Citation: Carbone L, Austin J (2016) Pain and Laboratory Animals: Publication Practices for Better Data Reproducibility and Better Animal Welfare. PLoS ONE 11(5): e0155001. doi:10.1371/journal. pone.0155001 Editor: Chang-Qing Gao, Central South University, CHINA Received: December 29, 2015 Accepted: April 22, 2016 Published: May 12, 2016 Copyright: © 2016 Carbone, Austin. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All relevant data are within the paper and its Supporting Information files. Authors may be contacted for further information. Funding: This study was funded by the United States National Science Foundation Division of Social and Economic Sciences. Award #1455838. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. and thus mostly treat pain management as solely an animal welfare concern, in the jurisdiction of animal care and use committees. At the same time, animal welfare regulations do not include guidance on publishing animal data, even though publication is an integral part of the cycle of research and can affect the welfare of animals in studies building on published work, leaving it to journals and authors to voluntarily decide what details of animal use to publish. We suggest that journals, scientists and animal welfare regulators should revise current guidelines and regulations, on treatment of pain and on transparent reporting of treatment of pain, to improve this dual welfare and data-quality deficiency. 
    more » « less
  5. Many undergraduate students encounter struggle as they navigate academic, financial, and social contexts of higher education. The transition to emergency online instruction during the Spring of 2020 due to the COVID-19 pandemic exacerbated these struggles. To assess college students’ struggles during the transition to online learning in undergraduate biology courses, we surveyed a diverse collection of students ( n = 238) at an R2 research institution in the Southeastern United States. Students were asked if they encountered struggles and whether they were able to overcome them. Based on how students responded, they were asked to elaborate on (1) how they persevered without struggle, (2) how they were able to overcome their struggles, or (3) what barriers they encountered that did not allow them to overcome their struggles. Each open-ended response was thematically coded to address salient patterns in students’ ability to either persevere or overcome their struggle. We found that during the transition to remote learning, 67% of students experienced struggle. The most reported struggles included: shifts in class format, effective study habits, time management, and increased external commitments. Approximately, 83% of those struggling students were able to overcome their struggle, most often citing their instructor’s support and resources offered during the transition as reasons for their success. Students also cited changes in study habits, and increased confidence or belief that they could excel within the course as ways in which they overcame their struggles. Overall, we found no link between struggles in the classroom and any demographic variables we measured, which included race/ethnicity, gender expression, first-generation college students, transfer student status, and commuter student status. Our results highlight the critical role that instructors play in supporting student learning during these uncertain times by promoting student self-efficacy and positive-growth mindset, providing students with the resources they need to succeed, and creating a supportive and transparent learning environment. 
    more » « less