skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: PGB: A PubMed Graph Benchmark for Heterogeneous Network Representation Learning
PubMed Graph Benchmark (PGB) aggregates the metadata associated with the biomedical articles from PubMed into a unified source. The benchmark contains metadata including title, abstract, authors, in/out citations, MeSH terms, MeSH hierarchy, venue, publication type, and chemicals.</p>  more » « less
Award ID(s):
2145411
PAR ID:
10419380
Author(s) / Creator(s):
;
Publisher / Repository:
Zenodo
Date Published:
Edition / Version:
1.0
Subject(s) / Keyword(s):
Biomedical literature Heterogeneous Network Bibliographic Network MeSH Tree
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. BackgroundWe performed a systematic review that identified at least 9,000 scientific papers on PubMed that include immunofluorescent images of cells from the central nervous system (CNS). These CNS papers contain tens of thousands of immunofluorescent neural images supporting the findings of over 50,000 associated researchers. While many existing reviews discuss different aspects of immunofluorescent microscopy, such as image acquisition and staining protocols, few papers discuss immunofluorescent imaging from an image-processing perspective. We analyzed the literature to determine the image processing methods that were commonly published alongside the associated CNS cell, microscopy technique, and animal model, and highlight gaps in image processing documentation and reporting in the CNS research field. MethodsWe completed a comprehensive search of PubMed publications using Medical Subject Headings (MeSH) terms and other general search terms for CNS cells and common fluorescent microscopy techniques. Publications were found on PubMed using a combination of column description terms and row description terms. We manually tagged the comma-separated values file (CSV) metadata of each publication with the following categories: animal or cell model, quantified features, threshold techniques, segmentation techniques, and image processing software. ResultsOf the almost 9,000 immunofluorescent imaging papers identified in our search, only 856 explicitly include image processing information. Moreover, hundreds of the 856 papers are missing thresholding, segmentation, and morphological feature details necessary for explainable, unbiased, and reproducible results. In our assessment of the literature, we visualized current image processing practices, compiled the image processing options from the top twelve software programs, and designed a road map to enhance image processing. We determined that thresholding and segmentation methods were often left out of publications and underreported or underutilized for quantifying CNS cell research. DiscussionLess than 10% of papers with immunofluorescent images include image processing in their methods. A few authors are implementing advanced methods in image analysis to quantify over 40 different CNS cell features, which can provide quantitative insights in CNS cell features that will advance CNS research. However, our review puts forward that image analysis methods will remain limited in rigor and reproducibility without more rigorous and detailed reporting of image processing methods. ConclusionImage processing is a critical part of CNS research that must be improved to increase scientific insight, explainability, reproducibility, and rigor. 
    more » « less
  2. null (Ed.)
    Abstract Objective This study aims at reviewing novel coronavirus disease (COVID-19) datasets extracted from PubMed Central articles, thus providing quantitative analysis to answer questions related to dataset contents, accessibility and citations. Methods We downloaded COVID-19-related full-text articles published until 31 May 2020 from PubMed Central. Dataset URL links mentioned in full-text articles were extracted, and each dataset was manually reviewed to provide information on 10 variables: (1) type of the dataset, (2) geographic region where the data were collected, (3) whether the dataset was immediately downloadable, (4) format of the dataset files, (5) where the dataset was hosted, (6) whether the dataset was updated regularly, (7) the type of license used, (8) whether the metadata were explicitly provided, (9) whether there was a PubMed Central paper describing the dataset and (10) the number of times the dataset was cited by PubMed Central articles. Descriptive statistics about these seven variables were reported for all extracted datasets. Results We found that 28.5% of 12 324 COVID-19 full-text articles in PubMed Central provided at least one dataset link. In total, 128 unique dataset links were mentioned in 12 324 COVID-19 full text articles in PubMed Central. Further analysis showed that epidemiological datasets accounted for the largest portion (53.9%) in the dataset collection, and most datasets (84.4%) were available for immediate download. GitHub was the most popular repository for hosting COVID-19 datasets. CSV, XLSX and JSON were the most popular data formats. Additionally, citation patterns of COVID-19 datasets varied depending on specific datasets. Conclusion PubMed Central articles are an important source of COVID-19 datasets, but there is significant heterogeneity in the way these datasets are mentioned, shared, updated and cited. 
    more » « less
  3. Automatically extracted metadata from scholarly documents in PDF formats is usually noisy and heterogeneous, often containing incomplete fields and erroneous values. One common way of cleaning metadata is to use a bibliographic reference dataset. The challenge is to match records between corpora with high precision. The existing solution which is based on information retrieval and string similarity on titles works well only if the titles are cleaned. We introduce a system designed to match scholarly document entities with noisy metadata against a reference dataset. The blocking function uses the classic BM25 algorithm to find the matching candidates from the reference data that has been indexed by ElasticSearch. The core components use supervised methods which combine features extracted from all available metadata fields. The system also leverages available citation information to match entities. The combination of metadata and citation achieves high accuracy that significantly outperforms the baseline method on the same test dataset. We apply this system to match the database of CiteSeerX against Web of Science, PubMed, and DBLP. This method will be deployed in the CiteSeerX system to clean metadata and link records to other scholarly big datasets. 
    more » « less
  4. With the ever-increasing abundance of biomedical articles, improving the accuracy of keyword search results becomes crucial for ensuring reproducible research. However, keyword extraction for biomedical articles is hard due to the existence of obscure keywords and the lack of a comprehensive benchmark. PubMedAKE is an author-assigned keyword extraction dataset that contains the title, abstract, and keywords of over 843,269 articles from the PubMed open access subset database. This dataset, publicly available on Zenodo, is the largest keyword extraction benchmark with sufficient samples to train neural networks. Experimental results using state-of-the-art baseline methods illustrate the need for developing automatic keyword extraction methods for biomedical literature. 
    more » « less
  5. We propose a new static high-performance mesh data structure for triangle surface meshes on the GPU. Our data structure is carefully designed for parallel execution while capturing mesh locality and confining data access, as much as possible, within the GPU's fast shared memory. We achieve this by subdividing the mesh intopatchesand representing these patches compactly using a matrix-based representation. Our patching technique is decorated withribbons, thin mesh strips around patches that eliminate the need to communicate between different computation thread blocks, resulting in consistent high throughput. We call our data structureRXMesh: Ribbon-matriX Mesh. We hide the complexity of our data structure behind a flexible but powerful programming model that helps deliver high performance by inducing load balance even in highly irregular input meshes. We show the efficacy of our programming model on common geometry processing applications---mesh smoothing and filtering, geodesic distance, and vertex normal computation. For evaluation, we benchmark our data structure against well-optimized GPU and (single and multi-core) CPU data structures and show significant speedups. 
    more » « less