skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, May 23 until 2:00 AM ET on Friday, May 24 due to maintenance. We apologize for the inconvenience.

Search for: All records

Award ID contains: 1838200

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. With the ever-increasing abundance of biomedical articles, improving the accuracy of keyword search results becomes crucial for ensuring reproducible research. However, keyword extraction for biomedical articles is hard due to the existence of obscure keywords and the lack of a comprehensive benchmark. PubMedAKE is an author-assigned keyword extraction dataset that contains the title, abstract, and keywords of over 843,269 articles from the PubMed open access subset database. This dataset, publicly available on Zenodo, is the largest keyword extraction benchmark with sufficient samples to train neural networks. Experimental results using state-of-the-art baseline methods illustrate the need for developing automatic keyword extraction methods for biomedical literature. 
    more » « less
  2. Systematic reviews (SRs) are a crucial component of evidence-based clinical practice. Unfortunately, SRs are labor-intensive and unscalable with the exponential growth in literature. Automating evidence synthesis using machine learning models has been proposed but solely focuses on the text and ignores additional features like citation information. Recent work demonstrated that citation embeddings can outperform the text itself, suggesting that better network representation may expedite SRs. Yet, how to utilize the rich information in heterogeneous information networks (HIN) for network embeddings is understudied. Existing HIN models fail to produce a high-quality embedding compared to simply running state-of-the-art homogeneous network models. To address existing HIN model limitations, we propose SR-CoMbEr, a community-based multi-view graph convolutional network for learning better embeddings for evidence synthesis. Our model automatically discovers article communities to learn robust embeddings that simultaneously encapsulate the rich semantics in HINs. We demonstrate the effectiveness of our model to automate 15 SRs. 
    more » « less
  3. This is the train-test-validation dataset for pubmed open-access articles keyphrase extraction task. The small_* file contains the all articles that have 5to 25 extractive keyphrases (keyphrase in the article that is inside the abstract of the article).

    more » « less
  4. Electronic Health Record modeling is crucial for digital medicine. However, existing models ignore higher-order interactions among medical codes and their causal relations towards downstream clinical predictions. To address such limitations, we propose a novel framework CACHE, to provide effective and insightful clinical predictions based on hypergraph representation learning and counterfactual and factual reasoning techniques. Experiments on two real EHR datasets show the superior performance of CACHE. Case studies with a domain expert illustrate a preferred capability of CACHE in generating clinically meaningful interpretations towards the correct predictions. 
    more » « less
  5. Federated Learning (FL) is a promising framework for multiple clients to learn a joint model without directly sharing the data. In addition to high utility of the joint model, rigorous privacy protection of the data and communication efficiency are important design goals. Many existing efforts achieve rigorous privacy by ensuring differential privacy for intermediate model parameters, however, they assume a uniform privacy parameter for all the clients. In practice, different clients may have different privacy requirements due to varying policies or preferences. In this paper, we focus on explicitly modeling and leveraging the heterogeneous privacy requirements of different clients and study how to optimize utility for the joint model while minimizing communication cost. As differentially private perturbations affect the model utility, a natural idea is to make better use of information submitted by the clients with higher privacy budgets (referred to as "public" clients, and the opposite as "private" clients). The challenge is how to use such information without biasing the joint model. We propose P rojected F ederated A veraging (PFA), which extracts the top singular subspace of the model updates submitted by "public" clients and utilizes them to project the model updates of "private" clients before aggregating them. We then propose communication-efficient PFA+, which allows "private" clients to upload projected model updates instead of original ones. Our experiments verify the utility boost of both algorithms compared to the baseline methods, whereby PFA+ achieves over 99% uplink communication reduction for "private" clients. 
    more » « less