NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cross-institutional dental electronic health record entity extraction via generative artificial intelligence and synthetic notes

https://doi.org/10.1093/jamiaopen/ooaf061

Chuang, Yao-Shun; Lee, Chun-Teh; Lin, Guo-Hao; Brandon, Ryan; Jiang, Xiaoqian; Walji, Muhammad F; Tokede, Oluwabunmi (May 2025, JAMIA Open)

Abstract BackgroundWhile most health-care providers now use electronic health records (EHRs) to document clinical care, many still treat them as digital versions of paper records. As a result, documentation often remains unstructured, with free-text entries in progress notes. This limits the potential for secondary use and analysis, as machine-learning and data analysis algorithms are more effective with structured data. ObjectiveThis study aims to use advanced artificial intelligence (AI) and natural language processing (NLP) techniques to improve diagnostic information extraction from clinical notes in a periodontal use case. By automating this process, the study seeks to reduce missing data in dental records and minimize the need for extensive manual annotation, a long-standing barrier to widespread NLP deployment in dental data extraction. Materials and MethodsThis research utilizes large language models (LLMs), specifically Generative Pretrained Transformer 4, to generate synthetic medical notes for fine-tuning a RoBERTa model. This model was trained to better interpret and process dental language, with particular attention to periodontal diagnoses. Model performance was evaluated by manually reviewing 360 clinical notes randomly selected from each of the participating site’s dataset. ResultsThe results demonstrated high accuracy of periodontal diagnosis data extraction, with the sites 1 and 2 achieving a weighted average score of 0.97-0.98. This performance held for all dimensions of periodontal diagnosis in terms of stage, grade, and extent. DiscussionSynthetic data effectively reduced manual annotation needs while preserving model quality. Generalizability across institutions suggests viability for broader adoption, though future work is needed to improve contextual understanding. ConclusionThe study highlights the potential transformative impact of AI and NLP on health-care research. Most clinical documentation (40%-80%) is free text. Scaling our method could enhance clinical data reuse.
more » « less
Free, publicly-accessible full text available May 2, 2026
Exploring the tradeoff between data privacy and utility with a clinical data analysis use case

https://doi.org/10.1186/s12911-024-02545-9

Im, Eunyoung; Kim, Hyeoneui; Lee, Hyungbok; Jiang, Xiaoqian; Kim, Ju Han (December 2024, BMC Medical Informatics and Decision Making)

Abstract BackgroundSecuring adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset’s utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset’s utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility. MethodsPredictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two. ResultsAll 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores. ConclusionsAs the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data’s intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility.
more » « less
Free, publicly-accessible full text available December 1, 2025
Privacy-preserving federated genome-wide association studies via dynamic sampling

https://doi.org/10.1093/bioinformatics/btad639

Wang, Xinyue; Dervishi, Leonard; Li, Wentao; Ayday, Erman; Jiang, Xiaoqian; Vaidya, Jaideep (October 2023, Bioinformatics)
Nikolski, Macha (Ed.)
Abstract MotivationGenome-wide association studies (GWAS) benefit from the increasing availability of genomic data and cross-institution collaborations. However, sharing data across institutional boundaries jeopardizes medical data confidentiality and patient privacy. While modern cryptographic techniques provide formal secure guarantees, the substantial communication and computational overheads hinder the practical application of large-scale collaborative GWAS. ResultsThis work introduces an efficient framework for conducting collaborative GWAS on distributed datasets, maintaining data privacy without compromising the accuracy of the results. We propose a novel two-step strategy aimed at reducing communication and computational overheads, and we employ iterative and sampling techniques to ensure accurate results. We instantiate our approach using logistic regression, a commonly used statistical method for identifying associations between genetic markers and the phenotype of interest. We evaluate our proposed methods using two real genomic datasets and demonstrate their robustness in the presence of between-study heterogeneity and skewed phenotype distributions using a variety of experimental settings. The empirical results show the efficiency and applicability of the proposed method and the promise for its application for large-scale collaborative GWAS. Availability and implementationThe source code and data are available at https://github.com/amioamo/TDS.
more » « less
Full Text Available
Privacy preserving identification of population stratification for collaborative genomic research

https://doi.org/10.1093/bioinformatics/btad274

Dervishi, Leonard; Li, Wenbiao; Halimi, Anisa; Jiang, Xiaoqian; Vaidya, Jaideep; Ayday, Erman (June 2023, Bioinformatics)

Abstract The rapid improvements in genomic sequencing technology have led to the proliferation of locally collected genomic datasets. Given the sensitivity of genomic data, it is crucial to conduct collaborative studies while preserving the privacy of the individuals. However, before starting any collaborative research effort, the quality of the data needs to be assessed. One of the essential steps of the quality control process is population stratification: identifying the presence of genetic difference in individuals due to subpopulations. One of the common methods used to group genomes of individuals based on ancestry is principal component analysis (PCA). In this article, we propose a privacy-preserving framework which utilizes PCA to assign individuals to populations across multiple collaborators as part of the population stratification step. In our proposed client-server-based scheme, we initially let the server train a global PCA model on a publicly available genomic dataset which contains individuals from multiple populations. The global PCA model is later used to reduce the dimensionality of the local data by each collaborator (client). After adding noise to achieve local differential privacy (LDP), the collaborators send metadata (in the form of their local PCA outputs) about their research datasets to the server, which then aligns the local PCA results to identify the genetic differences among collaborators’ datasets. Our results on real genomic data show that the proposed framework can perform population stratification analysis with high accuracy while preserving the privacy of the research participants.
more » « less
Not fully synthetic: LLM-based hybrid approaches towards privacy-preserving clinical note sharing.

Rahman, Sarkar A; Chuang, YS; Jiang, X; Mohammed, N (July 2025, AMIA Summits on Translational Science Proceedings)

Free, publicly-accessible full text available July 9, 2026
FSLearning: An Efficient Federated Split Learning Framework for Privacy-Preserving Disease Prediction

https://doi.org/10.1007/978-3-031-95838-0_22

Li, Bin; Jiang, Xiaoqian; Hsu, Yu-Chun; Harmanci, Arif O; Gao, Hongchang; Shi, Xinghua (January 2025, Springer Nature Switzerland)

Free, publicly-accessible full text available January 1, 2026
Disentangling Accelerated Cognitive Decline from the Normal Aging Process and Unraveling Its Genetic Components: A Neuroimaging-Based Deep Learning Approach

https://doi.org/10.3233/JAD-231020

Dai, Yulin; Hsu, Yu-Chun; Fernandes, Brisa S; Zhang, Kai; Li, Xiaoyang; Enduru, Nitesh; Liu, Andi; Manuel, Astrid M; Jiang, Xiaoqian; Zhao, Zhongming (February 2024, Journal of Alzheimer's Disease)

Background: The progressive cognitive decline, an integral component of Alzheimer’s disease (AD), unfolds in tandem with the natural aging process. Neuroimaging features have demonstrated the capacity to distinguish cognitive decline changes stemming from typical brain aging and AD between different chronological points. Objective: To disentangle the normal aging effect from the AD-related accelerated cognitive decline and unravel its genetic components using a neuroimaging-based deep learning approach. Methods: We developed a deep-learning framework based on a dual-loss Siamese ResNet network to extract fine-grained information from the longitudinal structural magnetic resonance imaging (MRI) data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study. We then conducted genome-wide association studies (GWAS) and post-GWAS analyses to reveal the genetic basis of AD-related accelerated cognitive decline. Results: We used our model to process data from 1,313 individuals, training it on 414 cognitively normal people and predicting cognitive assessment for all participants. In our analysis of accelerated cognitive decline GWAS, we identified two genome-wide significant loci: APOE locus (chromosome 19 p13.32) and rs144614292 (chromosome 11 p15.1). Variant rs144614292 (G > T) has not been reported in previous AD GWA studies. It is within the intronic region of NELL1, which is expressed in neurons and plays a role in controlling cell growth and differentiation. The cell-type-specific enrichment analysis and functional enrichment of GWAS signals highlighted the microglia and immune-response pathways. Conclusions: Our deep learning model effectively extracted relevant neuroimaging features and predicted individual cognitive decline. We reported a novel variant (rs144614292) within the NELL1 gene.
more » « less
Full Text Available
Federated generalized linear mixed models for collaborative genome-wide association studies

https://doi.org/10.1016/j.isci.2023.107227

Li, Wentao; Chen, Han; Jiang, Xiaoqian; Harmanci, Arif (August 2023, iScience)

Full Text Available
Efficient Federated Kinship Relationship Identification.

Wang X; Dervishi L; Li W; Jiang X; Ayday E; Vaidya J (January 2023, AMIA Jt Summits Transl Sci Proc 2023)

Kinship relationship estimation plays a significant role in today's genome studies. Since genetic data are mostly stored and protected in different silos, retrieving the desirable kinship relationships across federated data warehouses is a non-trivial problem. The ability to identify and connect related individuals is important for both research and clinical applications. In this work, we propose a new privacy-preserving kinship relationship estimation framework: Incremental Update Kinship Identification (INK). The proposed framework includes three key components that allow us to control the balance between privacy and accuracy (of kinship estimation): an incremental process coupled with the use of auxiliary information and informative scores. Our empirical evaluation shows that INK can achieve higher kinship identification correctness while exposing fewer genetic markers.
more » « less
Full Text Available
LLM for Patient-Trial Matching: Privacy-Aware Data Augmentation Towards Better Performance and Generalizability

Yuan J; Tang R; Jiang X; Hu X (January 2023, American Medical Informatics Association (AMIA) Annual Symposium)

The process of matching patients with suitable clinical trials is essential for advancing medical research and providing optimal care. However, current approaches face challenges such as data standardization, ethical considerations, and a lack of interoperability between Electronic Health Records (EHRs) and clinical trial criteria. In this paper, we explore the potential of large language models (LLMs) to address these challenges by leveraging their advanced natural language generation capabilities to improve compatibility between EHRs and clinical trial descriptions. We propose an innovative privacy-aware data augmentation approach for LLM-based patient-trial matching (LLM-PTM), which balances the benefits of LLMs while ensuring the security and confidentiality of sensitive patient data. Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%. Additionally, we present case studies to further illustrate the effectiveness of our approach and provide a deeper understanding of its underlying principles.
more » « less
Full Text Available

Search for: All records