skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Award ID contains: 2124789

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract BackgroundWhile most health-care providers now use electronic health records (EHRs) to document clinical care, many still treat them as digital versions of paper records. As a result, documentation often remains unstructured, with free-text entries in progress notes. This limits the potential for secondary use and analysis, as machine-learning and data analysis algorithms are more effective with structured data. ObjectiveThis study aims to use advanced artificial intelligence (AI) and natural language processing (NLP) techniques to improve diagnostic information extraction from clinical notes in a periodontal use case. By automating this process, the study seeks to reduce missing data in dental records and minimize the need for extensive manual annotation, a long-standing barrier to widespread NLP deployment in dental data extraction. Materials and MethodsThis research utilizes large language models (LLMs), specifically Generative Pretrained Transformer 4, to generate synthetic medical notes for fine-tuning a RoBERTa model. This model was trained to better interpret and process dental language, with particular attention to periodontal diagnoses. Model performance was evaluated by manually reviewing 360 clinical notes randomly selected from each of the participating site’s dataset. ResultsThe results demonstrated high accuracy of periodontal diagnosis data extraction, with the sites 1 and 2 achieving a weighted average score of 0.97-0.98. This performance held for all dimensions of periodontal diagnosis in terms of stage, grade, and extent. DiscussionSynthetic data effectively reduced manual annotation needs while preserving model quality. Generalizability across institutions suggests viability for broader adoption, though future work is needed to improve contextual understanding. ConclusionThe study highlights the potential transformative impact of AI and NLP on health-care research. Most clinical documentation (40%-80%) is free text. Scaling our method could enhance clinical data reuse. 
    more » « less
    Free, publicly-accessible full text available May 2, 2026
  2. Abstract BackgroundSecuring adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset’s utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset’s utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility. MethodsPredictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two. ResultsAll 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores. ConclusionsAs the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data’s intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility. 
    more » « less
    Free, publicly-accessible full text available December 1, 2025
  3. Nikolski, Macha (Ed.)
    Abstract MotivationGenome-wide association studies (GWAS) benefit from the increasing availability of genomic data and cross-institution collaborations. However, sharing data across institutional boundaries jeopardizes medical data confidentiality and patient privacy. While modern cryptographic techniques provide formal secure guarantees, the substantial communication and computational overheads hinder the practical application of large-scale collaborative GWAS. ResultsThis work introduces an efficient framework for conducting collaborative GWAS on distributed datasets, maintaining data privacy without compromising the accuracy of the results. We propose a novel two-step strategy aimed at reducing communication and computational overheads, and we employ iterative and sampling techniques to ensure accurate results. We instantiate our approach using logistic regression, a commonly used statistical method for identifying associations between genetic markers and the phenotype of interest. We evaluate our proposed methods using two real genomic datasets and demonstrate their robustness in the presence of between-study heterogeneity and skewed phenotype distributions using a variety of experimental settings. The empirical results show the efficiency and applicability of the proposed method and the promise for its application for large-scale collaborative GWAS. Availability and implementationThe source code and data are available at https://github.com/amioamo/TDS. 
    more » « less
  4. Abstract The rapid improvements in genomic sequencing technology have led to the proliferation of locally collected genomic datasets. Given the sensitivity of genomic data, it is crucial to conduct collaborative studies while preserving the privacy of the individuals. However, before starting any collaborative research effort, the quality of the data needs to be assessed. One of the essential steps of the quality control process is population stratification: identifying the presence of genetic difference in individuals due to subpopulations. One of the common methods used to group genomes of individuals based on ancestry is principal component analysis (PCA). In this article, we propose a privacy-preserving framework which utilizes PCA to assign individuals to populations across multiple collaborators as part of the population stratification step. In our proposed client-server-based scheme, we initially let the server train a global PCA model on a publicly available genomic dataset which contains individuals from multiple populations. The global PCA model is later used to reduce the dimensionality of the local data by each collaborator (client). After adding noise to achieve local differential privacy (LDP), the collaborators send metadata (in the form of their local PCA outputs) about their research datasets to the server, which then aligns the local PCA results to identify the genetic differences among collaborators’ datasets. Our results on real genomic data show that the proposed framework can perform population stratification analysis with high accuracy while preserving the privacy of the research participants. 
    more » « less
  5. Free, publicly-accessible full text available July 9, 2026
  6. Free, publicly-accessible full text available January 1, 2026
  7. Background: The progressive cognitive decline, an integral component of Alzheimer’s disease (AD), unfolds in tandem with the natural aging process. Neuroimaging features have demonstrated the capacity to distinguish cognitive decline changes stemming from typical brain aging and AD between different chronological points. Objective: To disentangle the normal aging effect from the AD-related accelerated cognitive decline and unravel its genetic components using a neuroimaging-based deep learning approach. Methods: We developed a deep-learning framework based on a dual-loss Siamese ResNet network to extract fine-grained information from the longitudinal structural magnetic resonance imaging (MRI) data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study. We then conducted genome-wide association studies (GWAS) and post-GWAS analyses to reveal the genetic basis of AD-related accelerated cognitive decline. Results: We used our model to process data from 1,313 individuals, training it on 414 cognitively normal people and predicting cognitive assessment for all participants. In our analysis of accelerated cognitive decline GWAS, we identified two genome-wide significant loci: APOE locus (chromosome 19 p13.32) and rs144614292 (chromosome 11 p15.1). Variant rs144614292 (G > T) has not been reported in previous AD GWA studies. It is within the intronic region of NELL1, which is expressed in neurons and plays a role in controlling cell growth and differentiation. The cell-type-specific enrichment analysis and functional enrichment of GWAS signals highlighted the microglia and immune-response pathways. Conclusions: Our deep learning model effectively extracted relevant neuroimaging features and predicted individual cognitive decline. We reported a novel variant (rs144614292) within the NELL1 gene. 
    more » « less
  8. Kinship relationship estimation plays a significant role in today's genome studies. Since genetic data are mostly stored and protected in different silos, retrieving the desirable kinship relationships across federated data warehouses is a non-trivial problem. The ability to identify and connect related individuals is important for both research and clinical applications. In this work, we propose a new privacy-preserving kinship relationship estimation framework: Incremental Update Kinship Identification (INK). The proposed framework includes three key components that allow us to control the balance between privacy and accuracy (of kinship estimation): an incremental process coupled with the use of auxiliary information and informative scores. Our empirical evaluation shows that INK can achieve higher kinship identification correctness while exposing fewer genetic markers. 
    more » « less
  9. The process of matching patients with suitable clinical trials is essential for advancing medical research and providing optimal care. However, current approaches face challenges such as data standardization, ethical considerations, and a lack of interoperability between Electronic Health Records (EHRs) and clinical trial criteria. In this paper, we explore the potential of large language models (LLMs) to address these challenges by leveraging their advanced natural language generation capabilities to improve compatibility between EHRs and clinical trial descriptions. We propose an innovative privacy-aware data augmentation approach for LLM-based patient-trial matching (LLM-PTM), which balances the benefits of LLMs while ensuring the security and confidentiality of sensitive patient data. Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%. Additionally, we present case studies to further illustrate the effectiveness of our approach and provide a deeper understanding of its underlying principles. 
    more » « less