skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on March 1, 2026

Title: Privacy-preserving framework for genomic computations via multi-key homomorphic encryption
Abstract MotivationThe affordability of genome sequencing and the widespread availability of genomic data have opened up new medical possibilities. Nevertheless, they also raise significant concerns regarding privacy due to the sensitive information they encompass. These privacy implications act as barriers to medical research and data availability. Researchers have proposed privacy-preserving techniques to address this, with cryptography-based methods showing the most promise. However, existing cryptography-based designs lack (i) interoperability, (ii) scalability, (iii) a high degree of privacy (i.e. compromise one to have the other), or (iv) multiparty analyses support (as most existing schemes process genomic information of each party individually). Overcoming these limitations is essential to unlocking the full potential of genomic data while ensuring privacy and data utility. Further research and development are needed to advance privacy-preserving techniques in genomics, focusing on achieving interoperability and scalability, preserving data utility, and enabling secure multiparty computation. ResultsThis study aims to overcome the limitations of current cryptography-based techniques by employing a multi-key homomorphic encryption scheme. By utilizing this scheme, we have developed a comprehensive protocol capable of conducting diverse genomic analyses. Our protocol facilitates interoperability among individual genome processing and enables multiparty tests, analyses of genomic databases, and operations involving multiple databases. Consequently, our approach represents an innovative advancement in secure genomic data processing, offering enhanced protection and privacy measures. Availability and implementationAll associated code and documentation are available at https://github.com/farahpoor/smkhe.  more » « less
Award ID(s):
2141622
PAR ID:
10588032
Author(s) / Creator(s):
; ; ;
Editor(s):
Martelli, Pier Luigi
Publisher / Repository:
Oxford
Date Published:
Journal Name:
Bioinformatics
Volume:
41
Issue:
3
ISSN:
1367-4811
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract MotivationDatabase fingerprinting has been widely used to discourage unauthorized redistribution of data by providing means to identify the source of data leakages. However, there is no fingerprinting scheme aiming at achieving liability guarantees when sharing genomic databases. Thus, we are motivated to fill in this gap by devising a vanilla fingerprinting scheme specifically for genomic databases. Moreover, since malicious genomic database recipients may compromise the embedded fingerprint (distort the steganographic marks, i.e. the embedded fingerprint bit-string) by launching effective correlation attacks, which leverage the intrinsic correlations among genomic data (e.g. Mendel’s law and linkage disequilibrium), we also augment the vanilla scheme by developing mitigation techniques to achieve robust fingerprinting of genomic databases against correlation attacks. ResultsVia experiments using a real-world genomic database, we first show that correlation attacks against fingerprinting schemes for genomic databases are very powerful. In particular, the correlation attacks can distort more than half of the fingerprint bits by causing a small utility loss (e.g. database accuracy and consistency of SNP–phenotype associations measured via P-values). Next, we experimentally show that the correlation attacks can be effectively mitigated by our proposed mitigation techniques. We validate that the attacker can hardly compromise a large portion of the fingerprint bits even if it pays a higher cost in terms of degradation of the database utility. For example, with around 24% loss in accuracy and 20% loss in the consistency of SNP–phenotype associations, the attacker can only distort about 30% fingerprint bits, which is insufficient for it to avoid being accused. We also show that the proposed mitigation techniques also preserve the utility of the shared genomic databases, e.g. the mitigation techniques only lead to around 3% loss in accuracy. Availability and implementationhttps://github.com/xiutianxi/robust-genomic-fp-github. 
    more » « less
  2. Nikolski, Macha (Ed.)
    Abstract MotivationGenome-wide association studies (GWAS) benefit from the increasing availability of genomic data and cross-institution collaborations. However, sharing data across institutional boundaries jeopardizes medical data confidentiality and patient privacy. While modern cryptographic techniques provide formal secure guarantees, the substantial communication and computational overheads hinder the practical application of large-scale collaborative GWAS. ResultsThis work introduces an efficient framework for conducting collaborative GWAS on distributed datasets, maintaining data privacy without compromising the accuracy of the results. We propose a novel two-step strategy aimed at reducing communication and computational overheads, and we employ iterative and sampling techniques to ensure accurate results. We instantiate our approach using logistic regression, a commonly used statistical method for identifying associations between genetic markers and the phenotype of interest. We evaluate our proposed methods using two real genomic datasets and demonstrate their robustness in the presence of between-study heterogeneity and skewed phenotype distributions using a variety of experimental settings. The empirical results show the efficiency and applicability of the proposed method and the promise for its application for large-scale collaborative GWAS. Availability and implementationThe source code and data are available at https://github.com/amioamo/TDS. 
    more » « less
  3. DNA edit distance (ED) measures the minimum number of single nucleotide insertions, substitutions, or deletions required to convert a DNA sequence into another. ED has broad applications in healthcare such as sequence alignment, genome assembly, functional annotation, and drug discovery. Privacy-preserving computation is essential in this context to protect sensitive genomic data. Nonetheless, the existing secure DNA edit distance solutions lack efficiency when handling large data sequences or resort to approximations and fail to accurately compute the metric. In this work, we introduce ScureED, a protocol that tackles these limitations, resulting in a significant performance enhancement of approximately 2-24 times compared to existing methods. Our protocol computes a secure ED between two genomes, each comprising 1,000 letters, in just a few seconds. The underlying technique of our protocol is a novel approach that transforms the established approximate matching technique (i.e., the Ukkonen algorithm) into exact matching, exploiting the inherent similarity in human DNA to achieve cost-effectiveness. Furthermore, we introduce various optimizations tailored for secure computation in scenarios with a limited input domain, such as DNA sequences composed solely of the four nucleotide letters. 
    more » « less
  4. Preserving differential privacy has been well studied under the centralized setting. However, it’s very challenging to preserve differential privacy under multiparty setting, especially for the vertically partitioned case. In this work, we propose a new framework for differential privacy preserving multiparty learning in the vertically partitioned setting. Our core idea is based on the functional mechanism that achieves differential privacy of the released model by adding noise to the objective function. We show the server can simply dissect the objective function into single-party and cross-party sub-functions, and allocate computation and perturbation of their polynomial coefficients t o l ocal p arties. Our method n eeds o nly o ne r ound of noise addition and secure aggregation. The released model in our framework achieves the same utility as applying the functional mechanism in the centralized setting. Evaluation on real-world and synthetic datasets for linear and logistic regressions shows the effectiveness of our proposed method. 
    more » « less
  5. Alanio, Alexandre (Ed.)
    ABSTRACT Modern taxonomic classification is often based on phylogenetic analyses of a few molecular markers, although single-gene studies are still common. Here, we leverage genome-scale molecular phylogenetics (phylogenomics) of species and populations to reconstruct evolutionary relationships in a dense data set of 710 fungal genomes from the biomedically and technologically important genusAspergillus. To do so, we generated a novel set of 1,362 high-quality molecular markers specific forAspergillusand provided profile Hidden Markov Models for each, facilitating their use by others. Examining the resulting phylogeny helped resolve ongoing taxonomic controversies, identified new ones, and revealed extensive strain misidentification (7.59% of strains were previously misidentified), underscoring the importance of population-level sampling in species classification. These findings were corroborated using the current standard, taxonomically informative loci. These findings suggest that phylogenomics of species and populations can facilitate accurate taxonomic classifications and reconstructions of the Tree of Life.IMPORTANCEIdentification of fungal species relies on the use of molecular markers. Advances in genomic technologies have made it possible to sequence the genome of any fungal strain, making it possible to use genomic data for the accurate assignment of strains to fungal species (and for the discovery of new ones). We examined the usefulness and current limitations of genomic data using a large data set of 710 publicly available genomes from multiple strains and species of the biomedically, agriculturally, and industrially important genusAspergillus. Our evolutionary genomic analyses revealed that nearly 8% of publicly availableAspergillusgenomes are misidentified. Our work highlights the usefulness of genomic data for fungal systematic biology and suggests that systematic genome sequencing of multiple strains, including reference strains (e.g., type strains), of fungal species will be required to reduce misidentification errors in public databases. 
    more » « less