skip to main content

Title: Privacy preserving identification of population stratification for collaborative genomic research

The rapid improvements in genomic sequencing technology have led to the proliferation of locally collected genomic datasets. Given the sensitivity of genomic data, it is crucial to conduct collaborative studies while preserving the privacy of the individuals. However, before starting any collaborative research effort, the quality of the data needs to be assessed. One of the essential steps of the quality control process is population stratification: identifying the presence of genetic difference in individuals due to subpopulations. One of the common methods used to group genomes of individuals based on ancestry is principal component analysis (PCA). In this article, we propose a privacy-preserving framework which utilizes PCA to assign individuals to populations across multiple collaborators as part of the population stratification step. In our proposed client-server-based scheme, we initially let the server train a global PCA model on a publicly available genomic dataset which contains individuals from multiple populations. The global PCA model is later used to reduce the dimensionality of the local data by each collaborator (client). After adding noise to achieve local differential privacy (LDP), the collaborators send metadata (in the form of their local PCA outputs) about their research datasets to the server, which then aligns the local PCA results to identify the genetic differences among collaborators’ datasets. Our results on real genomic data show that the proposed framework can perform population stratification analysis with high accuracy while preserving the privacy of the research participants.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Page Range / eLocation ID:
p. i168-i176
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    Estimation of genetic relatedness, or kinship, is used occasionally for recreational purposes and in forensic applications. While numerous methods were developed to estimate kinship, they suffer from high computational requirements and often make an untenable assumption of homogeneous population ancestry of the samples. Moreover, genetic privacy is generally overlooked in the usage of kinship estimation methods. There can be ethical concerns about finding unknown familial relationships in third-party databases. Similar ethical concerns may arise while estimating and reporting sensitive population-level statistics such as inbreeding coefficients for the concerns around marginalization and stigmatization.


    Here, we present SIGFRIED, which makes use of existing reference panels with a projection-based approach that simplifies kinship estimation in the admixed populations. We use simulated and real datasets to demonstrate the accuracy and efficiency of kinship estimation. We present a secure federated kinship estimation framework and implement a secure kinship estimator using homomorphic encryption-based primitives for computing relatedness between samples in two different sites while genotype data are kept confidential. Source code and documentation for our methods can be found at


    Analysis of relatedness is fundamentally important for identifying relatives, in association studies, and for estimation of population-level estimates of inbreeding. As the awareness of individual and group genomic privacy is growing, privacy-preserving methods for the estimation of relatedness are needed. Presented methods alleviate the ethical and privacy concerns in the analysis of relatedness in admixed, historically isolated and underrepresented populations.

    Short Abstract

    Genetic relatedness is a central quantity used for finding relatives in databases, correcting biases in genome wide association studies and for estimating population-level statistics. Methods for estimating genetic relatedness have high computational requirements, and occasionally do not consider individuals from admixed ancestries. Furthermore, the ethical concerns around using genetic data and calculating relatedness are not considered. We present a projection-based approach that can efficiently and accurately estimate kinship. We implement our method using encryption-based techniques that provide provable security guarantees to protect genetic data while kinship statistics are computed among multiple sites.

    more » « less
  2. This paper studies a distributed optimization problem in the federated learning (FL) framework under differential privacy constraints, whereby a set of clients having local samples are connected to an untrusted server, who wants to learn a global model while preserving the privacy of clients’ local datasets. We propose a new client sampling called self-sampling that reflects the random availability of clients in the learning process in FL. We analyze the differential privacy of the SGD with client self-sampling by composing amplification by sub-sampling along with amplification by shuffling. Furthermore, we analyze the convergence of the proposed SGD algorithm showing that we can get a reasonable learning performance while preserving the privacy of clients’ data even with client self-sampling. 
    more » « less
  3. Federated Learning enables a population of clients, working with a trusted server, to collaboratively learn a shared machine learning model while keeping each client's data within its own local systems. This reduces the risk of exposing sensitive data, but it is still possible to reverse engineer information about a client's private data set from communicated model parameters. Most federated learning systems therefore use differential privacy to introduce noise to the parameters. This adds uncertainty to any attempt to reveal private client data, but also reduces the accuracy of the shared model, limiting the useful scale of privacy-preserving noise. A system can further reduce the coordinating server's ability to recover private client information, without additional accuracy loss, by also including secure multiparty computation. An approach combining both techniques is especially relevant to financial firms as it allows new possibilities for collaborative learning without exposing sensitive client data. This could produce more accurate models for important tasks like optimal trade execution, credit origination, or fraud detection. The key contributions of this paper are: We present a privacy-preserving federated learning protocol to a non-specialist audience, demonstrate it using logistic regression on a real-world credit card fraud data set, and evaluate it using an open-source simulation platform which we have adapted for the development of federated learning systems. 
    more » « less
  4. Abstract

    Intraspecific polymorphism in birds, especially plumage colour polymorphism, and the mechanisms that control it are an area of active research in evolutionary biology. The black‐headed bulbul (Brachypodius atriceps) is a polymorphic species with two distinct morphs, yellow and grey. This species inhabits the mainland and virtually all continental islands of Southeast Asia where yellow morphs predominate, but on two islands in the Sunda region, Bawean and Maratua, grey morphs are common or exclusive. Here, we generated a high‐quality reference genome of a yellow individual and resequenced genomes of multiple individuals of both yellow and grey morphs to study the genetic basis of coloration and population history of the species. Using PCA and STRUCTURE analysis, we found the Maratua Island population (which is exclusively grey) to be distinct from all otherB.atricepspopulations, having been isolated c. 1.9 million years ago (Ma). In contrast, Bawean grey individuals (a subset of yellow and grey individuals on that island) are embedded within an almost panmictic Sundaic clade of yellow birds. UsingFSTanddxyto compare variable genomic segments between Maratua and yellow individuals, we located peaks of divergence and identified candidate loci involved in the colour polymorphism. Tests of selection among coding‐proteins in highFSTregions, however, did not indicate selection on the candidate genes. Overall, we report on some loci that are potentially responsible for the grey/yellow polymorphism in a species that otherwise shows little genetic diversification across most of its range.

    more » « less
  5. Tensor factorization has been demonstrated as an efficient approach for computational phenotyping, where massive electronic health records (EHRs) are converted to concise and meaningful clinical concepts. While distributing the tensor factorization tasks to local sites can avoid direct data sharing, it still requires the exchange of intermediary results which could reveal sensitive patient information. Therefore, the challenge is how to jointly decompose the tensor under rigorous and principled privacy constraints, while still support the model's interpretability. We propose DPFact, a privacy-preserving collaborative tensor factorization method for computational phenotyping using EHR. It embeds advanced privacy-preserving mechanisms with collaborative learning. Hospitals can keep their EHR database private but also collaboratively learn meaningful clinical concepts by sharing differentially private intermediary results. Moreover, DPFact solves the heterogeneous patient population using a structured sparsity term. In our framework, each hospital decomposes its local tensors and sends the updated intermediary results with output perturbation every several iterations to a semi-trusted server which generates the phenotypes. The evaluation on both real-world and synthetic datasets demonstrated that under strict privacy constraints, our method is more accurate and communication-efficient than state-of-the-art baseline methods. 
    more » « less