skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Efficient Federated Kinship Relationship Identification.
Kinship relationship estimation plays a significant role in today's genome studies. Since genetic data are mostly stored and protected in different silos, retrieving the desirable kinship relationships across federated data warehouses is a non-trivial problem. The ability to identify and connect related individuals is important for both research and clinical applications. In this work, we propose a new privacy-preserving kinship relationship estimation framework: Incremental Update Kinship Identification (INK). The proposed framework includes three key components that allow us to control the balance between privacy and accuracy (of kinship estimation): an incremental process coupled with the use of auxiliary information and informative scores. Our empirical evaluation shows that INK can achieve higher kinship identification correctness while exposing fewer genetic markers.  more » « less
Award ID(s):
2124789
PAR ID:
10448808
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
AMIA Jt Summits Transl Sci Proc 2023
Page Range / eLocation ID:
534–543
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Kinship plays a fundamental role in the evolution of social systems and is considered a key driver of group living. To understand the role of kinship in the formation and maintenance of social bonds, accurate measures of genetic relatedness are critical. Genotype‐by‐sequencing technologies are rapidly advancing the accuracy and precision of genetic relatedness estimates for wild populations. The ability to assign kinship from genetic data varies depending on a species’ or population's mating system and pattern of dispersal, and empirical data from longitudinal studies are crucial to validate these methods. We use data from a long‐term behavioural study of a polygynandrous, bisexually philopatric marine mammal to measure accuracy and precision of parentage and genetic relatedness estimation against a known partial pedigree. We show that with moderate but obtainable sample sizes of approximately 4,235 SNPs and 272 individuals, highly accurate parentage assignments and genetic relatedness coefficients can be obtained. Additionally, we subsample our data to quantify how data availability affects relatedness estimation and kinship assignment. Lastly, we conduct a social network analysis to investigate the extent to which accuracy and precision of relatedness estimation improve statistical power to detect an effect of relatedness on social structure. Our results provide practical guidance for minimum sample sizes and sequencing depth for future studies, as well as thresholds for post hoc interpretation of previous analyses. 
    more » « less
  2. Understanding the relationship between genotypes and phenotypes is crucial for advancing personalized medicine. Expression quantitative trait loci (eQTL) mapping plays a significant role by correlating genetic variants to gene expression levels. Despite the progress made by large-scale projects, eQTL mapping still faces challenges in statistical power and privacy concerns. Multi-site studies can increase sample sizes but are hindered by privacy issues. We present privateQTL, a novel framework leveraging secure multi-party computation for secure and federated eQTL mapping. When tested in a real-world scenario with data from different studies, privateQTL outperformed meta-analysis by accurately correcting for covariates and batch effect and retaining higher accuracy and precision for both eGene-eVariant mapping and effect size estimation. In addition, privateQTL is modular and scalable, making it adaptable for other molecular phenotypes and large-scale studies. Our results indicate that privateQTL is a practical solution for privacy-preserving collaborative eQTL mapping. 
    more » « less
  3. Abstract As genomic research continues to advance, sharing of genomic data and research outcomes has become increasingly important for fostering collaboration and accelerating scientific discovery. However, such data sharing must be balanced with the need to protect the privacy of individuals whose genetic information is being utilized. This paper presents a bidirectional framework for evaluating privacy risks associated with data shared (both in terms of summary statistics and research datasets) in genomic research papers, particularly focusing on re-identification risks such as membership inference attacks (MIA). The framework consists of a structured workflow that begins with a questionnaire designed to capture researchers’ (authors’) self-reported data sharing practices and privacy protection measures. Responses are used to calculate the risk of re-identification for their study (paper) when compared with the National Institutes of Health (NIH) genomic data sharing policy. Any gaps in compliance help us to identify potential vulnerabilities and encourage the researchers to enhance their privacy measures before submitting their research for publication. The paper also demonstrates the application of this framework, using published genomic research as case study scenarios to emphasize the importance of implementing bidirectional frameworks to support trustworthy open science and genomic data sharing practices. 
    more » « less
  4. Traditionally, an item-level differential privacy framework has been studied for applications in distributed learning. However, when a client has multiple data samples, and might want to also hide its potential participation, a more appropriate notion is that of user-level privacy [1]. In this paper, we develop a distributed private optimization framework that studies the trade-off between user-level local differential privacy guarantees and performance. This is enabled by a novel distributed user- level private mean estimation algorithm using distributed private heavy-hitter estimation. We use this result to develop the privacy- performance trade-off for distributed optimization. 
    more » « less
  5. Abstract A Bayesian method is proposed for variable selection in high-dimensional matrix autoregressive models which reflects and exploits the original matrix structure of data to (a) reduce dimensionality and (b) foster interpretability of multidimensional relationship structures. A compact form of the model is derived which facilitates the estimation procedure and two computational methods for the estimation are proposed: a Markov chain Monte Carlo algorithm and a scalable Bayesian EM algorithm. Being based on the spike-and-slab framework for fast posterior mode identification, the latter enables Bayesian data analysis of matrix-valued time series at large scales. The theoretical properties, comparative performance, and computational efficiency of the proposed model is investigated through simulated examples and an application to a panel of country economic indicators. 
    more » « less