skip to main content

Title: Privacy-aware estimation of relatedness in admixed populations
Abstract Background

Estimation of genetic relatedness, or kinship, is used occasionally for recreational purposes and in forensic applications. While numerous methods were developed to estimate kinship, they suffer from high computational requirements and often make an untenable assumption of homogeneous population ancestry of the samples. Moreover, genetic privacy is generally overlooked in the usage of kinship estimation methods. There can be ethical concerns about finding unknown familial relationships in third-party databases. Similar ethical concerns may arise while estimating and reporting sensitive population-level statistics such as inbreeding coefficients for the concerns around marginalization and stigmatization.


Here, we present SIGFRIED, which makes use of existing reference panels with a projection-based approach that simplifies kinship estimation in the admixed populations. We use simulated and real datasets to demonstrate the accuracy and efficiency of kinship estimation. We present a secure federated kinship estimation framework and implement a secure kinship estimator using homomorphic encryption-based primitives for computing relatedness between samples in two different sites while genotype data are kept confidential. Source code and documentation for our methods can be found at


Analysis of relatedness is fundamentally important for identifying relatives, in association studies, and for estimation of population-level estimates of inbreeding. As the awareness of more » individual and group genomic privacy is growing, privacy-preserving methods for the estimation of relatedness are needed. Presented methods alleviate the ethical and privacy concerns in the analysis of relatedness in admixed, historically isolated and underrepresented populations.

Short Abstract

Genetic relatedness is a central quantity used for finding relatives in databases, correcting biases in genome wide association studies and for estimating population-level statistics. Methods for estimating genetic relatedness have high computational requirements, and occasionally do not consider individuals from admixed ancestries. Furthermore, the ethical concerns around using genetic data and calculating relatedness are not considered. We present a projection-based approach that can efficiently and accurately estimate kinship. We implement our method using encryption-based techniques that provide provable security guarantees to protect genetic data while kinship statistics are computed among multiple sites.

« less
; ; ; ; ;
Publication Date:
Journal Name:
Briefings in Bioinformatics
Oxford University Press
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    An increasing number of phylogenomic studies have documented a clear “footprint” of postspeciation introgression among closely related species. Nonetheless, systematic genome-wide studies of factors that determine the likelihood of introgression remain rare. Here, we propose an a priori hypothesis-testing framework that uses introgression statistics—including a new metric of estimated introgression, Dp—to evaluate general patterns of introgression prevalence and direction across multiple closely related species. We demonstrate this approach using whole genome sequences from 32 lineages in 11 wild tomato species to assess the effect of three factors on introgression—genetic relatedness, geographical proximity, and mating system differences—based on multiple trios within the “ABBA–BABA” test. Our analyses suggest each factor affects the prevalence of introgression, although our power to detect these is limited by the number of comparisons currently available. We find that of 14 species pairs with geographically “proximate” versus “distant” population comparisons, 13 showed evidence of introgression; in 10 of these cases, this was more prevalent between geographically closer populations. We also find modest evidence that introgression declines with increasing genetic divergence between lineages, is more prevalent between lineages that share the same mating system, and—when it does occur between mating systems—tends to involve gene flow from more inbreedingmore »to more outbreeding lineages. Although our analysis indicates that recent postspeciation introgression is frequent in this group—detected in 15 of 17 tested trios—estimated levels of genetic exchange are modest (0.2–2.5% of the genome), so the relative importance of hybridization in shaping the evolutionary trajectories of these species could be limited. Regardless, similar clade-wide analyses of genomic introgression would be valuable for disentangling the major ecological, reproductive, and historical determinants of postspeciation gene flow, and for assessing the relative contribution of introgression as a source of genetic variation.

    « less
  2. Abstract Objective

    Online patient portals become important during disruptions to in-person health care, like when cases of coronavirus disease 2019 (COVID-19) and other respiratory viruses rise, yet underlying structural inequalities associated with race, socio-economic status, and other socio-demographic characteristics may affect their use. We analyzed a population-based survey to identify disparities within the United States in access to online portals during the early period of COVID-19 in 2020.

    Materials and Methods

    The National Cancer Institute fielded the 2020 Health and Information National Trends Survey from February to June 2020. We conducted multivariable analysis to identify socio-demographic characteristics of US patients who were offered and accessed online portals, and reasons for nonuse.


    Less than half of insured adult patients reported accessing an online portal in the prior 12 months, and this was less common among patients who are male, are Hispanic, have less than a college degree, have Medicaid insurance, have no regular provider, or have no internet. Reasons for nonuse include: wanting to speak directly to a provider, not having an online record, concerns about privacy, and discomfort with technology.


    Despite the rapid expansion of digital health technologies due to COVID-19, we found persistent socio-demographic disparities in access to patient portals. Ensuringmore »that digital health tools are secure, private, and trustworthy would address some patient concerns that are barriers to portal access.


    Expanding the use of online portals requires explicitly addressing fundamental inequities to prevent exacerbating existing disparities, particularly during surges in cases of COVID-19 and other respiratory viruses that tax health care resources.

    « less
  3. Abstract Motivation

    Network-based analyses of high-throughput genomics data provide a holistic, systems-level understanding of various biological mechanisms for a common population. However, when estimating multiple networks across heterogeneous sub-populations, varying sample sizes pose a challenge in the estimation and inference, as network differences may be driven by differences in power. We are particularly interested in addressing this challenge in the context of proteomic networks for related cancers, as the number of subjects available for rare cancer (sub-)types is often limited.


    We develop NExUS (Network Estimation across Unequal Sample sizes), a Bayesian method that enables joint learning of multiple networks while avoiding artefactual relationship between sample size and network sparsity. We demonstrate through simulations that NExUS outperforms existing network estimation methods in this context, and apply it to learn network similarity and shared pathway activity for groups of cancers with related origins represented in The Cancer Genome Atlas (TCGA) proteomic data.

    Availability and implementation

    The NExUS source code is freely available for download at

    Supplementary information

    Supplementary data are available at Bioinformatics online.

  4. Abstract Background

    A considerable amount of various types of data have been collected during the COVID-19 pandemic, the analysis and understanding of which have been indispensable for curbing the spread of the disease. As the pandemic moves to an endemic state, the data collected during the pandemic will continue to be rich sources for further studying and understanding the impacts of the pandemic on various aspects of our society. On the other hand, naïve release and sharing of the information can be associated with serious privacy concerns.


    We use three common but distinct data types collected during the pandemic (case surveillance tabular data, case location data, and contact tracing networks) to illustrate the publication and sharing of granular information and individual-level pandemic data in a privacy-preserving manner. We leverage and build upon the concept of differential privacy to generate and release privacy-preserving data for each data type. We investigate the inferential utility of privacy-preserving information through simulation studies at different levels of privacy guarantees and demonstrate the approaches in real-life data. All the approaches employed in the study are straightforward to apply.


    The empirical studies in all three data cases suggest that privacy-preserving results based on the differentially privately sanitized datamore »can be similar to the original results at a reasonably small privacy loss ($$\epsilon \approx 1$$ϵ1). Statistical inferences based on sanitized data using the multiple synthesis technique also appear valid, with nominal coverage of 95% confidence intervals when there is no noticeable bias in point estimation. When$$\epsilon <1$$ϵ<1 and the sample size is not large enough, some privacy-preserving results are subject to bias, partially due to the bounding applied to sanitized data as a post-processing step to satisfy practical data constraints.


    Our study generates statistical evidence on the practical feasibility of sharing pandemic data with privacy guarantees and on how to balance the statistical utility of released information during this process.

    « less
  5. Abstract Motivation

    Polygenic risk score (PRS) has been widely exploited for genetic risk prediction due to its accuracy and conceptual simplicity. We introduce a unified Bayesian regression framework, NeuPred, for PRS construction, which accommodates varying genetic architectures and improves overall prediction accuracy for complex diseases by allowing for a wide class of prior choices. To take full advantage of the framework, we propose a summary-statistics-based cross-validation strategy to automatically select suitable chromosome-level priors, which demonstrates a striking variability of the prior preference of each chromosome, for the same complex disease, and further significantly improves the prediction accuracy.


    Simulation studies and real data applications with seven disease datasets from the Wellcome Trust Case Control Consortium cohort and eight groups of large-scale genome-wide association studies demonstrate that NeuPred achieves substantial and consistent improvements in terms of predictive r2 over existing methods. In addition, NeuPred has similar or advantageous computational efficiency compared with the state-of-the-art Bayesian methods.

    Availability and implementation

    The R package implementing NeuPred is available at

    Supplementary information

    Supplementary data are available at Bioinformatics online.