An increasing number of phylogenomic studies have documented a clear “footprint” of postspeciation introgression among closely related species. Nonetheless, systematic genome-wide studies of factors that determine the likelihood of introgression remain rare. Here, we propose an a priori hypothesis-testing framework that uses introgression statistics—including a new metric of estimated introgression, Dp—to evaluate general patterns of introgression prevalence and direction across multiple closely related species. We demonstrate this approach using whole genome sequences from 32 lineages in 11 wild tomato species to assess the effect of three factors on introgression—genetic relatedness, geographical proximity, and mating system differences—based on multiple trios within the “ABBA–BABA” test. Our analyses suggest each factor affects the prevalence of introgression, although our power to detect these is limited by the number of comparisons currently available. We find that of 14 species pairs with geographically “proximate” versus “distant” population comparisons, 13 showed evidence of introgression; in 10 of these cases, this was more prevalent between geographically closer populations. We also find modest evidence that introgression declines with increasing genetic divergence between lineages, is more prevalent between lineages that share the same mating system, and—when it does occur between mating systems—tends to involve gene flow from more inbreedingmore »
Estimation of genetic relatedness, or kinship, is used occasionally for recreational purposes and in forensic applications. While numerous methods were developed to estimate kinship, they suffer from high computational requirements and often make an untenable assumption of homogeneous population ancestry of the samples. Moreover, genetic privacy is generally overlooked in the usage of kinship estimation methods. There can be ethical concerns about finding unknown familial relationships in third-party databases. Similar ethical concerns may arise while estimating and reporting sensitive population-level statistics such as inbreeding coefficients for the concerns around marginalization and stigmatization.
Here, we present SIGFRIED, which makes use of existing reference panels with a projection-based approach that simplifies kinship estimation in the admixed populations. We use simulated and real datasets to demonstrate the accuracy and efficiency of kinship estimation. We present a secure federated kinship estimation framework and implement a secure kinship estimator using homomorphic encryption-based primitives for computing relatedness between samples in two different sites while genotype data are kept confidential. Source code and documentation for our methods can be found at https://doi.org/10.5281/zenodo.7053352.
Analysis of relatedness is fundamentally important for identifying relatives, in association studies, and for estimation of population-level estimates of inbreeding. As the awareness of more »
Genetic relatedness is a central quantity used for finding relatives in databases, correcting biases in genome wide association studies and for estimating population-level statistics. Methods for estimating genetic relatedness have high computational requirements, and occasionally do not consider individuals from admixed ancestries. Furthermore, the ethical concerns around using genetic data and calculating relatedness are not considered. We present a projection-based approach that can efficiently and accurately estimate kinship. We implement our method using encryption-based techniques that provide provable security guarantees to protect genetic data while kinship statistics are computed among multiple sites.
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- Briefings in Bioinformatics
- Oxford University Press
- Sponsoring Org:
- National Science Foundation
More Like this
Online patient portals become important during disruptions to in-person health care, like when cases of coronavirus disease 2019 (COVID-19) and other respiratory viruses rise, yet underlying structural inequalities associated with race, socio-economic status, and other socio-demographic characteristics may affect their use. We analyzed a population-based survey to identify disparities within the United States in access to online portals during the early period of COVID-19 in 2020.
Materials and Methods
The National Cancer Institute fielded the 2020 Health and Information National Trends Survey from February to June 2020. We conducted multivariable analysis to identify socio-demographic characteristics of US patients who were offered and accessed online portals, and reasons for nonuse.
Less than half of insured adult patients reported accessing an online portal in the prior 12 months, and this was less common among patients who are male, are Hispanic, have less than a college degree, have Medicaid insurance, have no regular provider, or have no internet. Reasons for nonuse include: wanting to speak directly to a provider, not having an online record, concerns about privacy, and discomfort with technology.
Despite the rapid expansion of digital health technologies due to COVID-19, we found persistent socio-demographic disparities in access to patient portals. Ensuringmore »
Expanding the use of online portals requires explicitly addressing fundamental inequities to prevent exacerbating existing disparities, particularly during surges in cases of COVID-19 and other respiratory viruses that tax health care resources.
Network-based analyses of high-throughput genomics data provide a holistic, systems-level understanding of various biological mechanisms for a common population. However, when estimating multiple networks across heterogeneous sub-populations, varying sample sizes pose a challenge in the estimation and inference, as network differences may be driven by differences in power. We are particularly interested in addressing this challenge in the context of proteomic networks for related cancers, as the number of subjects available for rare cancer (sub-)types is often limited.
We develop NExUS (Network Estimation across Unequal Sample sizes), a Bayesian method that enables joint learning of multiple networks while avoiding artefactual relationship between sample size and network sparsity. We demonstrate through simulations that NExUS outperforms existing network estimation methods in this context, and apply it to learn network similarity and shared pathway activity for groups of cancers with related origins represented in The Cancer Genome Atlas (TCGA) proteomic data.
Availability and implementation
The NExUS source code is freely available for download at https://github.com/priyamdas2/NExUS.
Supplementary data are available at Bioinformatics online.
Some examples of privacy-preserving sharing of COVID-19 pandemic data with statistical utility evaluation
A considerable amount of various types of data have been collected during the COVID-19 pandemic, the analysis and understanding of which have been indispensable for curbing the spread of the disease. As the pandemic moves to an endemic state, the data collected during the pandemic will continue to be rich sources for further studying and understanding the impacts of the pandemic on various aspects of our society. On the other hand, naïve release and sharing of the information can be associated with serious privacy concerns.
We use three common but distinct data types collected during the pandemic (case surveillance tabular data, case location data, and contact tracing networks) to illustrate the publication and sharing of granular information and individual-level pandemic data in a privacy-preserving manner. We leverage and build upon the concept of differential privacy to generate and release privacy-preserving data for each data type. We investigate the inferential utility of privacy-preserving information through simulation studies at different levels of privacy guarantees and demonstrate the approaches in real-life data. All the approaches employed in the study are straightforward to apply.
The empirical studies in all three data cases suggest that privacy-preserving results based on the differentially privately sanitized datamore »
Our study generates statistical evidence on the practical feasibility of sharing pandemic data with privacy guarantees and on how to balance the statistical utility of released information during this process.
Polygenic risk score (PRS) has been widely exploited for genetic risk prediction due to its accuracy and conceptual simplicity. We introduce a unified Bayesian regression framework, NeuPred, for PRS construction, which accommodates varying genetic architectures and improves overall prediction accuracy for complex diseases by allowing for a wide class of prior choices. To take full advantage of the framework, we propose a summary-statistics-based cross-validation strategy to automatically select suitable chromosome-level priors, which demonstrates a striking variability of the prior preference of each chromosome, for the same complex disease, and further significantly improves the prediction accuracy.
Simulation studies and real data applications with seven disease datasets from the Wellcome Trust Case Control Consortium cohort and eight groups of large-scale genome-wide association studies demonstrate that NeuPred achieves substantial and consistent improvements in terms of predictive r2 over existing methods. In addition, NeuPred has similar or advantageous computational efficiency compared with the state-of-the-art Bayesian methods.
Availability and implementation
The R package implementing NeuPred is available at https://github.com/shuangsong0110/NeuPred.
Supplementary data are available at Bioinformatics online.