skip to main content


Title: What can one chromosome tell us about human biogeographical ancestry?
We study the problem of predicting human biogeographical ancestry using genomic data. While continental level ancestry is relatively simple using genomic information, distinguishing between individuals from closely associated subpopulations (e.g., from the same continent) is still a difficult challenge. In particular, we focus on the case where the analysis is constrained to using single nucleotide polymorphisms (SNPs) from just one chromosome. We thus propose methods to construct such ancestry informative SNP panels, and access the performance of such SNP panels from just one chromosome, for both continental-level and sub-population level ancestry prediction. We include results that demonstrate the performance of the proposed methods, including comparison with other recently published related methods.  more » « less
Award ID(s):
1650474 1066197
NSF-PAR ID:
10053534
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proc. 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Page Range / eLocation ID:
188 to 193
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background Large medical centers in urban areas, like Los Angeles, care for a diverse patient population and offer the potential to study the interplay between genetic ancestry and social determinants of health. Here, we explore the implications of genetic ancestry within the University of California, Los Angeles (UCLA) ATLAS Community Health Initiative—an ancestrally diverse biobank of genomic data linked with de-identified electronic health records (EHRs) of UCLA Health patients ( N =36,736). Methods We quantify the extensive continental and subcontinental genetic diversity within the ATLAS data through principal component analysis, identity-by-descent, and genetic admixture. We assess the relationship between genetically inferred ancestry (GIA) and >1500 EHR-derived phenotypes (phecodes). Finally, we demonstrate the utility of genetic data linked with EHR to perform ancestry-specific and multi-ancestry genome and phenome-wide scans across a broad set of disease phenotypes. Results We identify 5 continental-scale GIA clusters including European American (EA), African American (AA), Hispanic Latino American (HL), South Asian American (SAA) and East Asian American (EAA) individuals and 7 subcontinental GIA clusters within the EAA GIA corresponding to Chinese American, Vietnamese American, and Japanese American individuals. Although we broadly find that self-identified race/ethnicity (SIRE) is highly correlated with GIA, we still observe marked differences between the two, emphasizing that the populations defined by these two criteria are not analogous. We find a total of 259 significant associations between continental GIA and phecodes even after accounting for individuals’ SIRE, demonstrating that for some phenotypes, GIA provides information not already captured by SIRE. GWAS identifies significant associations for liver disease in the 22q13.31 locus across the HL and EAA GIA groups (HL p -value=2.32×10 −16 , EAA p -value=6.73×10 −11 ). A subsequent PheWAS at the top SNP reveals significant associations with neurologic and neoplastic phenotypes specifically within the HL GIA group. Conclusions Overall, our results explore the interplay between SIRE and GIA within a disease context and underscore the utility of studying the genomes of diverse individuals through biobank-scale genotyping linked with EHR-based phenotyping. 
    more » « less
  2. Abstract Background

    Estimation of genetic relatedness, or kinship, is used occasionally for recreational purposes and in forensic applications. While numerous methods were developed to estimate kinship, they suffer from high computational requirements and often make an untenable assumption of homogeneous population ancestry of the samples. Moreover, genetic privacy is generally overlooked in the usage of kinship estimation methods. There can be ethical concerns about finding unknown familial relationships in third-party databases. Similar ethical concerns may arise while estimating and reporting sensitive population-level statistics such as inbreeding coefficients for the concerns around marginalization and stigmatization.

    Results

    Here, we present SIGFRIED, which makes use of existing reference panels with a projection-based approach that simplifies kinship estimation in the admixed populations. We use simulated and real datasets to demonstrate the accuracy and efficiency of kinship estimation. We present a secure federated kinship estimation framework and implement a secure kinship estimator using homomorphic encryption-based primitives for computing relatedness between samples in two different sites while genotype data are kept confidential. Source code and documentation for our methods can be found at https://doi.org/10.5281/zenodo.7053352.

    Conclusions

    Analysis of relatedness is fundamentally important for identifying relatives, in association studies, and for estimation of population-level estimates of inbreeding. As the awareness of individual and group genomic privacy is growing, privacy-preserving methods for the estimation of relatedness are needed. Presented methods alleviate the ethical and privacy concerns in the analysis of relatedness in admixed, historically isolated and underrepresented populations.

    Short Abstract

    Genetic relatedness is a central quantity used for finding relatives in databases, correcting biases in genome wide association studies and for estimating population-level statistics. Methods for estimating genetic relatedness have high computational requirements, and occasionally do not consider individuals from admixed ancestries. Furthermore, the ethical concerns around using genetic data and calculating relatedness are not considered. We present a projection-based approach that can efficiently and accurately estimate kinship. We implement our method using encryption-based techniques that provide provable security guarantees to protect genetic data while kinship statistics are computed among multiple sites.

     
    more » « less
  3. Background While genomic variations can provide valuable information for health care and ancestry, the privacy of individual genomic data must be protected. Thus, a secure environment is desirable for a human DNA database such that the total data are queryable but not directly accessible to involved parties (eg, data hosts and hospitals) and that the query results are learned only by the user or authorized party. Objective In this study, we provide efficient and secure computations on panels of single nucleotide polymorphisms (SNPs) from genomic sequences as computed under the following set operations: union, intersection, set difference, and symmetric difference. Methods Using these operations, we can compute similarity metrics, such as the Jaccard similarity, which could allow querying a DNA database to find the same person and genetic relatives securely. We analyzed various security paradigms and show metrics for the protocols under several security assumptions, such as semihonest, malicious with honest majority, and malicious with a malicious majority. Results We show that our methods can be used practically on realistically sized data. Specifically, we can compute the Jaccard similarity of two genomes when considering sets of SNPs, each with 400,000 SNPs, in 2.16 seconds with the assumption of a malicious adversary in an honest majority and 0.36 seconds under a semihonest model. Conclusions Our methods may help adopt trusted environments for hosting individual genomic data with end-to-end data security. 
    more » « less
  4. Abstract Objectives

    Crown and root traits, like those in the Arizona State University Dental Anthropology System (ASUDAS), are seemingly useful as genetic proxies. However, recent studies report mixed results concerning their heritability, and ability to assess variation to the level of genomic data. The aim is to test further if such traits can approximate genetic relatedness, among continental and global samples.

    Materials and Methods

    First, for 12 African populations, Mantel correlations were calculated between mean measure of divergence (MMD) distances from up to 36 ASUDAS traits, andFSTdistances from >350,000 single nucleotide polymorphisms (SNPs) among matched dental and genetic samples. Second, among 32 global samples, MMD andFSTdistances were again compared. Correlations were also calculated between them and inter‐sample geographic distances to further evaluate correspondence.

    Results

    A close ASUDAS/SNP association, based on MMD andFSTcorrelations, is evident, withrmvalues between .72 globally and .84 in Africa. The same is true concerning their association with geographic distances, from .68 for a 36‐trait African MMD to .77 forFSTglobally; one exception isFSTand African geographic distances,rm= 0.49. Partial MMD/FSTcorrelations controlling for geographic distances are strong for Africa (.78) and moderate globally (.4).

    Discussion

    Relative to prior studies, MMD/FSTcorrelations imply greater dental and genetic correspondence; for studies allowing direct comparison, the present correlations are markedly stronger. The implication is that ASUDAS traits are reliable proxies for genetic data—a positive conclusion, meaning they can be used with or instead of genomic markers when the latter are unavailable.

     
    more » « less
  5. Abstract

    Hybridization facilitates recombination between divergent genetic lineages and can be shaped by both neutral and selective processes. Upon hybridization, loci with no net fitness effects introgress randomly from parental species into the genomes of hybrid individuals. Conversely, alleles from one parental species at some loci may provide a selective advantage to hybrids, resulting in patterns of introgression that do not conform to random expectations. We investigated genomic patterns of differential introgression in natural hybrids of two species of Caribbean anoles,Anolis pulchellusandA. krugiin Puerto Rico. Hybrids exhibitA. pulchellusphenotypes but possessA. krugimitochondrial DNA, originated from multiple, independent hybridization events, and appear to have replaced pureA. pulchellusacross a large area in western Puerto Rico. Combining genome‐wide SNP datasets with bioinformatic methods to identify signals of differential introgression in hybrids, we demonstrate that the genomes of hybrids are dominated bypulchellus‐derived alleles and show only 10%–20%A. krugiancestry. The majority ofA. krugiloci in hybrids exhibit a signal of non‐random differential introgression and include loci linked to genes involved in development and immune function. Three of these genes (delta like canonical notch ligand 1, jagged1 and notch receptor 1) affect cell differentiation and growth and interact with mitochondrial function. Our results suggest that differential non‐random introgression for a subset of loci may be driven by selection favouring the inheritance of compatible mitochondrial and nuclear‐encoded genes in hybrids.

     
    more » « less