skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: A two‐phase Bayesian methodology for the analysis of binary phenotypes in genome‐wide association studies
Abstract

Recent advances in sequencing and genotyping technologies are contributing to a data revolution in genome‐wide association studies that is characterized by the challenging largepsmallnproblem in statistics. That is, given these advances, many such studies now consider evaluating an extremely large number of genetic markers (p) genotyped on a small number of subjects (n). Given the dimension of the data, a joint analysis of the markers is often fraught with many challenges, while a marginal analysis is not sufficient. To overcome these obstacles, herein, we propose a Bayesian two‐phase methodology that can be used to jointly relate genetic markers to binary traits while controlling for confounding. The first phase of our approach makes use of a marginal scan to identify a reduced set of candidate markers that are then evaluated jointly via a hierarchical model in the second phase. Final marker selection is accomplished through identifying a sparse estimator via a novel and computationally efficient maximum a posteriori estimation technique. We evaluate the performance of the proposed approach through extensive numerical studies, and consider a genome‐wide application involving colorectal cancer.

 
more » « less
Award ID(s):
1826715
NSF-PAR ID:
10458722
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Biometrical Journal
Volume:
62
Issue:
1
ISSN:
0323-3847
Page Range / eLocation ID:
p. 191-201
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Genetic interactions have been recognized as a potentially important contributor to the heritability of complex diseases. Nevertheless, due to small effect sizes and stringent multiple‐testing correction, identifying genetic interactions in complex diseases is particularly challenging. To address the above challenges, many genomic research initiatives collaborate to form large‐scale consortia and develop open access to enable sharing of genome‐wide association study (GWAS) data. Despite the perceived benefits of data sharing from large consortia, a number of practical issues have arisen, such as privacy concerns on individual genomic information and heterogeneous data sources from distributed GWAS databases. In the context of large consortia, we demonstrate that the heterogeneously appearing marginal effects over distributed GWAS databases can offer new insights into genetic interactions for which conventional methods have had limited success. In this paper, we develop a novel two‐stage testing procedure, namedphylogenY‐based effect‐size tests for interactions using first 2 moments(YETI2), to detect genetic interactions through bothpooled marginal effects, in terms of averaging site‐specific marginal effects, andheterogeneity in marginal effectsacross sites, using a meta‐analytic framework. YETI2 can not only be applied to large consortia without shared personal information but also can be used to leverage underlyingheterogeneity in marginal effectsto prioritize potential genetic interactions. We investigate the performance of YETI2 through simulation studies and apply YETI2 to bladder cancer data from dbGaP.

     
    more » « less
  2. Abstract

    Understanding the genetics of biological diversification across micro‐ and macro‐evolutionary time scales is a vibrant field of research for molecular ecologists as rapid advances in sequencing technologies promise to overcome former limitations. In palms, an emblematic, economically and ecologically important plant family with high diversity in the tropics, studies of diversification at the population and species levels are still hampered by a lack of genomic markers suitable for the genotyping of large numbers of recently diverged taxa. To fill this gap, we used a whole genome sequencing approach to develop target sequencing for molecular markers in 4,184 genome regions, including 4,051 genes and 133 non‐genic putatively neutral regions. These markers were chosen to cover a wide range of evolutionary rates allowing future studies at the family, genus, species and population levels. Special emphasis was given to the avoidance of copy number variation during marker selection. In addition, a set of 149 well‐known sequence regions previously used as phylogenetic markers by the palm biological research community were included in the target regions, to open the possibility to combine and jointly analyse already available data sets with genomic data to be produced with this new toolkit. The bait set was effective for species belonging to all three palm sub‐families tested (Arecoideae, Ceroxyloideae and Coryphoideae), with high mapping rates, specificity and efficiency. The number of high‐quality single nucleotide polymorphisms (SNPs) detected at both the sub‐family and population levels facilitates efficient analyses of genomic diversity across micro‐ and macro‐evolutionary time scales.

     
    more » « less
  3. Abstract

    The study of ecological speciation is inherently linked to the study of selection. Methods for estimating phenotypic selection within a generation based on associations between trait values and fitness (e.g. survival) of individuals are established. These methods attempt to disentangle selection acting directly on a trait from indirect selection caused by correlations with other traits via multivariate statistical approaches (i.e. inference of selection gradients). The estimation of selection on genotypic or genomic variation could also benefit from disentangling direct and indirect selection on genetic loci. However, achieving this goal is difficult with genomic data because the number of potentially correlated genetic loci (p) is very large relative to the number of individuals sampled (n). In other words, the number of model parameters exceeds the number of observations (p ≫ n). We present simulations examining the utility of whole‐genome regression approaches (i.e. Bayesian sparse linear mixed models) for quantifying direct selection in cases wherep ≫ n. Such models have been used for genome‐wide association mapping and are common in artificial breeding. Our results show they hold promise for studies of natural selection in the wild and thus of ecological speciation. But we also demonstrate important limitations to the approach and discuss study designs required for more robust inferences.

     
    more » « less
  4. Abstract

    High‐throughput DNA sequencing facilitates the analysis of large portions of the genome in nonmodel organisms, ensuring high accuracy of population genetic parameters. However, empirical studies evaluating the appropriate sample size for these kinds of studies are still scarce. In this study, we use double‐digest restriction‐associated DNA sequencing (ddRADseq) to recover thousands of single nucleotide polymorphisms (SNPs) for two physically isolated populations ofAmphirrhox longifolia(Violaceae), a nonmodel plant species for which no reference genome is available. We used resampling techniques to construct simulated populations with a random subset of individuals and SNPs to determine how many individuals and biallelic markers should be sampled for accurate estimates of intra‐ and interpopulation genetic diversity. We identified 3646 and 4900 polymorphic SNPs for the two populations ofA. longifolia, respectively. Our simulations show that, overall, a sample size greater than eight individuals has little impact on estimates of genetic diversity withinA. longifoliapopulations, when 1000 SNPs or higher are used. Our results also show that even at a very small sample size (i.e. two individuals), accurate estimates ofFSTcan be obtained with a large number of SNPs (≥1500). These results highlight the potential of high‐throughput genomic sequencing approaches to address questions related to evolutionary biology in nonmodel organisms. Furthermore, our findings also provide insights into the optimization of sampling strategies in the era of population genomics.

     
    more » « less
  5. Springer, Mark (Ed.)
    Abstract Despite the increasing feasibility of sequencing whole genomes from diverse taxa, a persistent problem in phylogenomics is the selection of appropriate genetic markers or loci for a given taxonomic group or research question. In this review, we aim to streamline the decision-making process when selecting specific markers to use in phylogenomic studies by introducing commonly used types of genomic markers, their evolutionary characteristics, and their associated uses in phylogenomics. Specifically, we review the utilities of ultraconserved elements (including flanking regions), anchored hybrid enrichment loci, conserved nonexonic elements, untranslated regions, introns, exons, mitochondrial DNA, single nucleotide polymorphisms, and anonymous regions (nonspecific regions that are evenly or randomly distributed across the genome). These various genomic elements and regions differ in their substitution rates, likelihood of neutrality or of being strongly linked to loci under selection, and mode of inheritance, each of which are important considerations in phylogenomic reconstruction. These features may give each type of marker important advantages and disadvantages depending on the biological question, number of taxa sampled, evolutionary timescale, cost effectiveness, and analytical methods used. We provide a concise outline as a resource to efficiently consider key aspects of each type of genetic marker. There are many factors to consider when designing phylogenomic studies, and this review may serve as a primer when weighing options between multiple potential phylogenomic markers. 
    more » « less