skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: AraPheno and the AraGWAS Catalog 2020: a major database update including RNA-Seq and knockout mutation data for Arabidopsis thaliana
Abstract Genome-wide association studies (GWAS) are integral for studying genotype-phenotype relationships and gaining a deeper understanding of the genetic architecture underlying trait variation. A plethora of genetic associations between distinct loci and various traits have been successfully discovered and published for the model plant Arabidopsis thaliana. This success and the free availability of full genomes and phenotypic data for more than 1,000 different natural inbred lines led to the development of several data repositories. AraPheno (https://arapheno.1001genomes.org) serves as a central repository of population-scale phenotypes in A. thaliana, while the AraGWAS Catalog (https://aragwas.1001genomes.org) provides a publicly available, manually curated and standardized collection of marker-trait associations for all available phenotypes from AraPheno. In this major update, we introduce the next generation of both platforms, including new data, features and tools. We included novel results on associations between knockout-mutations and all AraPheno traits. Furthermore, AraPheno has been extended to display RNA-Seq data for hundreds of accessions, providing expression information for over 28 000 genes for these accessions. All data, including the imputed genotype matrix used for GWAS, are easily downloadable via the respective databases.  more » « less
Award ID(s):
1701918
PAR ID:
10127270
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Nucleic Acids Research
ISSN:
0305-1048
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. {"Abstract":["Traits that have lost function sometimes persist through evolutionary\n time. Persistence may occur if there is not enough standing genetic\n variation for the trait to allow a response to selection, if selection\n against the trait is weak relative to drift, or if the trait has a\n residual function. To determine the evolutionary processes shaping whether\n nonfunctional traits are retained or lost, we investigated short stamens\n in 16 populations of Arabidopsis thaliana along an elevational cline in\n northeast Spain. A. thaliana is highly self-pollinating and prior work\n suggests short stamens do not contribute to self-pollination. We found a\n cline in short stamen number from retention of short stamens in high\n elevation populations to incomplete loss in low elevation populations. We\n did not find evidence that limited genetic variation constrains short\n stamen loss at high elevations, nor evidence for divergent selection on\n short stamens between high and low elevations. Finally, we identified loci\n associated with short stamens in northeast Spain that are different from\n loci associated with variation in short stamens across latitudes from a\n previous study. Overall, we did not identify the evolutionary mechanisms\n contributing to an elevational cline in short stamen number so further\n research is clearly warranted. This dryad dataset includes the GWAS output\n results. See the github for phenotypic data and SRA for genotypic data."],"TechnicalInfo":["# Evaluating the roles of drift and selection in trait loss along an\n elevational gradient Dataset DOI:\n [10.5061/dryad.8sf7m0d0z](10.5061/dryad.8sf7m0d0z) ## Description of the\n data and file structure These files are the relatedness matrices and GWAS\n output files for a GWAS on short stamen number in *A.\n thaliana* from an elevation gradient across the Pyrenees. The\n associated paper is "Evaluating the Roles of Drift and Selection in\n Trait Loss along an Elevational Gradient" by Buysse et al. The code\n used to generate the files can be found on\n github: [https://github.com/sfbuysse/A_thaliana_StamenLoss_2025](https://github.com/sfbuysse/A_thaliana_StamenLoss_2025).  The input data is SNP information for 61 genotypes from 16 native populations of *A. thaliana*. ### Files and variables #### File: RelatednessMatrices.zip **Description:** **RelatednessMatrices.zip** contains centered Relatedness Matrices made with GEMMA v0.98.4. Relatedness matrices are *.cXX.txt and *.log.txt show the code and run log information. allSNPs.PlinkFiltering_Asin, allSNPs.PlinkFiltering_Binary, allSNPs.PlinkFiltering_raw : identical relatedness matrices made using all SNPs in the dataset after filtering with Plink. Names were changed to match the phenotype files to run the GWAS.  allSNPs.PlinkFiltering*_*raw_subset : centered relatedness matrix made with all SNPs after plink filtering but only the individuals with some short stamen loss (mean short stamen number < 2). NoCent.PlinkFiltering_Asin, NoCent.PlinkFiltering_Binary, NoCent.PlinkFiltering_raw  : identical relatedness matrices made after excluding the centromere region and filtering with Plink. Names were changed to match the phenotype files to run the GWAS.  NoCent.PlinkFiltering_raw_subset. : centered relatedness matrix made after excluding the centromere and plink filtering but only the individuals with some short stamen loss (mean short stamen number < 2). #### File: GWAS.zip **Description:** **GWAS.zip** contains GWAS output files. The GWAS output files are  *.assoc.txt and the code information is  *.log.txt. GWAS were run in GEMMA v0.98.4. Within each .assoc.txt file the columns are as follows: * chr = chromosome * rs = snp id (chromosome:base pair position) * ps = base pair position * n_miss = number of genotypes missing genetic information at that SNP * allele1 = minor allele * allele2 = major allele * af = minor allele frequency * beta = affect size * se = standard error for beta * log_lH1 = log liklihood of alternative hypothesis that beta does not equal 0 (H0 is that beta =0) * l_remle = restricted maximum liklihood estimates for lambda * l_mle = maximum liklihood estimates for lambda * p_wald = p value from the Wald test * p_lrt = p value from liiklihood ratio test * p_score = p value from score test allSNPs.PlinkFiltering_Asin.c : include allSNPs after filtering with plink. phenotypes were arcsine transformed before GWAS. Centered relatedness matrix used. allSNPs.PlinkFiltering_Binary.c : include allSNPs after filtering with plink. phenotypes were transformed to a binary trait before GWAS - no short stamen loss = 0, any short stamen loss = 1. Centered relatedness matrix used. allSNPs.PlinkFiltering_raw.c : include allSNPs after filtering with plink. phenotypes were not transformed before GWAS. Centered relatedness matrix used. allSNPs.PlinkFiltering*_*raw_subset.c : include allSNPs after filtering with plink. phenotypes were not transformed before GWAS but the individuals used were subset down to only those that had some short stamen loss (mean short stamen number < 2). Centered relatedness matrix used. NoCent.PlinkFiltering_Asin.c : Centromere excluded. Plink Filtering as before. Arcsine transformed phenotypes. Centered relatedness matrix. NoCent.PlinkFiltering_Binary.c : Centromere excluded. Plink Filtering as before. Phenotypes converted to a binary trait. Centered relatedness matrix. NoCent.PlinkFiltering_raw.c : Centromere excluded. Plink Filtering as before. Phenotypes not transformed. Centered relatedness matrix. NoCent.PlinkFiltering_raw_subset.c : Centromere excluded. Plink Filtering as before. Individuals subset to only those that had some short stamen loss. Centered relatedness matrix. ## Code/software We used GEMMA v0.98.4 to create the files. ## Access information Other publicly accessible locations of the data: * [https://github.com/sfbuysse/A_thaliana_StamenLoss_2025](https://github.com/sfbuysse/A_thaliana_StamenLoss_2025) : scripts and information for creation of input files and use of output files after generation. * Genotypic data used is submitted to NCBI SRA as accession PRJNA1246133."]} 
    more » « less
  2. null (Ed.)
    Tomato (Solanum lycopersicum L.) is a widely used model plant species for dissecting out the genomic bases of complex traits to thus provide an optimal platform for modern “-omics” studies and genome-guided breeding. Genome-wide association studies (GWAS) have become a preferred approach for screening large diverse populations and many traits. Here, we present GWAS analysis of a collection of 115 landraces and 11 vintage and modern cultivars. A total of 26 conventional descriptors, 40 traits obtained by digital phenotyping, the fruit content of six carotenoids recorded at the early ripening (breaker) and red-ripe stages and 21 climate-related variables were analyzed in the context of genetic diversity monitored in the 126 accessions. The data obtained from thorough phenotyping and the SNP diversity revealed by sequencing of ripe fruit transcripts of 120 of the tomato accessions were jointly analyzed to determine which genomic regions are implicated in the expressed phenotypic variation. This study reveals that the use of fruit RNA-Seq SNP diversity is effective not only for identification of genomic regions that underlie variation in fruit traits, but also of variation related to additional plant traits and adaptive responses to climate variation. These results allowed validation of our approach because different marker-trait associations mapped on chromosomal regions where other candidate genes for the same traits were previously reported. In addition, previously uncharacterized chromosomal regions were targeted as potentially involved in the expression of variable phenotypes, thus demonstrating that our tomato collection is a precious reservoir of diversity and an excellent tool for gene discovery. 
    more » « less
  3. Mapping the genetic basis of complex traits is critical to uncovering the biological mechanisms that underlie disease and other phenotypes. Genome-wide association studies (GWAS) in humans and quantitative trait locus (QTL) mapping in model organisms can now explain much of the observed heritability in many traits, allowing us to predict phenotype from genotype. However, constraints on power due to statistical confounders in large GWAS and smaller sample sizes in QTL studies still limit our ability to resolve numerous small-effect variants, map them to causal genes, identify pleiotropic effects across multiple traits, and infer non-additive interactions between loci (epistasis). Here, we introduce barcoded bulk quantitative trait locus (BB-QTL) mapping, which allows us to construct, genotype, and phenotype 100,000 offspring of a budding yeast cross, two orders of magnitude larger than the previous state of the art. We use this panel to map the genetic basis of eighteen complex traits, finding that the genetic architecture of these traits involves hundreds of small-effect loci densely spaced throughout the genome, many with widespread pleiotropic effects across multiple traits. Epistasis plays a central role, with thousands of interactions that provide insight into genetic networks. By dramatically increasing sample size, BB-QTL mapping demonstrates the potential of natural variants in high-powered QTL studies to reveal the highly polygenic, pleiotropic, and epistatic architecture of complex traits. 
    more » « less
  4. ABSTRACT Genome-wide association studies (GWAS) can identify genetic variants responsible for naturally occurring and quantitative phenotypic variation. Association studies therefore provide a powerful complement to approaches that rely on de novo mutations for characterizing gene function. Although bacteria should be amenable to GWAS, few GWAS have been conducted on bacteria, and the extent to which nonindependence among genomic variants (e.g., linkage disequilibrium [LD]) and the genetic architecture of phenotypic traits will affect GWAS performance is unclear. We apply association analyses to identify candidate genes underlying variation in 20 biochemical, growth, and symbiotic phenotypes among 153 strains of Ensifer meliloti . For 11 traits, we find genotype-phenotype associations that are stronger than expected by chance, with the candidates in relatively small linkage groups, indicating that LD does not preclude resolving association candidates to relatively small genomic regions. The significant candidates show an enrichment for nucleotide polymorphisms (SNPs) over gene presence-absence variation (PAV), and for five traits, candidates are enriched in large linkage groups, a possible signature of epistasis. Many of the variants most strongly associated with symbiosis phenotypes were in genes previously identified as being involved in nitrogen fixation or nodulation. For other traits, apparently strong associations were not stronger than the range of associations detected in permuted data. In sum, our data show that GWAS in bacteria may be a powerful tool for characterizing genetic architecture and identifying genes responsible for phenotypic variation. However, careful evaluation of candidates is necessary to avoid false signals of association. IMPORTANCE Genome-wide association analyses are a powerful approach for identifying gene function. These analyses are becoming commonplace in studies of humans, domesticated animals, and crop plants but have rarely been conducted in bacteria. We applied association analyses to 20 traits measured in Ensifer meliloti , an agriculturally and ecologically important bacterium because it fixes nitrogen when in symbiosis with leguminous plants. We identified candidate alleles and gene presence-absence variants underlying variation in symbiosis traits, antibiotic resistance, and use of various carbon sources; some of these candidates are in genes previously known to affect these traits whereas others were in genes that have not been well characterized. Our results point to the potential power of association analyses in bacteria, but also to the need to carefully evaluate the potential for false associations. 
    more » « less
  5. Abstract The genome‐wide association studies (GWAS) typically use linear or logistic regression models to identify associations between phenotypes (traits) and genotypes (genetic variants) of interest. However, the use of regression with the additive assumption has potential limitations. First, the normality assumption of residuals is the one that is rarely seen in practice, and deviation from normality increases the Type‐I error rate. Second, building a model based on such an assumption ignores genetic structures, like, dominant, recessive, and protective‐risk cases. Ignoring genetic variants may result in spurious conclusions about the associations between a variant and a trait. We propose an assumption‐free model built upon data‐consistent inversion (DCI), which is a recently developed measure‐theoretic framework utilized for uncertainty quantification. This proposed DCI‐derived model builds a nonparametric distribution on model inputs that propagates to the distribution of observed data without the required normality assumption of residuals in the regression model. This characteristic enables the proposed DCI‐derived model to cover all genetic variants without emphasizing on additivity of the classic‐GWAS model. Simulations and a replication GWAS with data from the COPDGene demonstrate the ability of this model to control the Type‐I error rate at least as well as the classic‐GWAS (additive linear model) approach while having similar or greater power to discover variants in different genetic modes of transmission. 
    more » « less