skip to main content


Title: The lingering effects of Neanderthal introgression on human complex traits
The genetic variants introduced into the ancestors of modern humans from interbreeding with Neanderthals have been suggested to contribute an unexpected extent to complex human traits. However, testing this hypothesis has been challenging due to the idiosyncratic population genetic properties of introgressed variants. We developed rigorous methods to assess the contribution of introgressed Neanderthal variants to heritable trait variation and applied these methods to analyze 235,592 introgressed Neanderthal variants and 96 distinct phenotypes measured in about 300,000 unrelated white British individuals in the UK Biobank. Introgressed Neanderthal variants make a significant contribution to trait variation (explaining 0.12% of trait variation on average). However, the contribution of introgressed variants tends to be significantly depleted relative to modern human variants matched for allele frequency and linkage disequilibrium (about 59% depletion on average), consistent with purifying selection on introgressed variants. Different from previous studies (McArthur et al., 2021), we find no evidence for elevated heritability across the phenotypes examined. We identified 348 independent significant associations of introgressed Neanderthal variants with 64 phenotypes. Previous work (Skov et al., 2020) has suggested that a majority of such associations are likely driven by statistical association with nearby modern human variants that are the true causal variants. Applying a customized fine-mapping led us to identify 112 regions across 47 phenotypes containing 4303 unique genetic variants where introgressed variants are highly likely to have a phenotypic effect. Examination of these variants reveals their substantial impact on genes that are important for the immune system, development, and metabolism.  more » « less
Award ID(s):
1943497
NSF-PAR ID:
10423504
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
eLife
Volume:
12
ISSN:
2050-084X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Introduction

    Autoimmune disorders (ADs) are a group of about 80 disorders that occur when self-attacking autoantibodies are produced due to failure in the self-tolerance mechanisms. ADs are polygenic disorders and associations with genes both in the human leukocyte antigen (HLA) region and outside of it have been described. Previous studies have shown that they are highly comorbid with shared genetic risk factors, while epidemiological studies revealed associations between various lifestyle and health-related phenotypes and ADs.

    Methods

    Here, for the first time, we performed a comparative polygenic risk score (PRS) - Phenome Wide Association Study (PheWAS) for 11 different ADs (Juvenile Idiopathic Arthritis, Primary Sclerosing Cholangitis, Celiac Disease, Multiple Sclerosis, Rheumatoid Arthritis, Psoriasis, Myasthenia Gravis, Type 1 Diabetes, Systemic Lupus Erythematosus, Vitiligo Late Onset, Vitiligo Early Onset) and 3,254 phenotypes available in the UK Biobank that include a wide range of socio-demographic, lifestyle and health-related outcomes. Additionally, we investigated the genetic relationships of the studied ADs, calculating their genetic correlation and conducting cross-disorder GWAS meta-analyses for the observed AD clusters.

    Results

    In total, we identified 508 phenotypes significantly associated with at least one AD PRS. 272 phenotypes were significantly associated after excluding variants in the HLA region from the PRS estimation. Through genetic correlation and genetic factor analyses, we identified four genetic factors that run across studied ADs. Cross-trait meta-analyses within each factor revealed pleiotropic genome-wide significant loci.

    Discussion

    Overall, our study confirms the association of different factors with genetic susceptibility for ADs and reveals novel observations that need to be further explored.

     
    more » « less
  2. Falush, Daniel (Ed.)
    Abstract Although some variation introgressed from Neanderthals has undergone selective sweeps, little is known about its functional significance. We used a Massively Parallel Reporter Assay (MPRA) to assay 5,353 high-frequency introgressed variants for their ability to modulate the gene expression within 170 bp of endogenous sequence. We identified 2,548 variants in active putative cis-regulatory elements (CREs) and 292 expression-modulating variants (emVars). These emVars are predicted to alter the binding motifs of important immune transcription factors, are enriched for associations with neutrophil and white blood cell count, and are associated with the expression of genes that function in innate immune pathways including inflammatory response and antiviral defense. We combined the MPRA data with other data sets to identify strong candidates to be driver variants of positive selection including an emVar that may contribute to protection against severe COVID-19 response. We endogenously deleted two CREs containing expression-modulation variants linked to immune function, rs11624425 and rs80317430, identifying their primary genic targets as ELMSAN1, and PAN2 and STAT2, respectively, three genes differentially expressed during influenza infection. Overall, we present the first database of experimentally identified expression-modulating Neanderthal-introgressed alleles contributing to potential immune response in modern humans. 
    more » « less
  3. Abstract

    Classical genetic studies have identified many cases of pleiotropy where mutations in individual genes alter many different phenotypes. Quantitative genetic studies of natural genetic variants frequently examine one or a few traits, limiting their potential to identify pleiotropic effects of natural genetic variants. Widely adopted community association panels have been employed by plant genetics communities to study the genetic basis of naturally occurring phenotypic variation in a wide range of traits. High-density genetic marker data—18M markers—from 2 partially overlapping maize association panels comprising 1,014 unique genotypes grown in field trials across at least 7 US states and scored for 162 distinct trait data sets enabled the identification of of 2,154 suggestive marker-trait associations and 697 confident associations in the maize genome using a resampling-based genome-wide association strategy. The precision of individual marker-trait associations was estimated to be 3 genes based on a reference set of genes with known phenotypes. Examples were observed of both genetic loci associated with variation in diverse traits (e.g., above-ground and below-ground traits), as well as individual loci associated with the same or similar traits across diverse environments. Many significant signals are located near genes whose functions were previously entirely unknown or estimated purely via functional data on homologs. This study demonstrates the potential of mining community association panel data using new higher-density genetic marker sets combined with resampling-based genome-wide association tests to develop testable hypotheses about gene functions, identify potential pleiotropic effects of natural genetic variants, and study genotype-by-environment interaction.

     
    more » « less
  4. Obeid, Iyad ; Selesnick, Ivan ; Picone, Joseph (Ed.)
    The Temple University Hospital Seizure Detection Corpus (TUSZ) [1] has been in distribution since April 2017. It is a subset of the TUH EEG Corpus (TUEG) [2] and the most frequently requested corpus from our 3,000+ subscribers. It was recently featured as the challenge task in the Neureka 2020 Epilepsy Challenge [3]. A summary of the development of the corpus is shown below in Table 1. The TUSZ Corpus is a fully annotated corpus, which means every seizure event that occurs within its files has been annotated. The data is selected from TUEG using a screening process that identifies files most likely to contain seizures [1]. Approximately 7% of the TUEG data contains a seizure event, so it is important we triage TUEG for high yield data. One hour of EEG data requires approximately one hour of human labor to complete annotation using the pipeline described below, so it is important from a financial standpoint that we accurately triage data. A summary of the labels being used to annotate the data is shown in Table 2. Certain standards are put into place to optimize the annotation process while not sacrificing consistency. Due to the nature of EEG recordings, some records start off with a segment of calibration. This portion of the EEG is instantly recognizable and transitions from what resembles lead artifact to a flat line on all the channels. For the sake of seizure annotation, the calibration is ignored, and no time is wasted on it. During the identification of seizure events, a hard “3 second rule” is used to determine whether two events should be combined into a single larger event. This greatly reduces the time that it takes to annotate a file with multiple events occurring in succession. In addition to the required minimum 3 second gap between seizures, part of our standard dictates that no seizure less than 3 seconds be annotated. Although there is no universally accepted definition for how long a seizure must be, we find that it is difficult to discern with confidence between burst suppression or other morphologically similar impressions when the event is only a couple seconds long. This is due to several reasons, the most notable being the lack of evolution which is oftentimes crucial for the determination of a seizure. After the EEG files have been triaged, a team of annotators at NEDC is provided with the files to begin data annotation. An example of an annotation is shown in Figure 1. A summary of the workflow for our annotation process is shown in Figure 2. Several passes are performed over the data to ensure the annotations are accurate. Each file undergoes three passes to ensure that no seizures were missed or misidentified. The first pass of TUSZ involves identifying which files contain seizures and annotating them using our annotation tool. The time it takes to fully annotate a file can vary drastically depending on the specific characteristics of each file; however, on average a file containing multiple seizures takes 7 minutes to fully annotate. This includes the time that it takes to read the patient report as well as traverse through the entire file. Once an event has been identified, the start and stop time for the seizure is stored in our annotation tool. This is done on a channel by channel basis resulting in an accurate representation of the seizure spreading across different parts of the brain. Files that do not contain any seizures take approximately 3 minutes to complete. Even though there is no annotation being made, the file is still carefully examined to make sure that nothing was overlooked. In addition to solely scrolling through a file from start to finish, a file is often examined through different lenses. Depending on the situation, low pass filters are used, as well as increasing the amplitude of certain channels. These techniques are never used in isolation and are meant to further increase our confidence that nothing was missed. Once each file in a given set has been looked at once, the annotators start the review process. The reviewer checks a file and comments any changes that they recommend. This takes about 3 minutes per seizure containing file, which is significantly less time than the first pass. After each file has been commented on, the third pass commences. This step takes about 5 minutes per seizure file and requires the reviewer to accept or reject the changes that the second reviewer suggested. Since tangible changes are made to the annotation using the annotation tool, this step takes a bit longer than the previous one. Assuming 18% of the files contain seizures, a set of 1,000 files takes roughly 127 work hours to annotate. Before an annotator contributes to the data interpretation pipeline, they are trained for several weeks on previous datasets. A new annotator is able to be trained using data that resembles what they would see under normal circumstances. An additional benefit of using released data to train is that it serves as a means of constantly checking our work. If a trainee stumbles across an event that was not previously annotated, it is promptly added, and the data release is updated. It takes about three months to train an annotator to a point where their annotations can be trusted. Even though we carefully screen potential annotators during the hiring process, only about 25% of the annotators we hire survive more than one year doing this work. To ensure that the annotators are consistent in their annotations, the team conducts an interrater agreement evaluation periodically to ensure that there is a consensus within the team. The annotation standards are discussed in Ochal et al. [4]. An extended discussion of interrater agreement can be found in Shah et al. [5]. The most recent release of TUSZ, v1.5.2, represents our efforts to review the quality of the annotations for two upcoming challenges we hosted: an internal deep learning challenge at IBM [6] and the Neureka 2020 Epilepsy Challenge [3]. One of the biggest changes that was made to the annotations was the imposition of a stricter standard for determining the start and stop time of a seizure. Although evolution is still included in the annotations, the start times were altered to start when the spike-wave pattern becomes distinct as opposed to merely when the signal starts to shift from background. This cuts down on background that was mislabeled as a seizure. For seizure end times, all post ictal slowing that was included was removed. The recent release of v1.5.2 did not include any additional data files. Two EEG files had been added because, originally, they were corrupted in v1.5.1 but were able to be retrieved and added for the latest release. The progression from v1.5.0 to v1.5.1 and later to v1.5.2, included the re-annotation of all of the EEG files in order to develop a confident dataset regarding seizure identification. Starting with v1.4.0, we have also developed a blind evaluation set that is withheld for use in competitions. The annotation team is currently working on the next release for TUSZ, v1.6.0, which is expected to occur in August 2020. It will include new data from 2016 to mid-2019. This release will contain 2,296 files from 2016 as well as several thousand files representing the remaining data through mid-2019. In addition to files that were obtained with our standard triaging process, a part of this release consists of EEG files that do not have associated patient reports. Since actual seizure events are in short supply, we are mining a large chunk of data for which we have EEG recordings but no reports. Some of this data contains interesting seizure events collected during long-term EEG sessions or data collected from patients with a history of frequent seizures. It is being mined to increase the number of files in the corpus that have at least one seizure event. We expect v1.6.0 to be released before IEEE SPMB 2020. The TUAR Corpus is an open-source database that is currently available for use by any registered member of our consortium. To register and receive access, please follow the instructions provided at this web page: https://www.isip.piconepress.com/projects/tuh_eeg/html/downloads.shtml. The data is located here: https://www.isip.piconepress.com/projects/tuh_eeg/downloads/tuh_eeg_artifact/v2.0.0/. 
    more » « less
  5. Abstract

    The reuse of old genetic variation can promote rapid diversification in evolutionary radiations, but in most cases, the historical events underlying this divergence are not known. For example, ancient hybridization can generate new combinations of alleles that sort into descendant lineages, potentially providing the raw material to initiate divergence. In the Mimulus aurantiacus species complex, there is evidence for widespread gene flow among members of this radiation. In addition, allelic variation in the MaMyb2 gene is responsible for differences in flower color between the closely related ecotypes of subspecies puniceus, contributing to reproductive isolation by pollinators. Previous work suggested that MaMyb2 was introgressed into the red-flowered ecotype of puniceus. However, additional taxa within the radiation have independently evolved red flowers from their yellow-flowered ancestors, raising the possibility that this introgression had a more ancient origin. In this study, we used repeated tests of admixture from whole-genome sequence data across this diverse radiation to demonstrate that there has been both ancient and recurrent hybridization in this group. However, most of the signal of this ancient introgression has been removed due to selection, suggesting that widespread barriers to gene flow are in place between taxa. Yet, a roughly 30 kb region that contains the MaMyb2 gene is currently shared only among the red-flowered taxa. Patterns of admixture, sequence divergence, and extended haplotype homozygosity across this region confirm a history of ancient hybridization, where functional variants have been preserved due to positive selection in red-flowered taxa but lost in their yellow-flowered counterparts. The results of this study reveal that selection against gene flow can reduce genomic signatures of ancient hybridization, but that historical introgression can provide essential genetic variation that facilitates the repeated evolution of phenotypic traits between lineages.

     
    more » « less