skip to main content

Title: Detection of haplotype-dependent allele-specific DNA methylation in WGBS data
Abstract In heterozygous genomes, allele-specific measurements can reveal biologically significant differences in DNA methylation between homologous alleles associated with local changes in genetic sequence. Current approaches for detecting such events from whole-genome bisulfite sequencing (WGBS) data perform statistically independent marginal analysis at individual cytosine-phosphate-guanine (CpG) sites, thus ignoring correlations in the methylation state, or carry-out a joint statistical analysis of methylation patterns at four CpG sites producing unreliable statistical evidence. Here, we employ the one-dimensional Ising model of statistical physics and develop a method for detecting allele-specific methylation (ASM) events within segments of DNA containing clusters of linked single-nucleotide polymorphisms (SNPs), called haplotypes. Comparisons with existing approaches using simulated and real WGBS data show that our method provides an improved fit to data, especially when considering large haplotypes. Importantly, the method employs robust hypothesis testing for detecting statistically significant imbalances in mean methylation level and methylation entropy, as well as for identifying haplotypes for which the genetic variant carries significant information about the methylation state. As such, our ASM analysis approach can potentially lead to biological discoveries with important implications for the genetics of complex human diseases.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Nature Communications
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. INTRODUCTION Transposable elements (TEs), repeat expansions, and repeat-mediated structural rearrangements play key roles in chromosome structure and species evolution, contribute to human genetic variation, and substantially influence human health through copy number variants, structural variants, insertions, deletions, and alterations to gene transcription and splicing. Despite their formative role in genome stability, repetitive regions have been relegated to gaps and collapsed regions in human genome reference GRCh38 owing to the technological limitations during its development. The lack of linear sequence in these regions, particularly in centromeres, resulted in the inability to fully explore the repeat content of the human genome in the context of both local and regional chromosomal environments. RATIONALE Long-read sequencing supported the complete, telomere-to-telomere (T2T) assembly of the pseudo-haploid human cell line CHM13. This resource affords a genome-scale assessment of all human repetitive sequences, including TEs and previously unknown repeats and satellites, both within and outside of gaps and collapsed regions. Additionally, a complete genome enables the opportunity to explore the epigenetic and transcriptional profiles of these elements that are fundamental to our understanding of chromosome structure, function, and evolution. Comparative analyses reveal modes of repeat divergence, evolution, and expansion or contraction with locus-level resolution. RESULTS We implemented a comprehensive repeat annotation workflow using previously known human repeats and de novo repeat modeling followed by manual curation, including assessing overlaps with gene annotations, segmental duplications, tandem repeats, and annotated repeats. Using this method, we developed an updated catalog of human repetitive sequences and refined previous repeat annotations. We discovered 43 previously unknown repeats and repeat variants and characterized 19 complex, composite repetitive structures, which often carry genes, across T2T-CHM13. Using precision nuclear run-on sequencing (PRO-seq) and CpG methylated sites generated from Oxford Nanopore Technologies long-read sequencing data, we assessed RNA polymerase engagement across retroelements genome-wide, revealing correlations between nascent transcription, sequence divergence, CpG density, and methylation. These analyses were extended to evaluate RNA polymerase occupancy for all repeats, including high-density satellite repeats that reside in previously inaccessible centromeric regions of all human chromosomes. Moreover, using both mapping-dependent and mapping-independent approaches across early developmental stages and a complete cell cycle time series, we found that engaged RNA polymerase across satellites is low; in contrast, TE transcription is abundant and serves as a boundary for changes in CpG methylation and centromere substructure. Together, these data reveal the dynamic relationship between transcriptionally active retroelement subclasses and DNA methylation, as well as potential mechanisms for the derivation and evolution of new repeat families and composite elements. Focusing on the emerging T2T-level assembly of the HG002 X chromosome, we reveal that a high level of repeat variation likely exists across the human population, including composite element copy numbers that affect gene copy number. Additionally, we highlight the impact of repeats on the structural diversity of the genome, revealing repeat expansions with extreme copy number differences between humans and primates while also providing high-confidence annotations of retroelement transduction events. CONCLUSION The comprehensive repeat annotations and updated repeat models described herein serve as a resource for expanding the compendium of human genome sequences and reveal the impact of specific repeats on the human genome. In developing this resource, we provide a methodological framework for assessing repeat variation within and between human genomes. The exhaustive assessment of the transcriptional landscape of repeats, at both the genome scale and locally, such as within centromeres, sets the stage for functional studies to disentangle the role transcription plays in the mechanisms essential for genome stability and chromosome segregation. Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates and other species to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Telomere-to-telomere assembly of CHM13 supports repeat annotations and discoveries. The human reference T2T-CHM13 filled gaps and corrected collapsed regions (triangles) in GRCh38. Combining long read–based methylation calls, PRO-seq, and multilevel computational methods, we provide a compendium of human repeats, define retroelement expression and methylation profiles, and delineate locus-specific sites of nascent transcription genome-wide, including previously inaccessible centromeres. SINE, short interspersed element; SVA, SINE–variable number tandem repeat– Alu ; LINE, long interspersed element; LTR, long terminal repeat; TSS, transcription start site; pA, xxxxxxxxxxxxxxxx. 
    more » « less
  2. INTRODUCTION To faithfully distribute genetic material to daughter cells during cell division, spindle fibers must couple to DNA by means of a structure called the kinetochore, which assembles at each chromosome’s centromere. Human centromeres are located within large arrays of tandemly repeated DNA sequences known as alpha satellite (αSat), which often span millions of base pairs on each chromosome. Arrays of αSat are frequently surrounded by other types of tandem satellite repeats, which have poorly understood functions, along with nonrepetitive sequences, including transcribed genes. Previous genome sequencing efforts have been unable to generate complete assemblies of satellite-rich regions because of their scale and repetitive nature, limiting the ability to study their organization, variation, and function. RATIONALE Pericentromeric and centromeric (peri/centromeric) satellite DNA sequences have remained almost entirely missing from the assembled human reference genome for the past 20 years. Using a complete, telomere-to-telomere (T2T) assembly of a human genome, we developed and deployed tailored computational approaches to reveal the organization and evolutionary patterns of these satellite arrays at both large and small length scales. We also performed experiments to map precisely which αSat repeats interact with kinetochore proteins. Last, we compared peri/centromeric regions among multiple individuals to understand how these sequences vary across diverse genetic backgrounds. RESULTS Satellite repeats constitute 6.2% of the T2T-CHM13 genome assembly, with αSat representing the single largest component (2.8% of the genome). By studying the sequence relationships of αSat repeats in detail across each centromere, we found genome-wide evidence that human centromeres evolve through “layered expansions.” Specifically, distinct repetitive variants arise within each centromeric region and expand through mechanisms that resemble successive tandem duplications, whereas older flanking sequences shrink and diverge over time. We also revealed that the most recently expanded repeats within each αSat array are more likely to interact with the inner kinetochore protein Centromere Protein A (CENP-A), which coincides with regions of reduced CpG methylation. This suggests a strong relationship between local satellite repeat expansion, kinetochore positioning, and DNA hypomethylation. Furthermore, we uncovered large and unexpected structural rearrangements that affect multiple satellite repeat types, including active centromeric αSat arrays. Last, by comparing sequence information from nearly 1600 individuals’ X chromosomes, we observed that individuals with recent African ancestry possess the greatest genetic diversity in the region surrounding the centromere, which sometimes contains a predominantly African αSat sequence variant. CONCLUSION The genetic and epigenetic properties of centromeres are closely interwoven through evolution. These findings raise important questions about the specific molecular mechanisms responsible for the relationship between inner kinetochore proteins, DNA hypomethylation, and layered αSat expansions. Even more questions remain about the function and evolution of non-αSat repeats. To begin answering these questions, we have produced a comprehensive encyclopedia of peri/centromeric sequences in a human genome, and we demonstrated how these regions can be studied with modern genomic tools. Our work also illuminates the rich genetic variation hidden within these formerly missing regions of the genome, which may contribute to health and disease. This unexplored variation underlines the need for more T2T human genome assemblies from genetically diverse individuals. Gapless assemblies illuminate centromere evolution. ( Top ) The organization of peri/centromeric satellite repeats. ( Bottom left ) A schematic portraying (i) evidence for centromere evolution through layered expansions and (ii) the localization of inner-kinetochore proteins in the youngest, most recently expanded repeats, which coincide with a region of DNA hypomethylation. ( Bottom right ) An illustration of the global distribution of chrX centromere haplotypes, showing increased diversity in populations with recent African ancestry. 
    more » « less
  3. Low socioeconomic status (SES) and living in a disadvantaged neighborhood are associated with poor cardiovascular health. Multiple lines of evidence have linked DNA methylation to both cardiovascular risk factors and social disadvantage indicators. However, limited research has investigated the role of DNA methylation in mediating the associations of individual- and neighborhood-level disadvantage with multiple cardiovascular risk factors in large, multi-ethnic, population-based cohorts. We examined whether disadvantage at the individual level (childhood and adult SES) and neighborhood level (summary neighborhood SES as assessed by Census data and social environment as assessed by perceptions of aesthetic quality, safety, and social cohesion) were associated with 11 cardiovascular risk factors including measures of obesity, diabetes, lipids, and hypertension in 1,154 participants from the Multi-Ethnic Study of Atherosclerosis (MESA). For significant associations, we conducted epigenome-wide mediation analysis to identify methylation sites mediating the relationship between individual/neighborhood disadvantage and cardiovascular risk factors using the JT-Comp method that assesses sparse mediation effects under a composite null hypothesis. In models adjusting for age, sex, race/ethnicity, smoking, medication use, and genetic principal components of ancestry, epigenetic mediation was detected for the associations of adult SES with body mass index (BMI), insulin, and high-density lipoprotein cholesterol (HDL-C), as well as for the association between neighborhood socioeconomic disadvantage and HDL-C at FDRq< 0.05. The 410 CpG mediators identified for the SES-BMI association were enriched for CpGs associated with gene expression (expression quantitative trait methylation loci, or eQTMs), and corresponding genes were enriched in antigen processing and presentation pathways. For cardiovascular risk factors other than BMI, most of the epigenetic mediators lost significance after controlling for BMI. However, 43 methylation sites showed evidence of mediating the neighborhood socioeconomic disadvantage and HDL-C association after BMI adjustment. The identified mediators were enriched for eQTMs, and corresponding genes were enriched in inflammatory and apoptotic pathways. Our findings support the hypothesis that DNA methylation acts as a mediator between individual- and neighborhood-level disadvantage and cardiovascular risk factors, and shed light on the potential underlying epigenetic pathways. Future studies are needed to fully elucidate the biological mechanisms that link social disadvantage to poor cardiovascular health.

    more » « less
  4. Early experiences can have enduring impacts on brain and behavior, but the strength of these effects can be influenced by genetic variation. In principle, polymorphic CpGs (polyCpGs) may contribute to gene‐by‐environment interactions (G × E) by altering DNA methylation. In this study, we investigate the influence of polyCpGs on the development of vasopressin receptor 1a abundance in the retrosplenial cortex (RSC‐V1aR) of prairie voles (Microtus ochrogaster). Two alternative alleles (‘HI’/‘LO’) predict RSCavpr1aexpression, V1aR abundance and sexual fidelity in adulthood; these alleles differ in the frequency of CpG sites and in methylation at a putative intron enhancer. We hypothesized that the elevated CpG abundance in the LO allele would make homozygous LO/LO voles more sensitive to developmental perturbations. We found that genotype differences in RSC‐V1aR abundance emerged early in ontogeny and were accompanied by differences in methylation of the putative enhancer. As predicted, postnatal treatment with an oxytocin receptor antagonist (OTA) reduced RSC‐V1aR abundance in LO/LO adults but not their HI/HI siblings. Similarly, methylation inhibition by zebularine increased RSC‐V1aR in LO/LO adults, but not in HI/HI siblings. These data show a gene‐by‐environment interaction in RSC‐V1aR. Surprisingly, however, neither OTA nor zebularine altered adult methylation of the intronic enhancer, suggesting that differences in sensitivity could not be explained by CpG density at the enhancer alone. Methylated DNA immunoprecipiation‐sequencing showed additional differentially methylated regions between HI/HI and LO/LO voles. Future research should examine the role of these regions and other regulatory elements in the ontogeny of RSC‐V1aR and its developmentally induced changes.

    more » « less
  5. Background: Though the development of targeted cancer drugs continues to accelerate, doctors still lack reliable methods for predicting patient response to standard-of-care therapies for most cancers. DNA methylation has been implicated in tumor drug response and is a promising source of predictive biomarkers of drug efficacy, yet the relationship between drug efficacy and DNA methylation remains largely unexplored. Method: In this analysis, we performed log-rank survival analyses on patients grouped by cancer and drug exposure to find CpG sites where binary methylation status is associated with differential survival in patients treated with a specific drug but not in patients with the same cancer who were not exposed to that drug. We also clustered these drug-specific CpG sites based on co-methylation among patients to identify broader methylation patterns that may be related to drug efficacy, which we investigated for transcription factor binding site enrichment using gene set enrichment analysis. Results: We identified CpG sites that were drug-specific predictors of survival in 38 cancer-drug patient groups across 15 cancers and 20 drugs. These included 11 CpG sites with similar drug-specific survival effects in multiple cancers. We also identified 76 clusters of CpG sites with stronger associations with patient drug response, many of which contained CpG sites in gene promoters containing transcription factor binding sites. Conclusion: These findings are promising biomarkers of drug response for a variety of drugs and contribute to our understanding of drug-methylation interactions in cancer. Investigation and validation of these results could lead to the development of targeted co-therapies aimed at manipulating methylation in order to improve efficacy of commonly used therapies and could improve patient survival and quality of life by furthering the effort toward drug response prediction. 
    more » « less