skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, May 23 until 2:00 AM ET on Friday, May 24 due to maintenance. We apologize for the inconvenience.

Title: Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks
Core Ideas Cross‐species models of chromatin state from sequence are comparable or superior to within‐species models. Model performance is highest on accessible regions open in many tissues. Transcription factor motifs can be ranked by importance to each species and chromatin state.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
The Plant Genome
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Wittkopp, Patricia (Ed.)
    Abstract In Drosophila melanogaster and D. simulans head tissue, 60% of orthologous genes show evidence of sex-biased expression in at least one species. Of these, ∼39% (2,192) are conserved in direction. We hypothesize enrichment of open chromatin in the sex where we see expression bias and closed chromatin in the opposite sex. Male-biased orthologs are significantly enriched for H3K4me3 marks in males of both species (∼89% of male-biased orthologs vs. ∼76% of unbiased orthologs). Similarly, female-biased orthologs are significantly enriched for H3K4me3 marks in females of both species (∼90% of female-biased orthologs vs. ∼73% of unbiased orthologs). The sex-bias ratio in female-biased orthologs was similar in magnitude between the two species, regardless of the closed chromatin (H3K27me2me3) marks in males. However, in male-biased orthologs, the presence of H3K27me2me3 in both species significantly reduced the correlation between D. melanogaster sex-bias ratio and the D. simulans sex-bias ratio. Male-biased orthologs are enriched for evidence of positive selection in the D. melanogaster group. There are more male-biased genes than female-biased genes in both species. For orthologs with gains/losses of sex-bias between the two species, there is an excess of male-bias compared to female-bias, but there is no consistent pattern in the relationship between H3K4me3 or H3K27me2me3 chromatin marks and expression. These data suggest chromatin state is a component of the maintenance of sex-biased expression and divergence of sex-bias between species is reflected in the complexity of the chromatin status. 
    more » « less
  2. Changes in developmental gene regulatory networks (dGRNs) underlie much of the diversity of life, but the evolutionary mechanisms that operate on interactions with these networks remain poorly understood. Closely related species with extreme phenotypic divergence provide a valuable window into the genetic and molecular basis for changes in dGRNs and their relationship to adaptive changes in organismal traits. Here we analyze genomes, epigenomes, and transcriptomes during early development in two sea urchin species in the genus Heliocidaris that exhibit highly divergent life histories and in an outgroup species. Signatures of positive selection and changes in chromatin status within putative gene regulatory elements are both enriched on the branch leading to the derived life history, and particularly so near core dGRN genes; in contrast, positive selection within protein-coding regions have at most a modest enrichment in branch and function. Single-cell transcriptomes reveal a dramatic delay in cell fate specification in the derived state, which also has far fewer open chromatin regions, especially near dGRN genes with conserved roles in cell fate specification. Experimentally perturbing the function of three key transcription factors reveals profound evolutionary changes in the earliest events that pattern the embryo, disrupting regulatory interactions previously conserved for ~225 million years. Together, these results demonstrate that natural selection can rapidly reshape developmental gene expression on a broad scale when selective regimes abruptly change and that even highly conserved dGRNs and patterning mechanisms in the early embryo remain evolvable under appropriate ecological circumstances. 
    more » « less
  3. Although genome-sequence assemblies are available for a growing number of plant species, gene-expression responses to stimuli have been cataloged for only a subset of these species. Many genes show altered transcription patterns in response to abiotic stresses. However, orthologous genes in related species often exhibit different responses to a given stress. Accordingly, data on the regulation of gene expression in one species are not reliable predictors of orthologous gene responses in a related species. Here, we trained a supervised classification model to identify genes that transcriptionally respond to cold stress. A model trained with only features calculated directly from genome assemblies exhibited only modest decreases in performance relative to models trained by using genomic, chromatin, and evolution/diversity features. Models trained with data from one species successfully predicted which genes would respond to cold stress in other related species. Cross-species predictions remained accurate when training was performed in cold-sensitive species and predictions were performed in cold-tolerant species and vice versa. Models trained with data on gene expression in multiple species provided at least equivalent performance to models trained and tested in a single species and outperformed single-species models in cross-species prediction. These results suggest that classifiers trained on stress data from well-studied species may suffice for predicting gene-expression patterns in related, less-studied species with sequenced genomes. 
    more » « less
  4. Abstract Motivation

    Sequence-based deep learning approaches have been shown to predict a multitude of functional genomic readouts, including regions of open chromatin and RNA expression of genes. However, a major limitation of current methods is that model interpretation relies on computationally demanding post hoc analyses, and even then, one can often not explain the internal mechanics of highly parameterized models. Here, we introduce a deep learning architecture called totally interpretable sequence-to-function model (tiSFM). tiSFM improves upon the performance of standard multilayer convolutional models while using fewer parameters. Additionally, while tiSFM is itself technically a multilayer neural network, internal model parameters are intrinsically interpretable in terms of relevant sequence motifs.


    We analyze published open chromatin measurements across hematopoietic lineage cell-types and demonstrate that tiSFM outperforms a state-of-the-art convolutional neural network model custom-tailored to this dataset. We also show that it correctly identifies context-specific activities of transcription factors with known roles in hematopoietic differentiation, including Pax5 and Ebf1 for B-cells, and Rorc for innate lymphoid cells. tiSFM’s model parameters have biologically meaningful interpretations, and we show the utility of our approach on a complex task of predicting the change in epigenetic state as a function of developmental transition.

    Availability and implementation

    The source code, including scripts for the analysis of key findings, can be found at, implemented in Python.

    more » « less
  5. Abstract Motivation

    High throughput chromosome conformation capture (Hi-C) contact matrices are used to predict 3D chromatin structures in eukaryotic cells. High-resolution Hi-C data are less available than low-resolution Hi-C data due to sequencing costs but provide greater insight into the intricate details of 3D chromatin structures such as enhancer–promoter interactions and sub-domains. To provide a cost-effective solution to high-resolution Hi-C data collection, deep learning models are used to predict high-resolution Hi-C matrices from existing low-resolution matrices across multiple cell types.


    Here, we present two Cascading Residual Networks called HiCARN-1 and HiCARN-2, a convolutional neural network and a generative adversarial network, that use a novel framework of cascading connections throughout the network for Hi-C contact matrix prediction from low-resolution data. Shown by image evaluation and Hi-C reproducibility metrics, both HiCARN models, overall, outperform state-of-the-art Hi-C resolution enhancement algorithms in predictive accuracy for both human and mouse 1/16, 1/32, 1/64 and 1/100 downsampled high-resolution Hi-C data. Also, validation by extracting topologically associating domains, chromosome 3D structure and chromatin loop predictions from the enhanced data shows that HiCARN can proficiently reconstruct biologically significant regions.

    Availability and implementation

    HiCARN can be accessed and utilized as an open-sourced software at: and is also available as a containerized application that can be run on any platform.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    more » « less