skip to main content


Title: Prediction of histone post-translational modifications using deep learning
Abstract Motivation Histone post-translational modifications (PTMs) are involved in a variety of essential regulatory processes in the cell, including transcription control. Recent studies have shown that histone PTMs can be accurately predicted from the knowledge of transcription factor binding or DNase hypersensitivity data. Similarly, it has been shown that one can predict PTMs from the underlying DNA primary sequence. Results In this study, we introduce a deep learning architecture called DeepPTM for predicting histone PTMs from transcription factor binding data and the primary DNA sequence. Extensive experimental results show that our deep learning model outperforms the prediction accuracy of the model proposed in Benveniste et al. (PNAS 2014) and DeepHistone (BMC Genomics 2019). The competitive advantage of our framework lies in the synergistic use of deep learning combined with an effective pre-processing step. Our classification framework has also enabled the discovery that the knowledge of a small subset of transcription factors (which are histone-PTM and cell-type-specific) can provide almost the same prediction accuracy that can be obtained using all the transcription factors data. Availabilityand implementation https://github.com/dDipankar/DeepPTM. Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1814359
NSF-PAR ID:
10249203
Author(s) / Creator(s):
;
Editor(s):
Cowen, Lenore
Date Published:
Journal Name:
Bioinformatics
Volume:
36
Issue:
24
ISSN:
1367-4803
Page Range / eLocation ID:
5610 to 5617
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Human immunodeficiency virus type 1 (HIV-1) genome integration is closely related to clinical latency and viral rebound. In addition to human DNA sequences that directly interact with the integration machinery, the selection of HIV integration sites has also been shown to depend on the heterogeneous genomic context around a large region, which greatly hinders the prediction and mechanistic studies of HIV integration.

    Results

    We have developed an attention-based deep learning framework, named DeepHINT, to simultaneously provide accurate prediction of HIV integration sites and mechanistic explanations of the detected sites. Extensive tests on a high-density HIV integration site dataset showed that DeepHINT can outperform conventional modeling strategies by automatically learning the genomic context of HIV integration from primary DNA sequence alone or together with epigenetic information. Systematic analyses on diverse known factors of HIV integration further validated the biological relevance of the prediction results. More importantly, in-depth analyses of the attention values output by DeepHINT revealed intriguing mechanistic implications in the selection of HIV integration sites, including potential roles of several DNA-binding proteins. These results established DeepHINT as an effective and explainable deep learning framework for the prediction and mechanistic study of HIV integration.

    Availability and implementation

    DeepHINT is available as an open-source software and can be downloaded from https://github.com/nonnerdling/DeepHINT.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Introduction Various sequencing based approaches are used to identify and characterize the activities of cis -regulatory elements in a genome-wide fashion. Some of these techniques rely on indirect markers such as histone modifications (ChIP-seq with histone antibodies) or chromatin accessibility (ATAC-seq, DNase-seq, FAIRE-seq), while other techniques use direct measures such as episomal assays measuring the enhancer properties of DNA sequences (STARR-seq) and direct measurement of the binding of transcription factors (ChIP-seq with transcription factor-specific antibodies). The activities of cis -regulatory elements such as enhancers, promoters, and repressors are determined by their sequence and secondary processes such as chromatin accessibility, DNA methylation, and bound histone markers. Methods Here, machine learning models are employed to evaluate the accuracy with which cis -regulatory elements identified by various commonly used sequencing techniques can be predicted by their underlying sequence alone to distinguish between cis -regulatory activity that is reflective of sequence content versus secondary processes. Results and discussion Models trained and evaluated on D. melanogaster sequences identified through DNase-seq and STARR-seq are significantly more accurate than models trained on sequences identified by H3K4me1, H3K4me3, and H3K27ac ChIP-seq, FAIRE-seq, and ATAC-seq. These results suggest that the activity detected by DNase-seq and STARR-seq can be largely explained by underlying DNA sequence, independent of secondary processes. Experimentally, a subset of DNase-seq and H3K4me1 ChIP-seq sequences were tested for enhancer activity using luciferase assays and compared with previous tests performed on STARR-seq sequences. The experimental data indicated that STARR-seq sequences are substantially enriched for enhancer-specific activity, while the DNase-seq and H3K4me1 ChIP-seq sequences are not. Taken together, these results indicate that the DNase-seq approach identifies a broad class of regulatory elements of which enhancers are a subset and the associated data are appropriate for training models for detecting regulatory activity from sequence alone, STARR-seq data are best for training enhancer-specific sequence models, and H3K4me1 ChIP-seq data are not well suited for training and evaluating sequence-based models for cis -regulatory element prediction. 
    more » « less
  3. Abstract Motivation

    An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem.

    Results

    Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these hidden Markov models into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA datasets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods.

    Availability and implementation

    Our program is freely available at https://github.com/ramzan1990/sequence2vec.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract Motivation

    Machine learning models for predicting cell-type-specific transcription factor (TF) binding sites have become increasingly more accurate thanks to the increased availability of next-generation sequencing data and more standardized model evaluation criteria. However, knowledge transfer from data-rich to data-limited TFs and cell types remains crucial for improving TF binding prediction models because available binding labels are highly skewed towards a small collection of TFs and cell types. Transfer prediction of TF binding sites can potentially benefit from a multitask learning approach; however, existing methods typically use shallow single-task models to generate low-resolution predictions. Here, we propose NetTIME, a multitask learning framework for predicting cell-type-specific TF binding sites with base-pair resolution.

    Results

    We show that the multitask learning strategy for TF binding prediction is more efficient than the single-task approach due to the increased data availability. NetTIME trains high-dimensional embedding vectors to distinguish TF and cell-type identities. We show that this approach is critical for the success of the multitask learning strategy and allows our model to make accurate transfer predictions within and beyond the training panels of TFs and cell types. We additionally train a linear-chain conditional random field (CRF) to classify binding predictions and show that this CRF eliminates the need for setting a probability threshold and reduces classification noise. We compare our method’s predictive performance with two state-of-the-art methods, Catchitt and Leopard, and show that our method outperforms previous methods under both supervised and transfer learning settings.

    Availability and implementation

    NetTIME is freely available at https://github.com/ryi06/NetTIME and the code is also archived at https://doi.org/10.5281/zenodo.6994897.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. Komeili, Arash (Ed.)
    ABSTRACT Histone proteins are found across diverse lineages of Archaea , many of which package DNA and form chromatin. However, previous research has led to the hypothesis that the histone-like proteins of high-salt-adapted archaea, or halophiles, function differently. The sole histone protein encoded by the model halophilic species Halobacterium salinarum , HpyA, is nonessential and expressed at levels too low to enable genome-wide DNA packaging. Instead, HpyA mediates the transcriptional response to salt stress. Here we compare the features of genome-wide binding of HpyA to those of HstA, the sole histone of another model halophile, Haloferax volcanii . hstA , like hpyA , is a nonessential gene. To better understand HpyA and HstA functions, protein-DNA binding data (chromatin immunoprecipitation sequencing [ChIP-seq]) of these halophilic histones are compared to publicly available ChIP-seq data from DNA binding proteins across all domains of life, including transcription factors (TFs), nucleoid-associated proteins (NAPs), and histones. These analyses demonstrate that HpyA and HstA bind the genome infrequently in discrete regions, which is similar to TFs but unlike NAPs, which bind a much larger genomic fraction. However, unlike TFs that typically bind in intergenic regions, HpyA and HstA binding sites are located in both coding and intergenic regions. The genome-wide dinucleotide periodicity known to facilitate histone binding was undetectable in the genomes of both species. Instead, TF-like and histone-like binding sequence preferences were detected for HstA and HpyA, respectively. Taken together, these data suggest that halophilic archaeal histones are unlikely to facilitate genome-wide chromatin formation and that their function defies categorization as a TF, NAP, or histone. IMPORTANCE Most cells in eukaryotic species—from yeast to humans—possess histone proteins that pack and unpack DNA in response to environmental cues. These essential proteins regulate genes necessary for important cellular processes, including development and stress protection. Although the histone fold domain originated in the domain of life Archaea , the function of archaeal histone-like proteins is not well understood relative to those of eukaryotes. We recently discovered that, unlike histones of eukaryotes, histones in hypersaline-adapted archaeal species do not package DNA and can act as transcription factors (TFs) to regulate stress response gene expression. However, the function of histones across species of hypersaline-adapted archaea still remains unclear. Here, we compare hypersaline histone function to a variety of DNA binding proteins across the tree of life, revealing histone-like behavior in some respects and specific transcriptional regulatory function in others. 
    more » « less