skip to main content


Title: Improved detection of epigenomic marks with mixed‐effects hidden Markov models
Abstract

Chromatin immunoprecipitation followed by next‐generation sequencing (ChIP‐seq) is a technique to detect genomic regions containing protein‐DNA interaction, such as transcription factor binding sites or regions containing histone modifications. One goal of the analysis of ChIP‐seq experiments is to identify genomic loci enriched for sequencing reads pertaining to DNA bound to the factor of interest. The accurate identification of such regions aids in the understanding of epigenomic marks and gene regulatory mechanisms. Given the reduction of massively parallel sequencing costs, methods to detect consensus regions of enrichment across multiple samples are of interest. Here, we present a statistical model to detect broad consensus regions of enrichment from ChIP‐seq technical or biological replicates through a class of zero‐inflated mixed‐effects hidden Markov models. We show that the proposed model outperforms existing methods for consensus peak calling in common epigenomic marks by accounting for the excess zeros and sample‐specific biases. We apply our method to data from the Encyclopedia of DNA Elements and Roadmap Epigenomics projects and also from an extensive simulation study.

 
more » « less
PAR ID:
10126584
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Biometrics
Volume:
75
Issue:
4
ISSN:
0006-341X
Format(s):
Medium: X Size: p. 1401-1413
Size(s):
p. 1401-1413
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is widely used to identify factor binding to genomic DNA and chromatin modifications. ChIP-seq data analysis is affected by genomic regions that generate ultra-high artifactual signals. To remove these signals from ChIP-seq data, the Encyclopedia of DNA Elements (ENCODE) project developed comprehensive sets of regions defined by low mappability and ultra-high signals called blacklists for human, mouse (Mus musculus), nematode (Caenorhabditis elegans), and fruit fly (Drosophila melanogaster). However, blacklists are not currently available for many model and nonmodel species. Here, we describe an alternative approach for removing false-positive peaks called greenscreen. Greenscreen is easy to implement, requires few input samples, and uses analysis tools frequently employed for ChIP-seq. Greenscreen removes artifactual signals as effectively as blacklists in Arabidopsis thaliana and human ChIP-seq dataset while covering less of the genome and dramatically improves ChIP-seq peak calling and downstream analyses. Greenscreen filtering reveals true factor binding overlap and occupancy changes in different genetic backgrounds or tissues. Because it is effective with as few as two inputs, greenscreen is readily adaptable for use in any species or genome build. Although developed for ChIP-seq, greenscreen also identifies artifactual signals from other genomic datasets including Cleavage Under Targets and Release Using Nuclease. We present an improved ChIP-seq pipeline incorporating greenscreen that detects more true peaks than other methods. 
    more » « less
  2. Abstract Background

    Epigenomic profiling assays such as ChIP-seq have been widely used to map the genome-wide enrichment profiles of chromatin-associated proteins and posttranslational histone modifications. Sequencing depth is a key parameter in experimental design and quality control. However, due to variable sequencing depth requirements across experimental conditions, it can be challenging to determine optimal sequencing depth, particularly for projects involving multiple targets or cell types.

    Results

    We developed thepeaksatR package to provide target read depth estimates for epigenomic experiments based on the analysis of peak saturation curves. We appliedpeaksatto establish the distinctive read depth requirements for ChIP-seq studies of histone modifications in different cell lines. Usingpeaksat,we were able to estimate the target read depth required per library to obtain high-quality peak calls for downstream analysis. In addition,peaksatwas applied to other sequence-enrichment methods including CUT&RUN and ATAC-seq.

    Conclusion

    peaksataddresses a need for researchers to make informed decisions about whether their sequencing data has been generated to an adequate depth and subsequently sufficient meaningful peaks, and failing that, how many more reads would be required per library.peaksatis applicable to other sequence-based methods that include calling peaks in their analysis.

     
    more » « less
  3. Abstract

    Detecting changes in the activity of a transcription factor (TF) in response to a perturbation provides insights into the underlying cellular process. Transcription Factor Enrichment Analysis (TFEA) is a robust and reliable computational method that detects positional motif enrichment associated with changes in transcription observed in response to a perturbation. TFEA detects positional motif enrichment within a list of ranked regions of interest (ROIs), typically sites of RNA polymerase initiation inferred from regulatory data such as nascent transcription. Therefore, we also introducemuMerge, a statistically principled method of generating a consensus list of ROIs from multiple replicates and conditions. TFEA is broadly applicable to data that informs on transcriptional regulation including nascent transcription (eg. PRO-Seq), CAGE, histone ChIP-Seq, and accessibility data (e.g., ATAC-Seq). TFEA not only identifies the key regulators responding to a perturbation, but also temporally unravels regulatory networks with time series data. Consequently, TFEA serves as a hypothesis-generating tool that provides an easy, rigorous, and cost-effective means to broadly assess TF activity yielding new biological insights.

     
    more » « less
  4. Abstract

    Long single-molecular sequencing technologies, such as PacBio circular consensus sequencing (CCS) and nanopore sequencing, are advantageous in detecting DNA 5-methylcytosine in CpGs (5mCpGs), especially in repetitive genomic regions. However, existing methods for detecting 5mCpGs using PacBio CCS are less accurate and robust. Here, we present ccsmeth, a deep-learning method to detect DNA 5mCpGs using CCS reads. We sequence polymerase-chain-reaction treated and M.SssI-methyltransferase treated DNA of one human sample using PacBio CCS for training ccsmeth. Using long (≥10 Kb) CCS reads, ccsmeth achieves 0.90 accuracy and 0.97 Area Under the Curve on 5mCpG detection at single-molecule resolution. At the genome-wide site level, ccsmeth achieves >0.90 correlations with bisulfite sequencing and nanopore sequencing using only 10× reads. Furthermore, we develop a Nextflow pipeline, ccsmethphase, to detect haplotype-aware methylation using CCS reads, and then sequence a Chinese family trio to validate it. ccsmeth and ccsmethphase can be robust and accurate tools for detecting DNA 5-methylcytosines.

     
    more » « less
  5. Abstract

    Genomic regions that control traits of interest can be rapidly identified using BSA-Seq, a technology in which next-generation sequencing is applied to bulked segregant analysis (BSA). We recently developed the significant structural variant method for BSA-Seq data analysis that exhibits higher detection power than standard BSA-Seq analysis methods. Our original algorithm was developed to analyze BSA-Seq data in which genome sequences of one parent served as the reference sequences in genotype calling and, thus, required the availability of high-quality assembled parental genome sequences. Here, we modified the original script to effectively detect the genomic region–trait associations using only bulk genome sequences. We analyzed two public BSA-Seq datasets using our modified method and the standard allele frequency and G-statistic methods with and without the aid of the parental genome sequences. Our results demonstrate that the genomic region(s) associated with the trait of interest could be reliably identified via the significant structural variant method without using the parental genome sequences.

     
    more » « less