skip to main content


Title: Greenscreen: A simple method to remove artifactual signals and enrich for true peaks in genomic datasets including ChIP-seq data
Abstract Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is widely used to identify factor binding to genomic DNA and chromatin modifications. ChIP-seq data analysis is affected by genomic regions that generate ultra-high artifactual signals. To remove these signals from ChIP-seq data, the Encyclopedia of DNA Elements (ENCODE) project developed comprehensive sets of regions defined by low mappability and ultra-high signals called blacklists for human, mouse (Mus musculus), nematode (Caenorhabditis elegans), and fruit fly (Drosophila melanogaster). However, blacklists are not currently available for many model and nonmodel species. Here, we describe an alternative approach for removing false-positive peaks called greenscreen. Greenscreen is easy to implement, requires few input samples, and uses analysis tools frequently employed for ChIP-seq. Greenscreen removes artifactual signals as effectively as blacklists in Arabidopsis thaliana and human ChIP-seq dataset while covering less of the genome and dramatically improves ChIP-seq peak calling and downstream analyses. Greenscreen filtering reveals true factor binding overlap and occupancy changes in different genetic backgrounds or tissues. Because it is effective with as few as two inputs, greenscreen is readily adaptable for use in any species or genome build. Although developed for ChIP-seq, greenscreen also identifies artifactual signals from other genomic datasets including Cleavage Under Targets and Release Using Nuclease. We present an improved ChIP-seq pipeline incorporating greenscreen that detects more true peaks than other methods.  more » « less
Award ID(s):
1905062 1953279
NSF-PAR ID:
10404457
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
The Plant Cell
Volume:
34
Issue:
12
ISSN:
1040-4651
Page Range / eLocation ID:
4795 to 4815
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Characterizing genome-wide binding profiles of transcription factors (TFs) is essential for understanding biological processes. Although techniques have been developed to assess binding profiles within a population of cells, determining them at a single-cell level remains elusive. Here, we report scFAN (single-cell factor analysis network), a deep learning model that predicts genome-wide TF binding profiles in individual cells. scFAN is pretrained on genome-wide bulk assay for transposase-accessible chromatin sequencing (ATAC-seq), DNA sequence, and chromatin immunoprecipitation sequencing (ChIP-seq) data and uses single-cell ATAC-seq to predict TF binding in individual cells. We demonstrate the efficacy of scFAN by both studying sequence motifs enriched within predicted binding peaks and using predicted TFs for discovering cell types. We develop a new metric “TF activity score” to characterize each cell and show that activity scores can reliably capture cell identities. scFAN allows us to discover and study cellular identities and heterogeneity based on chromatin accessibility profiles. 
    more » « less
  2. Abstract

    Chromatin immunoprecipitation followed by next‐generation sequencing (ChIP‐seq) is a technique to detect genomic regions containing protein‐DNA interaction, such as transcription factor binding sites or regions containing histone modifications. One goal of the analysis of ChIP‐seq experiments is to identify genomic loci enriched for sequencing reads pertaining to DNA bound to the factor of interest. The accurate identification of such regions aids in the understanding of epigenomic marks and gene regulatory mechanisms. Given the reduction of massively parallel sequencing costs, methods to detect consensus regions of enrichment across multiple samples are of interest. Here, we present a statistical model to detect broad consensus regions of enrichment from ChIP‐seq technical or biological replicates through a class of zero‐inflated mixed‐effects hidden Markov models. We show that the proposed model outperforms existing methods for consensus peak calling in common epigenomic marks by accounting for the excess zeros and sample‐specific biases. We apply our method to data from the Encyclopedia of DNA Elements and Roadmap Epigenomics projects and also from an extensive simulation study.

     
    more » « less
  3. Komeili, Arash (Ed.)
    ABSTRACT Histone proteins are found across diverse lineages of Archaea , many of which package DNA and form chromatin. However, previous research has led to the hypothesis that the histone-like proteins of high-salt-adapted archaea, or halophiles, function differently. The sole histone protein encoded by the model halophilic species Halobacterium salinarum , HpyA, is nonessential and expressed at levels too low to enable genome-wide DNA packaging. Instead, HpyA mediates the transcriptional response to salt stress. Here we compare the features of genome-wide binding of HpyA to those of HstA, the sole histone of another model halophile, Haloferax volcanii . hstA , like hpyA , is a nonessential gene. To better understand HpyA and HstA functions, protein-DNA binding data (chromatin immunoprecipitation sequencing [ChIP-seq]) of these halophilic histones are compared to publicly available ChIP-seq data from DNA binding proteins across all domains of life, including transcription factors (TFs), nucleoid-associated proteins (NAPs), and histones. These analyses demonstrate that HpyA and HstA bind the genome infrequently in discrete regions, which is similar to TFs but unlike NAPs, which bind a much larger genomic fraction. However, unlike TFs that typically bind in intergenic regions, HpyA and HstA binding sites are located in both coding and intergenic regions. The genome-wide dinucleotide periodicity known to facilitate histone binding was undetectable in the genomes of both species. Instead, TF-like and histone-like binding sequence preferences were detected for HstA and HpyA, respectively. Taken together, these data suggest that halophilic archaeal histones are unlikely to facilitate genome-wide chromatin formation and that their function defies categorization as a TF, NAP, or histone. IMPORTANCE Most cells in eukaryotic species—from yeast to humans—possess histone proteins that pack and unpack DNA in response to environmental cues. These essential proteins regulate genes necessary for important cellular processes, including development and stress protection. Although the histone fold domain originated in the domain of life Archaea , the function of archaeal histone-like proteins is not well understood relative to those of eukaryotes. We recently discovered that, unlike histones of eukaryotes, histones in hypersaline-adapted archaeal species do not package DNA and can act as transcription factors (TFs) to regulate stress response gene expression. However, the function of histones across species of hypersaline-adapted archaea still remains unclear. Here, we compare hypersaline histone function to a variety of DNA binding proteins across the tree of life, revealing histone-like behavior in some respects and specific transcriptional regulatory function in others. 
    more » « less
  4. Abstract

    Genome-wide profiling of chromatin accessibility by DNase-seq or ATAC-seq has been widely used to identify regulatory DNA elements and transcription factor binding sites. However, enzymatic DNA cleavage exhibits intrinsic sequence biases that confound chromatin accessibility profiling data analysis. Existing computational tools are limited in their ability to account for such intrinsic biases and not designed for analyzing single-cell data. Here, we present Simplex Encoded Linear Model for Accessible Chromatin (SELMA), a computational method for systematic estimation of intrinsic cleavage biases from genomic chromatin accessibility profiling data. We demonstrate that SELMA yields accurate and robust bias estimation from both bulk and single-cell DNase-seq and ATAC-seq data. SELMA can utilize internal mitochondrial DNA data to improve bias estimation. We show that transcription factor binding inference from DNase footprints can be improved by incorporating estimated biases using SELMA. Furthermore, we show strong effects of intrinsic biases in single-cell ATAC-seq data, and develop the first single-cell ATAC-seq intrinsic bias correction model to improve cell clustering. SELMA can enhance the performance of existing bioinformatics tools and improve the analysis of both bulk and single-cell chromatin accessibility sequencing data.

     
    more » « less
  5. Abstract Background

    The genetic information contained in the genome of an organism is organized in genes and regulatory elements that control gene expression. The genomes of multiple plants species have already been sequenced and the gene repertory have been annotated, however,cis-regulatory elements remain less characterized, limiting our understanding of genome functionality. These elements act as open platforms for recruiting both positive- and negative-acting transcription factors, and as such, chromatin accessibility is an important signature for their identification.

    Results

    In this work we developed a transgenic INTACT [isolation of nuclei tagged in specific cell types] system in tetraploid wheat for nuclei purifications. Then, we combined the INTACT system together with the assay for transposase-accessible chromatin with sequencing [ATAC-seq] to identify open chromatin regions in wheat root tip samples. Our ATAC-seq results showed a large enrichment of open chromatin regions in intergenic and promoter regions, which is expected for regulatory elements and that is similar to ATAC-seq results obtained in other plant species. In addition, root ATAC-seq peaks showed a significant overlap with a previously published ATAC-seq data from wheat leaf protoplast, indicating a high reproducibility between the two experiments and a large overlap between open chromatin regions in root and leaf tissues. Importantly, we observed overlap between ATAC-seq peaks andcis-regulatory elements that have been functionally validated in wheat, and a good correlation between normalized accessibility and gene expression levels.

    Conclusions

    We have developed and validated an INTACT system in tetraploid wheat that allows rapid and high-quality nuclei purification from root tips. Those nuclei were successfully used to performed ATAC-seq experiments that revealed open chromatin regions in the wheat genome that will be useful to identify cis-regulatory elements. The INTACT system presented here will facilitate the development of ATAC-seq datasets in other tissues, growth stages, and under different growing conditions to generate a more complete landscape of the accessible DNA regions in the wheat genome.

     
    more » « less