skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: peaksat: an R package for ChIP-seq peak saturation analysis
Abstract BackgroundEpigenomic profiling assays such as ChIP-seq have been widely used to map the genome-wide enrichment profiles of chromatin-associated proteins and posttranslational histone modifications. Sequencing depth is a key parameter in experimental design and quality control. However, due to variable sequencing depth requirements across experimental conditions, it can be challenging to determine optimal sequencing depth, particularly for projects involving multiple targets or cell types. ResultsWe developed thepeaksatR package to provide target read depth estimates for epigenomic experiments based on the analysis of peak saturation curves. We appliedpeaksatto establish the distinctive read depth requirements for ChIP-seq studies of histone modifications in different cell lines. Usingpeaksat,we were able to estimate the target read depth required per library to obtain high-quality peak calls for downstream analysis. In addition,peaksatwas applied to other sequence-enrichment methods including CUT&RUN and ATAC-seq. Conclusionpeaksataddresses a need for researchers to make informed decisions about whether their sequencing data has been generated to an adequate depth and subsequently sufficient meaningful peaks, and failing that, how many more reads would be required per library.peaksatis applicable to other sequence-based methods that include calling peaks in their analysis.  more » « less
Award ID(s):
1826689
PAR ID:
10392914
Author(s) / Creator(s):
; ; ; ; ; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
BMC Genomics
Volume:
24
Issue:
1
ISSN:
1471-2164
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract The ability to profile transcriptomes and characterize global gene expression changes has been greatly enabled by the development of RNA sequencing technologies (RNA-seq). However, the process of generating sequencing-compatible cDNA libraries from RNA samples can be time-consuming and expensive, especially for bacterial mRNAs which lack poly(A)-tails that are often used to streamline this process for eukaryotic samples. Compared to the increasing throughput and decreasing cost of sequencing, library preparation has had limited advances. Here, we describe bacterial-multiplexed-seq (BaM-seq), an approach that enables simple barcoding of many bacterial RNA samples that decreases the time and cost of library preparation. We also present targeted-bacterial-multiplexed-seq (TBaM-seq) that allows for differential expression analysis of specific gene panels with over 100-fold enrichment in read coverage. In addition, we introduce the concept of transcriptome redistribution based on TBaM-seq that dramatically reduces the required sequencing depth while still allowing for quantification of both highly and lowly abundant transcripts. These methods accurately measure gene expression changes with high technical reproducibility and agreement with gold standard, lower throughput approaches. Together, use of these library preparation protocols allows for fast, affordable generation of sequencing libraries. 
    more » « less
  2. Transposable elements (TEs) and other repetitive regions have been shown to contain gene regulatory elements, including transcription factor binding sites. However, regulatory elements harbored by repeats have proven difficult to characterize using short-read sequencing assays such as ChIP-seq or ATAC-seq. Most regulatory genomics analysis pipelines discard “multimapped” reads that align equally well to multiple genomic locations. Because multimapped reads arise predominantly from repeats, current analysis pipelines fail to detect a substantial portion of regulatory events that occur in repetitive regions. To address this shortcoming, we developed Allo, a new approach to allocate multimapped reads in an efficient, accurate, and user-friendly manner. Allo combines probabilistic mapping of multimapped reads with a convolutional neural network that recognizes the read distribution features of potential peaks, offering enhanced accuracy in multimapping read assignment. Allo also provides read-level output in the form of a corrected alignment file, making it compatible with existing regulatory genomics analysis pipelines and downstream peak-finders. In a demonstration application on CTCF ChIP-seq data, we show that Allo results in the discovery of thousands of new CTCF peaks. Many of these peaks contain the expected cognate motif and/or serve as TAD boundaries. We additionally apply Allo to a diverse collection of ENCODE ChIP-seq data sets, resulting in multiple previously unidentified interactions between transcription factors and repetitive element families. Finally, we show that Allo may be particularly beneficial in identifying ChIP-seq peaks at centromeres, near segmentally duplicated genes, and in younger TEs, enabling new regulatory analyses in these regions. 
    more » « less
  3. ABSTRACT Metagenomics is a powerful tool for characterising viruses, with broad applications across diverse disciplines, from understanding the ecology and evolutionary history of viruses to identifying causative agents of emerging outbreaks with unknown aetiology. Additionally, metagenomic data contains valuable information about the amount of virus present within samples. However, we have yet to leverage metagenomics to assess viral load, which is a key epidemiological parameter. To effectively use sequencing outputs to inform transmission, we need to understand the relationship between read depth and viral load across a diverse set of viruses. Here, using target enrichment sequencing, we investigated the detection and recovery of virus genomes by spiking known concentrations of DNA and RNA viruses into wild rodent faecal samples. In total, 15 experimental replicates were sequenced with target enrichment sequencing and compared to shotgun sequencing of the same background samples. Target enriched sequencing recovered all spike-in viruses at every concentration (102, 103, and 105± 1 log genome copies) and showed a log-linear relationship between spike-in concentration and mean read depth. Background viruses (includingKobuvirusandCardiovirus) were recovered consistently across all biological and technical replicates, but genome coverage was variable between virus genera and likely reflected the composition of target enrichment probe panel. Overall, our study highlights the strengths and weaknesses of using commercially available panels to quantify and characterise wildlife viromes, and underscores the importance of probe panel design for accurately interpreting coverage and read depth. To advance the use of metagenomics for understanding virus transmission, further research will be needed to elucidate how sequencing strategy (e.g. library depth, pooling), virome composition, and probe design influence viral read counts and genome coverage. 
    more » « less
  4. Introduction Various sequencing based approaches are used to identify and characterize the activities of cis -regulatory elements in a genome-wide fashion. Some of these techniques rely on indirect markers such as histone modifications (ChIP-seq with histone antibodies) or chromatin accessibility (ATAC-seq, DNase-seq, FAIRE-seq), while other techniques use direct measures such as episomal assays measuring the enhancer properties of DNA sequences (STARR-seq) and direct measurement of the binding of transcription factors (ChIP-seq with transcription factor-specific antibodies). The activities of cis -regulatory elements such as enhancers, promoters, and repressors are determined by their sequence and secondary processes such as chromatin accessibility, DNA methylation, and bound histone markers. Methods Here, machine learning models are employed to evaluate the accuracy with which cis -regulatory elements identified by various commonly used sequencing techniques can be predicted by their underlying sequence alone to distinguish between cis -regulatory activity that is reflective of sequence content versus secondary processes. Results and discussion Models trained and evaluated on D. melanogaster sequences identified through DNase-seq and STARR-seq are significantly more accurate than models trained on sequences identified by H3K4me1, H3K4me3, and H3K27ac ChIP-seq, FAIRE-seq, and ATAC-seq. These results suggest that the activity detected by DNase-seq and STARR-seq can be largely explained by underlying DNA sequence, independent of secondary processes. Experimentally, a subset of DNase-seq and H3K4me1 ChIP-seq sequences were tested for enhancer activity using luciferase assays and compared with previous tests performed on STARR-seq sequences. The experimental data indicated that STARR-seq sequences are substantially enriched for enhancer-specific activity, while the DNase-seq and H3K4me1 ChIP-seq sequences are not. Taken together, these results indicate that the DNase-seq approach identifies a broad class of regulatory elements of which enhancers are a subset and the associated data are appropriate for training models for detecting regulatory activity from sequence alone, STARR-seq data are best for training enhancer-specific sequence models, and H3K4me1 ChIP-seq data are not well suited for training and evaluating sequence-based models for cis -regulatory element prediction. 
    more » « less
  5. Abstract The development of next-generation sequencing (NGS) enabled a shift from array-based genotyping to directly sequencing genomic libraries for high-throughput genotyping. Even though whole-genome sequencing was initially too costly for routine analysis in large populations such as breeding or genetic studies, continued advancements in genome sequencing and bioinformatics have provided the opportunity to capitalize on whole-genome information. As new sequencing platforms can routinely provide high-quality sequencing data for sufficient genome coverage to genotype various breeding populations, a limitation comes in the time and cost of library construction when multiplexing a large number of samples. Here we describe a high-throughput whole-genome skim-sequencing (skim-seq) approach that can be utilized for a broad range of genotyping and genomic characterization. Using optimized low-volume Illumina Nextera chemistry, we developed a skim-seq method and combined up to 960 samples in one multiplex library using dual index barcoding. With the dual-index barcoding, the number of samples for multiplexing can be adjusted depending on the amount of data required, and could be extended to 3,072 samples or more. Panels of doubled haploid wheat lines (Triticum aestivum, CDC Stanley x CDC Landmark), wheat-barley (T.aestivumxHordeum vulgare) and wheat-wheatgrass (Triticum durum x Thinopyrum intermedium) introgression lines as well as known monosomic wheat stocks were genotyped using the skim-seq approach. Bioinformatics pipelines were developed for various applications where sequencing coverage ranged from 1 × down to 0.01 × per sample. Using reference genomes, we detected chromosome dosage, identified aneuploidy, and karyotyped introgression lines from the skim-seq data. Leveraging the recent advancements in genome sequencing, skim-seq provides an effective and low-cost tool for routine genotyping and genetic analysis, which can track and identify introgressions and genomic regions of interest in genetics research and applied breeding programs. 
    more » « less