Short linear motifs are 3 to 11 amino acid long peptide patterns that play important regulatory roles in modulating protein activities. Although they are abundant in proteins, it is often difficult to discover them by experiments, because of the low affinity binding and transient interaction of short linear motifs with their partners. Moreover, available computational methods cannot effectively predict short linear motifs, due to their short and degenerate nature. Here we developed a novel approach, FlexSLiM, for reliable discovery of short linear motifs in protein sequences. By testing on simulated data and benchmark experimental data, we demonstrated that FlexSLiM more effectively identifies short linear motifs than existing methods. We provide a general tool that will advance the understanding of short linear motifs, which will facilitate the research on protein targeting signals, protein post-translational modifications, and many others.
more »
« less
Graph-Based Motif Discovery in Mimotope Profiles of Serum Antibody Repertoire
Phage display technique has a multitude of applications such as epitope mapping, organ targeting, therapeutic antibody engineering and vaccine design. One area of particular importance is the detection of cancers in early stages, where the discovery of binding motifs and epitopes is critical. While several techniques exist to characterize phages, Next Generation Sequencing (NGS) stands out for its ability to provide detailed insights into antibody binding sites on antigens. However, when dealing with NGS data, identifying regulatory motifs poses significant challenges. Existing methods often lack scalability for large datasets, rely on prior knowledge about the number of motifs, and exhibit low accuracy. In this paper, we present a novel approach for identifying regulatory motifs in NGS data. Our method leverages results from graph theory to overcome the limitations of existing techniques.
more »
« less
- PAR ID:
- 10490255
- Publisher / Repository:
- Springer
- Date Published:
- Journal Name:
- Lecture notes in bioinformatics
- ISSN:
- 2366-6323
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Polen, Tino (Ed.)ABSTRACT Regulation of gene expression is a vital component of cellular biology. Transcription factor proteins often bind regulatory DNA sequences upstream of transcription start sites to facilitate the activation or repression of RNA polymerase. Research laboratories have devoted many projects to understanding the transcription regulatory networks for transcription factors, as these regulated genes provide critical insight into the biology of the host organism. Various in vivo and in vitro assays have been developed to elucidate transcription regulatory networks. Several assays, including SELEX-seq and ChIP-seq, capture DNA-bound transcription factors to determine the preferred DNA-binding sequences, which can then be mapped to the host organism’s genome to identify candidate regulatory genes. In this protocol, we describe an alternative in vitro , iterative selection approach to ascertaining DNA-binding sequences of a transcription factor of interest using restriction endonuclease, protection, selection, and amplification (REPSA). Contrary to traditional antibody-based capture methods, REPSA selects for transcription factor-bound DNA sequences by challenging binding reactions with a type IIS restriction endonuclease. Cleavage-resistant DNA species are amplified by PCR and then used as inputs for the next round of REPSA. This process is repeated until a protected DNA species is observed by gel electrophoresis, which is an indication of a successful REPSA experiment. Subsequent high-throughput sequencing of REPSA-selected DNAs accompanied by motif discovery and scanning analyses can be used for determining transcription factor consensus binding sequences and potential regulated genes, providing critical first steps in determining organisms’ transcription regulatory networks. IMPORTANCE Transcription regulatory proteins are an essential class of proteins that help maintain cellular homeostasis by adapting the transcriptome based on environmental cues. Dysregulation of transcription factors can lead to diseases such as cancer, and many eukaryotic and prokaryotic transcription factors have become enticing therapeutic targets. Additionally, in many understudied organisms, the transcription regulatory networks for uncharacterized transcription factors remain unknown. As such, the need for experimental techniques to establish transcription regulatory networks is paramount. Here, we describe a step-by-step protocol for REPSA, an inexpensive, iterative selection technique to identify transcription factor-binding sequences without the need for antibody-based capture methods.more » « less
-
Abstract Precision medicine aims for personalized prognosis and therapeutics by utilizing recent genome-scale high-throughput profiling techniques, including next-generation sequencing (NGS). However, translating NGS data faces several challenges. First, NGS count data are often overdispersed, requiring appropriate modeling. Second, compared to the number of involved molecules and system complexity, the number of available samples for studying complex disease, such as cancer, is often limited, especially considering disease heterogeneity. The key question is whether we may integrate available data from all different sources or domains to achieve reproducible disease prognosis based on NGS count data. In this paper, we develop a Bayesian Multi-Domain Learning (BMDL) model that derives domain-dependent latent representations of overdispersed count data based on hierarchical negative binomial factorization for accurate cancer subtyping even if the number of samples for a specific cancer type is small. Experimental results from both our simulated and NGS datasets from The Cancer Genome Atlas (TCGA) demonstrate the promising potential of BMDL for effective multi-domain learning without negative transfer effects often seen in existing multi-task learning and transfer learning methods.more » « less
-
Abstract MotivationThe availability of numerous ChIP-seq datasets for transcription factors (TF) has provided an unprecedented opportunity to identify all TF binding sites in genomes. However, the progress has been hindered by the lack of a highly efficient and accurate tool to find not only the target motifs, but also cooperative motifs in very big datasets. ResultsWe herein present an ultrafast and accurate motif-finding algorithm, ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators. Thus, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes. Availability and implementationSource code and binaries are freely available for download at https://github.com/zhengchangsulab/prosampler. It was implemented in C++ and supported on Linux, macOS and MS Windows platforms. Supplementary informationSupplementary materials are available at Bioinformatics online.more » « less
-
Abstract Gene regulation in eukaryotes is partly shaped by the 3D organization of chromatin within the cell nucleus. Distal interactions between cis-regulatory elements and their target genes are widespread, and many causal loci underlying heritable agricultural traits have been mapped to distal non-coding elements. The biology underlying chromatin loop formation in plants is poorly understood. Dissecting the sequence features that mediate distal interactions is an important step toward identifying putative molecular mechanisms. Here, we trained GenomicLinks, a deep learning model, to identify DNA sequence features predictive of 3D chromatin interactions in maize. We found that the presence of binding motifs of specific transcription factor classes, especially bHLH, is predictive of chromatin interaction specificities. Using an in silico mutagenesis approach we show the removal of these motifs from loop anchors leads to reduced interaction probabilities. We were able to validate these predictions with single-cell co-accessibility data from different maize genotypes that harbor natural substitutions in these TF binding motifs. GenomicLinks is currently implemented as an open-source web tool, which should facilitate its wider use in the plant research community.more » « less
An official website of the United States government
