This content will become publicly available on July 24, 2025
Dense arrangements of binding sites within nucleotide sequences can collectively influence downstream transcription rates or initiate biomolecular interactions. For example, natural promoter regions can harbor many overlapping transcription factor binding sites that influence the rate of transcription initiation. Despite the prevalence of overlapping binding sites in nature, rapid design of nucleotide sequences with many overlapping sites remains a challenge. Here, we show that this is an NP-hard problem, coined here as the nucleotide String Packing Problem (SPP). We then introduce a computational technique that efficiently assembles sets of DNA-protein binding sites into dense, contiguous stretches of double-stranded DNA. For the efficient design of nucleotide sequences spanning hundreds of base pairs, we reduce the SPP to an Orienteering Problem with integer distances, and then leverage modern integer linear programming solvers. Our method optimally packs sets of 20–100 binding sites into dense nucleotide arrays of 50–300 base pairs in 0.05–10 seconds. Unlike approximation algorithms or meta-heuristics, our approach finds provably optimal solutions. We demonstrate how our method can generate large sets of diverse sequences suitable for library generation, where the frequency of binding site usage across the returned sequences can be controlled by modulating the objective function. As an example, we then show how adding additional constraints, like the inclusion of sequence elements with fixed positions, allows for the design of bacterial promoters. The nucleotide string packing approach we present can accelerate the design of sequences with complex DNA-protein interactions. When used in combination with synthesis and high-throughput screening, this design strategy could help interrogate how complex binding site arrangements impact either gene expression or biomolecular mechanisms in varied cellular contexts.
more » « less- Award ID(s):
- 2324909
- NSF-PAR ID:
- 10526795
- Editor(s):
- Klumpp, Stefan
- Publisher / Repository:
- PLOS Computational Biology
- Date Published:
- Journal Name:
- PLOS Computational Biology
- Volume:
- 20
- Issue:
- 7
- ISSN:
- 1553-7358
- Page Range / eLocation ID:
- e1012276
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
null (Ed.)How homeodomain proteins gain sufficient specificity to control different cell fates has been a long-standing problem in developmental biology. The conserved Gsx homeodomain proteins regulate specific aspects of neural development in animals from flies to mammals, and yet they belong to a large transcription factor family that bind nearly identical DNA sequences in vitro. Here, we show that the mouse and fly Gsx factors unexpectedly gain DNA binding specificity by forming cooperative homodimers on precisely spaced and oriented DNA sites. High-resolution genomic binding assays revealed that Gsx2 binds both monomer and homodimer sites in the developing mouse ventral telencephalon. Importantly, reporter assays showed that Gsx2 mediates opposing outcomes in a DNA binding site-dependent manner: Monomer Gsx2 binding represses transcription, whereas homodimer binding stimulates gene expression. In Drosophila , the Gsx homolog, Ind, similarly represses or stimulates transcription in a site-dependent manner via an autoregulatory enhancer containing a combination of monomer and homodimer sites. Integrating these findings, we test a model showing how the homodimer to monomer site ratio and the Gsx protein levels defines gene up-regulation versus down-regulation. Altogether, these data serve as a new paradigm for how cooperative homeodomain transcription factor binding can increase target specificity and alter regulatory outcomes.more » « less
-
John Pham, Ph.D. Editor-in-Chief (Ed.)The target DNA specificity of the CRISPR-associated genome editor nuclease Cas9 is determined by complementarity to a 20-nucleotide segment in its guide RNA. However, Cas9 can bind and cleave partially complementary off-target sequences, which raises safety concerns for its use in clinical applications. Here we report crystallographic structures of Cas9 bound to bona fide off-target substrates, revealing that off-target binding is enabled by a range of non-canonical base-pairing interactions and preservation of base stacking within the guide–off-target heteroduplex. Off-target sites containing single-nucleotide deletions relative to the guide RNA are accommodated by base skipping or multiple non-canonical base pairs rather than RNA bulge formation. Additionally, PAM-distal mismatches result in duplex unpairing and induce a conformational change of the Cas9 REC lobe that perturbs its conformational activation. Together, these insights provide a structural rationale for the off-target activity of Cas9 and contribute to the improved rational design of guide RNAs and off-target prediction algorithms.more » « less
-
Abstract CRISPR-associated transposases (CASTs) direct DNA integration downstream of target sites using the RNA-guided DNA binding activity of nuclease-deficient CRISPR-Cas systems. Transposition relies on several key protein-protein and protein-DNA interactions, but little is known about the explicit sequence requirements governing efficient transposon DNA integration activity. Here, we exploit pooled library screening and high-throughput sequencing to reveal novel sequence determinants during transposition by the Type I-F Vibrio cholerae CAST system (VchCAST). On the donor DNA, large transposon end libraries revealed binding site nucleotide preferences for the TnsB transposase, as well as an additional conserved region that encoded a consensus binding site for integration host factor (IHF). Remarkably, we found that VchCAST requires IHF for efficient transposition, thus revealing a novel cellular factor involved in CRISPR-associated transpososome assembly. On the target DNA, we uncovered preferred sequence motifs at the integration site that explained previously observed heterogeneity with single-base pair resolution. Finally, we exploited our library data to design modified transposon variants that enable in-frame protein tagging. Collectively, our results provide new clues about the assembly and architecture of the paired-end complex formed between TnsB and the transposon DNA, and inform the design of custom payload sequences for genome engineering applications with CAST systems.
-
Abstract Cooperative DNA-binding by transcription factor (TF) proteins is critical for eukaryotic gene regulation. In the human genome, many regulatory regions contain TF-binding sites in close proximity to each other, which can facilitate cooperative interactions. However, binding site proximity does not necessarily imply cooperative binding, as TFs can also bind independently to each of their neighboring target sites. Currently, the rules that drive cooperative TF binding are not well understood. In addition, it is oftentimes difficult to infer direct TF–TF cooperativity from existing DNA-binding data. Here, we show that in vitro binding assays using DNA libraries of a few thousand genomic sequences with putative cooperative TF-binding events can be used to develop accurate models of cooperativity and to gain insights into cooperative binding mechanisms. Using factors ETS1 and RUNX1 as our case study, we show that the distance and orientation between ETS1 sites are critical determinants of cooperative ETS1–ETS1 binding, while cooperative ETS1–RUNX1 interactions show more flexibility in distance and orientation and can be accurately predicted based on the affinity and sequence/shape features of the binding sites. The approach described here, combining custom experimental design with machine-learning modeling, can be easily applied to study the cooperative DNA-binding patterns of any TFs.
-
Polen, Tino (Ed.)ABSTRACT Regulation of gene expression is a vital component of cellular biology. Transcription factor proteins often bind regulatory DNA sequences upstream of transcription start sites to facilitate the activation or repression of RNA polymerase. Research laboratories have devoted many projects to understanding the transcription regulatory networks for transcription factors, as these regulated genes provide critical insight into the biology of the host organism. Various in vivo and in vitro assays have been developed to elucidate transcription regulatory networks. Several assays, including SELEX-seq and ChIP-seq, capture DNA-bound transcription factors to determine the preferred DNA-binding sequences, which can then be mapped to the host organism’s genome to identify candidate regulatory genes. In this protocol, we describe an alternative in vitro , iterative selection approach to ascertaining DNA-binding sequences of a transcription factor of interest using restriction endonuclease, protection, selection, and amplification (REPSA). Contrary to traditional antibody-based capture methods, REPSA selects for transcription factor-bound DNA sequences by challenging binding reactions with a type IIS restriction endonuclease. Cleavage-resistant DNA species are amplified by PCR and then used as inputs for the next round of REPSA. This process is repeated until a protected DNA species is observed by gel electrophoresis, which is an indication of a successful REPSA experiment. Subsequent high-throughput sequencing of REPSA-selected DNAs accompanied by motif discovery and scanning analyses can be used for determining transcription factor consensus binding sequences and potential regulated genes, providing critical first steps in determining organisms’ transcription regulatory networks. IMPORTANCE Transcription regulatory proteins are an essential class of proteins that help maintain cellular homeostasis by adapting the transcriptome based on environmental cues. Dysregulation of transcription factors can lead to diseases such as cancer, and many eukaryotic and prokaryotic transcription factors have become enticing therapeutic targets. Additionally, in many understudied organisms, the transcription regulatory networks for uncharacterized transcription factors remain unknown. As such, the need for experimental techniques to establish transcription regulatory networks is paramount. Here, we describe a step-by-step protocol for REPSA, an inexpensive, iterative selection technique to identify transcription factor-binding sequences without the need for antibody-based capture methods.more » « less