skip to main content

Title: The fidelity of transcription in human cells
To determine the error rate of transcription in human cells, we analyzed the transcriptome of H1 human embryonic stem cells with a circle-sequencing approach that allows for high-fidelity sequencing of the transcriptome. These experiments identified approximately 100,000 errors distributed over every major RNA species in human cells. Our results indicate that different RNA species display different error rates, suggesting that human cells prioritize the fidelity of some RNAs over others. Cross-referencing the errors that we detected with various genetic and epigenetic features of the human genome revealed that the in vivo error rate in human cells changes along the length of a transcript and is further modified by genetic context, repetitive elements, epigenetic markers, and the speed of transcription. Our experiments further suggest that BRCA1, a DNA repair protein implicated in breast cancer, has a previously unknown role in the suppression of transcription errors. Finally, we analyzed the distribution of transcription errors in multiple tissues of a new mouse model and found that they occur preferentially in neurons, compared to other cell types. These observations lend additional weight to the idea that transcription errors play a key role in the progression of various neurological disorders, including Alzheimer’s disease.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; « less
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. INTRODUCTION Transposable elements (TEs), repeat expansions, and repeat-mediated structural rearrangements play key roles in chromosome structure and species evolution, contribute to human genetic variation, and substantially influence human health through copy number variants, structural variants, insertions, deletions, and alterations to gene transcription and splicing. Despite their formative role in genome stability, repetitive regions have been relegated to gaps and collapsed regions in human genome reference GRCh38 owing to the technological limitations during its development. The lack of linear sequence in these regions, particularly in centromeres, resulted in the inability to fully explore the repeat content of the human genome in the context of both local and regional chromosomal environments. RATIONALE Long-read sequencing supported the complete, telomere-to-telomere (T2T) assembly of the pseudo-haploid human cell line CHM13. This resource affords a genome-scale assessment of all human repetitive sequences, including TEs and previously unknown repeats and satellites, both within and outside of gaps and collapsed regions. Additionally, a complete genome enables the opportunity to explore the epigenetic and transcriptional profiles of these elements that are fundamental to our understanding of chromosome structure, function, and evolution. Comparative analyses reveal modes of repeat divergence, evolution, and expansion or contraction with locus-level resolution. RESULTS We implemented a comprehensive repeat annotation workflow using previously known human repeats and de novo repeat modeling followed by manual curation, including assessing overlaps with gene annotations, segmental duplications, tandem repeats, and annotated repeats. Using this method, we developed an updated catalog of human repetitive sequences and refined previous repeat annotations. We discovered 43 previously unknown repeats and repeat variants and characterized 19 complex, composite repetitive structures, which often carry genes, across T2T-CHM13. Using precision nuclear run-on sequencing (PRO-seq) and CpG methylated sites generated from Oxford Nanopore Technologies long-read sequencing data, we assessed RNA polymerase engagement across retroelements genome-wide, revealing correlations between nascent transcription, sequence divergence, CpG density, and methylation. These analyses were extended to evaluate RNA polymerase occupancy for all repeats, including high-density satellite repeats that reside in previously inaccessible centromeric regions of all human chromosomes. Moreover, using both mapping-dependent and mapping-independent approaches across early developmental stages and a complete cell cycle time series, we found that engaged RNA polymerase across satellites is low; in contrast, TE transcription is abundant and serves as a boundary for changes in CpG methylation and centromere substructure. Together, these data reveal the dynamic relationship between transcriptionally active retroelement subclasses and DNA methylation, as well as potential mechanisms for the derivation and evolution of new repeat families and composite elements. Focusing on the emerging T2T-level assembly of the HG002 X chromosome, we reveal that a high level of repeat variation likely exists across the human population, including composite element copy numbers that affect gene copy number. Additionally, we highlight the impact of repeats on the structural diversity of the genome, revealing repeat expansions with extreme copy number differences between humans and primates while also providing high-confidence annotations of retroelement transduction events. CONCLUSION The comprehensive repeat annotations and updated repeat models described herein serve as a resource for expanding the compendium of human genome sequences and reveal the impact of specific repeats on the human genome. In developing this resource, we provide a methodological framework for assessing repeat variation within and between human genomes. The exhaustive assessment of the transcriptional landscape of repeats, at both the genome scale and locally, such as within centromeres, sets the stage for functional studies to disentangle the role transcription plays in the mechanisms essential for genome stability and chromosome segregation. Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates and other species to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Telomere-to-telomere assembly of CHM13 supports repeat annotations and discoveries. The human reference T2T-CHM13 filled gaps and corrected collapsed regions (triangles) in GRCh38. Combining long read–based methylation calls, PRO-seq, and multilevel computational methods, we provide a compendium of human repeats, define retroelement expression and methylation profiles, and delineate locus-specific sites of nascent transcription genome-wide, including previously inaccessible centromeres. SINE, short interspersed element; SVA, SINE–variable number tandem repeat– Alu ; LINE, long interspersed element; LTR, long terminal repeat; TSS, transcription start site; pA, xxxxxxxxxxxxxxxx. 
    more » « less
  2. Agashe, Deepa (Ed.)
    Abstract Because errors at the DNA level power pathogen evolution, a systematic understanding of the rate and molecular spectra of mutations could guide the avoidance and treatment of infectious diseases. We thus accumulated tens of thousands of spontaneous mutations in 768 repeatedly bottlenecked lineages of 18 strains from various geographical sites, temporal spread, and genetic backgrounds. Entailing over ∼1.36 million generations, the resultant data yield an average mutation rate of ∼0.0005 per genome per generation, with a significant within-species variation. This is one of the lowest bacterial mutation rates reported, giving direct support for a high genome stability in this pathogen resulting from high DNA-mismatch-repair efficiency and replication-machinery fidelity. Pathogenicity genes do not exhibit an accelerated mutation rate, and thus, elevated mutation rates may not be the major determinant for the diversification of toxin and secretion systems. Intriguingly, a low error rate at the transcript level is not observed, suggesting distinct fidelity of the replication and transcription machinery. This study urges more attention on the most basic evolutionary processes of even the best-known human pathogens and deepens the understanding of their genome evolution. 
    more » « less
  3. Autism spectrum disorder (ASD) is a neurodevelopmental disorder that impedes patients’ cognition, social, speech and communication skills. ASD is highly heterogeneous with a variety of etiologies and clinical manifestations. The prevalence rate of ASD increased steadily in recent years. Presently, molecular mechanisms underlying ASD occurrence and development remain to be elucidated. Here, we integrated multi-layer genomics data to investigate the transcriptome and pathway dysregulations in ASD development. The RNA sequencing (RNA-seq) expression profiles of induced pluripotent stem cells (iPSCs), neural progenitor cells (NPCs) and neuron cells from ASD and normal samples were compared in our study. We found that substantially more genes were differentially expressed in the NPCs than the iPSCs. Consistently, gene set variation analysis revealed that the activity of the known ASD pathways in NPCs and neural cells were significantly different from the iPSCs, suggesting that ASD occurred at the early stage of neural system development. We further constructed comprehensive brain- and neural-specific regulatory networks by incorporating transcription factor (TF) and gene interactions with long 5 non-coding RNA(lncRNA) and protein interactions. We then overlaid the transcriptomes of different cell types on the regulatory networks to infer the regulatory cascades. The variations of the regulatory cascades between ASD and normal samples uncovered a set of novel disease-associated genes and gene interactions, particularly highlighting the functional roles of ELF3 and the interaction between STAT1 and lncRNA ELF3-AS 1 in the disease development. These new findings extend our understanding of ASD and offer putative new therapeutic targets for further studies. 
    more » « less
  4. INTRODUCTION Neurons are by far the most diverse of all cell types in animals, to the extent that “cell types” in mammalian brains are still mostly heterogeneous groups, and there is no consensus definition of the term. The Drosophila optic lobes, with approximately 200 well-defined cell types, provides a tractable system with which to address the genetic basis of neuronal type diversity. We previously characterized the distinct developmental gene expression program of each of these types using single-cell RNA sequencing (scRNA-seq), with one-to-one correspondence to the known morphological types. RATIONALE The identity of fly neurons is determined by temporal and spatial patterning mechanisms in stem cell progenitors, but it remained unclear how these cell fate decisions are implemented and maintained in postmitotic neurons. It was proposed in Caenorhabditis elegans that unique combinations of terminal selector transcription factors (TFs) that are continuously expressed in each neuron control nearly all of its type-specific gene expression. This model implies that it should be possible to engineer predictable and complete switches of identity between different neurons just by modifying these sustained TFs. We aimed to test this prediction in the Drosophila visual system. RESULTS Here, we used our developmental scRNA-seq atlases to identify the potential terminal selector genes in all optic lobe neurons. We found unique combinations of, on average, 10 differentially expressed and stably maintained (across all stages of development) TFs in each neuron. Through genetic gain- and loss-of-function experiments in postmitotic neurons, we showed that modifications of these selector codes are sufficient to induce predictable switches of identity between various cell types. Combinations of terminal selectors jointly control both developmental (e.g., morphology) and functional (e.g., neurotransmitters and their receptors) features of neurons. The closely related Transmedullary 1 (Tm1), Tm2, Tm4, and Tm6 neurons (see the figure) share a similar code of terminal selectors, but can be distinguished from each other by three TFs that are continuously and specifically expressed in one of these cell types: Drgx in Tm1, Pdm3 in Tm2, and SoxN in Tm6. We showed that the removal of each of these selectors in these cell types reprograms them to the default Tm4 fate. We validated these conversions using both morphological features and molecular markers. In addition, we performed scRNA-seq to show that ectopic expression of pdm3 in Tm4 and Tm6 neurons converts them to neurons with transcriptomes that are nearly indistinguishable from that of wild-type Tm2 neurons. We also show that Drgx expression in Tm1 neurons is regulated by Klumpfuss, a TF expressed in stem cells that instructs this fate in progenitors, establishing a link between the regulatory programs that specify neuronal fates and those that implement them. We identified an intronic enhancer in the Drgx locus whose chromatin is specifically accessible in Tm1 neurons and in which Klu motifs are enriched. Genomic deletion of this region knocked down Drgx expression specifically in Tm1 neurons, leaving it intact in the other cell types that normally express it. We further validated this concept by demonstrating that ectopic expression of Vsx (visual system homeobox) genes in Mi15 neurons not only converts them morphologically to Dm2 neurons, but also leads to the loss of their aminergic identity. Our results suggest that selector combinations can be further sculpted by receptor tyrosine kinase signaling after neurogenesis, providing a potential mechanism for postmitotic plasticity of neuronal fates. Finally, we combined our transcriptomic datasets with previously generated chromatin accessibility datasets to understand the mechanisms that control brain wiring downstream of terminal selectors. We built predictive computational models of gene regulatory networks using the Inferelator framework. Experimental validations of these networks revealed how selectors interact with ecdysone-responsive TFs to activate a large and specific repertoire of cell surface proteins and other effectors in each neuron at the onset of synapse formation. We showed that these network models can be used to identify downstream effectors that mediate specific cellular decisions during circuit formation. For instance, reduced levels of cut expression in Tm2 neurons, because of its negative regulation by pdm3 , controls the synaptic layer targeting of their axons. Knockdown of cut in Tm1 neurons is sufficient to redirect their axons to the Tm2 layer in the lobula neuropil without affecting other morphological features. CONCLUSION Our results support a model in which neuronal type identity is primarily determined by a relatively simple code of continuously expressed terminal selector TFs in each cell type throughout development. Our results provide a unified framework of how specific fates are initiated and maintained in postmitotic neurons and open new avenues to understanding synaptic specificity through gene regulatory networks. The conservation of this regulatory logic in both C. elegans and Drosophila makes it likely that the terminal selector concept will also be useful in understanding and manipulating the neuronal diversity of mammalian brains. Terminal selectors enable predictive cell fate reprogramming. Tm1, Tm2, Tm4, and Tm6 neurons of the Drosophila visual system share a core set of TFs continuously expressed by each cell type (simplified). The default Tm4 fate is overridden by the expression of a single additional terminal selector to generate Tm1 ( Drgx ), Tm2 ( pdm3 ), or Tm6 ( SoxN ) fates. 
    more » « less
  5. Abstract Background Accurate and comprehensive annotation of transcript sequences is essential for transcript quantification and differential gene and transcript expression analysis. Single-molecule long-read sequencing technologies provide improved integrity of transcript structures including alternative splicing, and transcription start and polyadenylation sites. However, accuracy is significantly affected by sequencing errors, mRNA degradation, or incomplete cDNA synthesis. Results We present a new and comprehensive Arabidopsis thaliana Reference Transcript Dataset 3 (AtRTD3). AtRTD3 contains over 169,000 transcripts—twice that of the best current Arabidopsis transcriptome and including over 1500 novel genes. Seventy-eight percent of transcripts are from Iso-seq with accurately defined splice junctions and transcription start and end sites. We develop novel methods to determine splice junctions and transcription start and end sites accurately. Mismatch profiles around splice junctions provide a powerful feature to distinguish correct splice junctions and remove false splice junctions. Stratified approaches identify high-confidence transcription start and end sites and remove fragmentary transcripts due to degradation. AtRTD3 is a major improvement over existing transcriptomes as demonstrated by analysis of an Arabidopsis cold response RNA-seq time-series. AtRTD3 provides higher resolution of transcript expression profiling and identifies cold-induced differential transcription start and polyadenylation site usage. Conclusions AtRTD3 is the most comprehensive Arabidopsis transcriptome currently. It improves the precision of differential gene and transcript expression, differential alternative splicing, and transcription start/end site usage analysis from RNA-seq data. The novel methods for identifying accurate splice junctions and transcription start/end sites are widely applicable and will improve single-molecule sequencing analysis from any species. 
    more » « less