skip to main content

Title: Identification and classification of reverse transcriptases in bacterial genomes and metagenomes

Reverse transcriptases (RTs) are found in different systems including group II introns, Diversity Generating Retroelements (DGRs), retrons, CRISPR-Cas systems, and Abortive Infection (Abi) systems in prokaryotes. Different classes of RTs can play different roles, such as template switching and mobility in group II introns, spacer acquisition in CRISPR-Cas systems, mutagenic retrohoming in DGRs, programmed cell suicide in Abi systems, and recently discovered phage defense in retrons. While some classes of RTs have been studied extensively, others remain to be characterized. There is a lack of computational tools for identifying and characterizing various classes of RTs. In this study, we built a tool (called myRT) for identification and classification of prokaryotic RTs. In addition, our tool provides information about the genomic neighborhood of each RT, providing potential functional clues. We applied our tool to predict RTs in all complete and draft bacterial genomes, and created a collection that can be used for exploration of putative RTs and their associated protein domains. Application of myRT to metagenomes showed that gut metagenomes encode proportionally more RTs related to DGRs, outnumbering retron-related RTs, as compared to the collection of reference genomes. MyRT is both available as a standalone software ( and also through more » a website (

« less
Publication Date:
Journal Name:
Nucleic Acids Research
Page Range or eLocation-ID:
p. e29-e29
Oxford University Press
Sponsoring Org:
National Science Foundation
More Like this
  1. Chia, Nicholas (Ed.)
    ABSTRACT A diversity of clustered regularly interspaced short palindromic repeat (CRISPR)-Cas systems provide adaptive immunity to bacteria and archaea through recording “memories” of past viral infections. Recently, many novel CRISPR-associated proteins have been discovered via computational studies, but those studies relied on biased and incomplete databases of assembled genomes. We avoided these biases and applied a network theory approach to search for novel CRISPR-associated genes by leveraging subtle ecological cooccurrence patterns identified from environmental metagenomes. We validated our method using existing annotations and discovered 32 novel CRISPR-associated gene families. These genes span a range of putative functions, with many potentially regulating the response to infection. IMPORTANCE Every branch on the tree of life, including microbial life, faces the threat of viral pathogens. Over the course of billions of years of coevolution, prokaryotes have evolved a great diversity of strategies to defend against viral infections. One of these is the CRISPR adaptive immune system, which allows microbes to “remember” past infections in order to better fight them in the future. There has been much interest among molecular biologists in CRISPR immunity because this system can be repurposed as a tool for precise genome editing. Recently, a number of comparative genomics approachesmore »have been used to detect novel CRISPR-associated genes in databases of genomes with great success, potentially leading to the development of new genome-editing tools. Here, we developed novel methods to search for these distinct classes of genes directly in environmental samples (“metagenomes”), thus capturing a more complete picture of the natural diversity of CRISPR-associated genes.« less
  2. Abstract

    There is an increasing interest in the clustered regularly interspaced short palindromic repeats CRISPR-associated protein (CRISPR-Cas) system to reveal potential virus–host dynamics. The universal and most conserved Cas protein,cas1is an ideal marker to elucidate CRISPR-Cas ecology. We constructed eight Hidden Markov Models (HMMs) and assembledcas1directly from metagenomes by a targeted-gene assembler, Xander, to improve detection capacity and resolve the diverse CRISPR-Cas systems. The eight HMMs were first validated by recovering all 17cas1subtypes from the simulated metagenome generated from 91 prokaryotic genomes across 11 phyla. We challenged the targeted method with 48 metagenomes from a tallgrass prairie in Central Oklahoma recovering 3394cas1. Among those, 88 were near full length, 5 times more than in de-novo assemblies from the Oklahoma metagenomes. To validate the host assignment bycas1, the targeted-assembledcas1was mapped to the de-novo assembled contigs. All the phylum assignments of those mapped contigs were assigned independent of CRISPR-Cas genes on the same contigs and consistent with the host taxonomies predicted by the mappedcas1. We then investigated whether 8 years of soil warming alteredcas1prevalence within the communities. A shift in microbial abundances was observed during the year with the biggest temperature differential (mean 4.16 °C above ambient).cas1prevalence increased and even in the phylamore »with decreased microbial abundances over the next 3 years, suggesting increasing virus–host interactions in response to soil warming. This targeted method provides an alternative means to effectively minecas1from metagenomes and uncover the host communities.

    « less
  3. Abstract Background

    CRISPR-Cas (clustered regularly interspaced short palindromic repeats—CRISPR-associated proteins) systems are adaptive immune systems commonly found in prokaryotes that provide sequence-specific defense against invading mobile genetic elements (MGEs). The memory of these immunological encounters are stored in CRISPR arrays, where spacer sequences record the identity and history of past invaders. Analyzing such CRISPR arrays provide insights into the dynamics of CRISPR-Cas systems and the adaptation of their host bacteria to rapidly changing environments such as the human gut.


    In this study, we utilized 601 publicly availableBacteroides fragilisgenome isolates from 12 healthy individuals, 6 of which include longitudinal observations, and 222 availableB. fragilisreference genomes to update the understanding ofB. fragilisCRISPR-Cas dynamics and their differential activities. Analysis of longitudinal genomic data showed that some CRISPR array structures remained relatively stable over time whereas others involved radical spacer acquisition during some periods, and diverse CRISPR arrays (associated with multiple isolates) co-existed in the same individuals with some persisted over time. Furthermore, features of CRISPR adaptation, evolution, and microdynamics were highlighted through an analysis of host-MGE network, such as modules of multiple MGEs and hosts, reflecting complex interactions betweenB. fragilisand its invaders mediated through the CRISPR-Cas systems.


    We made available of all annotated CRISPR-Casmore »systems and their target MGEs, and their interaction network as a web resource at We anticipate it will become an important resource for studying ofB. fragilis, its CRISPR-Cas systems, and its interaction with mobile genetic elements providing insights into evolutionary dynamics that may shape the species virulence and lead to its pathogenicity.

    « less
  4. ABSTRACT Anti-CRISPR (Acr) loci/operons encode Acr proteins and Acr-associated (Aca) proteins. Forty-five Acr families have been experimentally characterized inhibiting seven subtypes of CRISPR-Cas systems. We have developed a bioinformatics pipeline to identify genomic loci containing Acr homologs and/or Aca homologs by combining three computational approaches: homology, guilt-by-association, and self-targeting spacers. Homology search found thousands of Acr homologs in bacterial and viral genomes, but most are homologous to AcrIIA7 and AcrIIA9. Investigating the gene neighborhood of these Acr homologs revealed that only a small percentage (23.0% in bacteria and 8.2% in viruses) of them have neighboring Aca homologs and thus form Acr-Aca operons. Surprisingly, although a self-targeting spacer is a strong indicator of the presence of Acr genes in a genome, a large percentage of Acr-Aca loci are found in bacterial genomes without self-targeting spacers or even without complete CRISPR-Cas systems. Additionally, for Acr homologs from genomes with self-targeting spacers, homology-based Acr family assignments do not always agree with the self-targeting CRISPR-Cas subtypes. Last, by investigating Acr genomic loci coexisting with self-targeting spacers in the same genomes, five known subtypes (I-C, I-E, I-F, II-A, and II-C) and five new subtypes (I-B, III-A, III-B, IV-A, and V-U4) of Acrs were inferred. Basedmore »on these findings, we conclude that the discovery of new anti-CRISPRs should not be restricted to genomes with self-targeting spacers and loci with Acr homologs. The evolutionary arms race of CRISPR-Cas systems and anti-CRISPR systems may have driven the adaptive and rapid gain and loss of these elements in closely related genomes. IMPORTANCE As a naturally occurring adaptive immune system, CRISPR-Cas (clustered regularly interspersed short palindromic repeats–CRISPR-associated genes) systems are widely found in bacteria and archaea to defend against viruses. Since 2013, the application of various bacterial CRISPR-Cas systems has become very popular due to their development into targeted and programmable genome engineering tools with the ability to edit almost any genome. As the natural off-switch of CRISPR-Cas systems, anti-CRISPRs have a great potential to serve as regulators of CRISPR-Cas tools and enable safer and more controllable genome editing. This study will help understand the relative usefulness of the three bioinformatics approaches for new Acr discovery, as well as guide the future development of new bioinformatics tools to facilitate anti-CRISPR research. The thousands of Acr homologs and hundreds of new anti-CRISPR loci identified in this study will be a valuable data resource for genome engineers to search for new CRISPR-Cas regulators.« less
  5. Abstract CRISPR–Cas is an anti-viral mechanism of prokaryotes that has been widely adopted for genome editing. To make CRISPR–Cas genome editing more controllable and safer to use, anti-CRISPR proteins have been recently exploited to prevent excessive/prolonged Cas nuclease cleavage. Anti-CRISPR (Acr) proteins are encoded by (pro)phages/(pro)viruses, and have the ability to inhibit their host's CRISPR–Cas systems. We have built an online database AcrDB ( by scanning ∼19 000 genomes of prokaryotes and viruses with AcrFinder, a recently developed Acr-Aca (Acr-associated regulator) operon prediction program. Proteins in Acr-Aca operons were further processed by two machine learning-based programs (AcRanker and PaCRISPR) to obtain numerical scores/ranks. Compared to other anti-CRISPR databases, AcrDB has the following unique features: (i) It is a genome-scale database with the largest collection of data (39 799 Acr-Aca operons containing Aca or Acr homologs); (ii) It offers a user-friendly web interface with various functions for browsing, graphically viewing, searching, and batch downloading Acr-Aca operons; (iii) It focuses on the genomic context of Acr and Aca candidates instead of individual Acr protein family and (iv) It collects data with three independent programs each having a unique data mining algorithm for cross validation. AcrDB will be a valuable resource to themore »anti-CRISPR research community.« less