skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies
Background A pangenome is the collection of all genes found in a set of related genomes. For microbes, these genomes are often different strains of the same species, and the pangenome offers a means to compare gene content variation with differences in phenotypes, ecology, and phylogenetic relatedness. Though most frequently applied to bacteria, there is growing interest in adapting pangenome analysis to bacteriophages. However, working with phage genomes presents new challenges. First, most phage families are under-sampled, and homologous genes in related viruses can be difficult to identify. Second, homing endonucleases and intron-like sequences may be present, resulting in fragmented gene calls. Each of these issues can reduce the accuracy of standard pangenome analysis tools. Methods We developed an R pipeline called Rephine.r that takes as input the gene clusters produced by an initial pangenomics workflow. Rephine.r then proceeds in two primary steps. First, it identifies three common causes of fragmented gene calls: (1) indels creating early stop codons and new start codons; (2) interruption by a selfish genetic element; and (3) splitting at the ends of the reported genome. Fragmented genes are then fused to create new sequence alignments. In tandem, Rephine.r searches for distant homologs separated into different gene families using Hidden Markov Models. Significant hits are used to merge families into larger clusters. A final round of fragment identification is then run, and results may be used to infer single-copy core genomes and phylogenetic trees. Results We applied Rephine.r to three well-studied phage groups: the Tevenvirinae (e.g., T4), the Studiervirinae (e.g., T7), and the Pbunaviruses (e.g., PB1). In each case, Rephine.r recovered additional members of the single-copy core genome and increased the overall bootstrap support of the phylogeny. The Rephine.r pipeline is provided through GitHub ( https://www.github.com/coevoeco/Rephine.r ) as a single script for automated analysis and with utility functions to assist in building single-copy core genomes and predicting the sources of fragmented genes.  more » « less
Award ID(s):
1661357
PAR ID:
10374308
Author(s) / Creator(s):
;
Date Published:
Journal Name:
PeerJ
Volume:
9
ISSN:
2167-8359
Page Range / eLocation ID:
e11950
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract CrAssphage is the most abundant human-associated virus and the founding member of a large group of bacteriophages, discovered in animal-associated and environmental metagenomes, that infect bacteria of the phylum Bacteroidetes. We analyze 4907 Circular Metagenome Assembled Genomes (cMAGs) of putative viruses from human gut microbiomes and identify nearly 600 genomes of crAss-like phages that account for nearly 87% of the DNA reads mapped to these cMAGs. Phylogenetic analysis of conserved genes demonstrates the monophyly of crAss-like phages, a putative virus order, and of 5 branches, potential families within that order, two of which have not been identified previously. The phage genomes in one of these families are almost twofold larger than the crAssphage genome (145-192 kilobases), with high density of self-splicing introns and inteins. Many crAss-like phages encode suppressor tRNAs that enable read-through of UGA or UAG stop-codons, mostly, in late phage genes. A distinct feature of the crAss-like phages is the recurrent switch of the phage DNA polymerase type between A and B families. Thus, comparative genomic analysis of the expanded assemblage of crAss-like phages reveals aspects of genome architecture and expression as well as phage biology that were not apparent from the previous work on phage genomics. 
    more » « less
  2. Abstract Pan-genome analyses of metagenome-assembled genomes (MAGs) may suffer from the known issues with MAGs: fragmentation, incompleteness and contamination. Here, we conducted a critical assessment of pan-genomics of MAGs, by comparing pan-genome analysis results of complete bacterial genomes and simulated MAGs. We found that incompleteness led to significant core gene (CG) loss. The CG loss remained when using different pan-genome analysis tools (Roary, BPGA, Anvi’o) and when using a mixture of MAGs and complete genomes. Contamination had little effect on core genome size (except for Roary due to in its gene clustering issue) but had major influence on accessory genomes. Importantly, the CG loss was partially alleviated by lowering the CG threshold and using gene prediction algorithms that consider fragmented genes, but to a less degree when incompleteness was higher than 5%. The CG loss also led to incorrect pan-genome functional predictions and inaccurate phylogenetic trees. Our main findings were supported by a study of real MAG-isolate genome data. We conclude that lowering CG threshold and predicting genes in metagenome mode (as Anvi’o does with Prodigal) are necessary in pan-genome analysis of MAGs. Development of new pan-genome analysis tools specifically for MAGs are needed in future studies. 
    more » « less
  3. ABSTRACT Psocodea (booklice and parasitic lice) is an order of insects containing species with extensive mitochondrial genome rearrangements, particularly within the suborder Troctomorpha, in which some species possess an extremely fragmented mitochondrial genome with several small minichromosomes. In the remaining suborders of Psocodea, there are groups with the ancestral pancrustacean arrangement, quite extensive rearrangements (e.g. Trogiomorpha), or in which the small number of species analysed to date have rearrangements of only a few protein‐coding genes and/or tRNAs (e.g. Psocomorpha). Despite the apparent high rate of rearrangements in the order as a whole, a small number of complete mitochondrial genomes are available, especially for suborder Psocomorpha, the largest free‐living suborder. To understand the evolution of the gene arrangement of the mitochondrial genome within Psocomorpha and its phylogenetic implications, we assembled and analysed the mitochondrial genomes of 33 species of bark lice belonging to nine families in two infraorders. Within the infraorder Homilopsocidea, four families were analysed, mainly from Lachesillidae (which included 22 species of this family). Within the infraorder Caeciliusetae, seven species representing five families were analysed. Mitochondrial gene rearrangements were identified in seven of the nine families. Some of these rearrangements were unique to a single species, while some contained phylogenetic signal, being shared by related species. These rearrangements typically corresponded to transpositions and inversions of tRNAs, possibly caused by tandem duplication–random loss (TDRL) and/or recombination events. Phylogenetic analyses of mitochondrial gene sequences provided phylogenetic resolution for several branches of the tree, including monophyly of Lachesillinae. The genusHemicaeciliusEnderlein was found to be embedded within the genusLachesillaWestwood, rending the latter paraphyletic. Monophyly was also never recovered for Lachesillidae and Elipsocidae as currently defined. However, instability was observed for some higher level relationships within Psocomorpha, including the relationships among the major clades of Lachesillidae. 
    more » « less
  4. ABSTRACT Bacteriophages are the most abundant and diverse biological entities on the planet, and new phage genomes are being discovered at a rapid pace. As more phage genomes are published, new methods are needed for placing these genomes in an ecological and evolutionary context. Phages are difficult to study by phylogenetic methods, because they exchange genes regularly, and no single gene is conserved across all phages. Here, we demonstrate how gene-level networks can provide a high-resolution view of phage genetic diversity and offer a novel perspective on virus ecology. We focus our analyses on virus host range and show how network topology corresponds to host relatedness, how to find groups of genes with the strongest host-specific signatures, and how this perspective can complement phage host prediction tools. We discuss extensions of gene network analysis to predicting the emergence of phages on new hosts, as well as applications to features of phage biology beyond host range. IMPORTANCE Bacteriophages (phages) are viruses that infect bacteria, and they are critical drivers of bacterial evolution and community structure. It is generally difficult to study phages by using tree-based methods, because gene exchange is common, and no single gene is shared among all phages. Instead, networks offer a means to compare phages while placing them in a broader ecological and evolutionary context. In this work, we build a network that summarizes gene sharing across phages and test how a key constraint on phage ecology, host range, corresponds to the structure of the network. We find that the network reflects the relatedness among phage hosts, and phages with genes that are closer in the network are likelier to infect similar hosts. This approach can also be used to identify genes that affect host range, and we discuss possible extensions to analyze other aspects of viral ecology. 
    more » « less
  5. Summary White oak (Quercus alba) is an abundant forest tree species across eastern North America that is ecologically, culturally, and economically important.We report the first haplotype‐resolved chromosome‐scale genome assembly ofQ. albaand conduct comparative analyses of genome structure and gene content against other published Fagaceae genomes. We investigate the genetic diversity of this widespread species and the phylogenetic relationships among oaks using whole genome data.Despite strongly conserved chromosome synteny and genome size acrossQuercus, certain gene families have undergone rapid changes in size, including defense genes. Unbiased annotation of resistance (R) genes across oaks revealed that the overall number of R genes is similar across species – as are the chromosomal locations of R gene clusters – but, gene number within clusters is more labile. We found thatQ. albahas high genetic diversity, much of which predates its divergence from other oaks and likely impacts divergence time estimations. Our phylogenetic results highlight widespread phylogenetic discordance across the genus.The white oak genome represents a major new resource for studying genome diversity and evolution inQuercus. Additionally, we show that unbiased gene annotation is key to accurately assessing R gene evolution inQuercus. 
    more » « less