skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, June 13 until 2:00 AM ET on Friday, June 14 due to maintenance. We apologize for the inconvenience.

Title: Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies
Background A pangenome is the collection of all genes found in a set of related genomes. For microbes, these genomes are often different strains of the same species, and the pangenome offers a means to compare gene content variation with differences in phenotypes, ecology, and phylogenetic relatedness. Though most frequently applied to bacteria, there is growing interest in adapting pangenome analysis to bacteriophages. However, working with phage genomes presents new challenges. First, most phage families are under-sampled, and homologous genes in related viruses can be difficult to identify. Second, homing endonucleases and intron-like sequences may be present, resulting in fragmented gene calls. Each of these issues can reduce the accuracy of standard pangenome analysis tools. Methods We developed an R pipeline called Rephine.r that takes as input the gene clusters produced by an initial pangenomics workflow. Rephine.r then proceeds in two primary steps. First, it identifies three common causes of fragmented gene calls: (1) indels creating early stop codons and new start codons; (2) interruption by a selfish genetic element; and (3) splitting at the ends of the reported genome. Fragmented genes are then fused to create new sequence alignments. In tandem, Rephine.r searches for distant homologs separated into different gene families using Hidden Markov Models. Significant hits are used to merge families into larger clusters. A final round of fragment identification is then run, and results may be used to infer single-copy core genomes and phylogenetic trees. Results We applied Rephine.r to three well-studied phage groups: the Tevenvirinae (e.g., T4), the Studiervirinae (e.g., T7), and the Pbunaviruses (e.g., PB1). In each case, Rephine.r recovered additional members of the single-copy core genome and increased the overall bootstrap support of the phylogeny. The Rephine.r pipeline is provided through GitHub ( ) as a single script for automated analysis and with utility functions to assist in building single-copy core genomes and predicting the sources of fragmented genes.  more » « less
Award ID(s):
Author(s) / Creator(s):
Date Published:
Journal Name:
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A significant amount of literature is available on biocorrosion, which makes manual extraction of crucial information such as genes and proteins a laborious task. Despite the fast growth of biology related corrosion studies, there is a limited number of gene collections relating to the corrosion process (biocorrosion). Text mining offers a potential solution by automatically extracting the essential information from unstructured text. We present a text mining workflow that extracts biocorrosion associated genes/proteins in sulfate-reducing bacteria (SRB) from literature databases (e.g., PubMed and PMC). This semi-automatic workflow is built with the Named Entity Recognition (NER) method and Convolutional Neural Network (CNN) model. With PubMed and PMCID as inputs, the workflow identified 227 genes belonging to several Desulfovibrio species. To validate their functions, Gene Ontology (GO) enrichment and biological network analysis was performed using UniprotKB and STRING-DB, respectively. The GO analysis showed that metal ion binding, sulfur binding, and electron transport were among the principal molecular functions. Furthermore, the biological network analysis generated three interlinked clusters containing genes involved in metal ion binding, cellular respiration, and electron transfer, which suggests the involvement of the extracted gene set in biocorrosion. Finally, the dataset was validated through manual curation, yielding a similar set of genes as our workflow; among these, hysB and hydA, and sat and dsrB were identified as the metal ion binding and sulfur metabolism genes, respectively. The identified genes were mapped with the pangenome of 63 SRB genomes that yielded the distribution of these genes across 63 SRB based on the amino acid sequence similarity and were further categorized as core and accessory gene families. SRB’s role in biocorrosion involves the transfer of electrons from the metal surface via a hydrogen medium to the sulfate reduction pathway. Therefore, genes encoding hydrogenases and cytochromes might be participating in removing hydrogen from the metals through electron transfer. Moreover, the production of corrosive sulfide from the sulfur metabolism indirectly contributes to the localized pitting of the metals. After the corroboration of text mining results with SRB biocorrosion mechanisms, we suggest that the text mining framework could be utilized for genes/proteins extraction and significantly reduce the manual curation time. 
    more » « less
  2. null (Ed.)
    Abstract CrAssphage is the most abundant human-associated virus and the founding member of a large group of bacteriophages, discovered in animal-associated and environmental metagenomes, that infect bacteria of the phylum Bacteroidetes. We analyze 4907 Circular Metagenome Assembled Genomes (cMAGs) of putative viruses from human gut microbiomes and identify nearly 600 genomes of crAss-like phages that account for nearly 87% of the DNA reads mapped to these cMAGs. Phylogenetic analysis of conserved genes demonstrates the monophyly of crAss-like phages, a putative virus order, and of 5 branches, potential families within that order, two of which have not been identified previously. The phage genomes in one of these families are almost twofold larger than the crAssphage genome (145-192 kilobases), with high density of self-splicing introns and inteins. Many crAss-like phages encode suppressor tRNAs that enable read-through of UGA or UAG stop-codons, mostly, in late phage genes. A distinct feature of the crAss-like phages is the recurrent switch of the phage DNA polymerase type between A and B families. Thus, comparative genomic analysis of the expanded assemblage of crAss-like phages reveals aspects of genome architecture and expression as well as phage biology that were not apparent from the previous work on phage genomics. 
    more » « less

    Psocodea (booklice and parasitic lice) is an order of insects containing species with extensive mitochondrial genome rearrangements, particularly within the suborder Troctomorpha, in which some species possess an extremely fragmented mitochondrial genome with several small minichromosomes. In the remaining suborders of Psocodea, there are groups with the ancestral pancrustacean arrangement, quite extensive rearrangements (e.g. Trogiomorpha), or in which the small number of species analysed to date have rearrangements of only a few protein‐coding genes and/or tRNAs (e.g. Psocomorpha). Despite the apparent high rate of rearrangements in the order as a whole, a small number of complete mitochondrial genomes are available, especially for suborder Psocomorpha, the largest free‐living suborder. To understand the evolution of the gene arrangement of the mitochondrial genome within Psocomorpha and its phylogenetic implications, we assembled and analysed the mitochondrial genomes of 33 species of bark lice belonging to nine families in two infraorders. Within the infraorder Homilopsocidea, four families were analysed, mainly from Lachesillidae (which included 22 species of this family). Within the infraorder Caeciliusetae, seven species representing five families were analysed. Mitochondrial gene rearrangements were identified in seven of the nine families. Some of these rearrangements were unique to a single species, while some contained phylogenetic signal, being shared by related species. These rearrangements typically corresponded to transpositions and inversions of tRNAs, possibly caused by tandem duplication–random loss (TDRL) and/or recombination events. Phylogenetic analyses of mitochondrial gene sequences provided phylogenetic resolution for several branches of the tree, including monophyly of Lachesillinae. The genusHemicaeciliusEnderlein was found to be embedded within the genusLachesillaWestwood, rending the latter paraphyletic. Monophyly was also never recovered for Lachesillidae and Elipsocidae as currently defined. However, instability was observed for some higher level relationships within Psocomorpha, including the relationships among the major clades of Lachesillidae.

    more » « less
  4. Rappe, Michael S. (Ed.)
    ABSTRACT For the abundant marine Alphaproteobacterium Pelagibacter (SAR11), and other bacteria, phages are powerful forces of mortality. However, little is known about the most abundant Pelagiphages in nature, such as the widespread HTVC023P-type, which is currently represented by two cultured phages. Using viral metagenomic data sets and fluorescence-activated cell sorting, we recovered 80 complete, undescribed Podoviridae genomes that form 10 phylogenomically distinct clades (herein, named Clades I to X) related to the HTVC023P-type. These expanded the HTVC023P-type pan-genome by 15-fold and revealed 41 previously unknown auxiliary metabolic genes (AMGs) in this viral lineage. Numerous instances of partner-AMGs (colocated and involved in related functions) were observed, including partners in nucleotide metabolism, DNA hypermodification, and Curli biogenesis. The Type VIII secretion system (T8SS) responsible for Curli biogenesis was identified in nine genomes and expanded the repertoire of T8SS proteins reported thus far in viruses. Additionally, the identified T8SS gene cluster contained an iron-dependent regulator (FecR), as well as a histidine kinase and adenylate cyclase that can be implicated in T8SS function but are not within T8SS operons in bacteria. While T8SS are lacking in known Pelagibacter , they contribute to aggregation and biofilm formation in other bacteria. Phylogenetic reconstructions of partner-AMGs indicate derivation from cellular lineages with a more recent transfer between viral families. For example, homologs of all T8SS genes are present in syntenic regions of distant Myoviridae Pelagiphages, and they appear to have alphaproteobacterial origins with a later transfer between viral families. The results point to an unprecedented multipartner-AMG transfer between marine Myoviridae and Podoviridae. Together with the expansion of known metabolic functions, our studies provide new prospects for understanding the ecology and evolution of marine phages and their hosts. IMPORTANCE One of the most abundant and diverse marine bacterial groups is Pelagibacter . Phages have roles in shaping Pelagibacter ecology; however, several Pelagiphage lineages are represented by only a few genomes. This paucity of data from even the most widespread lineages has imposed limits on the understanding of the diversity of Pelagiphages and their impacts on hosts. Here, we report 80 complete genomes, assembled directly from environmental data, which are from undescribed Pelagiphages and render new insights into the manipulation of host metabolism during infection. Notably, the viruses have functionally related partner genes that appear to be transferred between distant viruses, including a suite that encode a secretion system which both brings a new functional capability to the host and is abundant in phages across the ocean. Together, these functions have important implications for phage evolution and for how Pelagiphage infection influences host biology in manners extending beyond canonical viral lysis and mortality. 
    more » « less
  5. Ferns are the second largest clade of vascular plants with over 10,000 species, yet the generation of genomic resources for the group has lagged behind other major clades of plants. Transcriptomic data have proven to be a powerful tool to assess phylogenetic relationships, using thousands of markers that are largely conserved across the genome, and without the need to sequence entire genomes. We assembled the largest nuclear phylogenetic dataset for ferns to date, including 2884 single-copy nuclear loci from 247 transcriptomes (242 ferns, five outgroups), and investigated phylogenetic relationships across the fern tree, the placement of whole genome duplications (WGDs), and gene retention patterns following WGDs. We generated a well-supported phylogeny of ferns and identified several regions of the fern phylogeny that demonstrate high levels of gene tree–species tree conflict, which largely correspond to areas of the phylogeny that have been difficult to resolve. Using a combination of approaches, we identified 27 WGDs across the phylogeny, including 18 large-scale events (involving more than one sampled taxon) and nine small-scale events (involving only one sampled taxon). Most inferred WGDs occur within single lineages (e.g., orders, families) rather than on the backbone of the phylogeny, although two inferred events are shared by leptosporangiate ferns (excluding Osmundales) and Polypodiales (excluding Lindsaeineae and Saccolomatineae), clades which correspond to the majority of fern diversity. We further examined how retained duplicates following WGDs compared across independent events and found that functions of retained genes were largely convergent, with processes involved in binding, responses to stimuli, and certain organelles over-represented in paralogs while processes involved in transport, organelles derived from endosymbiotic events, and signaling were under-represented. To date, our study is the most comprehensive investigation of the nuclear fern phylogeny, though several avenues for future research remain unexplored. 
    more » « less