skip to main content

Title: SkewIT: The Skew Index Test for large-scale GC Skew analysis of bacterial genomes
GC skew is a phenomenon observed in many bacterial genomes, wherein the two replication strands of the same chromosome contain different proportions of guanine and cytosine nucleotides. Here we demonstrate that this phenomenon, which was first discovered in the mid-1990s, can be used today as an analysis tool for the 15,000+ complete bacterial genomes in NCBI’s Refseq library. In order to analyze all 15,000+ genomes, we introduce a new method, SkewIT (Skew Index Test), that calculates a single metric representing the degree of GC skew for a genome. Using this metric, we demonstrate how GC skew patterns are conserved within certain bacterial phyla, e.g. Firmicutes, but show different patterns in other phylogenetic groups such as Actinobacteria. We also discovered that outlier values of SkewIT highlight potential bacterial mis-assemblies. Using our newly defined metric, we identify multiple mis-assembled chromosomal sequences in previously published complete bacterial genomes. We provide a SkewIT web app that calculates SkewI for any user-provided bacterial sequence. The web app also provides an interactive interface for the data generated in this paper, allowing users to further investigate the SkewI values and thresholds of the Refseq-97 complete bacterial genomes. Individual scripts for analysis of bacterial genomes are provided more » in the following repository: . « less
Rzhetsky, Andrey
Award ID(s):
Publication Date:
Journal Name:
PLOS Computational Biology
Sponsoring Org:
National Science Foundation
More Like this
  1. Harwood, Caroline S. (Ed.)
    ABSTRACT The recent leveraging of genome-resolved metagenomics has generated an enormous number of genomes from novel uncultured microbial lineages yet left many clades undescribed. Here, we present a global analysis of genomes belonging to Binatota (UBP10), a globally distributed, yet-uncharacterized bacterial phylum. All orders in Binatota encoded the capacity for aerobic methylotrophy using methanol, methylamine, sulfomethanes, and chloromethanes as the substrates. Methylotrophy in Binatota was characterized by order-specific substrate degradation preferences, as well as extensive metabolic versatility, i.e., the utilization of diverse sets of genes, pathways, and combinations to achieve a specific metabolic goal. The genomes also encoded multiple alkane hydroxylases and monooxygenases, potentially enabling growth on a wide range of alkanes and fatty acids. Pigmentation is inferred from a complete pathway for carotenoids (lycopene, β- and γ-carotenes, xanthins, chlorobactenes, and spheroidenes) production. Further, the majority of genes involved in bacteriochlorophyll a , c , and d biosynthesis were identified, although absence of key genes and failure to identify a photosynthetic reaction center preclude proposing phototrophic capacities. Analysis of 16S rRNA databases showed the preferences of Binatota to terrestrial and freshwater ecosystems, hydrocarbon-rich habitats, and sponges, supporting their potential role in mitigating methanol and methane emissions, breakdown of alkanes, andmore »their association with sponges. Our results expand the lists of methylotrophic, aerobic alkane-degrading, and pigment-producing lineages. We also highlight the consistent encountering of incomplete biosynthetic pathways in microbial genomes, a phenomenon necessitating careful assessment when assigning putative functions based on a set-threshold of pathway completion. IMPORTANCE A wide range of microbial lineages remain uncultured, yet little is known regarding their metabolic capacities, physiological preferences, and ecological roles in various ecosystems. We conducted a thorough comparative genomic analysis of 108 genomes belonging to the Binatota (UBP10), a globally distributed, yet-uncharacterized bacterial phylum. We present evidence that members of the order Binatota specialize in methylotrophy and identify an extensive repertoire of genes and pathways mediating the oxidation of multiple one-carbon (C 1 ) compounds in Binatota genomes. The occurrence of multiple alkane hydroxylases and monooxygenases in these genomes was also identified, potentially enabling growth on a wide range of alkanes and fatty acids. Pigmentation is inferred from a complete pathway for carotenoids production. We also report on the presence of incomplete chlorophyll biosynthetic pathways in all genomes and propose several evolutionary-grounded scenarios that could explain such a pattern. Assessment of the ecological distribution patterns of the Binatota indicates preference of its members to terrestrial and freshwater ecosystems characterized by high methane and methanol emissions, as well as multiple hydrocarbon-rich habitats and marine sponges.« less
  2. Abstract
    Excessive phosphorus (P) applications to croplands can contribute to eutrophication of surface waters through surface runoff and subsurface (leaching) losses. We analyzed leaching losses of total dissolved P (TDP) from no-till corn, hybrid poplar (Populus nigra X P. maximowiczii), switchgrass (Panicum virgatum), miscanthus (Miscanthus giganteus), native grasses, and restored prairie, all planted in 2008 on former cropland in Michigan, USA. All crops except corn (13 kg P ha−1 year−1) were grown without P fertilization. Biomass was harvested at the end of each growing season except for poplar. Soil water at 1.2 m depth was sampled weekly to biweekly for TDP determination during March–November 2009–2016 using tension lysimeters. Soil test P (0–25 cm depth) was measured every autumn. Soil water TDP concentrations were usually below levels where eutrophication of surface waters is frequently observed (> 0.02 mg L−1) but often higher than in deep groundwater or nearby streams and lakes. Rates of P leaching, estimated from measured concentrations and modeled drainage, did not differ statistically among cropping systems across years; 7-year cropping system means ranged from 0.035 to 0.072 kg P ha−1 year−1 with large interannual variation. Leached P was positively related to STP, which decreased over the 7 years in all systems. These results indicate that both P-fertilized and unfertilized cropping systems mayMore>>
  3. Abstract Background

    To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment.


    EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LGmore »study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study.


    EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species.

    « less
  4. Abstract

    Reverse transcriptases (RTs) are found in different systems including group II introns, Diversity Generating Retroelements (DGRs), retrons, CRISPR-Cas systems, and Abortive Infection (Abi) systems in prokaryotes. Different classes of RTs can play different roles, such as template switching and mobility in group II introns, spacer acquisition in CRISPR-Cas systems, mutagenic retrohoming in DGRs, programmed cell suicide in Abi systems, and recently discovered phage defense in retrons. While some classes of RTs have been studied extensively, others remain to be characterized. There is a lack of computational tools for identifying and characterizing various classes of RTs. In this study, we built a tool (called myRT) for identification and classification of prokaryotic RTs. In addition, our tool provides information about the genomic neighborhood of each RT, providing potential functional clues. We applied our tool to predict RTs in all complete and draft bacterial genomes, and created a collection that can be used for exploration of putative RTs and their associated protein domains. Application of myRT to metagenomes showed that gut metagenomes encode proportionally more RTs related to DGRs, outnumbering retron-related RTs, as compared to the collection of reference genomes. MyRT is both available as a standalone software ( and also throughmore »a website (

    « less
  5. Abstract

    The introduction of high-throughput chromosome conformation capture (Hi-C) into metagenomics enables reconstructing high-quality metagenome-assembled genomes (MAGs) from microbial communities. Despite recent advances in recovering eukaryotic, bacterial, and archaeal genomes using Hi-C contact maps, few of Hi-C-based methods are designed to retrieve viral genomes. Here we introduce ViralCC, a publicly available tool to recover complete viral genomes and detect virus-host pairs using Hi-C data. Compared to other Hi-C-based methods, ViralCC leverages the virus-host proximity structure as a complementary information source for the Hi-C interactions. Using mock and real metagenomic Hi-C datasets from several different microbial ecosystems, including the human gut, cow fecal, and wastewater, we demonstrate that ViralCC outperforms existing Hi-C-based binning methods as well as state-of-the-art tools specifically dedicated to metagenomic viral binning. ViralCC can also reveal the taxonomic structure of viruses and virus-host pairs in microbial communities. When applied to a real wastewater metagenomic Hi-C dataset, ViralCC constructs a phage-host network, which is further validated using CRISPR spacer analyses. ViralCC is an open-source pipeline available at