skip to main content

Search for: All records

Creators/Authors contains: "Hou, Shengwei"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Sequence classification facilitates a fundamental understanding of the structure of microbial communities. Binary metagenomic sequence classifiers are insufficient because environmental metagenomes are typically derived from multiple sequence sources. Here we introduce a deep-learning based sequence classifier, DeepMicroClass, that classifies metagenomic contigs into five sequence classes, i.e. viruses infecting prokaryotic or eukaryotic hosts, eukaryotic or prokaryotic chromosomes, and prokaryotic plasmids. DeepMicroClass achieved high performance for all sequence classes at various tested sequence lengths ranging from 500 bp to 100 kbps. By benchmarking on a synthetic dataset with variable sequence class composition, we showed that DeepMicroClass obtained better performance for eukaryotic, plasmid and viral contig classification than other state-of-the-art predictors. DeepMicroClass achieved comparable performance on viral sequence classification with geNomad and VirSorter2 when benchmarked on the CAMI II marine dataset. Using a coastal daily time-series metagenomic dataset as a case study, we showed that microbial eukaryotes and prokaryotic viruses are integral to microbial communities. By analyzing monthly metagenomes collected at HOT and BATS, we found relatively higher viral read proportions in the subsurface layer in late summer, consistent with the seasonal viral infection patterns prevalent in these areas. We expect DeepMicroClass will promote metagenomic studies of under-appreciated sequence types.

    more » « less
  2. Bacteria are single-celled organisms that live out their lives at a microscopic scale. We can find bacteria everywhere we look for them, including inside of our own bodies. Bacteria are incredibly diverse and come in many shapes and sizes. They also vary widely in how they live and grow. Some bacteria grow very quickly and others grow slowly. We wanted to measure the growth of many different types of bacteria in the environment. Unfortunately, some species of bacteria are very difficult to grow in the laboratory. To get around this, we designed a method to predict how fast a type of bacteria can grow, just from its DNA. This way, if we have the DNA of a bacterial species, we can measure its growth even if we cannot get it to grow in our laboratory. 
    more » « less
  3. Abstract Motivation

    Phage–host associations play important roles in microbial communities. But in natural communities, as opposed to culture-based lab studies where phages are discovered and characterized metagenomically, their hosts are generally not known. Several programs have been developed for predicting which phage infects which host based on various sequence similarity measures or machine learning approaches. These are often based on whole viral and host genomes, but in metagenomics-based studies, we rarely have whole genomes but rather must rely on contigs that are sometimes as short as hundreds of bp long. Therefore, we need programs that predict hosts of phage contigs on the basis of these short contigs. Although most existing programs can be applied to metagenomic datasets for these predictions, their accuracies are generally low. Here, we develop ContigNet, a convolutional neural network-based model capable of predicting phage–host matches based on relatively short contigs, and compare it to previously published VirHostMatcher (VHM) and WIsH.


    On the validation set, ContigNet achieves 72–85% area under the receiver operating characteristic curve (AUROC) scores, compared to the maximum of 68% by VHM or WIsH for contigs of lengths between 200 bps to 50 kbps. We also apply the model to the Metagenomic Gut Virus (MGV) catalogue, a dataset containing a wide range of draft genomes from metagenomic samples and achieve 60–70% AUROC scores compared to that of VHM and WIsH of 52%. Surprisingly, ContigNet can also be used to predict plasmid-host contig associations with high accuracy, indicating a similar genetic exchange between mobile genetic elements and their hosts.

    Availability and implementation

    The source code of ContigNet and related datasets can be downloaded from

    more » « less
  4. null (Ed.)
    Maximal growth rate is a basic parameter of microbial lifestyle that varies over several orders of magnitude, with doubling times ranging from a matter of minutes to multiple days. Growth rates are typically measured using laboratory culture experiments. Yet, we lack sufficient understanding of the physiology of most microbes to design appropriate culture conditions for them, severely limiting our ability to assess the global diversity of microbial growth rates. Genomic estimators of maximal growth rate provide a practical solution to survey the distribution of microbial growth potential, regardless of cultivation status. We developed an improved maximal growth rate estimator and predicted maximal growth rates from over 200,000 genomes, metagenome-assembled genomes, and single-cell amplified genomes to survey growth potential across the range of prokaryotic diversity; extensions allow estimates from 16S rRNA sequences alone as well as weighted community estimates from metagenomes. We compared the growth rates of cultivated and uncultivated organisms to illustrate how culture collections are strongly biased toward organisms capable of rapid growth. Finally, we found that organisms naturally group into two growth classes and observed a bias in growth predictions for extremely slow-growing organisms. These observations ultimately led us to suggest evolutionary definitions of oligotrophy and copiotrophy based on the selective regime an organism occupies. We found that these growth classes are associated with distinct selective regimes and genomic functional potentials. 
    more » « less
  5. Abstract

    Growth rates are central to understanding microbial interactions and community dynamics. Metagenomic growth estimators have been developed, specifically codon usage bias (CUB) for maximum growth rates and “peak-to-trough ratio” (PTR) for in situ rates. Both were originally tested with pure cultures, but natural populations are more heterogeneous, especially in individual cell histories pertinent to PTR. To test these methods, we compared predictors with observed growth rates of freshly collected marine prokaryotes in unamended seawater. We prefiltered and diluted samples to remove grazers and greatly reduce virus infection, so net growth approximated gross growth. We sampled over 44 h for abundances and metagenomes, generating 101 metagenome-assembled genomes (MAGs), including Actinobacteria, Verrucomicrobia, SAR406, MGII archaea, etc. We tracked each MAG population by cell-abundance-normalized read recruitment, finding growth rates of 0 to 5.99 per day, the first reported rates for several groups, and used these rates as benchmarks. PTR, calculated by three methods, rarely correlated to growth (r ~−0.26–0.08), except for rapidly growing γ-Proteobacteria (r ~0.63–0.92), while CUB correlated moderately well to observed maximum growth rates (r = 0.57). This suggests that current PTR approaches poorly predict actual growth of most marine bacterial populations, but maximum growth rates can be approximated from genomic characteristics.

    more » « less