skip to main content


Title: Model-based identification of conditionally-essential genes from transposon-insertion sequencing data
The understanding of bacterial gene function has been greatly enhanced by recent advancements in the deep sequencing of microbial genomes. Transposon insertion sequencing methods combines next-generation sequencing techniques with transposon mutagenesis for the exploration of the essentiality of genes under different environmental conditions. We propose a model-based method that uses regularized negative binomial regression to estimate the change in transposon insertions attributable to gene-environment changes in this genetic interaction study without transformations or uniform normalization. An empirical Bayes model for estimating the local false discovery rate combines unique and total count information to test for genes that show a statistically significant change in transposon counts. When applied to RB-TnSeq (randomized barcode transposon sequencing) and Tn-seq (transposon sequencing) libraries made in strains of Caulobacter crescentus using both total and unique count data the model was able to identify a set of conditionally beneficial or conditionally detrimental genes for each target condition that shed light on their functions and roles during various stress conditions.  more » « less
Award ID(s):
1934846
NSF-PAR ID:
10357958
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Segata, Nicola
Date Published:
Journal Name:
PLOS Computational Biology
Volume:
18
Issue:
3
ISSN:
1553-7358
Page Range / eLocation ID:
e1009273
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Background Significant progress has been made in advancing and standardizing tools for human genomic and biomedical research. Yet, the field of next-generation sequencing (NGS) analysis for microorganisms (including multiple pathogens) remains fragmented, lacks accessible and reusable tools, is hindered by local computational resource limitations, and does not offer widely accepted standards. One such “problem areas” is the analysis of Transposon Insertion Sequencing (TIS) data. TIS allows probing of almost the entire genome of a microorganism by introducing random insertions of transposon-derived constructs. The impact of the insertions on the survival and growth under specific conditions provides precise information about genes affecting specific phenotypic characteristics. A wide array of tools has been developed to analyze TIS data. Among the variety of options available, it is often difficult to identify which one can provide a reliable and reproducible analysis. Results Here we sought to understand the challenges and propose reliable practices for the analysis of TIS experiments. Using data from two recent TIS studies, we have developed a series of workflows that include multiple tools for data de-multiplexing, promoter sequence identification, transposon flank alignment, and read count repartition across the genome. Particular attention was paid to quality control procedures, such as determining the optimal tool parameters for the analysis and removal of contamination. Conclusions Our work provides an assessment of the currently available tools for TIS data analysis. It offers ready to use workflows that can be invoked by anyone in the world using our public Galaxy platform ( https://usegalaxy.org ). To lower the entry barriers, we have also developed interactive tutorials explaining details of TIS data analysis procedures at https://bit.ly/gxy-tis . 
    more » « less
  2. ABSTRACT Microbial communities are shaped by interactions among their constituent members. Some Gram-negative bacteria employ type VI secretion systems (T6SSs) to inject protein toxins into neighboring cells. These interactions have been theorized to affect the composition of host-associated microbiomes, but the role of T6SSs in the evolution of gut communities is not well understood. We report the discovery of two T6SSs and numerous T6SS-associated Rhs toxins within the gut bacteria of honey bees and bumble bees. We sequenced the genomes of 28 strains of Snodgrassella alvi , a characteristic bee gut microbe, and found tremendous variability in their Rhs toxin complements: altogether, these strains appear to encode hundreds of unique toxins. Some toxins are shared with Gilliamella apicola , a coresident gut symbiont, implicating horizontal gene transfer as a source of toxin diversity in the bee gut. We use data from a transposon mutagenesis screen to identify toxins with antibacterial function in the bee gut and validate the function and specificity of a subset of these toxin and immunity genes in Escherichia coli . Using transcriptome sequencing, we demonstrate that S. alvi T6SSs and associated toxins are upregulated in the gut environment. We find that S. alvi Rhs loci have a conserved architecture, consistent with the C-terminal displacement model of toxin diversification, with Rhs toxins, toxin fragments, and cognate immunity genes that are expressed and confer strong fitness effects in vivo . Our findings of T6SS activity and Rhs toxin diversity suggest that T6SS-mediated competition may be an important driver of coevolution within the bee gut microbiota. IMPORTANCE The structure and composition of host-associated bacterial communities are of broad interest, because these communities affect host health. Bees have a simple, conserved gut microbiota, which provides an opportunity to explore interactions between species that have coevolved within their host over millions of years. This study examined the role of type VI secretion systems (T6SSs)—protein complexes used to deliver toxic proteins into bacterial competitors—within the bee gut microbiota. We identified two T6SSs and diverse T6SS-associated toxins in bacterial strains from bees. Expression of these genes is increased in bacteria in the bee gut, and toxin and immunity genes demonstrate antibacterial and protective functions, respectively, when expressed in Escherichia coli . Our results suggest that coevolution among bacterial species in the bee gut has favored toxin diversification and maintenance of T6SS machinery, and demonstrate the importance of antagonistic interactions within host-associated microbial communities. 
    more » « less
  3. Single-cell RNA-sequencing (scRNA-seq) enables high throughput measurement of RNA expression in individual cells. Due to technical limitations, scRNA-seq data often contain zero counts for many transcripts in individual cells. These zero counts, or dropout events, complicate the analysis of scRNA-seq data using standard analysis methods developed for bulk RNA-seq data. Current scRNA-seq analysis methods typically overcome dropout by combining information across cells, leveraging the observation that cells generally occupy a small number of RNA expression states. We introduce netNMF-sc, an algorithm for scRNA-seq analysis that leverages information across both cells and genes. netNMF-sc combines network-regularized non-negative matrix factorization with a procedure for handling zero inflation in transcript count matrices. The matrix factorization results in a low-dimensional representation of the transcript count matrix, which imputes gene abundance for both zero and non-zero entries and can be used to cluster cells. The network regularization leverages prior knowledge of gene-gene interactions, encouraging pairs of genes with known interactions to be close in the low-dimensional representation. We show that netNMF-sc outperforms existing methods on simulated and real scRNA-seq data, with increasing advantage at higher dropout rates (e.g. above 60%). Furthermore, we show that the results from netNMF-sc -- including estimation of gene-gene covariance -- are robust to choice of network, with more representative networks leading to greater performance gains. 
    more » « less
  4. Abstract

    Bacterial genomes evolve in complex ecosystems and are best understood in this natural context, but replicating such conditions in the lab is challenging. We used transposon sequencing to define the fitness consequences of gene disruption in the bacterium Caulobacter crescentus grown in natural freshwater, compared with axenic growth in common laboratory media. Gene disruptions in amino-acid and nucleotide sugar biosynthesis pathways and in metabolic substrate transport machinery impaired fitness in both lake water and defined minimal medium relative to complex peptone broth. Fitness in lake water was enhanced by insertions in genes required for flagellum biosynthesis and reduced by insertions in genes involved in biosynthesis of the holdfast surface adhesin. We further uncovered numerous hypothetical and uncharacterized genes for which disruption impaired fitness in lake water, defined minimal medium, or both. At the genome scale, the fitness profile of mutants cultivated in lake water was more similar to that in complex peptone broth than in defined minimal medium. Microfiltration of lake water did not significantly affect the terminal cell density or the fitness profile of the transposon mutant pool, suggesting that Caulobacter does not strongly interact with other microbes in this ecosystem on the measured timescale. Fitness of select mutants with defects in cell surface biosynthesis and environmental sensing were significantly more variable across days in lake water than in defined medium, presumably owing to day-to-day heterogeneity in the lake environment. This study reveals genetic interactions between Caulobacter and a natural freshwater environment, and provides a new avenue to study gene function in complex ecosystems.

     
    more » « less
  5. Abstract Background

    Single-cell RNA-sequencing (scRNA-seq) technologies allow for the study of gene expression in individual cells. Often, it is of interest to understand how transcriptional activity is associated with cell-specific covariates, such as cell type, genotype, or measures of cell health. Traditional approaches for this type of association mapping assume independence between the outcome variables (or genes), and perform a separate regression for each. However, these methods are computationally costly and ignore the substantial correlation structure of gene expression. Furthermore, count-based scRNA-seq data pose challenges for traditional models based on Gaussian assumptions.

    Results

    We aim to resolve these issues by developing a reduced-rank regression model that identifies low-dimensional linear associations between a large number of cell-specific covariates and high-dimensional gene expression readouts. Our probabilistic model uses a Poisson likelihood in order to account for the unique structure of scRNA-seq counts. We demonstrate the performance of our model using simulations, and we apply our model to a scRNA-seq dataset, a spatial gene expression dataset, and a bulk RNA-seq dataset to show its behavior in three distinct analyses.

    Conclusion

    We show that our statistical modeling approach, which is based on reduced-rank regression, captures associations between gene expression and cell- and sample-specific covariates by leveraging low-dimensional representations of transcriptional states.

     
    more » « less