skip to main content

This content will become publicly available on March 7, 2023

Title: Model-based identification of conditionally-essential genes from transposon-insertion sequencing data
The understanding of bacterial gene function has been greatly enhanced by recent advancements in the deep sequencing of microbial genomes. Transposon insertion sequencing methods combines next-generation sequencing techniques with transposon mutagenesis for the exploration of the essentiality of genes under different environmental conditions. We propose a model-based method that uses regularized negative binomial regression to estimate the change in transposon insertions attributable to gene-environment changes in this genetic interaction study without transformations or uniform normalization. An empirical Bayes model for estimating the local false discovery rate combines unique and total count information to test for genes that show a statistically significant change in transposon counts. When applied to RB-TnSeq (randomized barcode transposon sequencing) and Tn-seq (transposon sequencing) libraries made in strains of Caulobacter crescentus using both total and unique count data the model was able to identify a set of conditionally beneficial or conditionally detrimental genes for each target condition that shed light on their functions and roles during various stress conditions.
; ; ; ; ;
Segata, Nicola
Award ID(s):
Publication Date:
Journal Name:
PLOS Computational Biology
Page Range or eLocation-ID:
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background Significant progress has been made in advancing and standardizing tools for human genomic and biomedical research. Yet, the field of next-generation sequencing (NGS) analysis for microorganisms (including multiple pathogens) remains fragmented, lacks accessible and reusable tools, is hindered by local computational resource limitations, and does not offer widely accepted standards. One such “problem areas” is the analysis of Transposon Insertion Sequencing (TIS) data. TIS allows probing of almost the entire genome of a microorganism by introducing random insertions of transposon-derived constructs. The impact of the insertions on the survival and growth under specific conditions provides precise information about genes affecting specific phenotypic characteristics. A wide array of tools has been developed to analyze TIS data. Among the variety of options available, it is often difficult to identify which one can provide a reliable and reproducible analysis. Results Here we sought to understand the challenges and propose reliable practices for the analysis of TIS experiments. Using data from two recent TIS studies, we have developed a series of workflows that include multiple tools for data de-multiplexing, promoter sequence identification, transposon flank alignment, and read count repartition across the genome. Particular attention was paid to quality control procedures, such asmore »determining the optimal tool parameters for the analysis and removal of contamination. Conclusions Our work provides an assessment of the currently available tools for TIS data analysis. It offers ready to use workflows that can be invoked by anyone in the world using our public Galaxy platform ( ). To lower the entry barriers, we have also developed interactive tutorials explaining details of TIS data analysis procedures at .« less
  2. ABSTRACT Microbial communities are shaped by interactions among their constituent members. Some Gram-negative bacteria employ type VI secretion systems (T6SSs) to inject protein toxins into neighboring cells. These interactions have been theorized to affect the composition of host-associated microbiomes, but the role of T6SSs in the evolution of gut communities is not well understood. We report the discovery of two T6SSs and numerous T6SS-associated Rhs toxins within the gut bacteria of honey bees and bumble bees. We sequenced the genomes of 28 strains of Snodgrassella alvi , a characteristic bee gut microbe, and found tremendous variability in their Rhs toxin complements: altogether, these strains appear to encode hundreds of unique toxins. Some toxins are shared with Gilliamella apicola , a coresident gut symbiont, implicating horizontal gene transfer as a source of toxin diversity in the bee gut. We use data from a transposon mutagenesis screen to identify toxins with antibacterial function in the bee gut and validate the function and specificity of a subset of these toxin and immunity genes in Escherichia coli . Using transcriptome sequencing, we demonstrate that S. alvi T6SSs and associated toxins are upregulated in the gut environment. We find that S. alvi Rhs loci have a conservedmore »architecture, consistent with the C-terminal displacement model of toxin diversification, with Rhs toxins, toxin fragments, and cognate immunity genes that are expressed and confer strong fitness effects in vivo . Our findings of T6SS activity and Rhs toxin diversity suggest that T6SS-mediated competition may be an important driver of coevolution within the bee gut microbiota. IMPORTANCE The structure and composition of host-associated bacterial communities are of broad interest, because these communities affect host health. Bees have a simple, conserved gut microbiota, which provides an opportunity to explore interactions between species that have coevolved within their host over millions of years. This study examined the role of type VI secretion systems (T6SSs)—protein complexes used to deliver toxic proteins into bacterial competitors—within the bee gut microbiota. We identified two T6SSs and diverse T6SS-associated toxins in bacterial strains from bees. Expression of these genes is increased in bacteria in the bee gut, and toxin and immunity genes demonstrate antibacterial and protective functions, respectively, when expressed in Escherichia coli . Our results suggest that coevolution among bacterial species in the bee gut has favored toxin diversification and maintenance of T6SS machinery, and demonstrate the importance of antagonistic interactions within host-associated microbial communities.« less
  3. Single-cell RNA-sequencing (scRNA-seq) enables high throughput measurement of RNA expression in individual cells. Due to technical limitations, scRNA-seq data often contain zero counts for many transcripts in individual cells. These zero counts, or dropout events, complicate the analysis of scRNA-seq data using standard analysis methods developed for bulk RNA-seq data. Current scRNA-seq analysis methods typically overcome dropout by combining information across cells, leveraging the observation that cells generally occupy a small number of RNA expression states. We introduce netNMF-sc, an algorithm for scRNA-seq analysis that leverages information across both cells and genes. netNMF-sc combines network-regularized non-negative matrix factorization with a procedure for handling zero inflation in transcript count matrices. The matrix factorization results in a low-dimensional representation of the transcript count matrix, which imputes gene abundance for both zero and non-zero entries and can be used to cluster cells. The network regularization leverages prior knowledge of gene-gene interactions, encouraging pairs of genes with known interactions to be close in the low-dimensional representation. We show that netNMF-sc outperforms existing methods on simulated and real scRNA-seq data, with increasing advantage at higher dropout rates (e.g. above 60%). Furthermore, we show that the results from netNMF-sc -- including estimation of gene-gene covariance --more »are robust to choice of network, with more representative networks leading to greater performance gains.« less
  4. Abstract Summary With the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells, etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or unique molecular identifier (UMI) deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-sequencing (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as polymerase chain reaction amplification, cellular barcodes (CB) and UMI selection and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performancemore »of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment. Supplementary information Supplementary data are available at Bioinformatics online.« less
  5. Abstract

    The development of single-cell RNA-sequencing (scRNA-seq) technologies has offered insights into complex biological systems at the single-cell resolution. In particular, these techniques facilitate the identifications of genes showing cell-type-specific differential expressions (DE). In this paper, we introduce MARBLES, a novel statistical model for cross-condition DE gene detection from scRNA-seq data. MARBLES employs a Markov Random Field model to borrow information across similar cell types and utilizes cell-type-specific pseudobulk count to account for sample-level variability. Our simulation results showed that MARBLES is more powerful than existing methods to detect DE genes with an appropriate control of false positive rate. Applications of MARBLES to real data identified novel disease-related DE genes and biological pathways from both a single-cell lipopolysaccharide mouse dataset with 24 381 cells and 11 076 genes and a Parkinson’s disease human data set with 76 212 cells and 15 891 genes. Overall, MARBLES is a powerful tool to identify cell-type-specific DE genes across conditions from scRNA-seq data.