Microbial communities are often composed of taxa from different taxonomic groups. The associations among the constituent members in a microbial community play an important role in determining the functional characteristics of the community, and these associations can be modeled using an edge weighted graph (microbial network). A microbial network is typically inferred from a sample–taxa matrix that is obtained by sequencing multiple biological samples and identifying the taxa abundance in each sample. Motivated by microbiome studies that involve a large number of samples collected across a range of study parameters, here we consider the computational problem of identifying the number of microbial networks underlying the observed sample-taxa abundance matrix. Specifically, we consider the problem of determing the number of sparse microbial networks in this setting. We use a mixture model framework to address this problem, and present formulations to model both count data and proportion data. We propose several variational approximation based algorithms that allow the incorporation of the sparsity constraint while estimating the number of components in the mixture model. We evaluate these algorithms on a large number of simulated datasets generated using a collection of different graph structures (band, hub, cluster, random, and scale-free).
more »
« less
Variational Approximation based Model Selection for Microbial Network Inference
Microbial associations are characterized by both direct and indirect interactions between the constituent taxa in a microbial community, and play an important role in determining the structure, organization, and function of the community. Microbial associations can be represented using a weighted graph (microbial network) whose nodes represent taxa and edges represent pairwise associations. A microbial network is typically inferred from a sample-taxa matrix that is obtained by sequencing multiple biological samples and identifying the taxa counts in each sample. However, it is known that microbial associations are impacted by environmental and/or host factors. Thus, a sample-taxa matrix generated in a microbiome study involving a wide range of values for the environmental and/or clinical metadata variables may in fact be associated with more than one microbial network. Here we consider the problem of inferring multiple microbial networks from a given sample-taxa count matrix. Each sample is a count vector assumed to be generated by a mixture model consisting of component distributions that are Multivariate Poisson Log-Normal. We present a variational Expectation Maximization algorithm for the model selection problem to infer the correct number of components of this mixture model. Our approach involves reframing the mixture model as a latent variable model, treating only the mixing coefficients as parameters, and subsequently approximating the marginal likelihood using an evidence lower bound framework. Our algorithm is evaluated on a large simulated dataset generated using a collection of different graph structures (band, hub, cluster, random, and scale-free).
more »
« less
- Award ID(s):
- 2051283
- PAR ID:
- 10323782
- Editor(s):
- Singh, Mona
- Date Published:
- Journal Name:
- Journal of computational biology
- ISSN:
- 1066-5277
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
ABSTRACT Host-associated microbial communities are shaped by extrinsic and intrinsic factors to the holobiont organism. Environmental factors and microbe-microbe interactions act simultaneously on the microbial community structure, making the microbiome dynamics challenging to predict. The coral microbiome is essential to the health of coral reefs and sensitive to environmental changes. Here, we develop a dynamic model to determine the microbial community structure associated with the surface mucus layer (SML) of corals using temperature as an extrinsic factor and microbial network as an intrinsic factor. The model was validated by comparing the predicted relative abundances of microbial taxa to the relative abundances of microbial taxa from the sample data. The SML microbiome from Pseudodiploria strigosa was collected across reef zones in Bermuda, where inner and outer reefs are exposed to distinct thermal profiles. A shotgun metagenomics approach was used to describe the taxonomic composition and the microbial network of the coral SML microbiome. By simulating the annual temperature fluctuations at each reef zone, the model output is statistically identical to the observed data. The model was further applied to six scenarios that combined different profiles of temperature and microbial network to investigate the influence of each of these two factors on the model accuracy. The SML microbiome was best predicted by model scenarios with the temperature profile that was closest to the local thermal environment, regardless of the microbial network profile. Our model shows that the SML microbiome of P. strigosa in Bermuda is primarily structured by seasonal fluctuations in temperature at a reef scale, while the microbial network is a secondary driver. IMPORTANCE Coral microbiome dysbiosis (i.e., shifts in the microbial community structure or complete loss of microbial symbionts) caused by environmental changes is a key player in the decline of coral health worldwide. Multiple factors in the water column and the surrounding biological community influence the dynamics of the coral microbiome. However, by including only temperature as an external factor, our model proved to be successful in describing the microbial community associated with the surface mucus layer (SML) of the coral P. strigosa . The dynamic model developed and validated in this study is a potential tool to predict the coral microbiome under different temperature conditions.more » « less
-
We propose a structure-preserving model-reduction methodology for large-scale dynamic networks with tightly-connected components. First, the coherent groups are identified by a spectral clustering algorithm on the graph Laplacian matrix that models the network feedback. Then, a reduced network is built, where each node represents the aggregate dynamics of each coherent group, and the reduced network captures the dynamic coupling between the groups. We provide an upper bound on the approximation error when the network graph is randomly generated from a weight stochastic block model. Finally, numerical experiments align with and validate our theoretical findings.more » « less
-
We study a matrix completion problem that lever-ages a hierarchical structure of social similarity graphs as side information in the context of recommender systems. We assume that users are categorized into clusters, each of which comprises sub-clusters (or what we call “groups”). We consider a low-rank matrix model for the rating matrix, and a hierarchical stochastic block model that well respects practically-relevant social graphs.Under this setting, we characterize the information-theoretic limit on the number of observed matrix entries (i.e., optimal sample complexity) as a function of the quality of graph side information (to be detailed) by proving sharp upper and lower bounds on the sample complexity. Furthermore, we develop a matrix completion algorithm and empirically demonstrate via extensive experiments that the proposed algorithm achieves the optimal sample complexity.more » « less
-
Faust, Karoline (Ed.)ABSTRACT Microbes commonly organize into communities consisting of hundreds of species involved in complex interactions with each other. 16S ribosomal RNA (16S rRNA) amplicon profiling provides snapshots that reveal the phylogenies and abundance profiles of these microbial communities. These snapshots, when collected from multiple samples, can reveal the co-occurrence of microbes, providing a glimpse into the network of associations in these communities. However, the inference of networks from 16S data involves numerous steps, each requiring specific tools and parameter choices. Moreover, the extent to which these steps affect the final network is still unclear. In this study, we perform a meticulous analysis of each step of a pipeline that can convert 16S sequencing data into a network of microbial associations. Through this process, we map how different choices of algorithms and parameters affect the co-occurrence network and identify the steps that contribute substantially to the variance. We further determine the tools and parameters that generate robust co-occurrence networks and develop consensus network algorithms based on benchmarks with mock and synthetic data sets. The Microbial Co-occurrence Network Explorer, or MiCoNE (available athttps://github.com/segrelab/MiCoNE) follows these default tools and parameters and can help explore the outcome of these combinations of choices on the inferred networks. We envisage that this pipeline could be used for integrating multiple data sets and generating comparative analyses and consensus networks that can guide our understanding of microbial community assembly in different biomes. IMPORTANCEMapping the interrelationships between different species in a microbial community is important for understanding and controlling their structure and function. The surge in the high-throughput sequencing of microbial communities has led to the creation of thousands of data sets containing information about microbial abundances. These abundances can be transformed into co-occurrence networks, providing a glimpse into the associations within microbiomes. However, processing these data sets to obtain co-occurrence information relies on several complex steps, each of which involves numerous choices of tools and corresponding parameters. These multiple options pose questions about the robustness and uniqueness of the inferred networks. In this study, we address this workflow and provide a systematic analysis of how these choices of tools affect the final network and guidelines on appropriate tool selection for a particular data set. We also develop a consensus network algorithm that helps generate more robust co-occurrence networks based on benchmark synthetic data sets.more » « less