Pan-genome analyses of metagenome-assembled genomes (MAGs) may suffer from the known issues with MAGs: fragmentation, incompleteness and contamination. Here, we conducted a critical assessment of pan-genomics of MAGs, by comparing pan-genome analysis results of complete bacterial genomes and simulated MAGs. We found that incompleteness led to significant core gene (CG) loss. The CG loss remained when using different pan-genome analysis tools (Roary, BPGA, Anvi’o) and when using a mixture of MAGs and complete genomes. Contamination had little effect on core genome size (except for Roary due to in its gene clustering issue) but had major influence on accessory genomes. Importantly, the CG loss was partially alleviated by lowering the CG threshold and using gene prediction algorithms that consider fragmented genes, but to a less degree when incompleteness was higher than 5%. The CG loss also led to incorrect pan-genome functional predictions and inaccurate phylogenetic trees. Our main findings were supported by a study of real MAG-isolate genome data. We conclude that lowering CG threshold and predicting genes in metagenome mode (as Anvi’o does with Prodigal) are necessary in pan-genome analysis of MAGs. Development of new pan-genome analysis tools specifically for MAGs are needed in future studies.
more » « less- Award ID(s):
- 1933521
- NSF-PAR ID:
- 10372003
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Briefings in Bioinformatics
- Volume:
- 23
- Issue:
- 6
- ISSN:
- 1467-5463
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
McBain, Andrew J. (Ed.)ABSTRACT The recovery of metagenome-assembled genomes (MAGs) from metagenomic data has recently become a common task for microbial studies. The strengths and limitations of the underlying bioinformatics algorithms are well appreciated by now based on performance tests with mock data sets of known composition. However, these mock data sets do not capture the complexity and diversity often observed within natural populations, since their construction typically relies on only a single genome of a given organism. Further, it remains unclear if MAGs can recover population-variable genes (those shared by >10% but <90% of the members of the population) as efficiently as core genes (those shared by >90% of the members). To address these issues, we compared the gene variabilities of pathogenic Escherichia coli isolates from eight diarrheal samples, for which the isolate was the causative agent, against their corresponding MAGs recovered from the companion metagenomic data set. Our analysis revealed that MAGs with completeness estimates near 95% captured only 77% of the population core genes and 50% of the variable genes, on average. Further, about 5% of the genes of these MAGs were conservatively identified as missing in the isolate and were of different (non- Enterobacteriaceae ) taxonomic origin, suggesting errors at the genome-binning step, even though contamination estimates based on commonly used pipelines were only 1.5%. Therefore, the quality of MAGs may often be worse than estimated, and we offer examples of how to recognize and improve such MAGs to sufficient quality by (for instance) employing only contigs longer than 1,000 bp for binning. IMPORTANCE Metagenome assembly and the recovery of metagenome-assembled genomes (MAGs) have recently become common tasks for microbiome studies across environmental and clinical settings. However, the extent to which MAGs can capture the genes of the population they represent remains speculative. Current approaches to evaluating MAG quality are limited to the recovery and copy number of universal housekeeping genes, which represent a small fraction of the total genome, leaving the majority of the genome essentially inaccessible. If MAG quality in reality is lower than these approaches would estimate, this could have dramatic consequences for all downstream analyses and interpretations. In this study, we evaluated this issue using an approach that employed comparisons of the gene contents of MAGs to the gene contents of isolate genomes derived from the same sample. Further, our samples originated from a diarrhea case-control study, and thus, our results are relevant for recovering the virulence factors of pathogens from metagenomic data sets.more » « less
-
Advances in sequencing technologies and bioinformatics tools have dramatically increased the recovery rate of microbial genomes from metagenomic data. Assessing the quality of metagenome-assembled genomes (MAGs) is a critical step before downstream analysis. Here, we present CheckM2, an improved method of predicting genome quality of MAGs using machine learning. Using synthetic and experimental data, we demonstrate that CheckM2 outperforms existing tools in both accuracy and computational speed. In addition, CheckM2’s database can be rapidly updated with new high-quality reference genomes, including taxa represented only by a single genome. We also show that CheckM2 accurately predicts genome quality for MAGs from novel lineages, even for those with reduced genome size (for example, Patescibacteria and the DPANN superphylum). CheckM2 provides accurate genome quality predictions across bacterial and archaeal lineages, giving increased confidence when inferring biological conclusions from MAGs.more » « less
-
null (Ed.)Microorganisms can potentially colonise volcanic rocks using the chemical energy in reduced gases such as methane, hydrogen (H2) and carbon monoxide (CO). In this study, we analysed soil metagenomes from Chilean volcanic soils, representing three different successional stages with ages of 380, 269 and 63 years, respectively. A total of 19 metagenome-assembled genomes (MAGs) were retrieved from all stages with a higher number observed in the youngest soil (1640: 2 MAGs, 1751: 1 MAG, 1957: 16 MAGs). Genomic similarity indices showed that several MAGs had amino-acid identity (AAI) values >50% to the phyla Actinobacteria, Acidobacteria, Gemmatimonadetes, Proteobacteria and Chloroflexi. Three MAGs from the youngest site (1957) belonged to the class Ktedonobacteria (Chloroflexi). Complete cellular functions of all the MAGs were characterised, including carbon fixation, terpenoid backbone biosynthesis, formate oxidation and CO oxidation. All 19 environmental genomes contained at least one gene encoding a putative carbon monoxide dehydrogenase (CODH). Three MAGs had form I coxL operon (encoding the large subunit CO-dehydrogenase). One of these MAGs (MAG-1957-2.1, Ktedonobacterales) was highly abundant in the youngest soil. MAG-1957-2.1 also contained genes encoding a [NiFe]-hydrogenase and hyp genes encoding accessory enzymes and proteins. Little is known about the Ktedonobacterales through cultivated isolates, but some species can utilise H2 and CO for growth. Our results strongly suggest that the remote volcanic sites in Chile represent a natural habitat for Ktedonobacteria and they may use reduced gases for growth.more » « less
-
Abstract Synechococcus are the most abundant cyanobacteria in high latitude regions and are responsible for an estimated 17% of annual marine net primary productivity. Despite their biogeochemical importance, Synechococcus populations have been unevenly sampled across the ocean, with most studies focused on low-latitude strains. In particular, the near absence of Synechococcus genomes from high-latitude, High Nutrient Low Chlorophyll (HNLC) regions leaves a gap in our knowledge of picocyanobacterial adaptations to iron limitation and their influence on carbon, nitrogen, and iron cycles. We examined Synechococcus populations from the subarctic North Pacific, a well-characterized HNLC region, with quantitative metagenomics. Assembly with short and long reads produced two near complete Synechococcus metagenome-assembled genomes (MAGs). Quantitative metagenome-derived abundances of these populations matched well with flow cytometry counts, and the Synechococcus MAGs were estimated to comprise >99% of the Synechococcus at Station P. Whereas the Station P Synechococcus MAGs contained multiple genes for adaptation to iron limitation, both genomes lacked genes for uptake and assimilation of nitrate and nitrite, suggesting a dependence on ammonium, urea, and other forms of recycled nitrogen leading to reduced iron requirements. A global analysis of Synechococcus nitrate reductase abundance in the TARA Oceans dataset found nitrate assimilation genes are also lower in other HNLC regions. We propose that nitrate and nitrite assimilation gene loss in Synechococcus may represent an adaptation to severe iron limitation in high-latitude regions where ammonium availability is higher. Our findings have implications for models that quantify the contribution of cyanobacteria to primary production and subsequent carbon export.
-
Abstract Oxygen deficient zones (ODZs) account for about 30% of total oceanic fixed nitrogen loss via processes including denitrification, a microbially mediated pathway proceeding stepwise from NO3− to N2. This process may be performed entirely by complete denitrifiers capable of all four enzymatic steps, but many organisms possess only partial denitrification pathways, either producing or consuming key intermediates such as the greenhouse gas N2O. Metagenomics and marker gene surveys have revealed a diversity of denitrification genes within ODZs, but whether these genes co-occur within complete or partial denitrifiers and the identities of denitrifying taxa remain open questions. We assemble genomes from metagenomes spanning the ETNP and Arabian Sea, and map these metagenome-assembled genomes (MAGs) to 56 metagenomes from all three major ODZs to reveal the predominance of partial denitrifiers, particularly single-step denitrifiers. We find niche differentiation among nitrogen-cycling organisms, with communities performing each nitrogen transformation distinct in taxonomic identity and motility traits. Our collection of 962 MAGs presents the largest collection of pelagic ODZ microorganisms and reveals a clearer picture of the nitrogen cycling community within this environment.