Abstract Exploring the diversity of diazotrophs is key to understanding their role in supplying fixed nitrogen that supports marine productivity. A nested PCR assay using the universal primer set nifH1-nifH4, which targets the nitrogenase (nifH) gene, is a widely used approach for studying marine diazotrophs by amplicon sequencing. Metagenomics, direct sequencing of DNA without PCR, has provided complementary views of the diversity of marine diazotrophs. A significant fraction of the metagenome-derived nifH sequences (e.g. Planctomycete- and Proteobacteria-affiliated) were reported to have nucleotide mismatches with the nifH1-nifH4 primers, leading to the suggestion that nifH amplicon sequencing does not detect specific diazotrophic taxa and underrepresents diazotroph diversity. Here, we report that these mismatches are mostly located in a single-base at the 5′-end of the nifH4 primer, which does not impact detection of the nifH genes. This is demonstrated by the presence of nifH genes that contain the nucleotide mismatches in a recent compilation of global ocean nifH amplicon datasets, with high relative abundances detected in a variety of samples. While the metagenome- and metatranscriptome-derived nifH genes accounted for 4.4% of the total amplicon sequence variants from the global ocean nifH amplicon database, the corresponding amplicon sequence variants can have high relative abundances (accounting for 47% of the reads in the database). These analyses underscore that nifH amplicon sequencing using the nifH1-nifH4 primers is an important tool for studying diversity of marine diazotrophs, particularly as a complement to metagenomics which can provide taxonomic and metabolic information for some dominant groups.
more »
« less
PPIT: an R package for inferring microbial taxonomy from nifH sequences
Abstract Motivation Linking microbial community members to their ecological functions is a central goal of environmental microbiology. When assigned taxonomy, amplicon sequences of metabolic marker genes can suggest such links, thereby offering an overview of the phylogenetic structure underpinning particular ecosystem functions. However, inferring microbial taxonomy from metabolic marker gene sequences remains a challenge, particularly for the frequently sequenced nitrogen fixation marker gene, nitrogenase reductase (nifH). Horizontal gene transfer in recent nifH evolutionary history can confound taxonomic inferences drawn from the pairwise identity methods used in existing software. Other methods for inferring taxonomy are not standardized and require manual inspection that is difficult to scale. Results We present Phylogenetic Placement for Inferring Taxonomy (PPIT), an R package that infers microbial taxonomy from nifH amplicons using both phylogenetic and sequence identity approaches. After users place query sequences on a reference nifH gene tree provided by PPIT (n = 6317 full-length nifH sequences), PPIT searches the phylogenetic neighborhood of each query sequence and attempts to infer microbial taxonomy. An inference is drawn only if references in the phylogenetic neighborhood are: (1) taxonomically consistent and (2) share sufficient pairwise identity with the query, thereby avoiding erroneous inferences due to known horizontal gene transfer events. We find that PPIT returns a higher proportion of correct taxonomic inferences than BLAST-based approaches at the cost of fewer total inferences. We demonstrate PPIT on deep-sea sediment and find that Deltaproteobacteria are the most abundant potential diazotrophs. Using this dataset we show that emending PPIT inferences based on visual inspection of query sequence placement can achieve taxonomic inferences for nearly all sequences in a query set. We additionally discuss how users can apply PPIT to the analysis of other marker genes. Availability PPIT is freely available to non-commercial users at https://github.com/bkapili/ppit. Installation includes a vignette that demonstrates package use and reproduces the nifH amplicon analysis discussed here. The raw nifH amplicon sequence data have been deposited in the GenBank, EMBL, and DDBJ databases under BioProject number PRJEB37167. Supplementary information Supplementary data are available at Bioinformatics online.
more »
« less
- Award ID(s):
- 1634297
- PAR ID:
- 10230959
- Editor(s):
- Birol, Inanc
- Date Published:
- Journal Name:
- Bioinformatics
- ISSN:
- 1367-4803
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We introduce Operational Genomic Unit (OGU), a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent from taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldomly applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in one synthetic and two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome datasets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project dataset, and more accurate prediction of human age by the gut microbiomes in the Finnish population. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate OGU adoption in future metagenomics studies. Importance Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. However, current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution compared to 16S rRNA amplicon sequence variant analysis. To solve these challenges, we introduce Operational Genomic Units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition while (ii) permitting use of phylogeny-aware tools. Our analysis of real-world datasets shows several advantages over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGU as standard practice in metagenomic studies.more » « less
-
ABSTRACT In metabarcoding studies, Linnaean taxonomy assignments of Operational Taxonomic Units (OTUs) or Amplicon Sequence Variants (ASVs) underpin many downstream bioinformatics analyses and ecological interpretations of environmental DNA (eDNA) datasets. However, public molecular databases (i.e., SILVA, EUKARYOME, BOLD) for most microbial metazoan phyla (nematodes, tardigrades, kinorhynchs, etc.) are sparsely populated, negatively impacting our ability to assign ecologically meaningful taxonomy to these understudied groups. Additionally, the choice of bioinformatics parameters and computational algorithms can further affect the accuracy of eDNA taxonomy assignments. Here, we use twoin silicodatasets to show that taxonomy assignments using the 18S rRNA gene can be dramatically improved by curating Linnaean taxonomy strings associated with each reference sequence and closing phylogenetic gaps by improving taxon sampling. Using free‐living nematodes as a case study, we applied two commonly used taxonomy assignment algorithms (BLAST+ and the QIIME2 Naïve Bayes classifier) across six iterations of the SILVA 138 reference database to evaluate the precision and accuracy of taxonomy assignments. The BLAST+ top hit with a 90% sequence similarity cutoff often returned the highest percentage of correctly assigned taxonomy at the genus level, and the QIIME2 Naïve Bayes classifier performed similarly well when paired with a reference database containing corrected taxonomy strings. Our results highlight the urgent need for phylogenetically informed expansions of public reference databases (encompassing both genomes and common gene markers), focused on poorly sampled lineages that are now robustly recovered via eDNA metabarcoding approaches. Additional taxonomy curation efforts should be applied to popular reference databases such as SILVA, and taxon sampling could be rapidly improved by more frequent incorporation of newly published GenBank sequences linked to genus‐ and/or species‐level identifications.more » « less
-
Abstract Arbuscular mycorrhizal fungi (AMF; Glomeromycota) are difficult to culture; therefore, establishing a robust amplicon-based approach to taxa identification is imperative to describe AMF diversity. Further, due to low and biased sampling of AMF taxa, molecular databases do not represent the breadth of AMF diversity, making database matching approaches suboptimal. Therefore, a full description of AMF diversity requires a tool to determine sequence-based placement in the Glomeromycota clade. Nonetheless, commonly used gene regions, including the SSU and ITS, do not enable reliable phylogenetic placement. Here, we present an improved database and pipeline for the phylogenetic determination of AMF using amplicons from the large subunit (LSU) rRNA gene. We improve our database and backbone tree by including additional outgroup sequences. We also improve an existing bioinformatics pipeline by aligning forward and reverse reads separately, using a universal alignment for all tree building, and implementing a BLAST screening prior to tree building to remove non-homologous sequences. Finally, we present a script to extract AMF belonging to 11 major families as well as an amplicon sequencing variant (ASV) version of our pipeline. We test the utility of the pipeline by testing the placement of known AMF, known non-AMF, andAcaulosporasp. spore sequences. This work represents the most comprehensive database and pipeline for phylogenetic placement of AMF LSU amplicon sequences within the Glomeromycota clade.more » « less
-
Abstract Background The 16S mitochondrial rRNA gene is the most widely sequenced molecular marker in amphibian systematic studies, making it comparable to the universal CO1 barcode that is more commonly used in other animal groups. However, studies employ different primer combinations that target different lengths/regions of the 16S gene ranging from complete gene sequences (~ 1500 bp) to short fragments (~ 500 bp), the latter of which is the most ubiquitously used. Sequences of different lengths are often concatenated, compared, and/or jointly analyzed to infer phylogenetic relationships, estimate genetic divergence ( p -distances), and justify the recognition of new species (species delimitation), making the 16S gene region, by far, the most influential molecular marker in amphibian systematics. Despite their ubiquitous and multifarious use, no studies have ever been conducted to evaluate the congruence and performance among the different fragment lengths. Results Using empirical data derived from both Sanger-based and genomic approaches, we show that full-length 16S sequences recover the most accurate phylogenetic relationships, highest branch support, lowest variation in genetic distances (pairwise p -distances), and best-scoring species delimitation partitions. In contrast, widely used short fragments produce inaccurate phylogenetic reconstructions, lower and more variable branch support, erratic genetic distances, and low-scoring species delimitation partitions, the numbers of which are vastly overestimated. The relatively poor performance of short 16S fragments is likely due to insufficient phylogenetic information content. Conclusions Taken together, our results demonstrate that short 16S fragments are unable to match the efficacy achieved by full-length sequences in terms of topological accuracy, heuristic branch support, genetic divergences, and species delimitation partitions, and thus, phylogenetic and taxonomic inferences that are predicated on short 16S fragments should be interpreted with caution. However, short 16S fragments can still be useful for species identification, rapid assessments, or definitively coupling complex life stages in natural history studies and faunal inventories. While the full 16S sequence performs best, it requires the use of several primer pairs that increases cost, time, and effort. As a compromise, our results demonstrate that practitioners should utilize medium-length primers in favor of the short-fragment primers because they have the potential to markedly improve phylogenetic inference and species delimitation without additional cost.more » « less
An official website of the United States government

