<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>To Dereplicate or Not To Dereplicate?</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/24/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10170960</idno>
					<idno type="doi">10.1128/mSphere.00971-19</idno>
					<title level='j'>mSphere</title>
<idno>2379-5042</idno>
<biblScope unit="volume">5</biblScope>
<biblScope unit="issue">3</biblScope>					

					<author>Jacob T. Evans</author><author>Vincent J. Denef</author><author>Katherine McMahon</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[ABSTRACT            Metagenome-assembled genomes (MAGs) expand our understanding of microbial diversity, evolution, and ecology. Concerns have been raised on how sequencing, assembly, binning, and quality assessment tools may result in MAGs that do not reflect single populations in nature. Here, we reflect on another issue, i.e., how to handle highly similar MAGs assembled from independent data sets. Obtaining multiple genomic representatives for a species is highly valuable, as it allows for population genomic analyses; however, when retaining genomes of closely related populations, it complicates MAG quality assessment and abundance inferences. We show that (i) published data sets contain a large fraction of MAGs sharing>99% average nucleotide identity, (ii) different software packages and parameters used to resolve this redundancy remove very different numbers of MAGs, and (iii) the removal of closely related genomes leads to losses of population-specific auxiliary genes. Finally, we highlight some approaches that can infer strain-specific dynamics across a sample series without dereplication.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>a fast but less accurate alignment method, such as Mash <ref type="bibr">(10)</ref>. This step is included in the tool dRep, which subsequently uses genomewide alignments using animf with mummer/nucmer to align contigs (dRep default <ref type="bibr">[11]</ref><ref type="bibr">[12]</ref><ref type="bibr">[13]</ref>) or gene-based alignments using NSimScan to align open reading frames identified in each MAG (dRep gANI <ref type="bibr">[14]</ref>). Other tools, such as pyani <ref type="bibr">(15)</ref>, calculate sequence identity using BLAST-based genomewide alignments of MAG contigs. While slower, we consider it the reference for comparison due to the higher accuracy of BLAST-based alignments (16) (Fig. <ref type="figure">1</ref>). When using Mash as a step preceding pyani, we used default parameters <ref type="bibr">(10)</ref>. The computed pairwise distances then were used to cluster genomes into similar groups with hierarchical clustering using a custom python script with fcluster from SciPy (<ref type="url">http://www .scipy.org/</ref>) with a threshold of 2 (<ref type="url">https://github.com/DenefLab/Dereplication-Letter -Code</ref>). pyani then was run within each group created from the clustering (Fig. <ref type="figure">1</ref>). We used the recommended default percent sequence alignment threshold for dRep (10%) but used 75% for pyani, similar to the cutoff used for identifying orthologs using BLAST alignments. Only MAGs meeting this threshold with both comparisons from the pairwise comparison were dereplicated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>WHY DEREPLICATE?</head><p>Dereplication is the reduction of a set of genomes, typically assembled from metagenomic data, based on high sequence similarity (e.g., &#1022;99% average nucleotide identity) between these genomes. When redundancy in a database of genomes is maintained, the subsequent step of mapping sequencing reads back to this database of genomes leads to sequencing reads having multiple high-quality alignments. Depending on the software used and parameters chosen, this leads to sequencing reads either being randomly distributed across the redundant genomes, with one random alignment reported from many possible options, or being reported at all redundant locations. When using these data to make inferences about the relative abundance and population dynamics across samples, relative abundance for each taxon represented by these redundant MAGs will look artificially low, and it will appear that multiple ecologically equivalent populations co-occur. Instead, the more likely conclusion should be that one abundant population exists across all samples (Fig. <ref type="figure">2</ref>). This issue has been acknowledged in multiple studies, and authors have chosen various cutoffs to avoid this issue (e.g., &#1022;95% average nucleotide identity <ref type="bibr">[4]</ref>, &#1022;98% average nucleotide identity <ref type="bibr">[2]</ref>, &#1022;95% amino acid identity <ref type="bibr">[17]</ref>, and &#1022;99.5% amino acid identity <ref type="bibr">[3]</ref>). In our experience, the presence of multiple closely related genomes can also complicate manually curating MAGs using, for example, Anvi'o <ref type="bibr">(18)</ref>. When multiple similar genomes are present, particularly when some of the closely related MAGs are less complete, these differential coverage patterns will be less reliable due to the distribu-FIG <ref type="bibr">1</ref> Overview of dereplication approaches used in this study. All approaches first cluster similar genomes (Mash clusters are delineated with boxes) using a fast, less accurate approach (Mash), which is included in the dRep package but is a separate preprocessing step we carried out for the pyani analysis (indicated with the dotted line). Each cluster of MAGs then is separately dereplicated using pairwise alignments by identifying MAGs within each Mash cluster that share ANI above the specified threshold. These clusters are indicated by boxes, with Mash clusters split in two multiple cluster groups using the same line type (full or dashed lines). Which genomes end up in the same cluster varies depending on the approach used; only one clustering is shown. Finally, a representative MAG is selected, either as part of the package (dRep) or using a custom script (our approach that used pyani for pairwise comparisons, indicated by the dotted line), selecting the MAG with the highest estimated completion and lowest estimated contamination. tion of sequencing reads across MAGs in some parts of the genome and not in others, generating divergent differential coverage patterns for contigs originating from the same population. This could lead to the removal of parts of the genome that do in fact belong.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>WHY NOT DEREPLICATE?</head><p>Traditional population genomic analyses rely on the availability of genome sequences of many conspecific isolates. Metagenomics-based population genomic analyses could enhance cultivation-dependent methods by obtaining sequences of multiple closely related, independently sampled populations (a population being defined as individuals of the same species occurring at the same time and place). While many methodological issues remain (see reference <ref type="bibr">19)</ref>, MAGs of &#1022;99% ANI originating from multiple samples could be a valuable resource, which would be lost if we dereplicate our MAG data sets. Most approaches to dereplicate remove genomes based on the sequence identity of shared parts of the genome. As such, when removing genomes, in addition to data on single-nucleotide variation, we may lose information  <ref type="formula">3</ref>), grouped based on sequence similarity by Mash. A box outline indicates the genome was preserved after dereplication, while white space indicates it was removed. The dRep default does not remove multiple nearly identical MAGs, while dRep-gANI removes MAGs that are more distantly related than the 99% or 96.5% ANI cutoff. Black bars show the average sequence read coverage across all contigs of each MAG, ranging from 0 to 2,000, when aligning a metagenomic data set (Sequence Read Archive accession no. SRR1702559) using all genomes in the tree (none) or dereplicated genome sets using different tools. Reads were mapped to each Multi-FASTA file of retained MAGs using BWA-MEM with default parameters <ref type="bibr">(26)</ref>. Average coverage per contig was computed with pileup.sh from bbtools (<ref type="url">https://sourceforge.net/projects/bbmap/</ref>). The phylogenetic tree was created by searching for marker genes with PhyloSift (27) using its default set of marker genes. All MAGs had estimated completeness levels of &#1022;90% (3). The genes then were aligned with PhyloSift and the resulting alignments concatenated, and the tree was created with FastTree (28) using the -nt and -gtr parameters.</p><p>on variability in the auxiliary gene content among representatives from the same species. As an example, we analyzed the effect of dereplication on database auxiliary gene content using two of the most commonly used tools (dRep and redundancy removal based on pyani results). We used a set of 46 Microcystis aeruginosa MAGs that we previously generated with extensive manual curation <ref type="bibr">(9)</ref>. The ANI between pairs of these 46 genomes averages to 96.4%. Out of a total of 9,175 unique gene clusters across the 46 MAGs, dereplication led to the removal of up to 2,228 auxiliary gene clusters when using dRep gANI with a 96.5% cutoff (used for species delineation using genome sequences <ref type="bibr">[14]</ref>) (Fig. <ref type="figure">2</ref>). At the other extreme, using the dRep default, no genomes were removed from the MAG set; thus, no gene clusters were lost, while intermediate numbers of gene clusters were removed when using pyani (213) and dRep gANI (447) at 99% thresholds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VARIABLE PERFORMANCE OF COMMONLY USED SOFTWARE</head><p>As already indicated from the analysis shown in the first part of Table <ref type="table">1</ref>, different dereplication tools lead to different outcomes, even when using the same sequence identity cutoffs. Using publicly available MAG data sets (3-5), we evaluated the performance of two commonly used dereplication tools, dRep and pyani. For dRep, we used the default parameters, a cutoff of 99% (dRep default), and the gANI option, which aligns predicted open reading frames rather than the entire genomes, with cutoffs of 99% and 96.5% (dRep gANI). For pyani, we used a 99% ANI cutoff. As outlined in Fig. <ref type="figure">1</ref>, to reduce computation time, all approaches first cluster MAGs by calculating pairwise distances using Mash.</p><p>First, we performed a comprehensive analysis of a set of 7,800 genomes generated from 1,550 public metagenomes <ref type="bibr">(3)</ref>. In this study, no dereplication was done for most analyses, except for building the tree represented in Fig. <ref type="figure">2</ref> in the study originally reporting these MAGs (3). For the latter analysis by Parks and coauthors, dereplication was performed by removing genomes with an amino acid identity (AAI) of &#1350;99.5%, as calculated using CompareM (<ref type="url">https://github.com/dparks1134/CompareM</ref>), resulting in the removal of 27.5% of all MAGs. In our own analyses, relative to the pyani reference (32.9% removal), default dRep removed fewer genomes (19.3%), while the gANI dRep approach removed more MAGs (48.1% [99% ANI], 56.9% [96.5% ANI]) (Table <ref type="table">1</ref>). A closer look at one cluster of related MAGs indicated that dRep gANI regularly removed genomes that did not require removal, while dRep with default parameters did not remove a sufficient number of MAGs (Fig. <ref type="figure">2</ref>).</p><p>For a recent study that generated more than 90,000 MAGs (4), we performed our comparative dereplication analysis on the 1,952 uncultured bacterial species that were identified and examined by the authors. These were MAGs not classified at the species level in current databases that had been dereplicated by removing less complete MAGs that shared an ANI of &#1022;95% across 60% of their sequence length. In this case, pyani a Retained gene clusters of the Microcystis aeruginosa pangenome of 46 MAGs <ref type="bibr">(9)</ref>. A gene cluster consists of all genes across all genomes that had a minimum bit score of at least 0.5 when using the pangenome analysis workflow in Anvi'o <ref type="bibr">(18)</ref>. Retention of at least one representative in each gene cluster was evaluated when using different dereplication tools and ANI settings.</p><p>b Number of MAGs remaining after dereplication tools were run shows tool-dependent results of dereplication using data from three published data sets. SGB refers to numbered species-level clusters generated in analyses by Pasolli et al. <ref type="bibr">(5)</ref>.</p><p>removed four times fewer MAGs than the different implementations of dRep (Table <ref type="table">2</ref>).</p><p>In contrast to our preceding analyses, dRep default removed more MAGs than pyani, potentially because the authors had already dereplicated their MAG set at 95% ANI. Finally, we analyzed two MAG groups, clustered at the species level (95% ANI) by the authors of a recent study generating more than 150,000 MAGs <ref type="bibr">(5)</ref>. In this case, dRep default again removed fewer MAGs than pyani, while dRep using gANI removed many more MAGs (Table <ref type="table">2</ref>). Various factors may explain the differences in outcome between dereplication approaches, including the faster but less accurate alignment and percent identity calculation methods used in dRep, the diverging percent alignment thresholds used, and, potentially, different implementations of Mash in dRep and in our combined Mash-pyani approach. As for the percent alignment thresholds used, when we tried different coverage thresholds with the set of genomes represented in Fig. <ref type="figure">2</ref>, dropping the coverage threshold for pyani to 10%, used by dRep, led to the removal of one extra MAG. When doing the same for the Microcystis MAG set, we observed no change in the number of genomes removed by pyani (10 to 75% alignment thresholds when using 100% complete MAGs) (Table <ref type="table">2</ref>). Finally, dereplication tools also vary in their performance based on the completeness of the MAGs that are being compared. Previous guidelines for the use of dRep have shown that Mash, the first step in all the tested dereplication approaches (Fig. <ref type="figure">1</ref>), underestimates genome similarity for incomplete genomes and recommends against dereplicating genomes below 50% completeness (<ref type="url">https://drep.readthedocs.io/en/ latest/choosing_parameters.html</ref>). When using the pyani approach we favor, using our data set of Microcystis MAGs artificially reduced to lower completeness by randomly removing contigs, only 1 out of 46 genomes was removed at completeness levels between 10 and 60%. At 70, 80, and 90% completeness, pyani removed 2, 8, and 22 genomes at a threshold of &#1022;99% identity and alignments covering &#1022;75% of the MAG (Table <ref type="table">2</ref>). At their highest estimated completeness level, which ranged between 95 and 100%, 25 genomes were removed. While performance will likely vary depending on the data set, these results suggest that dereplication with our preferred combined Mashpyani approach is not able to prune highly incomplete genomes and will retain more genomes than optimal even for genomes that are estimated to be 80 to 90% complete. Considering the limited effect of percent alignment threshold on the pyani approach when using complete MAGs, lowering this threshold helps remove more genomes and, hence, should be lowered to 50% or less when trying to prune more incomplete MAGs (Table <ref type="table">2</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AVAILABLE APPROACHES TO LEVERAGE SAMPLING OF BETWEEN-POPULATION VARIATION</head><p>Several tools have been developed to maintain the auxiliary genomes of closely related strains while avoiding redundancy when tracking strain-resolved population dynamics in the environment using metagenomic data (reviewed in reference <ref type="bibr">19)</ref>. They typically use metagenomic data in combination with a database of genomes of closely related isolates or MAGs based on whether alleles of shared genes (StrainPhlAn <ref type="bibr">[20]</ref> and a Microcystis aeruginosa MAGs were artificially reduced in completeness by random subsampling of contigs.</p><p>The full MAGs ranged in estimated completeness between 95 and 100%. Shown are the numbers of the original 46 MAGs retained by our combined Mash-pyani approach using different percent alignment thresholds.</p><p>ConStrains <ref type="bibr">[21]</ref>), strain-specific auxiliary genes (PanPhlAn <ref type="bibr">[22]</ref>), or both are present in a sample (MIDAS <ref type="bibr">[23]</ref>). Similarly, the Anvi'o package incorporates a metapangenome workflow that reduces a set of user-defined conspecific genomes to gene clusters representing core and auxiliary genes and then estimates strain abundances across metagenomic data sets <ref type="bibr">(18)</ref>. In principle, all of these approaches avoid the issues associated with database redundancy highlighted in Table <ref type="table">1</ref> and Fig. <ref type="figure">2</ref> and the loss of population-specific auxiliary genes highlighted in Table <ref type="table">1</ref>. Although variant identification errors do remain, which are tool and likely database and metagenomic data set dependent, this has been reported to be as low as 0.1% <ref type="bibr">(20)</ref>. While potential issues with these approaches have not been fully evaluated, analyses focusing on populations where the dominant strain can be more readily resolved have been able to go as far as tracking in situ bacterial evolution in environmental biofilms and the human gut <ref type="bibr">(24,</ref><ref type="bibr">25)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CONCLUSIONS</head><p>Genome-centric metagenomics has opened a view into the undescribed branches of the tree of life. However, full awareness of the risks associated with MAGs is needed to avoid the misinterpretation of the data and populating databases with questionable genomes. Dereplication is a step carried out by many researchers as part of metagenomic informatic pipelines, but we highlight large differences between commonly used tools in how many genomes are removed. Tools designed to resolve closely related genomes in a database exist and may circumvent issues with redundancy while maximally leveraging all data from conspecific populations obtained from generating MAGs. As the ability to resolve closely related genomes is dependent on the genetic distance between genomes in the database and between database genomes and those of sampled populations, these tools need broader adaptation and evaluation to fully evaluate their accuracy. Additionally, we hope there will be further development of tools that resolve the issues highlighted in this study.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>May/June 2020 Volume 5 Issue 3 e00971-19 msphere.asm.org 2</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>May/June 2020 Volume 5 Issue 3 e00971-19 msphere.asm.org 3</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>May/June 2020 Volume 5 Issue 3 e00971-19 msphere.asm.org 4</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_3"><p>May/June 2020 Volume 5 Issue 3 e00971-19 msphere.asm.org 5</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_4"><p>May/June 2020 Volume 5 Issue 3 e00971-19 msphere.asm.org 6</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_5"><p>May/June 2020 Volume 5 Issue 3 e00971-19 msphere.asm.org 7</p></note>
		</body>
		</text>
</TEI>
