<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Mammal virus diversity estimates are unstable due to accelerating discovery effort</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/01/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10312508</idno>
					<idno type="doi">10.1098/rsbl.2021.0427</idno>
					<title level='j'>Biology Letters</title>
<idno>1744-957X</idno>
<biblScope unit="volume">18</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Rory Gibb</author><author>Gregory F. Albery</author><author>Nardus Mollentze</author><author>Evan A. Eskew</author><author>Liam Brierley</author><author>Sadie J. Ryan</author><author>Stephanie N. Seifert</author><author>Colin J. Carlson</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Host-virus association data underpin research into the distribution and eco-evolutionary correlates of viral diversity and zoonotic risk across host species. However, current knowledge of the wildlife virome is inherently constrained by historical discovery effort, and there are concerns that the reliability of ecological inference from host-virus data may be undermined by taxonomic and geographical sampling biases. Here, we evaluate whether current estimates of host-level viral diversity in wild mammals are stable enough to be considered biologically meaningful, by analysing a comprehensive dataset of discovery dates of 6571 unique mammal host-virus associations between 1930 and 2018. We show that virus discovery rates in mammal hosts are either constant or accelerating, with little evidence of declines towards viral richness asymptotes, even in highly sampled hosts. Consequently, inference of relative viral richness across host species has been unstable over time, particularly in bats, where intensified surveillance since the early 2000s caused a rapid rearrangement of species' ranked viral richness. Our results illustrate that comparative inference of host-level virus diversity across mammals is highly sensitive to even short-term changes in sampling effort. We advise caution to avoid overinterpreting patterns in current data, since it is feasible that an analysis conducted today could draw quite different conclusions than one conducted only a decade ago.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Pathogens are unevenly distributed across host species, and understanding the underlying coevolutionary processes is important for both ecological and &#169; 2022 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License <ref type="url">http://creativecommons.org/licenses/by/4.0/</ref>, which permits unrestricted use, provided the original author and source are credited.</p><p>health-motivated research. For example, data on how viral diversity is distributed across species and geographies can provide insights into biogeographical trends and anthropogenic drivers of cross-species transmission and disease emergence <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref>. Researchers have developed numerous hypotheses about the mechanisms underlying observed differences in virus diversity across hosts, from broad macroevolutionary trends (e.g. bats host a greater apparent diversity of viruses than other mammal orders <ref type="bibr">[4]</ref>) to narrower ecological associations (e.g. longer lived bats living in larger groups host a greater apparent diversity of viruses <ref type="bibr">[5]</ref>). Such work frequently analyses the number of viruses known to infect a given host species (viral richness) by synthesizing existing host-virus association data <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref><ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref><ref type="bibr">[7]</ref>.</p><p>However, recent work has raised concerns that such datasets inspire false confidence. Although host-virus association datasets take an increasingly complete inventory of current scientific knowledge <ref type="bibr">[8]</ref>, a substantial proportion of known viruses remain excluded because of long lead times before official taxonomic recognition, which is itself not uniform across the virome <ref type="bibr">[9]</ref>. An even greater proportion of the global virome remains completely undescribed <ref type="bibr">[10,</ref><ref type="bibr">11]</ref>, with current knowledge strongly influenced by discovery strategies <ref type="bibr">[12]</ref>. This may undermine inference about the distribution of zoonotic risk among host taxa <ref type="bibr">[9]</ref>, and multiple studies have shown that apparent patterns in zoonotic virus richness become insignificant after adjusting for total viral richness <ref type="bibr">[13,</ref><ref type="bibr">14]</ref>. Yet it remains unclear how this impacts more basic scientific questions, including those concerning macroecological patterns in species-level viral diversity.</p><p>In this study, we evaluate whether-given the limits of current data-host-level estimates of viral diversity in mammals can be considered biologically meaningful based on their temporal consistency. Even when a species' total viral diversity has been ground-truthed by thorough metagenomic sampling and rarefaction-based estimation <ref type="bibr">[15]</ref>, estimates suggest only approximately 3-7% of their viruses are captured by current host-virus association data <ref type="bibr">[10]</ref>. With such a small proportion of viruses described, it is plausible that comparative studies of viral diversity are using numbers that are both subject to change and highly sensitive to differences in sampling strategies between different host and virus groups.</p><p>We explore these questions using a dataset of 6571 mammal host-virus associations and their year of discovery (the earliest year that a virus was reported in association with a given host), representing a comprehensive inventory of known associations from 1930 to 2018 (electronic supplementary material, figure <ref type="figure">S1</ref>). We focus on wild mammals, because the historical intensity of pathogen discovery effort on domestic species could confound inference (electronic supplementary material, figure <ref type="figure">S2</ref>). First, we examine virus accumulation curves to test whether current absolute viral richness estimates in well-sampled orders and species are likely to be accurate, applying a test borrowed from research on parasite biodiversity <ref type="bibr">[16,</ref><ref type="bibr">17]</ref>: richness estimates can only be taken as 'stable'-and thus reflective of values close to the truth-if accumulation curves have passed an inflection point towards an asymptote <ref type="bibr">[18]</ref>. Alternatively, if viral diversity is still accumulating exponentially, current estimates may have little correlation to 'true' (unknown) viral richness. Second, we evaluate the temporal stability of relative viral richness estimates across wild mammals by testing the rank correlation between present-day and historical estimates. If the correlation of relative viral richness has remained fairly stable over time, this would suggest that species' viromes have been sampled proportionally, and that current data can (despite being incomplete) still provide meaningful comparative information about viral diversity across mammals.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Methods (a) Mammal host-virus association data over time</head><p>We accessed mammal host-virus records (1277 mammal species and 1756 viruses, of which 1073 are currently ratified by the International Committee on the Taxonomy of Viruses, ICTV) from a comprehensive multi-source database of host-virus associations (VIRION; <ref type="url">https://github.com/viralemergence/virion</ref>). VIRION compiles data from several static data sources, the NCBI GenBank database and the USAID PREDICT project database, with host taxonomy standardized to the NCBI taxonomic backbone <ref type="bibr">[6,</ref><ref type="bibr">8]</ref>. Here, we define a host-virus association based on broad evidence of infection: either serological, polymerase chain reaction (PCR)-based, or viral isolation. Some records describe recently discovered viral strains that are not yet resolved to species level; to ensure these do not inflate viral richness estimates, we only included taxonomically resolved viruses, defined as either ratified by ICTV (n = 1073) or reconciled to the internal viral taxonomy of the PREDICT project (n = 683) <ref type="bibr">[8]</ref>.</p><p>We defined the 'discovery year' for each unique host-virus pair (n = 6571) as the earliest year a given virus was reported in a given host, based on date of publication (for literature-based records), accession (for NCBI Nucleotide and GenBank-based records), or sample collection (for records from the USAID PRE-DICT database). The full database contains data up to mid-2021; however, novel association records become notably sparser after 2018 (electronic supplementary material, figure <ref type="figure">S1</ref>), probably owing to delays between viral sampling, reporting and taxonomic assignment <ref type="bibr">[6]</ref>. We therefore excluded all post-2018 records to avoid biasing inference about virus discovery trends in recent years. To examine trends in publication effort (a proxy for sampling effort), for each host we extracted annual counts of virus-related publications (by searching for species binomial plus all synonyms and 'virus' or 'viral') from the PubMed database using the R package 'rentrez' <ref type="bibr">[19]</ref>. We visualized cumulative virus discovery curves and publication counts over time at order-level (electronic supplementary material, figures S2 and S3) and across all wild mammal species (electronic supplementary material, figures S4 and S5). With the exception of individual species-level models (electronic supplementary material, figure <ref type="figure">S6</ref>), all subsequent analyses included wild species only (n = 1246) and excluded domestic and common laboratory species (defined using metadata compiled for VIRION <ref type="bibr">[6]</ref>) (electronic supplementary material, figure <ref type="figure">S2</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(b) Modelling trends in viral discovery rates at order-and species-level</head><p>We modelled trends in viral discovery rates by fitting generalized additive models (GAMs) to annual counts of viruses discovered per taxon (1930-2018), with a nonlinear trend of year fitted using penalized thin-plate regression splines in 'mgcv' <ref type="bibr">[20]</ref>. We fitted models at order-level (including the top eight best-sampled mammal orders with the highest known viral richness: Artiodactyla, Rodentia, Carnivora, Primates, Chiroptera, Lagomorpha, Perissodactyla and Eulipotyphla), and at species-level for the top 50 most virus-rich species in our dataset. Virus discovery counts were modelled as a Poisson process for all orders except Chiroptera, Rodentia and Primates, which were modelled using a negative binomial likelihood due to high overdispersion in recent years (figure <ref type="figure">1</ref>). If discovery curves have reached an inflection point in any taxon, we would expect a consistent downward trend in discovery rates in recent years. To test this, we identified time periods showing strong evidence of either increasing or declining trends, defined as periods during which the 95% confidence interval of the first derivative of the fitted spline does not overlap zero.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(c) Evaluating the temporal stability of relative viral richness estimates across taxa</head><p>A key assumption of most ecological studies using host-virus data is that currently known differences in virome composition between species (or higher groupings) are broadly representative of 'true' underlying patterns in viral diversity. If this were the case, differences in relative viral richness across taxa would be expected to stay relatively stable over time, even as discovery effort fills the gaps in species-level virus inventories. Alternatively, uneven sampling effort across species and time may severely impact this assumption <ref type="bibr">[9]</ref>, by causing instability and rapid reordering of viral richness estimates across taxa. We tested this by calculating the rank correlation (Spearman's &#961;) of viral richness in 2018 to viral richness estimates in annual timesteps backwards to 1960 (i.e. comparing the similarity of each annual historical 'snapshot' of ranked viral richness to present-day knowledge). We conducted this analysis at several taxonomic levels, comparing viral richness at the species level (across all mammal species, and separately within each of the key orders listed above), and comparing two different metrics at family and order levels (total viral richness and mean species-level viral richness).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(d) Examining the stability of ecological inferences</head><p>As a test of how changing knowledge might impact ecological inference, we examined the relationship between order-level species richness and viral richness, using data summarized at 5-year increments between 1990 and 2020 (n = 17 orders). We aimed to replicate Mollentze &amp; Streicker's <ref type="bibr">[13]</ref> finding that, at order-level, viral richness is mainly explained by species richness (suggesting a neutral explanation for the distribution of viral diversity). We accessed mammal order species richness estimates from the International Union for Conservation of Nature and, at each focal year, calculated order-level total viral richness (based on PCR or viral isolation evidence) and virus-related citation counts using only host-virus records up to and including that year. We modelled the relationship between log species richness and viral richness, adjusting for sampling effort (log citations), by fitting generalized linear models with a negative binomial likelihood. Analyses were conducted in R v. 4.0.3 <ref type="bibr">[21]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Results</head><p>Both cumulative discovery curves and fitted GAMs show that viral discovery in mammals is still in an upward growth phase, with little evidence of discovery rates declining towards zero (i.e. viral richness reaching an asymptote) in any group (figure <ref type="figure">1</ref>; electronic supplementary material, figure <ref type="figure">S2</ref>). This trend is mirrored in virus-related publication counts, which are exponentially increasing year-on-year across most mammal orders and covering an increasingly broad species range (electronic supplementary material, figure <ref type="figure">S3</ref>), but remain unevenly distributed across mammal groups (electronic supplementary material, figure <ref type="figure">S5</ref>). There is evidence for general upticks in discovery rates at two main historical junctures (figure <ref type="figure">1</ref>), first during the 1960s when technological improvements-including density gradient centrifugation for viral isolation, and establishment of the first human diploid fibroblast cell lines and the now-ubiquitous African green monkey kidney Vero cell line-facilitated industrial-scale production of viruses for research or vaccines <ref type="bibr">[22]</ref>. Discovery rates again increased sharply throughout the 2000s, coinciding with improvements in molecular detection techniques and nextgeneration sequencing, as well as growing funding for viral surveillance in wildlife following the 2002 SARS-CoV epidemic (an uptick in effort that was strongly focused on bats; electronic supplementary material, figure <ref type="figure">S3</ref>). The overall picture is the same at the species level, with the mean cumulative viral richness across all wild species still increasing exponentially (electronic supplementary material, figure <ref type="figure">S4</ref>) and little evidence of discovery rates declining within even highly sampled species (many of which are domestic; electronic supplementary material, figure <ref type="figure">S6</ref>). These trends are very similar when using several more conservative definitions of viral richness (viral genera, ICTV-ratified viruses, or stricter detection criteria excluding serologic detection; electronic supplementary material, figure <ref type="figure">S7</ref>). We also find no evidence that viral richness is becoming more weakly correlated to publication counts over time, as would be expected if viral diversity was reaching an asymptote in well-sampled groups (electronic supplementary material, figure <ref type="figure">S8</ref>). A consequence of this accelerating discovery trend is that inference of relative viral richness across species and higher taxonomic levels has been unstable over the last 60 years (figure <ref type="figure">2</ref>). Across all mammals, there is a consistent, gradual temporal decay in rank correlation between present-day and historical estimates of total viral richness, with species-level curves declining much more steeply than those at higher taxonomic levels (dropping to &#961; = 0.48 by 1991; figure <ref type="figure">2a</ref>). Rankings of mean species-level viral richness at order and family levels (arguably a more relevant metric when considering species contributions to community pathogen maintenance and transmission) are substantially more effort-sensitive than total viral richness, showing much steeper declines (figure <ref type="figure">2a</ref>). Within well-sampled mammal orders there is substantial variation in the historical stability of species-level relative viral richness estimates, and results before 1970 become unstable owing to data sparsity in several orders (figure <ref type="figure">2b</ref>). Notably, within Chiroptera, there has been an extremely rapid reordering of specieslevel viral richness estimates since 2000 (declining to &#961; = 0.59 by 2010 and to &#961; = 0.28 by 2001), probably owing to the ongoing intensification of research effort (electronic supplementary material, figure <ref type="figure">S3</ref>) and viral discovery (figure <ref type="figure">1</ref>) that followed the emergence of SARS-CoV <ref type="bibr">[23]</ref>. Our results show that such rapid changes in host-virus knowledge can impact inference: a positive relationship between order-level species richness and viral richness is only clearly detectable in data from 2010 onwards (electronic supplementary material, figure <ref type="figure">S9</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Discussion</head><p>Our results suggest that for most mammal species, viral diversity metrics are still shifting and largely reflect historical sampling bias. Given that even well-studied species do not have fully characterized viromes, these estimates are likely to continue shifting in coming years. Inference made on them, however, might become canonical in the literature-and embed false narratives about viral ecology-if these analyses are not repeated as the virome becomes better described. The situation might be improved by massively coordinated projects aiming to accelerate viral discovery <ref type="bibr">[11,</ref><ref type="bibr">24]</ref>, provided sampling strategies are designed to be taxonomically and geographically representative. However, the rapid recharacterization of the bat virome that has occurred since the first SARS epidemic highlights a significant risk: if sampling strategies are primarily motivated by either existing (zoonotic) viral diversity estimates or health security concerns linked to specific taxa, such initiatives might only further decouple observed and true underlying viral diversity.</p><p>Indeed, the unprecedented upward trend in wildlife virus discovery effort since 2000 has been unevenly distributed taxonomically and geographically, with rodents and bats being particularly heavily sampled and showing the highest instability in richness estimates. Ungulates (Artiodactyla and Perissodactyla) are unique among mammals in that reported viral diversity among domestic species exceeds that detected in wildlife (electronic supplementary material, figure <ref type="figure">S1</ref>). Although possibly reflecting the unique ecology of farmed livestock, this more likely reflects a bias towards sampling from livestock, which poses fewer logistical hurdles than sampling from wild ungulates. Further, many viral discovery efforts focus on the detection of targeted viral taxa (e.g. family-level consensus PCR) rather than unbiased approaches that remain cost-prohibitive and analytically challenging. Such evolving detection biases-including efforts to identify bat betacoronaviruses following the emergence of SARS-CoV-2-could, for example, continue to reinforce the perception of certain host taxa as unusually virus-diverse despite inconclusive evidence <ref type="bibr">[13]</ref>. Such biases have consequences for the stability of ecological inference: our heuristic analysis demonstrates that the recently reported positive relationship between species richness and viral richness at the order-level <ref type="bibr">[13]</ref> only becomes detectable in post-2010 data, which is especially notable given that estimates of relative order-level viral diversity have been more stable than species-level metrics (figure <ref type="figure">2</ref>). It is therefore concerning that comparative studies of correlates and geographical patterns of host-virus relationships conducted in the mid-2000s might feasibly have drawn quite different conclusions than similar studies conducted now or in the future.</p><p>These problems are not necessarily surprising to virologists, who have historically been more hesitant about inference from these limited samples than ecologists, and have encouraged particular caution with respect to inference about human health risks <ref type="bibr">[9]</ref>. Multiple studies have found that correcting for undersampling undermines widespread assumptions about zoonotic risk <ref type="bibr">[13,</ref><ref type="bibr">14]</ref>, and we suggest that future studies should similarly attempt to reject the null hypothesis that downstream patterns of zoonotic risk are a neutral consequence of total observed viral diversity. Given that present-day data are a tiny subset of the latent 'true' host-virus network, there will also be value in employing network-or measurement error-based methods that explicitly account for observation biases in analyses <ref type="bibr">[25]</ref>. Overall, because current patterns of host-level viral richness represent an unstable and biased snapshot of the mammal virome, we suggest that inference from host-virus association data needs to be carefully qualified and may not by itself be a comprehensive foundation for setting future agendas on viral zoonosis research or One Health policy.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Downloaded from https://royalsocietypublishing.org/ on 05 January 2022</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>royalsocietypublishing.org/journal/rsbl Biol. Lett. 18: 20210427 Downloaded from https://royalsocietypublishing.org/ on 05 January 2022</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>royalsocietypublishing.org/journal/rsbl Biol. Lett. 18: 20210427 Downloaded from https://royalsocietypublishing.org/ on 05 January 2022</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>royalsocietypublishing.org/journal/rsbl Biol. Lett. 18: 20210427 Downloaded from https://royalsocietypublishing.org/ on 05 January 2022</p></note>
		</body>
		</text>
</TEI>
