skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The impacts of fine-tuning, phylogenetic distance, and sample size on big-data bioacoustics
Vocalizations in animals, particularly birds, are critically important behaviors that influence their reproductive fitness. While recordings of bioacoustic data have been captured and stored in collections for decades, the automated extraction of data from these recordings has only recently been facilitated by artificial intelligence methods. These have yet to be evaluated with respect to accuracy of different automation strategies and features. Here, we use a recently published machine learning framework to extract syllables from ten bird species ranging in their phylogenetic relatedness from 1 to 85 million years, to compare how phylogenetic relatedness influences accuracy. We also evaluate the utility of applying trained models to novel species. Our results indicate that model performance is best on conspecifics, with accuracy progressively decreasing as phylogenetic distance increases between taxa. However, we also find that the application of models trained on multiple distantly related species can improve the overall accuracy to levels near that of training and analyzing a model on the same species. When planning big-data bioacoustics studies, care must be taken in sample design to maximize sample size and minimize human labor without sacrificing accuracy.  more » « less
Award ID(s):
2118240 2016189
PAR ID:
10404257
Author(s) / Creator(s):
; ;
Editor(s):
Staples, Anne Elizabeth
Date Published:
Journal Name:
PLOS ONE
Volume:
17
Issue:
12
ISSN:
1932-6203
Page Range / eLocation ID:
e0278522
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Stochastic models of character trait evolution have become a cornerstone of evolutionary biology in an array of contexts. While probabilistic models have been used extensively for statistical inference, they have largely been ignored for the purpose of measuring distances between phylogeny-aware models. Recent contributions to the problem of phylogenetic distance computation have highlighted the importance of explicitly considering evolutionary model parameters and their impacts on molecular sequence data when quantifying dissimilarity between trees. By comparing two phylogenies in terms of their induced probability distributions that are functions of many model parameters, these distances can be more informative than traditional approaches that rely strictly on differences in topology or branch lengths alone. Currently, however, these approaches are designed for comparing models of nucleotide substitution and gene tree distributions, and thus, are unable to address other classes of traits and associated models that may be of interest to evolutionary biologists. Here we expand the principles of probabilistic phylogenetic distances to compute tree distances under models of continuous trait evolution along a phylogeny. By explicitly considering both the degree of relatedness among species and the evolutionary processes that collectively give rise to character traits, these distances provide a foundation for comparing models and their predictions, and for quantifying the impacts of assuming one phylogenetic background over another while studying the evolution of a particular trait. We demonstrate the properties of these approaches using theory, simulations, and several empirical datasets that highlight potential uses of probabilistic distances in many scenarios. We also introduce an open-source R package named PRDATR for easy application by the scientific community for computing phylogenetic distances under models of character trait evolution. 
    more » « less
  2. Abstract Kinship plays a fundamental role in the evolution of social systems and is considered a key driver of group living. To understand the role of kinship in the formation and maintenance of social bonds, accurate measures of genetic relatedness are critical. Genotype‐by‐sequencing technologies are rapidly advancing the accuracy and precision of genetic relatedness estimates for wild populations. The ability to assign kinship from genetic data varies depending on a species’ or population's mating system and pattern of dispersal, and empirical data from longitudinal studies are crucial to validate these methods. We use data from a long‐term behavioural study of a polygynandrous, bisexually philopatric marine mammal to measure accuracy and precision of parentage and genetic relatedness estimation against a known partial pedigree. We show that with moderate but obtainable sample sizes of approximately 4,235 SNPs and 272 individuals, highly accurate parentage assignments and genetic relatedness coefficients can be obtained. Additionally, we subsample our data to quantify how data availability affects relatedness estimation and kinship assignment. Lastly, we conduct a social network analysis to investigate the extent to which accuracy and precision of relatedness estimation improve statistical power to detect an effect of relatedness on social structure. Our results provide practical guidance for minimum sample sizes and sequencing depth for future studies, as well as thresholds for post hoc interpretation of previous analyses. 
    more » « less
  3. Abstract Analysis of phylogenetic trees has become an essential tool in epidemiology. Likelihood-based methods fit models to phylogenies to draw inferences about the phylodynamics and history of viral transmission. However, these methods are often computationally expensive, which limits the complexity and realism of phylodynamic models and makes them ill-suited for informing policy decisions in real-time during rapidly developing outbreaks. Likelihood-free methods using deep learning are pushing the boundaries of inference beyond these constraints. In this paper, we extend, compare, and contrast a recently developed deep learning method for likelihood-free inference from trees. We trained multiple deep neural networks using phylogenies from simulated outbreaks that spread among 5 locations and found they achieve close to the same levels of accuracy as Bayesian inference under the true simulation model. We compared robustness to model misspecification of a trained neural network to that of a Bayesian method. We found that both models had comparable performance, converging on similar biases. We also implemented a method of uncertainty quantification called conformalized quantile regression that we demonstrate has similar patterns of sensitivity to model misspecification as Bayesian highest posterior density (HPD) and greatly overlap with HPDs, but have lower precision (more conservative). Finally, we trained and tested a neural network against phylogeographic data from a recent study of the SARS-Cov-2 pandemic in Europe and obtained similar estimates of region-specific epidemiological parameters and the location of the common ancestor in Europe. Along with being as accurate and robust as likelihood-based methods, our trained neural networks are on average over 3 orders of magnitude faster after training. Our results support the notion that neural networks can be trained with simulated data to accurately mimic the good and bad statistical properties of the likelihood functions of generative phylogenetic models. 
    more » « less
  4. Oscine songbirds learn vocalizations that function in mate attraction and territory defense. Sexual selection pressures on these learned songs could accelerate speciation. The Eastern and Spotted towhees are sister species that diverged recently (0.28 Ma) but now have partially overlapping ranges with evidence of some hybridization; widespread community-science recordings of these species, including songs within their zone of overlap and from potential hybrids, enable us to investigate whether song differentiation might facilitate their reproductive isolation. Here, we quantify 16 song features to analyze geographic variation in Spotted and Eastern towhee songs and test for species-level differences. We then use random-forest models to measure how accurately their songs can be classified by species, both within and outside the zone of overlap. While no single song feature reliably distinguishes the two species, a random-forest model trained on 16 features accurately classified 89.5% of songs; interestingly, species classification was less accurate in the zone of overlap. Finally, our analysis of the limited publicly available genetic data from each species supports the hypothesis that they are reproductively isolated. Together, our results suggest that, in combination, small variations in song features may contribute to these sister species’ ability to recognize their species-specific songs. 
    more » « less
  5. Abstract Pathogen spillover corresponds to the transmission of a pathogen or parasite from an original host species to a novel host species, preluding disease emergence. Understanding the interacting factors that lead to pathogen transmission in a zoonotic cycle could help identify novel hosts of pathogens and the patterns that lead to disease emergence. We hypothesize that ecological and biogeographic factors drive host encounters, infection susceptibility, and cross‐species spillover transmission. Using a rodent–ectoparasite system in the Neotropics, with shared ectoparasite associations as a proxy for ecological interaction between rodent species, we assessed relationships between rodents using geographic range, phylogenetic relatedness, and ectoparasite associations to determine the roles of generalist and specialist hosts in the transmission cycle of hantavirus. A total of 50 rodent species were ranked on their centrality in a network model based on ectoparasites sharing. Geographic proximity and phylogenetic relatedness were predictors for rodents to share ectoparasite species and were associated with shorter network path distance between rodents through shared ectoparasites. The rodent–ectoparasite network model successfully predicted independent data of seven known hantavirus hosts. The model predicted five novel rodent species as potential, unrecognized hantavirus hosts in South America. Findings suggest that ectoparasite data, geographic range, and phylogenetic relatedness of wildlife species could help predict novel hosts susceptible to infection and possible transmission of zoonotic pathogens. Hantavirus is a high‐consequence zoonotic pathogen with documented animal‐to‐animal, animal‐to‐human, and human‐to‐human transmission. Predictions of new rodent hosts can guide active epidemiological surveillance in specific areas and wildlife species to mitigate hantavirus spillover transmission risk from rodents to humans. This study supports the idea that ectoparasite relationships among rodents are a proxy of host species interactions and can inform transmission cycles of diverse pathogens circulating in wildlife disease systems, including wildlife viruses with epidemic potential, such as hantavirus. 
    more » « less