skip to main content


Search for: All records

Award ID contains: 2034228

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract Motivation

    Building reliable phylogenies from very large collections of sequences with a limited number of phylogenetically informative sites is challenging because sequencing errors and recurrent/backward mutations interfere with the phylogenetic signal, confounding true evolutionary relationships. Massive global efforts of sequencing genomes and reconstructing the phylogeny of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains exemplify these difficulties since there are only hundreds of phylogenetically informative sites but millions of genomes. For such datasets, we set out to develop a method for building the phylogenetic tree of genomic haplotypes consisting of positions harboring common variants to improve the signal-to-noise ratio for more accurate and fast phylogenetic inference of resolvable phylogenetic features.

    Results

    We present the TopHap approach that determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods. We develop a bootstrap strategy that resamples genomes spatiotemporally to assess topological robustness. The application of TopHap to build a phylogeny of 68 057 SARS-CoV-2 genomes (68KG) from the first year of the pandemic produced an evolutionary tree of major SARS-CoV-2 haplotypes. This phylogeny is concordant with the mutation tree inferred using the co-occurrence pattern of mutations and recovers key phylogenetic relationships from more traditional analyses. We also evaluated alternative roots of the SARS-CoV-2 phylogeny and found that the earliest sampled genomes in 2019 likely evolved by four mutations of the most recent common ancestor of all SARS-CoV-2 genomes. An application of TopHap to more than 1 million SARS-CoV-2 genomes reconstructed the most comprehensive evolutionary relationships of major variants, which confirmed the 68KG phylogeny and provided evolutionary origins of major and recent variants of concern.

    Availability and implementation

    TopHap is available at https://github.com/SayakaMiura/TopHap.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Background: Among the most consequential unknowns of the devastating COVID-19 pandemic are the durability of immunity and time to likely reinfection. There are limited direct data on SARS-CoV-2 long-term immune responses and reinfection. The aim of this study is to use data on the durability of immunity among evolutionarily close coronavirus relatives of SARS-CoV-2 to estimate times to reinfection by a comparative evolutionary analysis of related viruses SARS-CoV, MERS-CoV, human coronavirus (HCoV)-229E, HCoV-OC43, and HCoV-NL63. Methods: We conducted phylogenetic analyses of the S, M, and ORF1b genes to reconstruct a maximum-likelihood molecular phylogeny of human-infecting coronaviruses. This phylogeny enabled comparative analyses of peak-normalised nucleocapsid protein, spike protein, and whole-virus lysate IgG antibody optical density levels, in conjunction with reinfection data on endemic human-infecting coronaviruses. We performed ancestral and descendent states analyses to estimate the expected declines in antibody levels over time, the probabilities of reinfection based on antibody level, and the anticipated times to reinfection after recovery under conditions of endemic transmission for SARS-CoV-2, as well as the other human-infecting coronaviruses. Findings: We obtained antibody optical density data for six human-infecting coronaviruses, extending from 128 days to 28 years after infection between 1984 and 2020. These data provided a means to estimate profiles of the typical antibody decline and probabilities of reinfection over time under endemic conditions. Reinfection by SARS-CoV-2 under endemic conditions would likely occur between 3 months and 5·1 years after peak antibody response, with a median of 16 months. This protection is less than half the duration revealed for the endemic coronaviruses circulating among humans (5-95% quantiles 15 months to 10 years for HCoV-OC43, 31 months to 12 years for HCoV-NL63, and 16 months to 12 years for HCoV-229E). For SARS-CoV, the 5-95% quantiles were 4 months to 6 years, whereas the 95% quantiles for MERS-CoV were inconsistent by dataset. Interpretation: The timeframe for reinfection is fundamental to numerous aspects of public health decision making. As the COVID-19 pandemic continues, reinfection is likely to become increasingly common. Maintaining public health measures that curb transmission-including among individuals who were previously infected with SARS-CoV-2-coupled with persistent efforts to accelerate vaccination worldwide is critical to the prevention of COVID-19 morbidity and mortality. Funding: US National Science Foundation. 
    more » « less
  3. Yeager, Meredith (Ed.)
    Abstract Global sequencing of genomes of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has continued to reveal new genetic variants that are the key to unraveling its early evolutionary history and tracking its global spread over time. Here we present the heretofore cryptic mutational history and spatiotemporal dynamics of SARS-CoV-2 from an analysis of thousands of high-quality genomes. We report the likely most recent common ancestor of SARS-CoV-2, reconstructed through a novel application and advancement of computational methods initially developed to infer the mutational history of tumor cells in a patient. This progenitor genome differs from genomes of the first coronaviruses sampled in China by three variants, implying that none of the earliest patients represent the index case or gave rise to all the human infections. However, multiple coronavirus infections in China and the United States harbored the progenitor genetic fingerprint in January 2020 and later, suggesting that the progenitor was spreading worldwide months before and after the first reported cases of COVID-19 in China. Mutations of the progenitor and its offshoots have produced many dominant coronavirus strains that have spread episodically over time. Fingerprinting based on common mutations reveals that the same coronavirus lineage has dominated North America for most of the pandemic in 2020. There have been multiple replacements of predominant coronavirus strains in Europe and Asia as well as continued presence of multiple high-frequency strains in Asia and North America. We have developed a continually updating dashboard of global evolution and spatiotemporal trends of SARS-CoV-2 spread (http://sars2evo.datamonkey.org/). 
    more » « less
  4. Battistuzzi, Fabia Ursula (Ed.)
    Abstract The Molecular Evolutionary Genetics Analysis (MEGA) software has matured to contain a large collection of methods and tools of computational molecular evolution. Here, we describe new additions that make MEGA a more comprehensive tool for building timetrees of species, pathogens, and gene families using rapid relaxed-clock methods. Methods for estimating divergence times and confidence intervals are implemented to use probability densities for calibration constraints for node-dating and sequence sampling dates for tip-dating analyses. They are supported by new options for tagging sequences with spatiotemporal sampling information, an expanded interactive Node Calibrations Editor, and an extended Tree Explorer to display timetrees. Also added is a Bayesian method for estimating neutral evolutionary probabilities of alleles in a species using multispecies sequence alignments and a machine learning method to test for the autocorrelation of evolutionary rates in phylogenies. The computer memory requirements for the maximum likelihood analysis are reduced significantly through reprogramming, and the graphical user interface has been made more responsive and interactive for very big data sets. These enhancements will improve the user experience, quality of results, and the pace of biological discovery. Natively compiled graphical user interface and command-line versions of MEGA11 are available for Microsoft Windows, Linux, and macOS from www.megasoftware.net. 
    more » « less