skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Snapshot of SARS-CoV-2 Genome Availability up to April 2020 and its Implications: Data Analysis
Background The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been growing exponentially, affecting over 4 million people and causing enormous distress to economies and societies worldwide. A plethora of analyses based on viral sequences has already been published both in scientific journals and through non–peer-reviewed channels to investigate the genetic heterogeneity and spatiotemporal dissemination of SARS-CoV-2. However, a systematic investigation of phylogenetic information and sampling bias in the available data is lacking. Although the number of available genome sequences of SARS-CoV-2 is growing daily and the sequences show increasing phylogenetic information, country-specific data still present severe limitations and should be interpreted with caution. Objective The objective of this study was to determine the quality of the currently available SARS-CoV-2 full genome data in terms of sampling bias as well as phylogenetic and temporal signals to inform and guide the scientific community. Methods We used maximum likelihood–based methods to assess the presence of sufficient information for robust phylogenetic and phylogeographic studies in several SARS-CoV-2 sequence alignments assembled from GISAID (Global Initiative on Sharing All Influenza Data) data released between March and April 2020. Results Although the number of high-quality full genomes is growing daily, and sequence data released in April 2020 contain sufficient phylogenetic information to allow reliable inference of phylogenetic relationships, country-specific SARS-CoV-2 data sets still present severe limitations. Conclusions At the present time, studies assessing within-country spread or transmission clusters should be considered preliminary or hypothesis-generating at best. Hence, current reports should be interpreted with caution, and concerted efforts should continue to increase the number and quality of sequences required for robust tracing of the epidemic.  more » « less
Award ID(s):
2028221
PAR ID:
10222264
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
JMIR Public Health and Surveillance
Volume:
6
Issue:
2
ISSN:
2369-2960
Page Range / eLocation ID:
e19170
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Gaglia, Marta M. (Ed.)
    ABSTRACT Next-generation sequencing has been essential to the global response to the COVID-19 pandemic. As of January 2022, nearly 7 million severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences are available to researchers in public databases. Sequence databases are an abundant resource from which to extract biologically relevant and clinically actionable information. As the pandemic has gone on, SARS-CoV-2 has rapidly evolved, involving complex genomic changes that challenge current approaches to classifying SARS-CoV-2 variants. Deep sequence learning could be a potentially powerful way to build complex sequence-to-phenotype models. Unfortunately, while they can be predictive, deep learning typically produces “black box” models that cannot directly provide biological and clinical insight. Researchers should therefore consider implementing emerging methods for visualizing and interpreting deep sequence models. Finally, researchers should address important data limitations, including (i) global sequencing disparities, (ii) insufficient sequence metadata, and (iii) screening artifacts due to poor sequence quality control. 
    more » « less
  2. Abstract MotivationBuilding reliable phylogenies from very large collections of sequences with a limited number of phylogenetically informative sites is challenging because sequencing errors and recurrent/backward mutations interfere with the phylogenetic signal, confounding true evolutionary relationships. Massive global efforts of sequencing genomes and reconstructing the phylogeny of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains exemplify these difficulties since there are only hundreds of phylogenetically informative sites but millions of genomes. For such datasets, we set out to develop a method for building the phylogenetic tree of genomic haplotypes consisting of positions harboring common variants to improve the signal-to-noise ratio for more accurate and fast phylogenetic inference of resolvable phylogenetic features. ResultsWe present the TopHap approach that determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods. We develop a bootstrap strategy that resamples genomes spatiotemporally to assess topological robustness. The application of TopHap to build a phylogeny of 68 057 SARS-CoV-2 genomes (68KG) from the first year of the pandemic produced an evolutionary tree of major SARS-CoV-2 haplotypes. This phylogeny is concordant with the mutation tree inferred using the co-occurrence pattern of mutations and recovers key phylogenetic relationships from more traditional analyses. We also evaluated alternative roots of the SARS-CoV-2 phylogeny and found that the earliest sampled genomes in 2019 likely evolved by four mutations of the most recent common ancestor of all SARS-CoV-2 genomes. An application of TopHap to more than 1 million SARS-CoV-2 genomes reconstructed the most comprehensive evolutionary relationships of major variants, which confirmed the 68KG phylogeny and provided evolutionary origins of major and recent variants of concern. Availability and implementationTopHap is available at https://github.com/SayakaMiura/TopHap. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  3. Adrish, Muhammad (Ed.)
    Mexico has experienced one of the highest COVID-19 mortality rates in the world. A delayed implementation of social distancing interventions in late March 2020 and a phased reopening of the country in June 2020 has facilitated sustained disease transmission in the region. In this study we systematically generate and compare 30-day ahead forecasts using previously validated growth models based on mortality trends from the Institute for Health Metrics and Evaluation for Mexico and Mexico City in near real-time. Moreover, we estimate reproduction numbers for SARS-CoV-2 based on the methods that rely on genomic data as well as case incidence data. Subsequently, functional data analysis techniques are utilized to analyze the shapes of COVID-19 growth rate curves at the state level to characterize the spatiotemporal transmission patterns of SARS-CoV-2. The early estimates of the reproduction number for Mexico were estimated between R t ~1.1–1.3 from the genomic and case incidence data. Moreover, the mean estimate of R t has fluctuated around ~1.0 from late July till end of September 2020. The spatial analysis characterizes the state-level dynamics of COVID-19 into four groups with distinct epidemic trajectories based on epidemic growth rates. Our results show that the sequential mortality forecasts from the GLM and Richards model predict a downward trend in the number of deaths for all thirteen forecast periods for Mexico and Mexico City. However, the sub-epidemic and IHME models perform better predicting a more realistic stable trajectory of COVID-19 mortality trends for the last three forecast periods (09/21-10/21, 09/28-10/27, 09/28-10/27) for Mexico and Mexico City. Our findings indicate that phenomenological models are useful tools for short-term epidemic forecasting albeit forecasts need to be interpreted with caution given the dynamic implementation and lifting of social distancing measures. 
    more » « less
  4. Incarcerated individuals are a highly vulnerable population for infection with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Understanding the transmission of respiratory infections within prisons and between prisons and surrounding communities is a crucial component of pandemic preparedness and response. Here, we use mathematical and statistical models to analyze publicly available data on the spread of SARS-CoV-2 reported by the Ohio Department of Rehabilitation and Corrections (ODRC). Results from mass testing conducted on April 16, 2020 were analyzed together with time of first reported SARS-CoV-2 infection among Marion Correctional Institution (MCI) inmates. Extremely rapid, widespread infection of MCI inmates was reported, with nearly 80% of inmates infected within 3 weeks of the first reported inmate case. The dynamical survival analysis (DSA) framework that we use allows the derivation of explicit likelihoods based on mathematical models of transmission. We find that these data are consistent with three non-exclusive possibilities: (i) a basic reproduction number >14 with a single initially infected inmate, (ii) an initial superspreading event resulting in several hundred initially infected inmates with a reproduction number of approximately three, or (iii) earlier undetected circulation of virus among inmates prior to April. All three scenarios attest to the vulnerabilities of prisoners to COVID-19, and the inability to distinguish among these possibilities highlights the need for improved infection surveillance and reporting in prisons. 
    more » « less
  5. Abstract Since its global emergence in 2020, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused multiple epidemics in the United States. When medical treatments for the virus were still emerging and a vaccine was not yet available, state and local governments sought to limit its spread by enacting various social-distancing interventions, such as school closures and lockdowns; however, the effectiveness of these interventions was unknown. We applied an established, semimechanistic Bayesian hierarchical model of these interventions to the spread of SARS-CoV-2 from Europe to the United States, using case fatalities from February 29, 2020, up to April 25, 2020, when some states began reversing their interventions. We estimated the effects of interventions across all states, contrasted the estimated reproduction numbers before and after lockdown for each state, and contrasted the predicted number of future fatalities with the actual number of fatalities as a check of the model’s validity. Overall, school closures and lockdowns were the only interventions modeled that had a reliable impact on the time-varying reproduction number, and lockdown appears to have played a key role in reducing that number to below 1.0. We conclude that reversal of lockdown without implementation of additional, equally effective interventions will enable continued, sustained transmission of SARS-CoV-2 in the United States. 
    more » « less