skip to main content

Title: Analysis of SARS-CoV-2 mutations in the United States suggests presence of four substrains and novel variants

SARS-CoV-2 has been mutating since it was first sequenced in early January 2020. Here, we analyze 45,494 complete SARS-CoV-2 geneome sequences in the world to understand their mutations. Among them, 12,754 sequences are from the United States. Our analysis suggests the presence of four substrains and eleven top mutations in the United States. These eleven top mutations belong to 3 disconnected groups. The first and second groups consisting of 5 and 8 concurrent mutations are prevailing, while the other group with three concurrent mutations gradually fades out. Moreover, we reveal that female immune systems are more active than those of males in responding to SARS-CoV-2 infections. One of the top mutations, 27964C > T-(S24L) on ORF8, has an unusually strong gender dependence. Based on the analysis of all mutations on the spike protein, we uncover that two of four SARS-CoV-2 substrains in the United States become potentially more infectious.

; ; ; ; ;
Award ID(s):
1761320 1900473
Publication Date:
Journal Name:
Communications Biology
Nature Publishing Group
Sponsoring Org:
National Science Foundation
More Like this
  1. Yeager, Meredith (Ed.)
    Abstract Global sequencing of genomes of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has continued to reveal new genetic variants that are the key to unraveling its early evolutionary history and tracking its global spread over time. Here we present the heretofore cryptic mutational history and spatiotemporal dynamics of SARS-CoV-2 from an analysis of thousands of high-quality genomes. We report the likely most recent common ancestor of SARS-CoV-2, reconstructed through a novel application and advancement of computational methods initially developed to infer the mutational history of tumor cells in a patient. This progenitor genome differs from genomes of the first coronaviruses sampled in China by three variants, implying that none of the earliest patients represent the index case or gave rise to all the human infections. However, multiple coronavirus infections in China and the United States harbored the progenitor genetic fingerprint in January 2020 and later, suggesting that the progenitor was spreading worldwide months before and after the first reported cases of COVID-19 in China. Mutations of the progenitor and its offshoots have produced many dominant coronavirus strains that have spread episodically over time. Fingerprinting based on common mutations reveals that the same coronavirus lineage has dominated North America formore »most of the pandemic in 2020. There have been multiple replacements of predominant coronavirus strains in Europe and Asia as well as continued presence of multiple high-frequency strains in Asia and North America. We have developed a continually updating dashboard of global evolution and spatiotemporal trends of SARS-CoV-2 spread (« less
  2. Schwartz, Russell (Ed.)
    Abstract Motivation Building reliable phylogenies from very large collections of sequences with a limited number of phylogenetically informative sites is challenging because sequencing errors and recurrent/backward mutations interfere with the phylogenetic signal, confounding true evolutionary relationships. Massive global efforts of sequencing genomes and reconstructing the phylogeny of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains exemplify these difficulties since there are only hundreds of phylogenetically informative sites but millions of genomes. For such datasets, we set out to develop a method for building the phylogenetic tree of genomic haplotypes consisting of positions harboring common variants to improve the signal-to-noise ratio for more accurate and fast phylogenetic inference of resolvable phylogenetic features. Results We present the TopHap approach that determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods. We develop a bootstrap strategy that resamples genomes spatiotemporally to assess topological robustness. The application of TopHap to build a phylogeny of 68 057 SARS-CoV-2 genomes (68KG) from the first year of the pandemic produced an evolutionary tree of major SARS-CoV-2 haplotypes. This phylogeny is concordant with the mutation tree inferred using the co-occurrence pattern of mutations and recovers key phylogenetic relationshipsmore »from more traditional analyses. We also evaluated alternative roots of the SARS-CoV-2 phylogeny and found that the earliest sampled genomes in 2019 likely evolved by four mutations of the most recent common ancestor of all SARS-CoV-2 genomes. An application of TopHap to more than 1 million SARS-CoV-2 genomes reconstructed the most comprehensive evolutionary relationships of major variants, which confirmed the 68KG phylogeny and provided evolutionary origins of major and recent variants of concern. Availability and implementation TopHap is available at Supplementary information Supplementary data are available at Bioinformatics online.« less
  3. Background There are still many unanswered questions about the novel coronavirus; however, a largely underutilized source of knowledge is the millions of people who have recovered after contracting the virus. This includes a majority of undocumented cases of COVID-19, which were classified as mild or moderate and received little to no clinical care during the course of illness. Objective This study aims to document and glean insights from the experiences of individuals with a first-hand experience in dealing with COVID-19, especially the so-called mild-to-moderate cases that self-resolved while in isolation. Methods This web-based survey study called C19 Insider Scoop recruited adult participants aged 18 years or older who reside in the United States and had tested positive for COVID-19 or antibodies. Participants were recruited through various methods, including online support groups for COVID-19 survivors, advertisement in local news outlets, as well as through professional and other networks. The main outcomes measured in this study included knowledge of contraction or transmission of the virus, symptoms, and personal experiences on the road to recovery. Results A total of 72 participants (female, n=53; male, n=19; age range: 18-73 years; mean age: 41 [SD 14] years) from 22 US states were enrolled in thismore »study. The top known source of how people contracted SARS-CoV-2, the virus known to cause COVID-19, was through a family or household member (26/72, 35%). This was followed by essential workers contracting the virus through the workplace (13/72, 18%). Participants reported up to 27 less-documented symptoms that they experienced during their illness, such as brain or memory fog, palpitations, ear pain or discomfort, and neurological problems. In addition, 47 of 72 (65%) participants reported that their symptoms lasted longer than the commonly cited 2-week period even for mild cases of COVID-19. The mean recovery time of the study participants was 4.5 weeks, and exactly one-half of participants (50%) still experienced lingering symptoms of COVID-19 after an average of 65 days following illness onset. Additionally, 37 (51%) participants reported that they experienced stigma associated with contracting COVID-19. Conclusions This study presents preliminary findings suggesting that emphasis on family or household spread of COVID-19 may be lacking and that there is a general underestimation of the recovery time even for mild cases of illness with the virus. Although a larger study is needed to validate these results, it is important to note that as more people experience COVID-19, insights from COVID-19 survivors can enable a more informed public, pave the way for others who may be affected by the virus, and guide further research.« less
  4. Abstract

    Given the global impact and severity of COVID-19, there is a pressing need for a better understanding of the SARS-CoV-2 genome and mutations. Multi-strain sequence alignments of coronaviruses (CoV) provide important information for interpreting the genome and its variation. We apply a comparative genomics method, ConsHMM, to the multi-strain alignments of CoV to annotate every base of the SARS-CoV-2 genome with conservation states based on sequence alignment patterns among CoV. The learned conservation states show distinct enrichment patterns for genes, protein domains, and other regions of interest. Certain states are strongly enriched or depleted of SARS-CoV-2 mutations, which can be used to predict potentially consequential mutations. We expect the conservation states to be a resource for interpreting the SARS-CoV-2 genome and mutations.

  5. COVID-19 vaccines have been authorized in multiple countries, and more are under rapid development. Careful design of a vaccine prioritization strategy across sociodemographic groups is a crucial public policy challenge given that 1) vaccine supply will be constrained for the first several months of the vaccination campaign, 2) there are stark differences in transmission and severity of impacts from severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) across groups, and 3) SARS-CoV-2 differs markedly from previous pandemic viruses. We assess the optimal allocation of a limited vaccine supply in the United States across groups differentiated by age and essential worker status, which constrains opportunities for social distancing. We model transmission dynamics using a compartmental model parameterized to capture current understanding of the epidemiological characteristics of COVID-19, including key sources of group heterogeneity (susceptibility, severity, and contact rates). We investigate three alternative policy objectives (minimizing infections, years of life lost, or deaths) and model a dynamic strategy that evolves with the population epidemiological status. We find that this temporal flexibility contributes substantially to public health goals. Older essential workers are typically targeted first. However, depending on the objective, younger essential workers are prioritized to control spread or seniors to directly control mortality.more »When the objective is minimizing deaths, relative to an untargeted approach, prioritization averts deaths on a range between 20,000 (when nonpharmaceutical interventions are strong) and 300,000 (when these interventions are weak). We illustrate how optimal prioritization is sensitive to several factors, most notably, vaccine effectiveness and supply, rate of transmission, and the magnitude of initial infections.« less