skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Friday, May 17 until 8:00 AM ET on Saturday, May 18 due to maintenance. We apologize for the inconvenience.

Search for: All records

Award ID contains: 2028280

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Proteins are direct products of the genome and metabolites are functional products of interactions between the host and other factors such as environment, disease state, clinical information, etc. Omics data, including proteins and metabolites, are useful in characterizing biological processes underlying COVID-19 along with patient data and clinical information, yet few methods are available to effectively analyze such diverse and unstructured data. Using an integrated approach that combines proteomics and metabolomics data, we investigated the changes in metabolites and proteins in relation to patient characteristics (e.g., age, gender, and health outcome) and clinical information (e.g., metabolic panel and complete blood count test results). We found significant enrichment of biological indicators of lung, liver, and gastrointestinal dysfunction associated with disease severity using publicly available metabolite and protein profiles. Our analyses specifically identified enriched proteins that play a critical role in responses to injury or infection within these anatomical sites, but may contribute to excessive systemic inflammation within the context of COVID-19. Furthermore, we have used this information in conjunction with machine learning algorithms to predict the health status of patients presenting symptoms of COVID-19. This work provides a roadmap for understanding the biochemical pathways and molecular mechanisms that drive disease severity, progression, and treatment of COVID-19.

    more » « less
  2. Abstract

    SARS-CoV-2 (CoV) is the etiological agent of the COVID-19 pandemic and evolves to evade both host immune systems and intervention strategies. We divided the CoV genome into 29 constituent regions and applied novel analytical approaches to identify associations between CoV genomic features and epidemiological metadata. Our results show that nonstructural protein 3 (nsp3) and Spike protein (S) have the highest variation and greatest correlation with the viral whole-genome variation. S protein variation is correlated with nsp3, nsp6, and 3′-to-5′ exonuclease variation. Country of origin and time since the start of the pandemic were the most influential metadata associated with genomic variation, while host sex and age were the least influential. We define a novel statistic—coherence—and show its utility in identifying geographic regions (populations) with unusually high (many new variants) or low (isolated) viral phylogenetic diversity. Interestingly, at both global and regional scales, we identify geographic locations with high coherence neighboring regions of low coherence; this emphasizes the utility of this metric to inform public health measures for disease spread. Our results provide a direction to prioritize genes associated with outcome predictors (e.g., health, therapeutic, and vaccine outcomes) and to improve DNA tests for predicting disease status.

    more » « less
  3. Abstract

    The performance of computational methods and software to identify differentially expressed features in single‐cell RNA‐sequencing (scRNA‐seq) has been shown to be influenced by several factors, including the choice of the normalization method used and the choice of the experimental platform (or library preparation protocol) to profile gene expression in individual cells. Currently, it is up to the practitioner to choose the most appropriate differential expression (DE) method out of over 100 DE tools available to date, each relying on their own assumptions to model scRNA‐seq expression features. To model the technological variability in cross‐platform scRNA‐seq data, here we propose to use Tweedie generalized linear models that can flexibly capture a large dynamic range of observed scRNA‐seq expression profiles across experimental platforms induced by platform‐ and gene‐specific statistical properties such as heavy tails, sparsity, and gene expression distributions. We also propose a zero‐inflated Tweedie model that allows zero probability mass to exceed a traditional Tweedie distribution to model zero‐inflated scRNA‐seq data with excessive zero counts. Using both synthetic and published plate‐ and droplet‐based scRNA‐seq datasets, we perform a systematic benchmark evaluation of more than 10 representative DE methods and demonstrate that our method (Tweedieverse) outperforms the state‐of‐the‐art DE approaches across experimental platforms in terms of statistical power and false discovery rate control. Our open‐source software (R/Bioconductor package) is available at

    more » « less
  4. Abstract Respiratory syncytial virus (RSV) bronchiolitis is not only the leading cause of hospitalization in U.S. infants, but also a major risk factor for asthma development. While emerging evidence suggests clinical heterogeneity within RSV bronchiolitis, little is known about its biologically-distinct endotypes. Here, we integrated clinical, virus, airway microbiome (species-level), transcriptome, and metabolome data of 221 infants hospitalized with RSV bronchiolitis in a multicentre prospective cohort study. We identified four biologically- and clinically-meaningful endotypes: A) clinical classic microbiome M. nonliquefaciens inflammation IFN-intermediate , B) clinical atopic microbiome S. pneumoniae / M. catarrhalis inflammation IFN-high , C) clinical severe microbiome mixed inflammation IFN-low , and D) clinical non-atopic microbiome M.catarrhalis inflammation IL-6 . Particularly, compared with endotype A infants, endotype B infants—who are characterized by a high proportion of IgE sensitization and rhinovirus coinfection, S. pneumoniae/M. catarrhalis codominance, and high IFN-α and -γ response—had a significantly higher risk for developing asthma (9% vs. 38%; OR, 6.00: 95%CI, 2.08–21.9; P = 0.002). Our findings provide an evidence base for the early identification of high-risk children during a critical period of airway development. 
    more » « less
  5. Coelho, Luis Pedro (Ed.)
    It is challenging to associate features such as human health outcomes, diet, environmental conditions, or other metadata to microbial community measurements, due in part to their quantitative properties. Microbiome multi-omics are typically noisy, sparse (zero-inflated), high-dimensional, extremely non-normal, and often in the form of count or compositional measurements. Here we introduce an optimized combination of novel and established methodology to assess multivariable association of microbial community features with complex metadata in population-scale observational studies. Our approach, MaAsLin 2 (Microbiome Multivariable Associations with Linear Models), uses generalized linear and mixed models to accommodate a wide variety of modern epidemiological studies, including cross-sectional and longitudinal designs, as well as a variety of data types (e.g., counts and relative abundances) with or without covariates and repeated measurements. To construct this method, we conducted a large-scale evaluation of a broad range of scenarios under which straightforward identification of meta-omics associations can be challenging. These simulation studies reveal that MaAsLin 2’s linear model preserves statistical power in the presence of repeated measures and multiple covariates, while accounting for the nuances of meta-omics features and controlling false discovery. We also applied MaAsLin 2 to a microbial multi-omics dataset from the Integrative Human Microbiome (HMP2) project which, in addition to reproducing established results, revealed a unique, integrated landscape of inflammatory bowel diseases (IBD) across multiple time points and omics profiles. 
    more » « less
  6. Abstract In response to the COVID-19 outbreak, scientists and medical researchers are capturing a wide range of host responses, symptoms and lingering postrecovery problems within the human population. These variable clinical manifestations suggest differences in influential factors, such as innate and adaptive host immunity, existing or underlying health conditions, comorbidities, genetics and other factors—compounding the complexity of COVID-19 pathobiology and potential biomarkers associated with the disease, as they become available. The heterogeneous data pose challenges for efficient extrapolation of information into clinical applications. We have curated 145 COVID-19 biomarkers by developing a novel cross-cutting disease biomarker data model that allows integration and evaluation of biomarkers in patients with comorbidities. Most biomarkers are related to the immune (SAA, TNF-∝ and IP-10) or coagulation (D-dimer, antithrombin and VWF) cascades, suggesting complex vascular pathobiology of the disease. Furthermore, we observe commonality with established cancer biomarkers (ACE2, IL-6, IL-4 and IL-2) as well as biomarkers for metabolic syndrome and diabetes (CRP, NLR and LDL). We explore these trends as we put forth a COVID-19 biomarker resource ( that will help researchers and diagnosticians alike. 
    more » « less
  7. Wren, Jonathan (Ed.)
    Abstract Motivation The discovery of biologically interpretable and clinically actionable communities in heterogeneous omics data is a necessary first step toward deriving mechanistic insights into complex biological phenomena. Here, we present a novel clustering approach, omeClust, for community detection in omics profiles by simultaneously incorporating similarities among measurements and the overall complex structure of the data. Results We show that omeClust outperforms published methods in inferring the true community structure as measured by both sensitivity and misclassification rate on simulated datasets. We further validated omeClust in diverse, multiple omics datasets, revealing new communities and functionally related groups in microbial strains, cell line gene expression patterns and fetal genomic variation. We also derived enrichment scores attributable to putatively meaningful biological factors in these datasets that can serve as hypothesis generators facilitating new sets of testable hypotheses. Availability and implementation omeClust is open-source software, and the implementation is available online at Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  8. Systemic lupus erythematosus (SLE) is a chronic autoimmune disease characterized by the production of autoantibodies predominantly to nuclear material. Many aspects of disease pathology are mediated by the deposition of nucleic acid containing immune complexes, which also induce the type 1interferon response, a characteristic feature of SLE. Notably, SLE is remarkably heterogeneous, with a variety of organs involved in different individuals, who also show variation in disease severity related to their ancestries. Here, we probed one potential contribution to disease heterogeneity as well as a possible source of immunoreactive nucleic acids by exploring the expression of human endogenous retroviruses (HERVs). We investigated the expression of HERVs in SLE and their potential relationship to SLE features and the expression of biochemical pathways, including the interferon gene signature (IGS). Towards this goal, we analyzed available and new RNA-Seq data from two independent whole blood studies using Telescope. We identified 481 locus specific HERV encoding regions that are differentially expressed between case and control individuals with only 14% overlap of differentially expressed HERVs between these two datasets. We identified significant differences between differentially expressed HERVs and non-differentially expressed HERVs between the two datasets. We also characterized the host differentially expressed genes and tested their association with the differentially expressed HERVs. We found that differentially expressed HERVs were significantly more physically proximal to host differentially expressed genes than non-differentially expressed HERVs. Finally, we capitalized on locus specific resolution of HERV mapping to identify key molecular pathways impacted by differential HERV expression in people with SLE. 
    more » « less