skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on November 14, 2025

Title: GeM-LR: Discovering predictive biomarkers for small datasets in vaccine studies
Despite significant progress in vaccine research, the level of protection provided by vaccination can vary significantly across individuals. As a result, understanding immunologic variation across individuals in response to vaccination is important for developing next-generation efficacious vaccines. Accurate outcome prediction and identification of predictive biomarkers would represent a significant step towards this goal. Moreover, in early phase vaccine clinical trials, small datasets are prevalent, raising the need and challenge of building a robust and explainable prediction model that can reveal heterogeneity in small datasets. We propose a new model named Generative Mixture of Logistic Regression (GeM-LR), which combines characteristics of both a generative and a discriminative model. In addition, we propose a set of model selection strategies to enhance the robustness and interpretability of the model. GeM-LR extends a linear classifier to a non-linear classifier without losing interpretability and empowers the notion of predictive clustering for characterizing data heterogeneity in connection with the outcome variable. We demonstrate the strengths and utility of GeM-LR by applying it to data from several studies. GeM-LR achieves better prediction results than other popular methods while providing interpretations at different levels.  more » « less
Award ID(s):
2205004
PAR ID:
10628524
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ;
Editor(s):
Finley, Stacey D
Publisher / Repository:
PLOS
Date Published:
Journal Name:
PLOS Computational Biology
Volume:
20
Issue:
11
ISSN:
1553-7358
Page Range / eLocation ID:
e1012581
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract To design effective vaccine policies, policymakers need detailed data about who has been vaccinated, who is holding out, and why. However, existing data in the US are insufficient: reported vaccination rates are often delayed or not granular enough, and surveys of vaccine hesitancy are limited by high-level questions and self-report biases. Here we show how search engine logs and machine learning can help to fill these gaps, using anonymized Bing data from February to August 2021. First, we develop avaccine intent classifierthat accurately detects when a user is seeking the COVID-19 vaccine on Bing. Our classifier demonstrates strong agreement with CDC vaccination rates, while preceding CDC reporting by 1–2 weeks, and estimates more granular ZIP-level rates, revealing local heterogeneity in vaccine seeking. To study vaccine hesitancy, we use our classifier to identify two groups,vaccine early adoptersandvaccine holdouts. We find that holdouts, compared to early adopters matched on covariates, are 67% likelier to click on untrusted news sites, and are much more concerned about vaccine requirements, development, and vaccine myths. Even within holdouts, clusters emerge with different concerns and openness to the vaccine. Finally, we explore the temporal dynamics of vaccine concerns and vaccine seeking, and find that key indicators predict when individuals convert from holding out to seeking the vaccine. 
    more » « less
  2. Machine learning (ML) methods, such as artificial neural networks (ANN), k-nearest neighbors (kNN), random forests (RF), support vector machines (SVM), and boosted decision trees (DTs), may offer stronger predictive performance than more traditional, parametric methods, such as linear regression, multiple linear regression, and logistic regression (LR), for specific mapping and modeling tasks. However, this increased performance is often accompanied by increased model complexity and decreased interpretability, resulting in critiques of their “black box” nature, which highlights the need for algorithms that can offer both strong predictive performance and interpretability. This is especially true when the global model and predictions for specific data points need to be explainable in order for the model to be of use. Explainable boosting machines (EBM), an augmentation and refinement of generalize additive models (GAMs), has been proposed as an empirical modeling method that offers both interpretable results and strong predictive performance. The trained model can be graphically summarized as a set of functions relating each predictor variable to the dependent variable along with heat maps representing interactions between selected pairs of predictor variables. In this study, we assess EBMs for predicting the likelihood or probability of slope failure occurrence based on digital terrain characteristics in four separate Major Land Resource Areas (MLRAs) in the state of West Virginia, USA and compare the results to those obtained with LR, kNN, RF, and SVM. EBM provided predictive accuracies comparable to RF and SVM and better than LR and kNN. The generated functions and visualizations for each predictor variable and included interactions between pairs of predictor variables, estimation of variable importance based on average mean absolute scores, and provided scores for each predictor variable for new predictions add interpretability, but additional work is needed to quantify how these outputs may be impacted by variable correlation, inclusion of interaction terms, and large feature spaces. Further exploration of EBM is merited for geohazard mapping and modeling in particular and spatial predictive mapping and modeling in general, especially when the value or use of the resulting predictions would be greatly enhanced by improved interpretability globally and availability of prediction explanations at each cell or aggregating unit within the mapped or modeled extent. 
    more » « less
  3. Wallqvist, Anders (Ed.)
    The SARS-CoV-2 pandemic has generated a considerable number of infections and associated morbidity and mortality across the world. Recovery from these infections, combined with the onset of large-scale vaccination, have led to rapidly-changing population-level immunological landscapes. In turn, these complexities have highlighted a number of important unknowns related to the breadth and strength of immunity following recovery or vaccination. Using simple mathematical models, we investigate the medium-term impacts of waning immunity against severe disease on immuno-epidemiological dynamics. We find that uncertainties in the duration of severity-blocking immunity (imparted by either infection or vaccination) can lead to a large range of medium-term population-level outcomes (i.e. infection characteristics and immune landscapes). Furthermore, we show that epidemiological dynamics are sensitive to the strength and duration of underlying host immune responses; this implies that determining infection levels from hospitalizations requires accurate estimates of these immune parameters. More durable vaccines both reduce these uncertainties and alleviate the burden of SARS-CoV-2 in pessimistic outcomes. However, heterogeneity in vaccine uptake drastically changes immune landscapes toward larger fractions of individuals with waned severity-blocking immunity. In particular, if hesitancy is substantial, more robust vaccines have almost no effects on population-level immuno-epidemiology, even if vaccination rates are compensatorily high among vaccine-adopters. This pessimistic scenario for vaccination heterogeneity arises because those few individuals that are vaccine-adopters are so readily re-vaccinated that the duration of vaccinal immunity has no appreciable consequences on their immune status. Furthermore, we find that this effect is heightened if vaccine-hesitants have increased transmissibility (e.g. due to riskier behavior). Overall, our results illustrate the necessity to characterize both transmission-blocking and severity-blocking immune time scales. Our findings also underline the importance of developing robust next-generation vaccines with equitable mass vaccine deployment. 
    more » « less
  4. Abstract In a survey and three experiments (one preregistered with a nationally representative sample), we examined if vaccination requirements are likely to backfire, as commonly feared. We investigated if relative to encouraging free choice in vaccination, requiring a vaccine weakens or strengthens vaccination intentions, both in general and among individuals with a predisposition to experience psychological reactance. In the four studies, compared to free choice, requirements strengthened vaccination intentions across racial and ethnic groups, across studies, and across levels of trait psychological reactance. The results consistently suggest that fears of a backlash against vaccine mandates may be unfounded and that requirements will promote COVID-19 vaccine uptake in the United States. 
    more » « less
  5. Abstract MotivationPredictive biological signatures provide utility as biomarkers for disease diagnosis and prognosis, as well as prediction of responses to vaccination or therapy. These signatures are identified from high-throughput profiling assays through a combination of dimensionality reduction and machine learning techniques. The genes, proteins, metabolites, and other biological analytes that compose signatures also generate hypotheses on the underlying mechanisms driving biological responses, thus improving biological understanding. Dimensionality reduction is a critical step in signature discovery to address the large number of analytes in omics datasets, especially for multi-omics profiling studies with tens of thousands of measurements. Latent factor models, which can account for the structural heterogeneity across diverse assays, effectively integrate multi-omics data and reduce dimensionality to a small number of factors that capture correlations and associations among measurements. These factors provide biologically interpretable features for predictive modeling. However, multi-omics integration and predictive modeling are generally performed independently in sequential steps, leading to suboptimal factor construction. Combining these steps can yield better multi-omics signatures that are more predictive while still being biologically meaningful. ResultsWe developed a supervised variational Bayesian factor model that extracts multi-omics signatures from high-throughput profiling datasets that can span multiple data types. Signature-based multiPle-omics intEgration via lAtent factoRs (SPEAR) adaptively determines factor rank, emphasis on factor structure, data relevance and feature sparsity. The method improves the reconstruction of underlying factors in synthetic examples and prediction accuracy of coronavirus disease 2019 severity and breast cancer tumor subtypes. Availability and implementationSPEAR is a publicly available R-package hosted at https://bitbucket.org/kleinstein/SPEAR. 
    more » « less