skip to main content


This content will become publicly available on September 28, 2024

Title: Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence

Imbalanced data, a common challenge encountered in statistical analyses of clinical trial datasets and disease modeling, refers to the scenario where one class significantly outnumbers the other in a binary classification problem. This imbalance can lead to biased model performance, favoring the majority class, and affecting the understanding of the relative importance of predictive variables. Despite its prevalence, the existing literature lacks comprehensive studies that elucidate methodologies to handle imbalanced data effectively. In this study, we discuss the binary logistic model and its limitations when dealing with imbalanced data, as model performance tends to be biased towards the majority class. We propose a novel approach to addressing imbalanced data and apply it to publicly available data from the VITAL trial, a large-scale clinical trial that examines the effects of vitamin D and Omega-3 fatty acid to investigate the relationship between vitamin D and cancer incidence in sub-populations based on race/ethnicity and demographic factors such as body mass index (BMI), age, and sex. Our results demonstrate a significant improvement in model performance after our undersampling method is applied to the data set with respect to cancer incidence prediction. Both epidemiological and laboratory studies have suggested that vitamin D may lower the occurrence and death rate of cancer, but inconsistent and conflicting findings have been reported due to the difficulty of conducting large-scale clinical trials. We also utilize logistic regression within each ethnic sub-population to determine the impact of demographic factors on cancer incidence, with a particular focus on the role of vitamin D. This study provides a framework for using classification models to understand relative variable importance when dealing with imbalanced data.

 
more » « less
Award ID(s):
2150280
NSF-PAR ID:
10488837
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Frontiers
Date Published:
Journal Name:
Frontiers in Oncology
Volume:
13
ISSN:
2234-943X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Graph neural networks (GNNs) have emerged as a powerful tool for modeling graph data due to their ability to learn a concise representation of the data by integrating the node attributes and link information in a principled fashion. However, despite their promise, there are several practical challenges that must be overcome to effectively use them for node classification problems. In particular, current approaches are vulnerable to different kinds of biases inherent in the graph data. First, if the class distribution is imbalanced, then the GNNs' loss function is biased towards classifying the majority class correctly rather than the minority class, which hurts the performance of the latter class. Second, due to homophily effect, the learned representation and subsequent downstream tasks may favor certain demographic groups over others when applied to social network data. To mitigate such biases, we propose a novel framework called Fairness-Aware Cost Sensitive Graph Convolutional Network (FACS-GCN) for classifying nodes in networks with skewed class distributions. Our approach combines a cost-sensitive exponential loss with an adversarial learning component to alleviate the ill-effects of both biases. The framework employs a stagewise additive modeling approach to ensure there is no significant loss in accuracy when imparting fairness into the GNN. Experimental results on 6 benchmark graph data demonstrate the effectiveness of FACS-GCN against comparable baseline methods in terms of promoting fairness while maintaining a high model accuracy on the majority of the datasets. 
    more » « less
  2. Heterogeneity among Alzheimer’s disease (AD) patients confounds clinical trial patient selection and therapeutic efficacy evaluation. This work defines separable AD clinical sub-populations using unsupervised machine learning. Clustering (t-SNE followed by k-means) of patient features and association rule mining (ARM) was performed on the ADNIMERGE dataset from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Patient sociodemographics, brain imaging, biomarkers, cognitive tests, and medication usage were included for analysis. Four AD clinical sub-populations were identified using between-cluster mean fold changes [cognitive performance, brain volume]: cluster-1 represented least severe disease [+17.3, +13.3]; cluster-0 [−4.6, +3.8] and cluster-3 [+10.8, −4.9] represented mid-severity sub-populations; cluster-2 represented most severe disease [−18.4, −8.4]. ARM assessed frequently occurring pharmacologic substances within the 4 sub-populations. No drug class was associated with the least severe AD (cluster-1), likely due to lesser antecedent disease. Anti-hyperlipidemia drugs associated with cluster-0 (mid-severity, higher volume). Interestingly, antioxidants vitamin C and E associated with cluster-3 (mid-severity, higher cognition). Anti-depressants like Zoloft associated with most severe disease (cluster-2). Vitamin D is protective for AD, but ARM identified significant underutilization across all AD sub-populations. Identification and feature characterization of four distinct AD sub-population “clusters” using standard clinical features enhances future clinical trial selection criteria and cross-study comparative analysis. 
    more » « less
  3. Kretzschmar, Mirjam E. (Ed.)
    Background Development of an effective antiviral drug for Coronavirus Disease 2019 (COVID-19) is a global health priority. Although several candidate drugs have been identified through in vitro and in vivo models, consistent and compelling evidence from clinical studies is limited. The lack of evidence from clinical trials may stem in part from the imperfect design of the trials. We investigated how clinical trials for antivirals need to be designed, especially focusing on the sample size in randomized controlled trials. Methods and findings A modeling study was conducted to help understand the reasons behind inconsistent clinical trial findings and to design better clinical trials. We first analyzed longitudinal viral load data for Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) without antiviral treatment by use of a within-host virus dynamics model. The fitted viral load was categorized into 3 different groups by a clustering approach. Comparison of the estimated parameters showed that the 3 distinct groups were characterized by different virus decay rates ( p -value < 0.001). The mean decay rates were 1.17 d −1 (95% CI: 1.06 to 1.27 d −1 ), 0.777 d −1 (0.716 to 0.838 d −1 ), and 0.450 d −1 (0.378 to 0.522 d −1 ) for the 3 groups, respectively. Such heterogeneity in virus dynamics could be a confounding variable if it is associated with treatment allocation in compassionate use programs (i.e., observational studies). Subsequently, we mimicked randomized controlled trials of antivirals by simulation. An antiviral effect causing a 95% to 99% reduction in viral replication was added to the model. To be realistic, we assumed that randomization and treatment are initiated with some time lag after symptom onset. Using the duration of virus shedding as an outcome, the sample size to detect a statistically significant mean difference between the treatment and placebo groups (1:1 allocation) was 13,603 and 11,670 (when the antiviral effect was 95% and 99%, respectively) per group if all patients are enrolled regardless of timing of randomization. The sample size was reduced to 584 and 458 (when the antiviral effect was 95% and 99%, respectively) if only patients who are treated within 1 day of symptom onset are enrolled. We confirmed the sample size was similarly reduced when using cumulative viral load in log scale as an outcome. We used a conventional virus dynamics model, which may not fully reflect the detailed mechanisms of viral dynamics of SARS-CoV-2. The model needs to be calibrated in terms of both parameter settings and model structure, which would yield more reliable sample size calculation. Conclusions In this study, we found that estimated association in observational studies can be biased due to large heterogeneity in viral dynamics among infected individuals, and statistically significant effect in randomized controlled trials may be difficult to be detected due to small sample size. The sample size can be dramatically reduced by recruiting patients immediately after developing symptoms. We believe this is the first study investigated the study design of clinical trials for antiviral treatment using the viral dynamics model. 
    more » « less
  4. e20551 Background: Enzyme activity is at the center of all biological processes. When these activities are misregulated by changes in sequence, expression, or activity, pathologies emerge. Misregulation of protease enzymes such as Matrix Metalloproteinases and Cathepsins play a key role in the pathophysiology of cancer. We describe here a novel class of graphene-based, cost effective biosensors that can detect altered protease activation in a blood sample from early stage lung cancer patients. Methods: The Gene Expression Omnibus (GEO) tool was used to identify proteases differentially expressed in lung cancer and matched normal tissue. Biosensors were assembled on a graphene backbone annotated with one of a panel of fluorescently tagged peptides. The graphene quenches fluorescence until the peptide is either cleaved by active proteases or altered by post-translational modification. 19 protease biosensors were evaluated on 431 commercially collected serum samples from non-lung cancer controls (69%) and pathologically confirmed lung cancer cases (31%) tested over two independent cohorts. Serum was incubated with each of the 19 biosensors and enzyme activity was measured indirectly as a continuous variable by a fluorescence plate reader. Analysis was performed using Emerge, a proprietary predictive and classification modeling system based on massively parallel evolving “Turing machine” algorithms. Each analysis stratified allocation into training and testing sets, and reserved an out-of-sample validation set for reporting. Results: 256 clinical samples were initially evaluated including 35% cancer cases evenly distributed across stages I (29%), II (26%), III (24%) and IV (21%). The case controls included common co-morbidies in the at-risk population such as COPD, chronic bronchitis, and benign nodules (19%). Using the Emerge classification analysis, biosensor biomarkers alone (no clinical factors) demonstrated Sensitivity (Se.) = 92% (CI 82%-99%) and Specificity (Sp.) = 82% (CI 69%-91%) in the out-of-sample set. An independent cohort of 175 clinical cases (age 67±8, 52% male) focused on early detection (26% cancer, 70% Stage I, 30% Stage II/III) were similarly evaluated. Classification showed Se. = 100% (CI 79%-100%) and Sp. = 93% (CI 80%-99%) in the out-of-sample set. For the entire dataset of 175 samples, Se. = 100% (CI 92%-100%) and Sp. = 97% (CI 92%-99%) was observed. Conclusions: Lung cancer can be treated if it is diagnosed when still localized. Despite clear data showing screening for lung cancer by Low Dose Computed Tomography (LDCT) is effective, screening compliance remains very low. Protease biosensors provide a cost effective additional specialized tool with high sensitivity and specificity in detection of early stage lung cancer. A large prospective trial of at-risk smokers with follow up is being conducted to evaluate a commercial version of this assay. 
    more » « less
  5. When faced with severely imbalanced binary classification problems, we often train models on bootstrapped data in which the number of instances of each class occur in a more favorable ratio, often equal to one. We view algorithmic inequity through the lens of imbalanced classification: In order to balance the performance of a classifier across groups, we can bootstrap to achieve training sets that are balanced with respect to both labels and group identity. For an example problem with severe class imbalance—prediction of suicide death from administrative patient records—we illustrate how an equity‐directed bootstrap can bring test set sensitivities and specificities much closer to satisfying the equal odds criterion. In the context of naïve Bayes and logistic regression, we analyse the equity‐weighted bootstrap, demonstrating that it works by bringing odds ratios close to one, and linking it to methods involving intercept adjustment, thresholding, and weighting.

     
    more » « less