- Award ID(s):
- 2046059
- PAR ID:
- 10329622
- Date Published:
- Journal Name:
- Remote sensing
- Volume:
- 13
- Issue:
- 24
- ISSN:
- 2072-4292
- Page Range / eLocation ID:
- 4991
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract Background Predictive models utilizing social determinants of health (SDH), demographic data, and local weather data were trained to predict missed imaging appointments (MIA) among breast imaging patients at the Boston Medical Center (BMC). Patients were characterized by many different variables, including social needs, demographics, imaging utilization, appointment features, and weather conditions on the date of the appointment. Methods This HIPAA compliant retrospective cohort study was IRB approved. Informed consent was waived. After data preprocessing steps, the dataset contained 9,970 patients and 36,606 appointments from 1/1/2015 to 12/31/2019. We identified 57 potentially impactful variables used in the initial prediction model and assessed each patient for MIA. We then developed a parsimonious model via recursive feature elimination, which identified the 25 most predictive variables. We utilized linear and non-linear models including support vector machines (SVM), logistic regression (LR), and random forest (RF) to predict MIA and compared their performance. Results The highest-performing full model is the nonlinear RF, achieving the highest Area Under the ROC Curve (AUC) of 76% and average F1 score of 85%. Models limited to the most predictive variables were able to attain AUC and F1 scores comparable to models with all variables included. The variables most predictive of missed appointments included timing, prior appointment history, referral department of origin, and socioeconomic factors such as household income and access to caregiving services. Conclusions Prediction of MIA with the data available is inherently limited by the complex, multifactorial nature of MIA. However, the algorithms presented achieved acceptable performance and demonstrated that socioeconomic factors were useful predictors of MIA. In contrast with non-modifiable demographic factors, we can address SDH to decrease the incidence of MIA.more » « less
-
Water sustainability in the built environment requires an accurate estimation of residential water end uses (e.g., showers, toilets, faucets, etc.). In this study, we evaluate the performance of four models (Random Forest, RF; Support Vector Machines, SVM; Logistic Regression, Log‐reg; and Neural Networks, NN) for residential water end‐use classification using actual (measured) and synthetic labeled data sets. We generated synthetic labeled data using Conditional Tabular Generative Adversarial Networks. We then utilized grid search to train each model on their respective optimized hyperparameters. The RF model exhibited the best model performance overall, while the Log‐reg model had the shortest execution times under different balanced and imbalanced (based on number of events per class) synthetic data scenarios, demonstrating a computationally efficient alternative for RF for specific end uses. The NN model exhibited high performance with the tradeoff of longer execution times compared to the other classification models. In the balanced data set scenario, all models achieved closely aligned F1‐scores, ranging from 0.83 to 0.90. However, when faced with imbalanced data reflective of actual conditions, both the SVM and Log‐reg models showed inferior performance compared to the RF and NN models. Overall, we concluded that decision tree‐based models emerge as the optimal choice for classification tasks in the context of water end‐use data. Our study advances residential smart water metering systems through creating synthetic labeled end‐use data and providing insight into the strengths and weaknesses of various supervised machine learning classifiers for end‐use identification.more » « less
-
Prediction of student performance in Introductory programming courses can assist struggling students and improve their persistence. On the other hand, it is important for the prediction to be transparent for the instructor and students to effectively utilize the results of this prediction. Explainable Machine Learning models can effectively help students and instructors gain insights into students’ different programming behaviors and problem-solving strategies that can lead to good or poor performance. This study develops an explainable model that predicts students’ performance based on programming assignment submission information. We extract different data-driven features from students’ programming submissions and employ a stacked ensemble model to predict students’ final exam grades. We use SHAP, a game-theory-based framework, to explain the model’s predictions to help the stakeholders understand the impact of different programming behaviors on students’ success. Moreover, we analyze the impact of important features and utilize a combination of descriptive statistics and mixture models to identify different profiles of students based on their problem-solving patterns to bolster explainability. The experimental results suggest that our model significantly outperforms other Machine Learning models, including KNN, SVM, XGBoost, Bagging, Boosting, and Linear regression. Our explainable and transparent model can help explain students’ common problem-solving patterns in relationship with their level of expertise resulting in effective intervention and adaptive support to students.more » « less
-
Abstract Cellular automaton (CA) are important tools that provide insight into urbanization dynamics and possible future patterns. The calibration process is the core theme of these models. This study compares the performance of two common machine‐learning classifiers, random forest (RF), and support vector machines (SVM), to calibrate CA. It focuses on the sensitivity analysis of the sample size and the number of input variables for each classifier. We applied the models to the Wallonia region (Belgium) as a case study to demonstrate the performance of each classifier. The results highlight that RF produces a land‐use pattern that simulates the observed pattern more precisely than SVM especially with a low sample size, which is important for study areas with low levels of land‐use change. Although zoning information notably enhances the accuracy of SVM‐based probability maps, zoning marginally influences the RF‐derived probability maps. In the case of the SVM, the CA model did not significantly improve due to the increased sample size. The performance of the 5,000 sample size was observed to be better than the 15,000 sample size. The RF‐driven CA had the best performance with a high sample, while zoning information was excluded.
-
Objective Sudden unexpected death in epilepsy (SUDEP) is the leading cause of epilepsy-related mortality. Although lots of effort has been made in identifying clinical risk factors for SUDEP in the literature, there are few validated methods to predict individual SUDEP risk. Prolonged postictal EEG suppression (PGES) is a potential SUDEP biomarker, but its occurrence is infrequent and requires epilepsy monitoring unit admission. We use machine learning methods to examine SUDEP risk using interictal EEG and ECG recordings from SUDEP cases and matched living epilepsy controls. Methods This multicenter, retrospective, cohort study examined interictal EEG and ECG recordings from 30 SUDEP cases and 58 age-matched living epilepsy patient controls. We trained machine learning models with interictal EEG and ECG features to predict the retrospective SUDEP risk for each patient. We assessed cross-validated classification accuracy and the area under the receiver operating characteristic (AUC) curve. Results The logistic regression (LR) classifier produced the overall best performance, outperforming the support vector machine (SVM), random forest (RF), and convolutional neural network (CNN). Among the 30 patients with SUDEP [14 females; mean age (SD), 31 (8.47) years] and 58 living epilepsy controls [26 females (43%); mean age (SD) 31 (8.5) years], the LR model achieved the median AUC of 0.77 [interquartile range (IQR), 0.73–0.80] in five-fold cross-validation using interictal alpha and low gamma power ratio of the EEG and heart rate variability (HRV) features extracted from the ECG. The LR model achieved the mean AUC of 0.79 in leave-one-center-out prediction. Conclusions Our results support that machine learning-driven models may quantify SUDEP risk for epilepsy patients, future refinements in our model may help predict individualized SUDEP risk and help clinicians correlate predictive scores with the clinical data. Low-cost and noninvasive interictal biomarkers of SUDEP risk may help clinicians to identify high-risk patients and initiate preventive strategies.more » « less