skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on September 1, 2026

Title: Optimizing models for the prediction of one step ahead extreme flows to wastewater treatment plants using different synthetic sampling methods
High-flow events that significantly impact Water Resource Recovery Facility (WRRF) operations are rare, but accurately predicting these flows could improve treatment operations. Data-driven modeling approaches could be used; however, high flow events that impact operation are an infrequent occurrence, providing limited data from which to learn meaningful patterns. The performance of a statistical model (logistic regression) and two machine learning (ML) models (support vector machine and random forest) were evaluated to predict high flow events one-day-ahead to two plants located in different parts of the United States, Northern Virginia and the Gulf Coast of Texas, with combined and separate sewers, respectively. We compared baseline models (no synthetic data added) to models trained with synthetic data added from two different sampling techniques (SMOTE and ADASYN) that increased the representation of rare events in the training data. Both techniques enhanced the sample size of the very high-flow class, but ADASYN, which focused on generating synthetic samples near decision boundaries, led to greater improvements in model performance (reduced misclassification rates). Random forest combined with ADASYN achieved the best overall performance for both plants, demonstrating its robustness in identifying one-day-ahead extreme flow events to treatment plants. These results suggest that combining sampling techniques with ML has the potential to significantly improve the modeling of high-flow events at treatment plants. Our work will prove useful in building reliable predictive models that can inform management decisions needed for the better control of treatment operations.  more » « less
Award ID(s):
1931937
PAR ID:
10617747
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Elsevier
Date Published:
Journal Name:
Journal of Environmental Management
Volume:
392
Issue:
C
ISSN:
0301-4797
Page Range / eLocation ID:
126592
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. BackgroundMetamodels can address some of the limitations of complex simulation models by formulating a mathematical relationship between input parameters and simulation model outcomes. Our objective was to develop and compare the performance of a machine learning (ML)–based metamodel against a conventional metamodeling approach in replicating the findings of a complex simulation model. MethodsWe constructed 3 ML-based metamodels using random forest, support vector regression, and artificial neural networks and a linear regression-based metamodel from a previously validated microsimulation model of the natural history hepatitis C virus (HCV) consisting of 40 input parameters. Outcomes of interest included societal costs and quality-adjusted life-years (QALYs), the incremental cost-effectiveness (ICER) of HCV treatment versus no treatment, cost-effectiveness analysis curve (CEAC), and expected value of perfect information (EVPI). We evaluated metamodel performance using root mean squared error (RMSE) and Pearson’s R2on the normalized data. ResultsThe R2values for the linear regression metamodel for QALYs without treatment, QALYs with treatment, societal cost without treatment, societal cost with treatment, and ICER were 0.92, 0.98, 0.85, 0.92, and 0.60, respectively. The corresponding R2values for our ML-based metamodels were 0.96, 0.97, 0.90, 0.95, and 0.49 for support vector regression; 0.99, 0.83, 0.99, 0.99, and 0.82 for artificial neural network; and 0.99, 0.99, 0.99, 0.99, and 0.98 for random forest. Similar trends were observed for RMSE. The CEAC and EVPI curves produced by the random forest metamodel matched the results of the simulation output more closely than the linear regression metamodel. ConclusionsML-based metamodels generally outperformed traditional linear regression metamodels at replicating results from complex simulation models, with random forest metamodels performing best. HighlightsDecision-analytic models are frequently used by policy makers and other stakeholders to assess the impact of new medical technologies and interventions. However, complex models can impose limitations on conducting probabilistic sensitivity analysis and value-of-information analysis, and may not be suitable for developing online decision-support tools. Metamodels, which accurately formulate a mathematical relationship between input parameters and model outcomes, can replicate complex simulation models and address the above limitation. The machine learning–based random forest model can outperform linear regression in replicating the findings of a complex simulation model. Such a metamodel can be used for conducting cost-effectiveness and value-of-information analyses or developing online decision support tools. 
    more » « less
  2. Background and Objectives: Sepsis is a leading cause of mortality in intensive care units (ICUs). The development of a robust prognostic model utilizing patients’ clinical data could significantly enhance clinicians’ ability to make informed treatment decisions, potentially improving outcomes for septic patients. This study aims to create a novel machine-learning framework for constructing prognostic tools capable of predicting patient survival or mortality outcome. Methods: A novel dataset is created using concatenated triples of static data, temporal data, and clinical outcomes to expand data size. This structured input trains five machine learning classifiers (KNN, Logistic Regression, SVM, RF, and XGBoost) with advanced feature engineering. Models are evaluated on an independent cohort using AUROC and a new metric, 𝛾, which incorporates the F1 score, to assess discriminative power and generalizability. Results: We developed five prognostic models using the concatenated triple dataset with 10 dynamic features from patient medical records. Our analysis shows that the Extreme Gradient Boosting (XGBoost) model (AUROC = 0.777, F1 score = 0.694) and the Random Forest (RF) model (AUROC = 0.769, F1 score = 0.647), when paired with an ensemble under-sampling strategy, outperform other models. The RF model improves AUROC by 6.66% and reduces overfitting by 54.96%, while the XGBoost model shows a 0.52% increase in AUROC and a 77.72% reduction in overfitting. These results highlight our framework’s ability to enhance predictive accuracy and generalizability, particularly in sepsis prognosis. Conclusion: This study presents a novel modeling framework for predicting treatment outcomes in septic patients, designed for small, imbalanced, and high-dimensional datasets. By using temporal feature encoding, advanced sampling, and dimension reduction techniques, our approach enhances standard classifier performance. The resulting models show improved accuracy with limited data, offering valuable prognostic tools for sepsis management. This framework demonstrates the potential of machine learning in small medical datasets. 
    more » « less
  3. Background: Poststroke recovery depends on multiple factors and varies greatly across individuals. Using machine learning models, this study investigated the independent and complementary prognostic role of different patient-related factors in predicting response to language rehabilitation after a stroke. Methods: Fifty-five individuals with chronic poststroke aphasia underwent a battery of standardized assessments and structural and functional magnetic resonance imaging scans, and received 12 weeks of language treatment. Support vector machine and random forest models were constructed to predict responsiveness to treatment using pretreatment behavioral, demographic, and structural and functional neuroimaging data. Results: The best prediction performance was achieved by a support vector machine model trained on aphasia severity, demographics, measures of anatomic integrity and resting-state functional connectivity (F1=0.94). This model resulted in a significantly superior prediction performance compared with support vector machine models trained on all feature sets (F1=0.82, P <0.001) or a single feature set (F1 range=0.68–0.84, P <0.001). Across random forest models, training on resting-state functional magnetic resonance imaging connectivity data yielded the best F1 score (F1=0.87). Conclusions: While behavioral, multimodal neuroimaging data and demographic information carry complementary information in predicting response to rehabilitation in chronic poststroke aphasia, functional connectivity of the brain at rest after stroke is a particularly important predictor of responsiveness to treatment, both alone and combined with other patient-related factors. 
    more » « less
  4. The challenges inherent in field validation data, and real-world light detection and ranging (lidar) collections make it difficult to assess the best algorithms for using lidar to characterize forest stand volume. Here, we demonstrate the use of synthetic forest stands and simulated terrestrial laser scanning (TLS) for the purpose of evaluating which machine learning algorithms, scanning configurations, and feature spaces can best characterize forest stand volume. The random forest (RF) and support vector machine (SVM) algorithms generally outperformed k-nearest neighbor (kNN) for estimating plot-level vegetation volume regardless of the input feature space or number of scans. Also, the measures designed to characterize occlusion using spherical voxels generally provided higher predictive performance than measures that characterized the vertical distribution of returns using summary statistics by height bins. Given the difficulty of collecting a large number of scans to train models, and of collecting accurate and consistent field validation data, we argue that synthetic data offer an important means to parameterize models and determine appropriate sampling strategies. 
    more » « less
  5. Predicting short-term traffic volume is essential to improve transportation systems management and operations (TSMO) and the overall efficiency of traffic networks. The real-time data, collected from Internet of Things (IoT) devices, can be used to predict traffic volume. More specifically, the Automated Traffic Signal Performance Measures (ATSPM) data contain high-fidelity traffic data at multiple intersections and can reveal the spatio-temporal patterns of traffic volume for each signal. In this study, we have developed a machine learningbased approach using the data collected from ATSPM sensors of a corridor in Orlando, FL to predict future hourly traffic. The hourly predictions are calculated based on the previous six hours volume seen at the selected intersections. Additional factors that play an important role in traffic fluctuations include peak hours, day of the week, holidays, among others. Multiple machine learning models are applied to the dataset to determine the model with the best performance. Random Forest, XGBoost, and LSTM models show the best performance in predicting hourly traffic volumes. 
    more » « less