High-flow events that significantly impact Water Resource Recovery Facility (WRRF) operations are rare, but accurately predicting these flows could improve treatment operations. Data-driven modeling approaches could be used; however, high flow events that impact operation are an infrequent occurrence, providing limited data from which to learn meaningful patterns. The performance of a statistical model (logistic regression) and two machine learning (ML) models (support vector machine and random forest) were evaluated to predict high flow events one-day-ahead to two plants located in different parts of the United States, Northern Virginia and the Gulf Coast of Texas, with combined and separate sewers, respectively. We compared baseline models (no synthetic data added) to models trained with synthetic data added from two different sampling techniques (SMOTE and ADASYN) that increased the representation of rare events in the training data. Both techniques enhanced the sample size of the very high-flow class, but ADASYN, which focused on generating synthetic samples near decision boundaries, led to greater improvements in model performance (reduced misclassification rates). Random forest combined with ADASYN achieved the best overall performance for both plants, demonstrating its robustness in identifying one-day-ahead extreme flow events to treatment plants. These results suggest that combining sampling techniques with ML has the potential to significantly improve the modeling of high-flow events at treatment plants. Our work will prove useful in building reliable predictive models that can inform management decisions needed for the better control of treatment operations.
more »
« less
Directly and Simultaneously Expressing Absolute and Relative Treatment Effects in Medical Data Models and Applications
Logistic regression is widely used in the analysis of medical data with binary outcomes to study treatment effects through (absolute) treatment effect parameters in the models. However, the indicative parameters of relative treatment effects are not introduced in logistic regression models, which can be a severe problem in efficiently modeling treatment effects and lead to the wrong conclusions with regard to treatment effects. This paper introduces a new enhanced logistic regression model that offers a new way of studying treatment effects by measuring the relative changes in the treatment effects and also incorporates the way in which logistic regression models the treatment effects. The new model, called the Absolute and Relative Treatment Effects (AbRelaTEs) model, is viewed as a generalization of logistic regression and an enhanced model with increased flexibility, interpretability, and applicability in real data applications than the logistic regression. The AbRelaTEs model is capable of modeling significant treatment effects via an absolute or relative or both ways. The new model can be easily implemented using statistical software, with the logistic regression model being treated as a special case. As a result, the classical logistic regression models can be replaced by the AbRelaTEs model to gain greater applicability and have a new benchmark model for more efficiently studying treatment effects in clinical trials, economic developments, and many applied areas. Moreover, the estimators of the coefficients are consistent and asymptotically normal under regularity conditions. In both simulation and real data applications, the model provides both significant and more meaningful results.
more »
« less
- Award ID(s):
- 2012298
- PAR ID:
- 10340962
- Date Published:
- Journal Name:
- Entropy
- Volume:
- 23
- Issue:
- 11
- ISSN:
- 1099-4300
- Page Range / eLocation ID:
- 1517
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Background Logistic regression (LR) is a widely used classification method for modeling binary outcomes in many medical data classification tasks. Researchers that collect and combine datasets from various data custodians and jurisdictions can greatly benefit from the increased statistical power to support their analysis goals. However, combining data from different sources creates serious privacy concerns that need to be addressed. Methods In this paper, we propose two privacy-preserving protocols for performing logistic regression with the Newton–Raphson method in the estimation of parameters. Our proposals are based on secure Multi-Party Computation (MPC) and tailored to the honest majority and dishonest majority security settings. Results The proposed protocols are evaluated against both synthetic and real-world datasets in terms of efficiency and accuracy, and a comparison is made with the ordinary logistic regression. The experimental results demonstrate that the proposed protocols are highly efficient and accurate. Conclusions Our work introduces two iterative algorithms to enable the distributed training of a logistic regression model in a privacy-preserving manner. The implementation results show that our algorithms can handle large datasets from multiple sources.more » « less
-
Abstract Facing the escalating effects of climate change, it is critical to improve the prediction and understanding of the hurricane evacuation decisions made by households in order to enhance emergency management. Current studies in this area often have relied on psychology-driven linear models, which frequently exhibited limitations in practice. The present study proposed a novel interpretable machine learning approach to predict household-level evacuation decisions by leveraging easily accessible demographic and resource-related predictors, compared to existing models that mainly rely on psychological factors. An enhanced logistic regression model (that is, an interpretable machine learning approach) was developed for accurate predictions by automatically accounting for nonlinearities and interactions (that is, univariate and bivariate threshold effects). Specifically, nonlinearity and interaction detection were enabled by low-depth decision trees, which offer transparent model structure and robustness. A survey dataset collected in the aftermath of Hurricanes Katrina and Rita, two of the most intense tropical storms of the last two decades, was employed to test the new methodology. The findings show that, when predicting the households’ evacuation decisions, the enhanced logistic regression model outperformed previous linear models in terms of both model fit and predictive capability. This outcome suggests that our proposed methodology could provide a new tool and framework for emergency management authorities to improve the prediction of evacuation traffic demands in a timely and accurate manner.more » « less
-
We propose a new estimator for average causal effects of a binary treatment with panel data in settings with general treatment patterns. Our approach augments the popular two‐way‐fixed‐effects specification with unit‐specific weights that arise from a model for the assignment mechanism. We show how to construct these weights in various settings, including the staggered adoption setting, where units opt into the treatment sequentially but permanently. The resulting estimator converges to an average (over units and time) treatment effect under the correct specification of the assignment model, even if the fixed‐ effect model is misspecified. We show that our estimator is more robust than the conventional two‐way estimator: it remains consistent if either the assignment mechanism or the two‐way regression model is correctly specified. In addition, the proposed estimator performs better than the two‐way‐fixed‐effect estimator if the outcome model and assignment mechanism are locally misspecified. This strong robustness property underlines and quantifies the benefits of modeling the assignment process and motivates using our estimator in practice. We also discuss an extension of our estimator to handle dynamic treatment effects.more » « less
-
Imbalanced data, a common challenge encountered in statistical analyses of clinical trial datasets and disease modeling, refers to the scenario where one class significantly outnumbers the other in a binary classification problem. This imbalance can lead to biased model performance, favoring the majority class, and affecting the understanding of the relative importance of predictive variables. Despite its prevalence, the existing literature lacks comprehensive studies that elucidate methodologies to handle imbalanced data effectively. In this study, we discuss the binary logistic model and its limitations when dealing with imbalanced data, as model performance tends to be biased towards the majority class. We propose a novel approach to addressing imbalanced data and apply it to publicly available data from the VITAL trial, a large-scale clinical trial that examines the effects of vitamin D and Omega-3 fatty acid to investigate the relationship between vitamin D and cancer incidence in sub-populations based on race/ethnicity and demographic factors such as body mass index (BMI), age, and sex. Our results demonstrate a significant improvement in model performance after our undersampling method is applied to the data set with respect to cancer incidence prediction. Both epidemiological and laboratory studies have suggested that vitamin D may lower the occurrence and death rate of cancer, but inconsistent and conflicting findings have been reported due to the difficulty of conducting large-scale clinical trials. We also utilize logistic regression within each ethnic sub-population to determine the impact of demographic factors on cancer incidence, with a particular focus on the role of vitamin D. This study provides a framework for using classification models to understand relative variable importance when dealing with imbalanced data.more » « less
An official website of the United States government

