skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Comparative Analysis of Supervised Classification Algorithms for Residential Water End Uses
Abstract Water sustainability in the built environment requires an accurate estimation of residential water end uses (e.g., showers, toilets, faucets, etc.). In this study, we evaluate the performance of four models (Random Forest, RF; Support Vector Machines, SVM; Logistic Regression, Log‐reg; and Neural Networks, NN) for residential water end‐use classification using actual (measured) and synthetic labeled data sets. We generated synthetic labeled data using Conditional Tabular Generative Adversarial Networks. We then utilized grid search to train each model on their respective optimized hyperparameters. The RF model exhibited the best model performance overall, while the Log‐reg model had the shortest execution times under different balanced and imbalanced (based on number of events per class) synthetic data scenarios, demonstrating a computationally efficient alternative for RF for specific end uses. The NN model exhibited high performance with the tradeoff of longer execution times compared to the other classification models. In the balanced data set scenario, all models achieved closely aligned F1‐scores, ranging from 0.83 to 0.90. However, when faced with imbalanced data reflective of actual conditions, both the SVM and Log‐reg models showed inferior performance compared to the RF and NN models. Overall, we concluded that decision tree‐based models emerge as the optimal choice for classification tasks in the context of water end‐use data. Our study advances residential smart water metering systems through creating synthetic labeled end‐use data and providing insight into the strengths and weaknesses of various supervised machine learning classifiers for end‐use identification.  more » « less
Award ID(s):
1847404
PAR ID:
10577146
Author(s) / Creator(s):
 ;  
Publisher / Repository:
DOI PREFIX: 10.1029
Date Published:
Journal Name:
Water Resources Research
Volume:
60
Issue:
6
ISSN:
0043-1397
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Trees in proximity to power lines can cause significant damage to utility infrastructure during storms, leading to substantial economic and societal costs. This study investigated the effectiveness of non-parametric machine learning algorithms in modeling tree-related outage risks to distribution power lines at a finer spatial scale. We used a vegetation risk model (VRM) comprising 15 predictor variables derived from roadside tree data, landscape information, vegetation management records, and utility infrastructure data. We evaluated the VRM’s performance using decision tree (DT), random forest (RF), k-Nearest Neighbor (k-NN), extreme gradient boosting (XGBoost), and support vector machine (SVM) techniques. The RF algorithm demonstrated the highest performance with an accuracy of 0.753, an AUC-ROC of 0.746, precision of 0.671, and an F1-score of 0.693. The SVM achieved the highest recall value of 0.727. Based on the overall performance, the RF emerged as the best machine learning algorithm, whereas the DT was the least suitable. The DT reported the lowest run times for both hyperparameter optimization (3.93 s) and model evaluation (0.41 s). XGBoost and the SVM exhibited the highest run times for hyperparameter tuning (9438.54 s) and model evaluation (112 s), respectively. The findings of this study are valuable for enhancing the resilience and reliability of the electric grid. 
    more » « less
  2. Abstract Water monitoring in households provides occupants and utilities with key information to support water conservation and efficiency in the residential sector. High costs, intrusiveness, and practical complexity limit appliance-level monitoring via sub-meters on every water-consuming end use in households. Non-intrusive machine learning methods have emerged as promising techniques to analyze observed data collected by a single meter at the inlet of the house and estimate the disaggregated contribution of each water end use. While fine temporal resolution data allow for more accurate end-use disaggregation, there is an inevitable increase in the amount of data that needs to be stored and analyzed. To explore this tradeoff and advance previous studies based on synthetic data, we first collected 1 s resolution indoor water use data from a residential single-point smart water metering system installed at a four-person household, as well as ground-truth end-use labels based on a water diary recorded over a 4-week study period. Second, we trained a supervised machine learning model (random forest classifier) to classify six water end-use categories across different temporal resolutions and two different model calibration scenarios. Finally, we evaluated the results based on three different performance metrics (micro, weighted, and macro F1 scores). Our findings show that data collected at 1- to 5-s intervals allow for better end-use classification (weighted F-score higher than 0.85), particularly for toilet events; however, certain water end uses (e.g., shower and washing machine events) can still be predicted with acceptable accuracy even at coarser resolutions, up to 1 min, provided that these end-use categories are well represented in the training dataset. Overall, our study provides insights for further water sustainability research and widespread deployment of smart water meters. 
    more » « less
  3. Abstract Spectroscopic techniques generate one-dimensional spectra with distinct peaks and specific widths in the frequency domain. These features act as unique identities for material characteristics. Deep neural networks (DNNs) has recently been considered a powerful tool for automatically categorizing experimental spectra data by supervised classification to evaluate material characteristics. However, most existing work assumes balanced spectral data among various classes in the training data, contrary to actual experiments, where the spectral data is usually imbalanced. The imbalanced training data deteriorates the supervised classification performance, hindering understanding of the phase behavior, specifically, sol-gel transition (gelation) of soft materials and glycomaterials. To address this issue, this paper applies a novel data augmentation method based on a generative adversarial network (GAN) proposed by the authors in their prior work. To demonstrate the effectiveness of the proposed method, the actual imbalanced spectral data from Pluronic F-127 hydrogel and Alpha-Cyclodextrin hydrogel are used to classify the phases of data. Specifically, our approach improves 8.8%, 6.4%, and 6.2% of the performance of the existing data augmentation methods regarding the classifier’s F-score, Precision, and Recall on average, respectively. Specifically, our method consists of three DNNs: the generator, discriminator, and classifier. The method generates samples that are not only authentic but emphasize the differentiation between material characteristics to provide balanced training data, improving the classification results. Based on these validated results, we expect the method’s broader applications in addressing imbalanced measurement data across diverse domains in materials science and chemical engineering. 
    more » « less
  4. Abstract The illegal timber trade has significant impact on the survival of endangered tropical hardwood species likeDalbergiaspp. (rosewood), a world‐wide protected genus from the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES). Due to increased threat toDalbergiaspp., and lack of action to reduce threats, port of entry analysis methods are required to identifyDalbergiaspp. Handheld laser‐induced breakdown spectroscopy (LIBS) has been shown to be capable of identifying species and establishing provenance ofDalbergiaspp. and other tropical hardwoods, but analysis methods for this work have yet to be investigated in detail. The present work investigates five well‐known algorithms—partial least squares discriminant analysis (PLS‐DA), classification and regression trees (CART),k‐nearest neighbor (k‐NN), random forest (RF), and support vector machine (SVM)—two training/test set sampling regimes, and data collection at two signal‐to‐noise (S/N) ratios to assess the potential for handheld LIBS analyses. Additionally, imbalanced classes are addressed. For this application, SVM and RF yield near identical results (though RF takes nearly 100 longer to compute), while the S/N ratio has a significant effect on model success assuming all else is equal. It was found that forming a training set with replicate low S/N analyses can perform as well as higher precision training sets for true prediction, even if the predicted samples have low signal to noise! This work confirms handheld LIBS analyzers can provide a viable method for classification of hardwood species, even within the same genus. 
    more » « less
  5. Martin, Andreas; Hinkelmann, Knut; Fill, Hans-Georg; Gerber, Aurona; Lenat, Doug; Stolle, Reinhard; Harmelen, Frank van (Ed.)
    Record linkage, often called entity resolution or de-duplication, refers to identifying the same entities across one or more databases. As the amount of data that is generated grows at an exponential rate, it becomes increasingly important to be able to integrate data from several sources to perform richer analysis. In this paper, we present an open source comprehensive end to end hybrid record linkage framework that combines the automatic and manual review process. Using this framework, we train several models based on different machine learning algorithms such as random forests, linear SVM, Radial SVM, and Dense Neural Networks and compare the effectiveness and efficiency of these models for record linkage in different settings. We evaluate model performance based on Recall, F1-score (quality of linkages) and number of uncertain pairs which is the number of pairs that need manual review. We also test our trained models in a new dataset to test how different trained models transfer to a new setting. The RF, linear SVM and radial SVM models transfer much better compared to the DNN. Finally, we study the effect of name2vec (n2v) feature, a letter embedding in names, on model performance. Using n2v results in smaller manual review set with slightly less F1-score. Overall the SVM models performed best in all experiments. 
    more » « less