Previous moderate- and high-temperature geothermal resource assessments of the western United States utilized weight-of-evidence and logistic regression methodstoestimateresourcefavorability,buttheseanalyses relied uponsomeexpert decisions.Whileexpert decisions can add confidence to aspects of the modeling process by ensuring only reasonable models are employed, expert decisions also introduce human bias into assessments. This bias presents a source of error that may affect the performance of the models and resulting resource estimates. Our study aims to reduce expert input through robust data-driven analyses and better-suited data science techniques, with the goals of saving time, reducing bias, and improving predictive ability. We present six favorability maps for geothermal resources in the western United States created using two strategies applied to three modern machine learning algorithms (logistic regression, support- vector machines, and XGBoost). To provide a direct comparison to previous assessments, we use the same input data as the 2008 U.S. Geological Survey (USGS) conventional moderate- to high-temperature geothermal resource assessment. The six new favorability maps required far less expert decision-making, but broadly agree with the previous assessment. Despite the fact that the 2008 assessment results employed linear methods, the non-linear machine learning algorithms (i.e., support-vector machines and XGBoost) produced greater agreement with the previous assessment than the linear machine learning algorithm (i.e., logistic regression). It is not surprising that geothermal systems depend on non-linear combinations of features, and we postulate that the expert decisions during the 2008 assessment accounted for system non-linearities. Substantial challenges to applying machine learning algorithms to predict geothermal resource favorability include severe class imbalance (i.e., there are very few known geothermal systems compared to the large area considered), and while there are known geothermal systems (i.e., positive labels), all other sites have an unknown status (i.e., they are unlabeled), instead of receiving a negative label (i.e., the known/proven absence of a geothermal resource). We address both challenges through a custom undersampling strategy that can be used with any algorithm and then evaluated using F1 scores.
more »
« less
Predicting Geothermal Favorability in the Western United States by Using Machine Learning: Addressing Challenges and Developing Solutions
Previous moderate- and high-temperature geothermal resource assessments of the western United States utilized weight-of-evidence and logistic regression methodstoestimateresourcefavorability,buttheseanalyses relied uponsomeexpert decisions.Whileexpert decisions can add confidence to aspects of the modeling process by ensuring only reasonable models are employed, expert decisions also introduce human bias into assessments. This bias presents a source of error that may affect the performance of the models and resulting resource estimates. Our study aims to reduce expert input through robust data-driven analyses and better-suited data science techniques, with the goals of saving time, reducing bias, and improving predictive ability. We present six favorability maps for geothermal resources in the western United States created using two strategies applied to three modern machine learning algorithms (logistic regression, support- vector machines, and XGBoost). To provide a direct comparison to previous assessments, we use the same input data as the 2008 U.S. Geological Survey (USGS) conventional moderate- to high-temperature geothermal resource assessment. The six new favorability maps required far less expert decision-making, but broadly agree with the previous assessment. Despite the fact that the 2008 assessment results employed linear methods, the non-linear machine learning algorithms (i.e., support-vector machines and XGBoost) produced greater agreement with the previous assessment than the linear machine learning algorithm (i.e., logistic regression). It is not surprising that geothermal systems depend on non-linear combinations of features, and we postulate that the expert decisions during the 2008 assessment accounted for system non-linearities. Substantial challenges to applying machine learning algorithms to predict geothermal resource favorability include severe class imbalance (i.e., there are very few known geothermal systems compared to the large area considered), and while there are known geothermal systems (i.e., positive labels), all other sites have an unknown status (i.e., they are unlabeled), instead of receiving a negative label (i.e., the known/proven absence of a geothermal resource). We address both challenges through a custom undersampling strategy that can be used with any algorithm and then evaluated using F1 scores.
more »
« less
- Award ID(s):
- 1850404
- PAR ID:
- 10386337
- Date Published:
- Journal Name:
- Forty Seventh Workshop on Geothermal Reservoir Engineering
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Selecting negative training sites is an important challenge to resolve when utilizing machine learning (ML) for predicting hydrothermal resource favorability because ideal models would discriminate between hydrothermal systems (positives) and all types of locations without hydrothermal systems (negatives). The Nevada Machine Learning project (NVML) fit an artificial neural network to identify areas favorable for hydrothermal systems by selecting 62 negative sites where the research team had confidence that no hydrothermal resource exists. Herein, we compare the implications of the expert selection of negatives (i.e., the NVML strategy) with a random sample strategy, where it is assumed that areas outside the favorable structural ellipses defined by NVML are negative. Because hydrothermal systems are sparse, it is highly probable that, in the absence of a favorable geological structure, hydrothermal favorability is low. We compare three training strategies: 1) the positive and negative labeled examples from NVML; 2) the positive examples from NVML with randomly selected negatives in equal frequency as NVML; and 3) the positive examples from NVML with randomly selected negatives reflecting the expected natural distribution of hydrothermal systems relative to the total area. We apply these training strategies to the NVML feature data (input data) using two ML algorithms (XGBoost and logistic regression) to create six favorability maps for hydrothermal resources. When accounting for the expected natural distribution of hydrothermal systems, we find that XGBoost performs better than the NVML neural network and its negatives. Model validation was less reliable using F1 scores, a common performance metric, than comparing probability estimates at known positives, likely because of the extreme natural class imbalance and the lack of negatively labeled sites. This work demonstrates that expert selection of negatives for training in NVML likely imparted modeling bias. Accounting for the sparsity of hydrothermal systems and all the types of locations without hydrothermal systems allows us to create better models for predicting hydrothermal resource favorability.more » « less
-
Recent advances in machine learning (ML) identifying areas favorable to hydrothermal systems indicate that the resolution of feature data remains a subject of necessary improvement before ML can reliably produce better models. Herein, we consider the value of adding new features or replacing other, low-value features with new input features in existing ML pipelines. Our previous work identified stress and seismicity as having less value than the other feature types (i.e., heat flow, distance to faults, and distance to magmatic activity) for the 2008 USGS hydrothermal energy assessment; hence, a fundamental question regards if the addition of new but partially correlated features will improve resulting models for hydrothermal favorability. Therefore, we add new maps for shear strain rate and dilation strain rate to fit logistic regression and XGBoost models, resulting in new 7-feature models that are compared to the old 5-feature models. Because these new features share a degree of correlation with the original relatively uninformative stress and seismicity features, we also consider replacement of the two lower-value features with the two new features, creating new 5-feature models. Adding the new features improves the predictive skill of the new 7-feature model over that of the old 5-feature model; albeit, that improvement is not statistically significant because the new features are correlated with the old features and, consequently, the new features do not present considerable new information. However, the new 5-feature XGBoost model has a statistically significant increase in predictive skill for known positives over the old 5-feature model at p = 0.06. This improved performance is due to the lower-dimensional feature space of the former than that of the latter. In higher-dimensional feature space, relationships between features and the presence or absence of hydrothermal systems are harder to discern (i.e., the 7-feature model likely suffers from the “curse of dimensionality”).more » « less
-
We train five models using two machine learning (ML) regression algorithms (i.e., linear regression and XGBoost) to predict hydrothermal upflow in the Great Basin. Feature data are extracted from datasets supporting the INnovative Geothermal Exploration through Novel Investigations Of Undiscovered Systems project (INGENIOUS). The label data (the reported convective signals) are extracted from measured thermal gradients in wells by comparing the total estimated heat flow at the wells to the modeled background conductive heat flow. That is, the reported convective signal is the difference between the background conductive heat flow and the well heat flow. The reported convective signals contain outliers that may affect upflow prediction, so the influence of outliers is tested by constructing models for two cases: 1) using all the data (i.e., -91 to 11,105 mW/m2), and 2) truncating the range of labels to include only reported convective signals between -25 and 200 mW/m2. Because hydrothermal systems are sparse, models that predict high convective signal in smaller areas better match the natural frequency of hydrothermal systems. Early results demonstrate that XGBoost outperforms linear regression. For XGBoost using the truncated range of labels, half of the high reported signals are within < 3 % of the highest predictions. For XGBoost using the entire range of labels, half of the high reported signals are in < 13 % of the highest predictions. While this implies that the truncated regression is superior, the all-data model better predicts the locations of power-producing systems (i.e., the operating power plants are in a smaller fraction of the study area given by the highest predictions). Even though the models generally predict greater hydrothermal upflow for higher reported convective signals than for lower reported convective signals, both XGBoost models consistently underpredict the magnitude of higher signals. This behavior is attributed to low resolution/granularity of input features compared with the scale of a hydrothermal upflow zone (a few km or less across). Trouble estimating exact values while still reliably predicting high versus low convective signals suggests that a future strategy such as ranked ordinal regression (e.g., classifying into ordered bins for low, medium, high, and very high convective signal) might fit better models, since doing so reduces problems introduced by outliers while preserving the property of larger versus smaller signals.more » « less
-
In this paper, we aim to address a relevant estimation problem that aviation professionals encounter in their daily operations. Specifically, aircraft load planners require information on the expected number of checked bags for a flight several hours prior to its scheduled departure to properly palletize and load the aircraft. However, the checked baggage prediction problem has not been sufficiently studied in the literature, particularly at the flight level. Existing prediction approaches have not properly accounted for the different impacts of overestimating and underestimating checked baggage volumes on airline operations. Therefore, we propose a custom loss function, in the form of a piecewise quadratic function, which aligns with airline operations practice and utilizes machine learning algorithms to optimize checked baggage predictions incorporating the new loss function. We consider multiple linear regression, LightGBM, and XGBoost, as supervised learning algorithms. We apply our proposed methods to baggage data from a major airline and additional data from various U.S. government agencies. We compare the performance of the three customized supervised learning algorithms. We find that the two gradient boosting methods (i.e., LightGBM and XGBoost) yield higher accuracy than the multiple linear regression; XGBoost outperforms LightGBM while LightGBM requires much less training time than XGBoost. We also investigate the performance of XGBoost on samples from different categories and provide insights for selecting an appropriate prediction algorithm to improve baggage prediction practices. Our modeling framework can be adapted to address other prediction challenges in aviation, such as predicting the number of standby passengers or no-shows.more » « less