Predicting large hydrothermal systems

Mordensky, Stanley Paul; Burns, Erick; DeAngelo, Jacob; Lipor, John

We train five models using two machine learning (ML) regression algorithms (i.e., linear regression and XGBoost) to predict hydrothermal upflow in the Great Basin. Feature data are extracted from datasets supporting the INnovative Geothermal Exploration through Novel Investigations Of Undiscovered Systems project (INGENIOUS). The label data (the reported convective signals) are extracted from measured thermal gradients in wells by comparing the total estimated heat flow at the wells to the modeled background conductive heat flow. That is, the reported convective signal is the difference between the background conductive heat flow and the well heat flow. The reported convective signals contain outliers that may affect upflow prediction, so the influence of outliers is tested by constructing models for two cases: 1) using all the data (i.e., -91 to 11,105 mW/m2), and 2) truncating the range of labels to include only reported convective signals between -25 and 200 mW/m2. Because hydrothermal systems are sparse, models that predict high convective signal in smaller areas better match the natural frequency of hydrothermal systems. Early results demonstrate that XGBoost outperforms linear regression. For XGBoost using the truncated range of labels, half of the high reported signals are within < 3 % of the highest predictions. For XGBoost using the entire range of labels, half of the high reported signals are in < 13 % of the highest predictions. While this implies that the truncated regression is superior, the all-data model better predicts the locations of power-producing systems (i.e., the operating power plants are in a smaller fraction of the study area given by the highest predictions). Even though the models generally predict greater hydrothermal upflow for higher reported convective signals than for lower reported convective signals, both XGBoost models consistently underpredict the magnitude of higher signals. This behavior is attributed to low resolution/granularity of input features compared with the scale of a hydrothermal upflow zone (a few km or less across). Trouble estimating exact values while still reliably predicting high versus low convective signals suggests that a future strategy such as ranked ordinal regression (e.g., classifying into ordered bins for low, medium, high, and very high convective signal) might fit better models, since doing so reduces problems introduced by outliers while preserving the property of larger versus smaller signals.

More Like this