Cursed? Why one does not simply add new data sets to supervised geothermal machine learning models

Mordensky, Stanley Paul; Burns, Erick; Lipor, John; DeAngelo, Jacob

Recent advances in machine learning (ML) identifying areas favorable to hydrothermal systems indicate that the resolution of feature data remains a subject of necessary improvement before ML can reliably produce better models. Herein, we consider the value of adding new features or replacing other, low-value features with new input features in existing ML pipelines. Our previous work identified stress and seismicity as having less value than the other feature types (i.e., heat flow, distance to faults, and distance to magmatic activity) for the 2008 USGS hydrothermal energy assessment; hence, a fundamental question regards if the addition of new but partially correlated features will improve resulting models for hydrothermal favorability. Therefore, we add new maps for shear strain rate and dilation strain rate to fit logistic regression and XGBoost models, resulting in new 7-feature models that are compared to the old 5-feature models. Because these new features share a degree of correlation with the original relatively uninformative stress and seismicity features, we also consider replacement of the two lower-value features with the two new features, creating new 5-feature models. Adding the new features improves the predictive skill of the new 7-feature model over that of the old 5-feature model; albeit, that improvement is not statistically significant because the new features are correlated with the old features and, consequently, the new features do not present considerable new information. However, the new 5-feature XGBoost model has a statistically significant increase in predictive skill for known positives over the old 5-feature model at p = 0.06. This improved performance is due to the lower-dimensional feature space of the former than that of the latter. In higher-dimensional feature space, relationships between features and the presence or absence of hydrothermal systems are harder to discern (i.e., the 7-feature model likely suffers from the “curse of dimensionality”).

More Like this