skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 4, 2026

Title: Stuck at Home: Machine‐Learning Models Predicting Solute Concentrations of One Stream Failed to Predict Solute Concentrations in Other Streams
ABSTRACT Machine‐learning models have been surprisingly successful at predicting stream solute concentrations, even for solutes without dedicated sensors. It would be extremely valuable if these models could predict solute concentrations in streams beyond the one in which they were trained. We assessed the generalisability of random forest models by training them in one or more streams and testing them in another. Models were made using grab sample and sensor data from 10 New Hampshire streams and rivers. As observed in previous studies, models trained in one stream were capable of accurately predicting solute concentrations in that stream. However, models trained on one stream produced inaccurate predictions of solute concentrations in other streams, with the exception of solutes measured by dedicated sensors (i.e., nitrate and dissolved organic carbon). Using data from multiple watersheds improved model results, but model performance was still worse than using the mean of the training dataset (Nash–Sutcliffe Efficiency < 0). Our results demonstrate that machine‐learning models thus far reliably predict solute concentrations only where trained, as differences in solute concentration patterns and sensor‐solute relationships limit their broader applicability.  more » « less
Award ID(s):
2401760 2215300 2129383
PAR ID:
10599852
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Hydrological Processes
Volume:
39
Issue:
5
ISSN:
0885-6087
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Stream solute monitoring has produced many insights into ecosystem and Earth system functions. Although new sensors have provided novel information about the fine‐scale temporal variation of some stream water solutes, we lack adequate sensor technology to gain the same insights for many other solutes. We used two machine learning algorithms – Support Vector Machine and Random Forest – to predict concentrations at 15‐min resolution for 10 solutes, of which eight lack specific sensors. The algorithms were trained with data from intensive stream sensing and manual stream sampling (weekly) for four full years in a hydrologic reference stream within the Hubbard Brook Experimental Forest in New Hampshire, USA. The Random Forest algorithm was slightly better at predicting solute concentrations than the Support Vector Machine algorithm (Nash‐Sutcliffe efficiencies ranged from 0.35 to 0.78 for Random Forest compared to 0.29 to 0.79 for Support Vector Machine). Solute predictions were most sensitive to the removal of fluorescent dissolved organic matter, pH and specific conductance as independent variables for both algorithms, and least sensitive to dissolved oxygen and turbidity. The predicted concentrations of calcium and monomeric aluminium were used to estimate catchment solute yield, which changed most dramatically for aluminium because it concentrates with stream discharge. These results show great promise for using a combined approach of stream sensing and intensive stream discrete sampling to build information about the high‐frequency variation of solutes for which an appropriate sensor or proxy is not available. 
    more » « less
  2. Abstract Surface runoff and infiltrated water en route to the stream interact with dynamic landscape properties, ranging from vegetation and microbial activities to soil and geological attributes. Stream solute concentrations are highly variable and interconnected due to these interactions, flow paths, and residence times, and often exhibit hysteresis with flow. Significant unknowns remain about how point measurements of stream solute chemistry reflect interdependent hydrobiogeochemical and physical processes, and how signatures are encapsulated as nonlinear dynamical relationships between variables. We take a Machine Learning (ML) approach to understand and capture these dynamical relationships and improve predictions of solutes at short and long time scales. We introduce a physical process‐based “flow‐gate” into an Long Short‐Term Memory (LSTM) model, which enables the model to learn hysteresis behaviors if they exist. Further, we use information‐theoretic metrics to detect how solutes are interdependent and iteratively select source solutes that best predict a given target solute concentration. The “flow‐gate LSTM” model improves model predictions (1%–32% decreases in RMSE) relative to the standard LSTM model for all nine solutes included in the study. The predictive improvements from the flow‐gate LSTM model highlight the importance of lagged concentration and discharge relationships for certain solutes. It also indicates a potential limitation in the traditional LSTM model approach since flow rates are always provided as input sources, but this information is not fully utilized. This work provides a starting point for a predictive understanding of geochemical interdependencies using machine‐learning approaches and highlights potential improvements in model architecture. 
    more » « less
  3. Solute concentrations in stream water vary with discharge in patterns that record complex feedbacks between hydrologic and biogeochemical processes. In a comparison of headwater catchments underlain by shale in Pennsylvania, USA (Shale Hills) and Wales, UK (Plynlimon), dissimilar concentration-discharge behaviors are best explained by contrasting landscape distributions of soil solution chemistry – especially dissolved organic carbon (DOC) – that have been established by patterns of vegetation. Specifically, elements that are concentrated in organic-rich soils due to biotic cycling (Mn, Ca, K) or that form strong complexes with DOC (Fe, Al) are spatially heterogeneous in pore waters because organic matter is heterogeneously distributed across the catchments. These solutes exhibit non-chemostatic "bioactive" behavior in the streams, and solute concentrations either decrease (Shale Hills) or increase (Plynlimon) with increasing discharge. In contrast, solutes that are concentrated in soil minerals and form only weak complexes with DOC (Na, Mg, Si) are spatially homogeneous in pore waters across each catchment. These solutes are chemostatic in that their stream concentrations vary little with stream discharge, likely because these solutes are released quickly from exchange sites in the soils during rainfall events. Differences in the hydrologic connectivity of organic-rich soils to the stream drive differences in concentration behavior between catchments. As such, in catchments where soil organic matter (SOM) is dominantly in lowlands (e.g., Shale Hills), bioactive elements are released to the stream early during rainfall events, whereas in catchments where SOM is dominantly in uplands (e.g., Plynlimon), bioactive elements are released later during rainfall events. The distribution of vegetation and SOM across the landscape is thus a key component for predictive models of solute transport in headwater catchments. 
    more » « less
  4. Abstract Understanding controls on solute export to streams is challenging because heterogeneous catchments can respond uniquely to drivers of environmental change. To understand general solute export patterns, we used a large‐scale inductive approach to evaluate concentration–discharge (C–Q) metrics across catchments spanning a broad range of catchment attributes and hydroclimatic drivers. We leveraged paired C–Q data for 11 solutes from CAMELS‐Chem, a database built upon an existing dataset of catchment and hydroclimatic attributes from relatively undisturbed catchments across the contiguous USA. Because C–Q relationships with Q thresholds reflect a shift in solute export dynamics and are poorly characterized across solutes and diverse catchments, we analysed C–Q relationships using Bayesian segmented regression to quantify Q thresholds in the C–Q relationship. Threshold responses were rare, representing only 12% of C–Q relationships, 56% of which occurred for solutes predominantly sourced from bedrock. Further, solutes were dominated by one or two C–Q patterns that reflected vertical solute–source distributions. Specifically, solutes predominantly sourced from bedrock had diluting C–Q responses in 43%–70% of catchments, and solutes predominantly sourced from soils had more enrichment responses in 35%–51% of catchments. We also linked C–Q relationships to catchment and hydroclimatic attributes to understand controls on export patterns. The relationships were generally weak despite the diversity of solutes and attribute types considered. However, catchment and hydroclimatic attributes in the central USA typically drove the most divergent export behaviour for solutes. Further, we illustrate how our inductive approach generated new hypotheses that can be tested at discrete, representative catchments using deductive approaches to better understand the processes underlying solute export patterns. Finally, given these long‐term C–Q relationships are from minimally disturbed catchments, our findings can be used as benchmarks for change in more disturbed catchments. 
    more » « less
  5. Abstract Stream fluxes are commonly reported without a complete accounting for uncertainty in the estimates, which makes it difficult to evaluate the significance of findings or to identify where to direct efforts to improve monitoring programs. At the Hubbard Brook Experimental Forest in the White Mountains of New Hampshire, USA, stream flow has been monitored continuously and solute concentrations have been sampled approximately weekly in small, gaged headwater streams since 1963, yet comprehensive uncertainty analyses have not been reported. We propagated uncertainty in the stage height–discharge relationship, watershed area, analytical chemistry, the concentration–discharge relationship used to interpolate solute concentrations, and the streamflow gap‐filling procedure to estimate uncertainty for both streamflow and solute fluxes for a recent 6‐year period (2013–2018) using a Monte Carlo approach. As a percentage of solute fluxes, uncertainty was highest for NH4+(34%), total dissolved nitrogen (8.8%), NO3(8.1%), and K+(7.4%), and lowest for dissolved organic carbon (3.7%), SO42−(4.0%), and Mg2+(4.4%). In units of flux, uncertainties were highest for solutes in highest concentration (Si, DOC, SO42−, and Na+) and lowest for those lowest in concentration (H+and NH4+). Laboratory analysis of solute concentration was a greater source of uncertainty than streamflow for solute flux, with the exception of DOC. Our results suggest that uncertainty in solute fluxes could be reduced with more precise measurements of solute concentrations. Additionally, more discharge measurements during high flows are needed to better characterize the stage‐discharge relationship. Quantifying uncertainty in streamflow and element export is important because it allows for determination of significance of differences in fluxes, which can be used to assess watershed response to disturbance and environmental change. 
    more » « less