skip to main content


Title: Predicting high‐frequency variation in stream solute concentrations with water quality sensors and machine learning
Abstract

Stream solute monitoring has produced many insights into ecosystem and Earth system functions. Although new sensors have provided novel information about the fine‐scale temporal variation of some stream water solutes, we lack adequate sensor technology to gain the same insights for many other solutes. We used two machine learning algorithms – Support Vector Machine and Random Forest – to predict concentrations at 15‐min resolution for 10 solutes, of which eight lack specific sensors. The algorithms were trained with data from intensive stream sensing and manual stream sampling (weekly) for four full years in a hydrologic reference stream within the Hubbard Brook Experimental Forest in New Hampshire, USA. The Random Forest algorithm was slightly better at predicting solute concentrations than the Support Vector Machine algorithm (Nash‐Sutcliffe efficiencies ranged from 0.35 to 0.78 for Random Forest compared to 0.29 to 0.79 for Support Vector Machine). Solute predictions were most sensitive to the removal of fluorescent dissolved organic matter, pH and specific conductance as independent variables for both algorithms, and least sensitive to dissolved oxygen and turbidity. The predicted concentrations of calcium and monomeric aluminium were used to estimate catchment solute yield, which changed most dramatically for aluminium because it concentrates with stream discharge. These results show great promise for using a combined approach of stream sensing and intensive stream discrete sampling to build information about the high‐frequency variation of solutes for which an appropriate sensor or proxy is not available.

 
more » « less
Award ID(s):
1637685 1907683
NSF-PAR ID:
10452592
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Hydrological Processes
Volume:
35
Issue:
1
ISSN:
0885-6087
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract. Advances in ambient environmental monitoring technologies are enabling concerned communities and citizens to collect data to better understand their local environment and potential exposures. These mobile, low-cost tools make it possible to collect data with increased temporal and spatial resolution, providing data on a large scale with unprecedented levels of detail. This type of data has the potential to empower people to make personal decisions about their exposure and support the development of local strategies for reducing pollution and improving health outcomes. However, calibration of these low-cost instruments has been a challenge. Often, a sensor package is calibrated via field calibration. This involves colocating the sensor package with a high-quality reference instrument for an extended period and then applying machine learning or other model fitting technique such as multiple linear regression to develop a calibration model for converting raw sensor signals to pollutant concentrations. Although this method helps to correct for the effects of ambient conditions (e.g., temperature) and cross sensitivities with nontarget pollutants, there is a growing body of evidence that calibration models can overfit to a given location or set of environmental conditions on account of the incidental correlation between pollutant levels and environmental conditions, including diurnal cycles. As a result, a sensor package trained at a field site may provide less reliable data when moved, or transferred, to a different location. This is a potential concern for applications seeking to perform monitoring away from regulatory monitoring sites, such as personal mobile monitoring or high-resolution monitoring of a neighborhood. We performed experiments confirming that transferability is indeed a problem and show that it can be improved by collecting data from multiple regulatory sites and building a calibration model that leverages data from a more diverse data set. We deployed three sensor packages to each of three sites with reference monitors (nine packages total) and then rotated the sensor packages through the sites over time. Two sites were in San Diego, CA, with a third outside of Bakersfield, CA, offering varying environmental conditions, general air quality composition, and pollutant concentrations. When compared to prior single-site calibration, the multisite approach exhibits better model transferability for a range of modeling approaches. Our experiments also reveal that random forest is especially prone to overfitting and confirm prior results that transfer is a significant source of both bias and standard error. Linear regression, on the other hand, although it exhibits relatively high error, does not degrade much in transfer. Bias dominated in our experiments, suggesting that transferability might be easily increased by detecting and correcting for bias. Also, given that many monitoring applications involve the deployment of many sensor packages based on the same sensing technology, there is an opportunity to leverage the availability of multiple sensors at multiple sites during calibration to lower the cost of training and better tolerate transfer. We contribute a new neural network architecture model termed split-NN that splits the model into two stages, in which the first stage corrects for sensor-to-sensor variation and the second stage uses the combined data of all the sensors to build a model for a single sensor package. The split-NN modeling approach outperforms multiple linear regression, traditional two- and four-layer neural networks, and random forest models. Depending on the training configuration, compared to random forest the split-NN method reduced error 0 %–11 % for NO2 and 6 %–13 % for O3. 
    more » « less
  2. Abstract

    Glacierized coastal catchments of the Gulf of Alaska (GoA) are undergoing rapid hydrologic fluctuations in response to climate change. These catchments deliver dissolved and suspended inorganic and organic matter to nearshore marine environments, however, these glacierized coastal catchments are relatively understudied and little is known about total solute and particulate fluxes to the ocean. We present hydrologic, physical, and geochemical data collected during April–October 2019–2021 from 10 streams along gradients of glacial fed to non‐glacial (i.e., precipitation) fed, in one Southcentral and one Southeast Alaska region. Hydrologic data reveal that glaciers drive the seasonal runoff patterns. The ẟ18O signature and specific conductance show distinctive seasonal variations in stream water sources between the study regions apparently due to the large amounts of rain in Southeast Alaska. Total dissolved solids concentrations and yields were elevated in the Southcentral region, due to lithologic influence on dissolved loads, however, the hydroclimate is the primary driver of the timing of dissolved and suspended yields. We show the yields of dissolved organic carbon is higher and that the δ13CPOCis enriched in the Southeast streams illustrating contrasts in organic carbon export across the GoA. Finally, we illustrate how future yields of solutes and sediments to the GoA may change as watersheds evolve from glacial influenced to precipitation dominated. This integrated analysis provides insights into how watershed characteristics beyond glacier coverage control properties of freshwater inputs to the GoA and the importance of expanding study regions to multiple hydroclimate regimes.

     
    more » « less
  3. Abstract. Solute concentrations in stream water vary with discharge in patterns that record complex feedbacks between hydrologic and biogeochemical processes. In a comparison of three shale-underlain headwater catchments located in Pennsylvania, USA (the forested Shale Hills Critical Zone Observatory), and Wales, UK (the peatland-dominated Upper Hafren and forest-dominated Upper Hore catchments in the Plynlimon forest), dissimilar concentration–discharge (CQ) behaviors are best explained by contrasting landscape distributions of soil solution chemistry – especially dissolved organic carbon (DOC) – that have been established by patterns of vegetation and soil organic matter (SOM). Specifically, elements that are concentrated in organic-rich soils due to biotic cycling (Mn, Ca, K) or that form strong complexes with DOC (Fe, Al) are spatially heterogeneous in pore waters because organic matter is heterogeneously distributed across the catchments. These solutes exhibit non-chemostatic behavior in the streams, and solute concentrations either decrease (Shale Hills) or increase (Plynlimon) with increasing discharge. In contrast, solutes that are concentrated in soil minerals and form only weak complexes with DOC (Na, Mg, Si) are spatially homogeneous in pore waters across each catchment. These solutes are chemostatic in that their stream concentrations vary little with stream discharge, likely because these solutes are released quickly from exchange sites in the soils during rainfall events. Furthermore, concentration–discharge relationships of non-chemostatic solutes changed following tree harvest in the Upper Hore catchment in Plynlimon, while no changes were observed for chemostatic solutes, underscoring the role of vegetation in regulating the concentrations of certain elements in the stream. These results indicate that differences in the hydrologic connectivity of organic-rich soils to the stream drive differences in concentration behavior between catchments. As such, in catchments where SOM is dominantly in lowlands (e.g., Shale Hills), we infer that non-chemostatic elements associated with organic matter are released to the stream early during rainfall events, whereas in catchments where SOM is dominantly in uplands (e.g., Plynlimon), these non-chemostatic elements are released later during rainfall events. The distribution of SOM across the landscape is thus a key component for predictive models of solute transport in headwater catchments.

     
    more » « less
  4. Abstract

    Stream fluxes are commonly reported without a complete accounting for uncertainty in the estimates, which makes it difficult to evaluate the significance of findings or to identify where to direct efforts to improve monitoring programs. At the Hubbard Brook Experimental Forest in the White Mountains of New Hampshire, USA, stream flow has been monitored continuously and solute concentrations have been sampled approximately weekly in small, gaged headwater streams since 1963, yet comprehensive uncertainty analyses have not been reported. We propagated uncertainty in the stage height–discharge relationship, watershed area, analytical chemistry, the concentration–discharge relationship used to interpolate solute concentrations, and the streamflow gap‐filling procedure to estimate uncertainty for both streamflow and solute fluxes for a recent 6‐year period (2013–2018) using a Monte Carlo approach. As a percentage of solute fluxes, uncertainty was highest for NH4+(34%), total dissolved nitrogen (8.8%), NO3(8.1%), and K+(7.4%), and lowest for dissolved organic carbon (3.7%), SO42−(4.0%), and Mg2+(4.4%). In units of flux, uncertainties were highest for solutes in highest concentration (Si, DOC, SO42−, and Na+) and lowest for those lowest in concentration (H+and NH4+). Laboratory analysis of solute concentration was a greater source of uncertainty than streamflow for solute flux, with the exception of DOC. Our results suggest that uncertainty in solute fluxes could be reduced with more precise measurements of solute concentrations. Additionally, more discharge measurements during high flows are needed to better characterize the stage‐discharge relationship. Quantifying uncertainty in streamflow and element export is important because it allows for determination of significance of differences in fluxes, which can be used to assess watershed response to disturbance and environmental change.

     
    more » « less
  5. Abstract

    Synoptic sampling of streams is an inexpensive way to gain insight into the spatial distribution of dissolved constituents in the subsurface critical zone. Few spatial synoptics have focused on urban watersheds although this approach is useful in urban areas where monitoring wells are uncommon. Baseflow stream sampling was used to quantify spatial variability of water chemistry in a highly developed Piedmont watershed in suburban Baltimore, MD having no permitted point discharges. Six synoptic surveys were conducted from 2014 to 2016 after an average of 10 days of no rain, when stream discharge was composed of baseflow from groundwater. Samples collected every 50 m over 5 km were analyzed for nitrate, sulfate, chloride, fluoride, and water stable isotopes. Longitudinal spatial patterns differed across constituents for each survey, but the pattern for each constituent varied little across synoptics. Results suggest a spatially heterogeneous, three‐dimensional pattern of localized groundwater contaminant zones steadily contributing solutes to the stream network, where high concentrations result from current and legacy land use practices. By contrast, observations from 35 point piezometers indicate that sparse groundwater measurements are not a good predictor of baseflow stream chemistry in this geologic setting. Cross‐covariance analysis of stream solute concentrations with groundwater model/backward particle tracking results suggest that spatial changes in base‐flow solute concentrations are associated with urban features such as impervious surface area, fill, and leaking potable water and sanitary sewer pipes. Predicted subsurface residence times suggest that legacy solute sources drive baseflow stream chemistry in the urban critical zone.

     
    more » « less