skip to main content

Title: Evaluating and improving the reliability of gas-phase sensor system calibrations across new locations for ambient measurements and personal exposure monitoring
Abstract. Advances in ambient environmental monitoring technologies are enabling concerned communities and citizens to collect data to better understand their local environment and potential exposures. These mobile, low-cost tools make it possible to collect data with increased temporal and spatial resolution, providing data on a large scale with unprecedented levels of detail. This type of data has the potential to empower people to make personal decisions about their exposure and support the development of local strategies for reducing pollution and improving health outcomes. However, calibration of these low-cost instruments has been a challenge. Often, a sensor package is calibrated via field calibration. This involves colocating the sensor package with a high-quality reference instrument for an extended period and then applying machine learning or other model fitting technique such as multiple linear regression to develop a calibration model for converting raw sensor signals to pollutant concentrations. Although this method helps to correct for the effects of ambient conditions (e.g., temperature) and cross sensitivities with nontarget pollutants, there is a growing body of evidence that calibration models can overfit to a given location or set of environmental conditions on account of the incidental correlation between pollutant levels and environmental conditions, including diurnal cycles. As a result, a sensor package trained at a field site may provide less reliable data when moved, or transferred, to a different location. This is a potential concern for applications seeking to perform monitoring away from regulatory monitoring sites, such as personal mobile monitoring or high-resolution monitoring of a neighborhood. We performed experiments confirming that transferability is indeed a problem and show that it can be improved by collecting data from multiple regulatory sites and building a calibration model that leverages data from a more diverse data set. We deployed three sensor packages to each of three sites with reference monitors (nine packages total) and then rotated the sensor packages through the sites over time. Two sites were in San Diego, CA, with a third outside of Bakersfield, CA, offering varying environmental conditions, general air quality composition, and pollutant concentrations. When compared to prior single-site calibration, the multisite approach exhibits better model transferability for a range of modeling approaches. Our experiments also reveal that random forest is especially prone to overfitting and confirm prior results that transfer is a significant source of both bias and standard error. Linear regression, on the other hand, although it exhibits relatively high error, does not degrade much in transfer. Bias dominated in our experiments, suggesting that transferability might be easily increased by detecting and correcting for bias. Also, given that many monitoring applications involve the deployment of many sensor packages based on the same sensing technology, there is an opportunity to leverage the availability of multiple sensors at multiple sites during calibration to lower the cost of training and better tolerate transfer. We contribute a new neural network architecture model termed split-NN that splits the model into two stages, in which the first stage corrects for sensor-to-sensor variation and the second stage uses the combined data of all the sensors to build a model for a single sensor package. The split-NN modeling approach outperforms multiple linear regression, traditional two- and four-layer neural networks, and random forest models. Depending on the training configuration, compared to random forest the split-NN method reduced error 0 %–11 % for NO2 and 6 %–13 % for O3.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Atmospheric Measurement Techniques
Page Range / eLocation ID:
4211 to 4239
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Low-cost sensors enable finer-scale spatiotemporal measurements within the existing methane (CH 4 ) monitoring infrastructure and could help cities mitigate CH 4 emissions to meet their climate goals. While initial studies of low-cost CH 4 sensors have shown potential for effective CH 4 measurement at ambient concentrations, sensor deployment remains limited due to questions about interferences and calibration across environments and seasons. This study evaluates sensor performance across seasons with specific attention paid to the sensor's understudied carbon monoxide (CO) interferences and environmental dependencies through long-term ambient co-location in an urban environment. The sensor was first evaluated in a laboratory using chamber calibration and co-location experiments, and then in the field through two 8 week co-locations with a reference CH 4 instrument. In the laboratory, the sensor was sensitive to CH 4 concentrations below ambient background concentrations. Different sensor units responded similarly to changing CH 4 , CO, temperature, and humidity conditions but required individual calibrations to account for differences in sensor response factors. When deployed in-field, co-located with a reference instrument near Baltimore, MD, the sensor captured diurnal trends in hourly CH 4 concentration after corrections for temperature, absolute humidity, CO concentration, and hour of day. Variable performance was observed across seasons with the sensor performing well ( R 2 = 0.65; percent bias 3.12%; RMSE 0.10 ppm) in the winter validation period and less accurately ( R 2 = 0.12; percent bias 3.01%; RMSE 0.08 ppm) in the summer validation period where there was less dynamic range in CH 4 concentrations. The results highlight the utility of sensor deployment in more variable ambient CH 4 conditions and demonstrate the importance of accounting for temperature and humidity dependencies as well as co-located CO concentrations with low-cost CH 4 measurements. We show this can be addressed via Multiple Linear Regression (MLR) models accounting for key covariates to enable urban measurements in areas with CH 4 enhancement. Together with individualized calibration prior to deployment, the sensor shows promise for use in low-cost sensor networks and represents a valuable supplement to existing monitoring strategies to identify CH 4 hotspots. 
    more » « less
  2. Abstract

    Carbon fluxes in terrestrial ecosystems and their response to environmental change are a major source of uncertainty in the modern carbon cycle. The National Ecological Observatory Network (NEON) presents the opportunity to merge eddy covariance (EC)‐derived fluxes with CO2isotope ratio measurements to gain insights into carbon cycle processes. Collected continuously and consistently across >40 sites, NEON EC and isotope data facilitate novel integrative analyses. However, currently provisioned atmospheric isotope data are uncalibrated, greatly limiting ability to perform cross‐site analyses. Here, we present two approaches to calibrating NEON CO2isotope ratios, along with an R package to calibrate NEON data. We find that calibrating CO2isotopologues independently yields a lowerδ13C bias (<0.05‰) and higher precision (<0.40‰) than directly correctingδ13C with linear regression (bias: <0.11‰, precision: 0.42‰), but with slightly higher error and lower precision in calibrated CO2mole fraction. The magnitude of the corrections toδ13C and CO2mole fractions vary substantially by site, underscoring the need for users to apply a consistent calibration framework to data in the NEON archive. Post‐calibration data sets show that site mean annualδ13C correlates negatively with precipitation, temperature, and aridity, but positively with elevation. Forested and agricultural ecosystems exhibit larger gradients in CO2andδ13C than other sites, particularly during the summer and at night. The overview and analysis tools developed here will facilitate cross‐site analysis using NEON data, provide a model for other continental‐scale observational networks, and enable new advances leveraging the isotope ratios of specific carbon fluxes.

    more » « less
  3. Background: As software development becomes more interdependent, unique relationships among software packages arise and form complex software ecosystems. Aim: We aim to understand the behavior of these ecosystems better through the lens of software supply chains and model how the effects of software dependency network affect the change in downloads of Javascript packages. Method: We analyzed 12,999 popular packages in NPM, between 01-December-2017 and 15-March-2018, using Linear Regression and Random Forest models and examined the effects of predictors representing different aspects of the software dependency supply chain on changes in numbers of downloads for a package. Result: Preliminary results suggest that the count and downloads of upstream and downstream runtime dependencies have a strong effect on the change in downloads, with packages having fewer, more popular packages as dependencies (upstream or downstream) likely to see an increase in downloads. This suggests that in order to interpret the number of downloads for a package properly, it is necessary to take into account the peculiarities of the supply chain (both upstream and downstream) of that package. Conclusion: Future work is needed to identify the effects of added, deleted, and unchanged dependencies for different types of packages, e.g. build tools, test tools. 
    more » « less
  4. Abstract

    Due to climate change and rapid urbanization, Urban Heat Island (UHI), featuring significantly higher temperature in metropolitan areas than surrounding areas, has caused negative impacts on urban communities. Temporal granularity is often limited in UHI studies based on satellite remote sensing data that typically has multi-day frequency coverage of a particular urban area. This low temporal frequency has restricted the development of models for predicting UHI. To resolve this limitation, this study has developed a cyber-based geographic information science and systems (cyberGIS) framework encompassing multiple machine learning models for predicting UHI with high-frequency urban sensor network data combined with remote sensing data focused on Chicago, Illinois, from 2018 to 2020. Enabled by rapid advances in urban sensor network technologies and high-performance computing, this framework is designed to predict UHI in Chicago with fine spatiotemporal granularity based on environmental data collected with the Array of Things (AoT) urban sensor network and Landsat-8 remote sensing imagery. Our computational experiments revealed that a random forest regression (RFR) model outperforms other models with the prediction accuracy of 0.45 degree Celsius in 2020 and 0.8 degree Celsius in 2018 and 2019 with mean absolute error as the evaluation metric. Humidity, distance to geographic center, and PM2.5concentration are identified as important factors contributing to the model performance. Furthermore, we estimate UHI in Chicago with 10-min temporal frequency and 1-km spatial resolution on the hottest day in 2018. It is demonstrated that the RFR model can accurately predict UHI at fine spatiotemporal scales with high-frequency urban sensor network data integrated with satellite remote sensing data.

    more » « less
  5. Background

    Metamodels can address some of the limitations of complex simulation models by formulating a mathematical relationship between input parameters and simulation model outcomes. Our objective was to develop and compare the performance of a machine learning (ML)–based metamodel against a conventional metamodeling approach in replicating the findings of a complex simulation model.


    We constructed 3 ML-based metamodels using random forest, support vector regression, and artificial neural networks and a linear regression-based metamodel from a previously validated microsimulation model of the natural history hepatitis C virus (HCV) consisting of 40 input parameters. Outcomes of interest included societal costs and quality-adjusted life-years (QALYs), the incremental cost-effectiveness (ICER) of HCV treatment versus no treatment, cost-effectiveness analysis curve (CEAC), and expected value of perfect information (EVPI). We evaluated metamodel performance using root mean squared error (RMSE) and Pearson’s R2on the normalized data.


    The R2values for the linear regression metamodel for QALYs without treatment, QALYs with treatment, societal cost without treatment, societal cost with treatment, and ICER were 0.92, 0.98, 0.85, 0.92, and 0.60, respectively. The corresponding R2values for our ML-based metamodels were 0.96, 0.97, 0.90, 0.95, and 0.49 for support vector regression; 0.99, 0.83, 0.99, 0.99, and 0.82 for artificial neural network; and 0.99, 0.99, 0.99, 0.99, and 0.98 for random forest. Similar trends were observed for RMSE. The CEAC and EVPI curves produced by the random forest metamodel matched the results of the simulation output more closely than the linear regression metamodel.


    ML-based metamodels generally outperformed traditional linear regression metamodels at replicating results from complex simulation models, with random forest metamodels performing best.


    Decision-analytic models are frequently used by policy makers and other stakeholders to assess the impact of new medical technologies and interventions. However, complex models can impose limitations on conducting probabilistic sensitivity analysis and value-of-information analysis, and may not be suitable for developing online decision-support tools. Metamodels, which accurately formulate a mathematical relationship between input parameters and model outcomes, can replicate complex simulation models and address the above limitation. The machine learning–based random forest model can outperform linear regression in replicating the findings of a complex simulation model. Such a metamodel can be used for conducting cost-effectiveness and value-of-information analyses or developing online decision support tools.

    more » « less