skip to main content

Title: Evaluating and improving the reliability of gas-phase sensor system calibrations across new locations for ambient measurements and personal exposure monitoring
Abstract. Advances in ambient environmental monitoring technologies are enabling concerned communities and citizens to collect data to better understand their local environment and potential exposures. These mobile, low-cost tools make it possible to collect data with increased temporal and spatial resolution, providing data on a large scale with unprecedented levels of detail. This type of data has the potential to empower people to make personal decisions about their exposure and support the development of local strategies for reducing pollution and improving health outcomes. However, calibration of these low-cost instruments has been a challenge. Often, a sensor package is calibrated via field calibration. This involves colocating the sensor package with a high-quality reference instrument for an extended period and then applying machine learning or other model fitting technique such as multiple linear regression to develop a calibration model for converting raw sensor signals to pollutant concentrations. Although this method helps to correct for the effects of ambient conditions (e.g., temperature) and cross sensitivities with nontarget pollutants, there is a growing body of evidence that calibration models can overfit to a given location or set of environmental conditions on account of the incidental correlation between pollutant levels and environmental conditions, including diurnal more » cycles. As a result, a sensor package trained at a field site may provide less reliable data when moved, or transferred, to a different location. This is a potential concern for applications seeking to perform monitoring away from regulatory monitoring sites, such as personal mobile monitoring or high-resolution monitoring of a neighborhood. We performed experiments confirming that transferability is indeed a problem and show that it can be improved by collecting data from multiple regulatory sites and building a calibration model that leverages data from a more diverse data set. We deployed three sensor packages to each of three sites with reference monitors (nine packages total) and then rotated the sensor packages through the sites over time. Two sites were in San Diego, CA, with a third outside of Bakersfield, CA, offering varying environmental conditions, general air quality composition, and pollutant concentrations. When compared to prior single-site calibration, the multisite approach exhibits better model transferability for a range of modeling approaches. Our experiments also reveal that random forest is especially prone to overfitting and confirm prior results that transfer is a significant source of both bias and standard error. Linear regression, on the other hand, although it exhibits relatively high error, does not degrade much in transfer. Bias dominated in our experiments, suggesting that transferability might be easily increased by detecting and correcting for bias. Also, given that many monitoring applications involve the deployment of many sensor packages based on the same sensing technology, there is an opportunity to leverage the availability of multiple sensors at multiple sites during calibration to lower the cost of training and better tolerate transfer. We contribute a new neural network architecture model termed split-NN that splits the model into two stages, in which the first stage corrects for sensor-to-sensor variation and the second stage uses the combined data of all the sensors to build a model for a single sensor package. The split-NN modeling approach outperforms multiple linear regression, traditional two- and four-layer neural networks, and random forest models. Depending on the training configuration, compared to random forest the split-NN method reduced error 0 %–11 % for NO2 and 6 %–13 % for O3. « less
; ; ; ; ; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
Atmospheric Measurement Techniques
Page Range or eLocation-ID:
4211 to 4239
Sponsoring Org:
National Science Foundation
More Like this
  1. Background: As software development becomes more interdependent, unique relationships among software packages arise and form complex software ecosystems. Aim: We aim to understand the behavior of these ecosystems better through the lens of software supply chains and model how the effects of software dependency network affect the change in downloads of Javascript packages. Method: We analyzed 12,999 popular packages in NPM, between 01-December-2017 and 15-March-2018, using Linear Regression and Random Forest models and examined the effects of predictors representing different aspects of the software dependency supply chain on changes in numbers of downloads for a package. Result: Preliminary results suggestmore »that the count and downloads of upstream and downstream runtime dependencies have a strong effect on the change in downloads, with packages having fewer, more popular packages as dependencies (upstream or downstream) likely to see an increase in downloads. This suggests that in order to interpret the number of downloads for a package properly, it is necessary to take into account the peculiarities of the supply chain (both upstream and downstream) of that package. Conclusion: Future work is needed to identify the effects of added, deleted, and unchanged dependencies for different types of packages, e.g. build tools, test tools.« less
  2. High-quality temperature data at a finer spatio-temporal scale is critical for analyzing the risk of heat exposure and hazards in urban environments. The variability of urban landscapes makes cities a challenging environment for quantifying heat exposure. Most of the existing heat hazard studies have inherent limitations on two fronts; first, the spatio-temporal granularities are too coarse, and second, the inability to track the ambient air temperature (AAT) instead of land surface temperature (LST). Overcoming these limitations requires developing models for mapping the variability in heat exposure in urban environments. We investigated an integrated approach for mapping urban heat hazards bymore »harnessing a diverse set of high-resolution measurements, including both ground-based and satellite-based temperature data. We mounted vehicle-borne mobile sensors on city buses to collect high-frequency temperature data throughout 2018 and 2019. Our research also incorporated key biophysical parameters and Landsat 8 LST data into Random Forest regression modeling to map the hyperlocal variability of heat hazard over areas not covered by the buses. The vehicle-borne temperature sensor data showed large temperature differences within the city, with the largest variations of up to 10 °C and morning-afternoon diurnal changes at a magnitude around 20 °C. Random Forest modeling on noontime (11:30 am – 12:30 pm) data to predict AAT produced accurate results with a mean absolute error of 0.29 °C and successfully showcased the enhanced granularity in urban heat hazard mapping. These maps revealed well-defined hyperlocal variabilities in AAT, which were not evident with other research approaches. Urban core and dense residential areas revealed larger than 5 °C AAT differences from their nearby green spaces. The sensing framework developed in this study can be easily implemented in other urban areas, and findings from this study will be beneficial in understanding the heat vulnerabilities of individual communities. It can be used by the local government to devise targeted hazard mitigation efforts such as increasing green space, developing better heatsafety policies, and exposure warning for workers.« less
  3. Timely updates of carbon stock distribution are needed to better understand the impacts of deforestation and degradation on forest carbon stock dynamics. This research aimed to explore an approach for estimating aboveground carbon density (ACD) in the Brazilian Amazon through integration of MODIS (moderate resolution imaging spectroradiometer) and a limited number of light detection and ranging (Lidar) data samples using linear regression (LR) and random forest (RF) algorithms, respectively. Airborne LiDAR data at 23 sites across the Brazilian Amazon were collected and used to calculate ACD. The ACD estimation model, which was developed by Longo et al. in the samemore »study area, was used to map ACD distribution in the 23 sites. The LR and RF methods were used to develop ACD models, in which the samples extracted from LiDAR-estimated ACD were used as dependent variables and MODIS-derived variables were used as independent variables. The evaluation of modeling results indicated that ACD can be successfully estimated with a coefficient of determination of 0.67 and root mean square error of 4.18 kg C/m2 using RF based on spectral indices. The mixed pixel problem in MODIS data is a major factor in ACD overestimation, while cloud contamination and data saturation are major factors in ACD underestimation. These uncertainties in ACD estimation using MODIS data make it difficult to examine annual ACD dynamics of degradation and growth, however this method can be used to examine the deforestation-induced ACD loss.« less
  4. Predictions from species distribution models (SDMs) are commonly used in support of environmental decision-making to explore potential impacts of climate change on biodiversity. However, because future climates are likely to differ from current climates, there has been ongoing interest in understanding the ability of SDMs to predict species responses under novel conditions (i.e., model transferability). Here, we explore the spatial and environmental limits to extrapolation in SDMs using forest inventory data from 11 model algorithms for 108 tree species across the western United States. Algorithms performed well in predicting occurrence for plots that occurred in the same geographic region inmore »which they were fitted. However, a substantial portion of models performed worse than random when predicting for geographic regions in which algorithms were not fitted. Our results suggest that for transfers in geographic space, no specific algorithm was better than another as there were no significant differences in predictive performance across algorithms. There were significant differences in predictive performance for algorithms transferred in environmental space with GAM performing best. However, the predictive performance of GAM declined steeply with increasing extrapolation in environmental space relative to other algorithms. The results of this study suggest that SDMs may be limited in their ability to predict species ranges beyond the environmental data used for model fitting. When predicting climate-driven range shifts, extrapolation may also not reflect important biotic and abiotic drivers of species ranges, and thus further misrepresent the realized shift in range. Future studies investigating transferability of process based SDMs or relationships between geodiversity and biodiversity may hold promise.« less
  5. Given the urgency of climate change, development of fast and reliable methods is essential to understand urban building energy use in the sector that accounts for 40% of total energy use in USA. Although machine learning (ML) methods may offer promise and are less difficult to develop, discrepancy in methods, results, and recommendations have emerged that requires attention. Existing research also shows inconsistencies related to integrating climate change models into energy modeling. To address these challenges, four models: random forest (RF), extreme gradient boosting (XGBoost), single regression tree, and multiple linear regression (MLR), were developed using the Commercial Building Energymore »Consumption Survey dataset to predict energy use intensity (EUI) under projected heating and cooling degree days by the Intergovernmental Panel on Climate Change (IPCC) across the USA during the 21st century. The RF model provided better performance and reduced the mean absolute error by 4%, 11%, and 12% compared to XGBoost, single regression tree, and MLR, respectively. Moreover, using the RF model for climate change analysis showed that office buildings’ EUI will increase between 8.9% to 63.1% compared to 2012 baseline for different geographic regions between 2030 and 2080. One region is projected to experience an EUI reduction of almost 1.5%. Finally, good data enhance the predicting ability of ML therefore, comprehensive regional building datasets are crucial to assess counteraction of building energy use in the face of climate change at finer spatial scale.« less