skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on February 1, 2026

Title: A Systematic Study of Popular Software Packages and AI/ML Models for Calibrating In Situ Air Quality Data: An Example with Purple Air Sensors
Accurate air pollution monitoring is critical to understand and mitigate the impacts of air pollution on human health and ecosystems. Due to the limited number and geographical coverage of advanced, highly accurate sensors monitoring air pollutants, many low-cost and low-accuracy sensors have been deployed. Calibrating low-cost sensors is essential to fill the geographical gap in sensor coverage. We systematically examined how different machine learning (ML) models and open-source packages could help improve the accuracy of particulate matter (PM) 2.5 data collected by Purple Air sensors. Eleven ML models and five packages were examined. This systematic study found that both models and packages impacted accuracy, while the random training/testing split ratio (e.g., 80/20 vs. 70/30) had minimal impact (0.745% difference for R2). Long Short-Term Memory (LSTM) models trained in RStudio and TensorFlow excelled, with high R2 scores of 0.856 and 0.857 and low Root Mean Squared Errors (RMSEs) of 4.25 µg/m3 and 4.26 µg/m3, respectively. However, LSTM models may be too slow (1.5 h) or computation-intensive for applications with fast response requirements. Tree-boosted models including XGBoost (0.7612, 5.377 µg/m3) in RStudio and Random Forest (RF) (0.7632, 5.366 µg/m3) in TensorFlow offered good performance with shorter training times (<1 min) and may be suitable for such applications. These findings suggest that AI/ML models, particularly LSTM models, can effectively calibrate low-cost sensors to produce precise, localized air quality data. This research is among the most comprehensive studies on AI/ML for air pollutant calibration. We also discussed limitations, applicability to other sensors, and the explanations for good model performances. This research can be adapted to enhance air quality monitoring for public health risk assessments, support broader environmental health initiatives, and inform policy decisions.  more » « less
Award ID(s):
1841520
PAR ID:
10574327
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Sensors
Date Published:
Journal Name:
Sensors
Volume:
25
Issue:
4
ISSN:
1424-8220
Page Range / eLocation ID:
1028
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Breathing in fine particulate matter of diameter less than 2.5 µm (PM2.5) greatly increases an individual’s risk of cardiovascular and respiratory diseases. As climate change progresses, extreme weather events, including wildfires, are expected to increase, exacerbating air pollution. However, models often struggle to capture extreme pollution events due to the rarity of high PM2.5 levels in training datasets. To address this, we implemented cluster-based undersampling and trained Transformer models to improve extreme event prediction using various cutoff thresholds (12.1 µg/m3 and 35.5 µg/m3) and partial sampling ratios (10/90, 20/80, 30/70, 40/60, 50/50). Our results demonstrate that the 35.5 µg/m3 threshold, paired with a 20/80 partial sampling ratio, achieved the best performance, with an RMSE of 2.080, MAE of 1.386, and R2 of 0.914, particularly excelling in forecasting high PM2.5 events. Overall, models trained on augmented data significantly outperformed those trained on original data, highlighting the importance of resampling techniques in improving air quality forecasting accuracy, especially for high-pollution scenarios. These findings provide critical insights into optimizing air quality forecasting models, enabling more reliable predictions of extreme pollution events. By advancing the ability to forecast high PM2.5 levels, this study contributes to the development of more informed public health and environmental policies to mitigate the impacts of air pollution, and advanced the technology for building better air quality digital twins. 
    more » « less
  2. The emergence of low-cost air quality sensors as viable tools for the monitoring of air quality at population and individual levels necessitates the evaluation of these instruments. The Flow air quality tracker, a product of Plume Labs, is one such sensor. To evaluate these sensors, we assessed 34 of them in a controlled laboratory setting by exposing them to PM10 and PM2.5 and compared the response with Plantower A003 measurements. The overall coefficient of determination (R2) of measured PM2.5 was 0.76 and of PM10 it was 0.73, but the Flows’ accuracy improved after each introduction of incense. Overall, these findings suggest that the Flow can be a useful air quality monitoring tool in air pollution areas with higher concentrations, when incorporated into other monitoring frameworks and when used in aggregate. The broader environmental implications of this work are that it is possible for individuals and groups to monitor their individual exposure to particulate matter pollution. 
    more » « less
  3. null (Ed.)
    Short-term exposure to fine particulate matter (PM2.5) pollution is linked to numerous adverse health effects. Pollution episodes, such as wildfires, can lead to substantial increases in PM2.5 levels. However, sparse regulatory measurements provide an incomplete understanding of pollution gradients. Here, we demonstrate an infrastructure that integrates community-based measurements from a network of low-cost PM2.5 sensors with rigorous calibration and a Gaussian process model to understand neighborhood-scale PM2.5 concentrations during three pollution episodes (July 4, 2018, fireworks; July 5 and 6, 2018, wildfire; Jan 3−7, 2019, persistent cold air pool, PCAP). The firework/wildfire events included 118 sensors in 84 locations, while the PCAP event included 218 sensors in 138 locations. The model results accurately predict reference measurements during the fireworks (n: 16, hourly root-mean-square error, RMSE, 12.3−21.5 μg/m3, n(normalized)-RMSE: 9−24%), the wildfire (n: 46, RMSE: 2.6−4.0 μg/m3; nRMSE: 13.1−22.9%), and the PCAP (n: 96, RMSE: 4.9−5.7 μg/m3; nRMSE: 20.2−21.3%). They also revealed dramatic geospatial differences in PM2.5 concentrations that are not apparent when only considering government measurements or viewing the US Environmental Protection Agency’s AirNow’s visualizations. Complementing the PM2.5 estimates and visualizations are highly resolved uncertainty maps. Together, these results illustrate the potential for low-cost sensor networks that combined with a data-fusion algorithm and appropriate calibration and training can dynamically and with improved accuracy estimate PM2.5 concentrations during pollution episodes. These highly resolved uncertainty estimates can provide a much-needed strategy to communicate uncertainty to end users. 
    more » « less
  4. Abstract The use of air quality monitoring networks to inform urban policies is critical especially where urban populations are exposed to unprecedented levels of air pollution. High costs, however, limit city governments’ ability to deploy reference grade air quality monitors at scale; for instance, only 33 reference grade monitors are available for the entire territory of Delhi, India, spanning 1500 sq km with 15 million residents. In this paper, we describe a high-precision spatio-temporal prediction model that can be used to derive fine-grained pollution maps. We utilize two years of data from a low-cost monitoring network of 28 custom-designed low-cost portable air quality sensors covering a dense region of Delhi. The model uses a combination of message-passing recurrent neural networks combined with conventional spatio-temporal geostatistics models to achieve high predictive accuracy in the face of high data variability and intermittent data availability from low-cost sensors (due to sensor faults, network, and power issues). Using data from reference grade monitors for validation, our spatio-temporal pollution model can make predictions within 1-hour time-windows at 9.4, 10.5, and 9.6% Mean Absolute Percentage Error (MAPE) over our low-cost monitors, reference grade monitors, and the combined monitoring network respectively. These accurate fine-grained pollution sensing maps provide a way forward to build citizen-driven low-cost monitoring systems that detect hazardous urban air quality at fine-grained granularities. 
    more » « less
  5. Low-cost sensors (LCSs) emerge as a popular tool for urban micro-climate studies by offering dense observational coverage. This study evaluates the performance of PurpleAir (PA) sensors for ambient temperature monitoring—a key but underexplored aspect of their use. While widely used for particulate matter, PA sensors’ temperature data remain underutilized and lack thorough validation. For the first time, this research evaluates their accuracy by comparing PA temperature measurements with collocated high-precision temperature data loggers across a dense urban network in a humid subtropical U.S. county. Results show a moderate correlation with reference data (r = 0.86) but an average overestimation of 3.77 °C, indicating PA sensors are better suited for identifying temperature trends but not for precise applications like extreme heat events. We also developed and compared eight calibration methods to create a replicable model using readily available crowdsourced data. The best-performing model reduced RMSE and MAE by 51% and 47%, respectively, and achieved an R2 of 0.89 compared to the uncalibrated scenario. Finally, the practical application of PA temperature data for identifying heat wave events was investigated, including an assessment of associated uncertainties. In sum, this work provides a crucial evaluation of PA’s temperature monitoring capabilities, offering a pathway for improved heat mapping, multi-hazard vulnerability assessments, and public health interventions in the development of climate-resilient cities. 
    more » « less