skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on February 1, 2026

Title: A Systematic Study of Popular Software Packages and AI/ML Models for Calibrating In Situ Air Quality Data: An Example with Purple Air Sensors
Accurate air pollution monitoring is critical to understand and mitigate the impacts of air pollution on human health and ecosystems. Due to the limited number and geographical coverage of advanced, highly accurate sensors monitoring air pollutants, many low-cost and low-accuracy sensors have been deployed. Calibrating low-cost sensors is essential to fill the geographical gap in sensor coverage. We systematically examined how different machine learning (ML) models and open-source packages could help improve the accuracy of particulate matter (PM) 2.5 data collected by Purple Air sensors. Eleven ML models and five packages were examined. This systematic study found that both models and packages impacted accuracy, while the random training/testing split ratio (e.g., 80/20 vs. 70/30) had minimal impact (0.745% difference for R2). Long Short-Term Memory (LSTM) models trained in RStudio and TensorFlow excelled, with high R2 scores of 0.856 and 0.857 and low Root Mean Squared Errors (RMSEs) of 4.25 µg/m3 and 4.26 µg/m3, respectively. However, LSTM models may be too slow (1.5 h) or computation-intensive for applications with fast response requirements. Tree-boosted models including XGBoost (0.7612, 5.377 µg/m3) in RStudio and Random Forest (RF) (0.7632, 5.366 µg/m3) in TensorFlow offered good performance with shorter training times (<1 min) and may be suitable for such applications. These findings suggest that AI/ML models, particularly LSTM models, can effectively calibrate low-cost sensors to produce precise, localized air quality data. This research is among the most comprehensive studies on AI/ML for air pollutant calibration. We also discussed limitations, applicability to other sensors, and the explanations for good model performances. This research can be adapted to enhance air quality monitoring for public health risk assessments, support broader environmental health initiatives, and inform policy decisions.  more » « less
Award ID(s):
1841520
PAR ID:
10574327
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Sensors
Date Published:
Journal Name:
Sensors
Volume:
25
Issue:
4
ISSN:
1424-8220
Page Range / eLocation ID:
1028
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Breathing in fine particulate matter of diameter less than 2.5 µm (PM2.5) greatly increases an individual’s risk of cardiovascular and respiratory diseases. As climate change progresses, extreme weather events, including wildfires, are expected to increase, exacerbating air pollution. However, models often struggle to capture extreme pollution events due to the rarity of high PM2.5 levels in training datasets. To address this, we implemented cluster-based undersampling and trained Transformer models to improve extreme event prediction using various cutoff thresholds (12.1 µg/m3 and 35.5 µg/m3) and partial sampling ratios (10/90, 20/80, 30/70, 40/60, 50/50). Our results demonstrate that the 35.5 µg/m3 threshold, paired with a 20/80 partial sampling ratio, achieved the best performance, with an RMSE of 2.080, MAE of 1.386, and R2 of 0.914, particularly excelling in forecasting high PM2.5 events. Overall, models trained on augmented data significantly outperformed those trained on original data, highlighting the importance of resampling techniques in improving air quality forecasting accuracy, especially for high-pollution scenarios. These findings provide critical insights into optimizing air quality forecasting models, enabling more reliable predictions of extreme pollution events. By advancing the ability to forecast high PM2.5 levels, this study contributes to the development of more informed public health and environmental policies to mitigate the impacts of air pollution, and advanced the technology for building better air quality digital twins. 
    more » « less
  2. The emergence of low-cost air quality sensors as viable tools for the monitoring of air quality at population and individual levels necessitates the evaluation of these instruments. The Flow air quality tracker, a product of Plume Labs, is one such sensor. To evaluate these sensors, we assessed 34 of them in a controlled laboratory setting by exposing them to PM10 and PM2.5 and compared the response with Plantower A003 measurements. The overall coefficient of determination (R2) of measured PM2.5 was 0.76 and of PM10 it was 0.73, but the Flows’ accuracy improved after each introduction of incense. Overall, these findings suggest that the Flow can be a useful air quality monitoring tool in air pollution areas with higher concentrations, when incorporated into other monitoring frameworks and when used in aggregate. The broader environmental implications of this work are that it is possible for individuals and groups to monitor their individual exposure to particulate matter pollution. 
    more » « less
  3. null (Ed.)
    Short-term exposure to fine particulate matter (PM2.5) pollution is linked to numerous adverse health effects. Pollution episodes, such as wildfires, can lead to substantial increases in PM2.5 levels. However, sparse regulatory measurements provide an incomplete understanding of pollution gradients. Here, we demonstrate an infrastructure that integrates community-based measurements from a network of low-cost PM2.5 sensors with rigorous calibration and a Gaussian process model to understand neighborhood-scale PM2.5 concentrations during three pollution episodes (July 4, 2018, fireworks; July 5 and 6, 2018, wildfire; Jan 3−7, 2019, persistent cold air pool, PCAP). The firework/wildfire events included 118 sensors in 84 locations, while the PCAP event included 218 sensors in 138 locations. The model results accurately predict reference measurements during the fireworks (n: 16, hourly root-mean-square error, RMSE, 12.3−21.5 μg/m3, n(normalized)-RMSE: 9−24%), the wildfire (n: 46, RMSE: 2.6−4.0 μg/m3; nRMSE: 13.1−22.9%), and the PCAP (n: 96, RMSE: 4.9−5.7 μg/m3; nRMSE: 20.2−21.3%). They also revealed dramatic geospatial differences in PM2.5 concentrations that are not apparent when only considering government measurements or viewing the US Environmental Protection Agency’s AirNow’s visualizations. Complementing the PM2.5 estimates and visualizations are highly resolved uncertainty maps. Together, these results illustrate the potential for low-cost sensor networks that combined with a data-fusion algorithm and appropriate calibration and training can dynamically and with improved accuracy estimate PM2.5 concentrations during pollution episodes. These highly resolved uncertainty estimates can provide a much-needed strategy to communicate uncertainty to end users. 
    more » « less
  4. Abstract The use of air quality monitoring networks to inform urban policies is critical especially where urban populations are exposed to unprecedented levels of air pollution. High costs, however, limit city governments’ ability to deploy reference grade air quality monitors at scale; for instance, only 33 reference grade monitors are available for the entire territory of Delhi, India, spanning 1500 sq km with 15 million residents. In this paper, we describe a high-precision spatio-temporal prediction model that can be used to derive fine-grained pollution maps. We utilize two years of data from a low-cost monitoring network of 28 custom-designed low-cost portable air quality sensors covering a dense region of Delhi. The model uses a combination of message-passing recurrent neural networks combined with conventional spatio-temporal geostatistics models to achieve high predictive accuracy in the face of high data variability and intermittent data availability from low-cost sensors (due to sensor faults, network, and power issues). Using data from reference grade monitors for validation, our spatio-temporal pollution model can make predictions within 1-hour time-windows at 9.4, 10.5, and 9.6% Mean Absolute Percentage Error (MAPE) over our low-cost monitors, reference grade monitors, and the combined monitoring network respectively. These accurate fine-grained pollution sensing maps provide a way forward to build citizen-driven low-cost monitoring systems that detect hazardous urban air quality at fine-grained granularities. 
    more » « less
  5. Abstract Belmont County, Ohio is heavily dominated by unconventional oil and gas development that results in high levels of ambient air pollution. Residents here chose to work with a national volunteer network to develop a method of participatory science to answer questions about the association between impact on the health of their community and pollution exposure from the many industrial point sources in the county and surrounding area and river valley. After first directing their questions to the government agencies responsible for permitting and protecting public health, residents noted the lack of detailed data and understanding of the impact of these industries. These residents and environmental advocates are using the resulting science to open a dialogue with the EPA in hopes to ultimately collaboratively develop air quality standards that better protect public health. Results from comparing measurements from a citizen-led participatory low-cost, high-density air pollution sensor network of 35 particulate matter and 25 volatile organic compound sensors against regulatory monitors show low correlations (consistently R 2 < 0.55). This network analysis combined with complementary models of emission plumes are revealing the inadequacy of the sparse regulatory air pollution monitoring network in the area, and opening many avenues for public health officials to further verify people’s experiences and act in the interest of residents’ health with enforcement and informed permitting practices. Further, the collaborative best practices developed by this study serve as a launchpad for other community science efforts looking to monitor local air quality in response to industrial growth. 
    more » « less