<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>An Environmental Data Collection for COVID-19 Pandemic Research</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>09/01/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10208492</idno>
					<idno type="doi">10.3390/data5030068</idno>
					<title level='j'>Data</title>
<idno>2306-5729</idno>
<biblScope unit="volume">5</biblScope>
<biblScope unit="issue">3</biblScope>					

					<author>Qian Liu</author><author>Wei Liu</author><author>Dexuan Sha</author><author>Shubham Kumar</author><author>Emily Chang</author><author>Vishakh Arora</author><author>Hai Lan</author><author>Yun Li</author><author>Zifu Wang</author><author>Yadong Zhang</author><author>Zhiran Zhang</author><author>Jackson T. Harris</author><author>Srikar Chinala</author><author>Chaowei Yang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The COVID-19 viral disease surfaced at the end of 2019 and quickly spread across the globe. To rapidly respond to this pandemic and offer data support for various communities (e.g., decision-makers in health departments and governments, researchers in academia, public citizens), the National Science Foundation (NSF) spatiotemporal innovation center constructed a spatiotemporal platform with various task forces including international researchers and implementation strategies. Compared to similar platforms that only offer viral and health data, this platform views virus-related environmental data collection (EDC) an important component for the geospatial analysis of the pandemic. The EDC contains environmental factors either proven or with potential to influence the spread of COVID-19 and virulence or influence the impact of the pandemic on human health (e.g., temperature, humidity, precipitation, air quality index and pollutants, nighttime light (NTL)). In this platform/framework, environmental data are processed and organized across multiple spatiotemporal scales for a variety of applications (e.g., global mapping of daily temperature, humidity, precipitation, correlation of the pandemic to the mean values of climate and weather factors by city). This paper introduces the raw input data, construction and metadata of reprocessed data, and data storage, as well as the sharing and quality control methodologies of the COVID-19 related environmental data collection.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>center (<ref type="url">https://covid-19.stcenter.net/</ref>) <ref type="bibr">[22]</ref>, with standardized spatiotemporal data structures in multiple spatiotemporal sales.</p><p>This paper offers a comprehensive description of the COVID-19 related environmental data collection. The paper is organized as follows: Section 2 introduces the raw data, derived values, and metadata of the collection; Section 3 describes the methodology concerning how derived values are produced, and data are processed and stored; Section 4 illustrates the data publishing method and provides downloading addresses; and finally, Section 5 introduces the data quality control method. Temperature and humidity are proven to have close relationships with the spread and control of COVID-19 <ref type="bibr">[23,</ref><ref type="bibr">24]</ref>. Our data collection includes reanalyzed temperature and humidity of Modern-Era Retrospective analysis for Research and Applications and Version 2 (MERRA-2) to provide historic and present casting values for the researchers and decision-makers to estimate and predict the spreading trends and patterns of the pandemic. The MERRA-2 provides data dating back to 1980 with a spatial resolution of ~50 km. It includes advances in the system that enabled assimilation of modern hyperspectral radiance and microwave observations, along with GPS-Radio Occultation datasets and additional advances in both the Goddard Earth Observing System (GEOS) model and the Gridpoint Statistical Interpolation (GSI) assimilation system <ref type="bibr">[25]</ref>. The influence and spreading trend can only be accurately analyzed and predicted when the climatological factors are removed from the data. The long-term availability allows researchers to exclude these factors such as seasonal cycles and trends from COVID-19 related analyses. Detailed information is shown in Table <ref type="table">1</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Data Description</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.2.">IMERG Precipitation Estimation</head><p>Precipitation is an important climate and weather factor that influences the moisture and humidity of the Earth. Although there is no study published on the relationship between COVID-19 and precipitation, precipitation plays a role in the spread of other infectious diseases <ref type="bibr">[26]</ref>. Our data collection introduces Integrated Multi-satellitE Retrievals for GPM (IMERG) as a potential related data source for the study of COVID-19. The IMERG precipitation estimation is a satellite-observation-based rainfall measurement that provides global coverage and spatial resolution of 10 km and temporal resolution of 30 min <ref type="bibr">[27,</ref><ref type="bibr">28]</ref>. It also provides historic datasets since 2014 for the researchers to investigate seasonal trends and conduct a more accurate study between precipitation and COVID-19. Detailed information is shown in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.3.">NPP/VIIRS Nighttime Light radiance</head><p>The nighttime light reflects human activities and economic conditions. Therefore, the impact of COVID-19 on humans can be detected through the investigation and analysis of the radiance values of nighttime light images <ref type="bibr">[5]</ref>. Our data sharing platform collects and processes NASA's Suomi-NPP VIIRS-DNB (VNP46A1), archived at NASA's LAADS DAAC data center (<ref type="url">https://ladsweb.modaps. eosdis.nasa.gov/</ref>), before, during, and after the quarantine policies affecting specific regions. The data have a spatial resolution of 500 m and daily temporal resolution, which is preprocessed using NASA Black Marble algorithm <ref type="bibr">[29]</ref>. The nighttime light data provide an even broader domain such as remote sensing, economic, humanity, urban planning, medical, etc., to accomplish their specific study related to COVID-19. Detailed information is shown in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.4.">Aura-OMI Air Pollution Observation</head><p>Air pollutants such as NO 2 are important indicators of economic <ref type="bibr">[30]</ref> and influence the mortality of COVID-19 <ref type="bibr">[8]</ref>. The collection of air pollution is crucial for both the study of economic impact by COVID-19 and spread of the virus. The Ozone Monitoring Instrument (OMI) flies on the National Aeronautics and Space Administration's Earth Observing System Aura satellite launched in July 2004 <ref type="bibr">[31]</ref>. The spatial resolution of OMI is 25 km, covering the globe once a day. The OMI measures criteria pollutants such as NO, O 3 , NO 2 and SO 2 . The US Environmental Protection Agency (EPA) has designated these atmospheric constituents as posing serious threats to human health and agricultural productivity. Many countries take these pollutants into account in the pollution index for the evaluation of air quality. These measurements track industrial pollution and biomass burning, and hence can be used to evaluate pollution levels and emissions changes on large scales, such as global and country-wide. The outbreak of COVID-19 has forced many countries to lock down industrial activities. Therefore, the amounts of various types of pollutants released to the environment were significantly reduced <ref type="bibr">[32]</ref>. The spatiotemporal distribution of OMI data will dynamically change with COVID-19 spreads. Detailed information is shown in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.5.">Ground-Based Air Quality Data</head><p>Ground-based air quality data are derived from national and regional meteorological and environmental protection departments all over the world. Right now, the data collection includes air quality released by the China Environmental Monitoring Centre (<ref type="url">http://www.cnemc.cn/sssj/</ref>) and the United States Environmental Protection Agency (<ref type="url">https://www.epa.gov/outdoor-air-quality-data</ref>). Ground-based air quality data are generally published in the form of a daily report, across an ambient air quality monitoring network covering four scales: country, province/state, city, and/or county. Concentrations and Air Quality Indices (AQI) of O 3 , NO 2 , SO 2 , PM 10 , CO and PM 2.5 can be obtained for the Chinese and American data sources. We will continue to acquire ground-based air quality data from other countries, e.g., UK, EU, Canada, Australia, etc. Detailed information is shown in Table <ref type="table">1</ref>. Acquisition Method Description: All the hourly 2-m specific humidity, which is "QV2M" variable in the original dataset, for each day are averaged and stored in "daily_QV2M" variable in the derived daily 2-m specific humidity dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Derived Product and Metadata</head><p>(2) Daily Reanalyzed 2-m Air Temperature Frequency: daily mean value Spatial Grid: 2D, single-level, full horizontal resolution Granule size: ~836 k Dimensions: longitude = 576, latitude = 361 Acquisition Method Description: All the hourly 2-m air temperature, which is "T2M" variable in the original dataset, for each day are averaged and stored in "daily_T2M" variable in the derived daily 2-m air temperature data.</p><p>(3) Daily Precipitation Frequency: daily mean value Spatial Grid: 2D, single-level, full horizontal resolution Granule size: ~25.9 MB Dimensions: longitude = 3600, latitude = 1800 Acquisition Method Description: All the half-hourly calibrated precipitation, which is "precipitationCal" variable in the original dataset, for each day are averaged and stored in "daily_precipitation" variable in the derived daily precipitation data.</p><p>(4) Monthly Nighttime Light Radiance Frequency: monthly mean value Spatial Grid: 1D Granule size: varies according to spatial coverage, China: 1.5G Dimensions: number of pixels Acquisition Method Description: All the cloudless daily nighttime light radiance over the target region, which is "DNB_At_Sensor_Radiance_500m" variable in the original dataset, for each day are averaged and stored in "monthly_mean_radiance" variable in the derived daily precipitation data.</p><p>(5) Metadata The metadata of daily/monthly global environmental factors are listed in Table <ref type="table">2</ref>. The city-level daily statistics data for Temperature/Humidity/Precipitation/NO 2 tropospheric vertical column density (TVCD) are obtained through the following steps: Firstly, the reprocessed data (e.g., temperature, humidity, precipitation from Section 2.2.1) are converted from NetCDF to GeoTIFF. Secondly, the vector boundary of each city is obtained through linking to the "GID_2" field in the administrative boundary map. Thirdly, all the pixel values within the vector boundary of each city are used as a statistical array. Fourthly, we calculate the maximum, minimum, and average values of Temperature/Humidity/Precipitation/NO 2 TVCD from the obtained statistical array of each city, and export the results to a CSV file with the variable names as "Max", "Min" and "Mean".</p><p>Name: Province-/State-level daily statistics for Temperature/Humidity/Precipitation Format: CSV File Contains information: Province/State code (GID_1), Max value, Mean value, Min value Acquisition Method Description: The province-/state-level daily statistics data for Temperature/Humidity/Precipitation are obtained through similar procedure as city-level, except for the vector boundary of each province/state is obtained by linking to the "GID_1" field in the administrative boundary map.</p><p>(3) Environmental Factors of Country-levels Name: Country-level daily statistics for Temperature/Humidity/Precipitation Format: CSV File Contains information: Country code (GID_0), Max value, Mean value, Min value Acquisition Method Description: The country-level daily statistics data for Temperature/Humidity/Precipitation are obtained through similar procedure as city-level, except for the vector boundary of each country is obtained by linking to the "GID_0" field in the administrative boundary map.</p><p>(4) Metadata The metadata of multiple administration levels' environmental factors are listed in Table <ref type="table">3</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Spatiotemporal Aggregation and Collocation</head><p>Focusing on the reprocessed environmental data (e.g., temperature, humidity, precipitation), it is necessary to establish the relationship between the data in time and space. This study proposes to statistically analyze the environmental characteristics on daily and monthly scales and different administrative levels based on vector boundaries.</p><p>As shown in Figure <ref type="figure">1</ref>, global maps of daily average factors are generated by aggregating the hourly and half-hourly data in temporal dimension for each spatial location with the means output to NetCDF format files.</p><p>The collocation with different administration levels are realized based on the Python programming language. The open-source libraries of GDAL and netCDF4 are used to convert the reprocessed data (e.g., temperature, humidity, precipitation) from NetCDF to GeoTIFF. By using open-source libraries such as "geopandas", "shapely" and "rasterio", the vector boundaries of different administrative levels (country, province/state and county/city) are used to obtain the GeoTIFF pixels covered by the mask as a statistical array. For the obtained pixel array, this is accomplished by setting the calculation conditions, using the NumPy scientific calculation library to extract the statistical characteristics (maximum, minimum, and average), and finally exporting array to a CSV file for storage. The specific procedure is shown in Figure <ref type="figure">1</ref>.</p><p>open-source libraries such as "geopandas", "shapely" and "rasterio", the vector boundaries of different administrative levels (country, province/state and county/city) are used to obtain the GeoTIFF pixels covered by the mask as a statistical array. For the obtained pixel array, this is accomplished by setting the calculation conditions, using the NumPy scientific calculation library to extract the statistical characteristics (maximum, minimum, and average), and finally exporting array to a CSV file for storage. The specific procedure is shown in Figure <ref type="figure">1</ref>.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Collocating Environmental Factors with COVID-19 Case Data</head><p>The proposed environmental data collection is integrated and published together with GMU STC Center data cube to associate with COVID-19 cases data. The data cube structure is established and utilized to represent factors and values from a spatiotemporal perspective. Due to the multiple scales of target regions, the dataset is divided by country and region at the first level, and the administration scales are archived and shared under distinct regional folders. Daily report and time-series summary reports are processed and published in each country and administrative level. For example, the United States folder includes administrative 1 for the state level dataset and administrative 2 for the county level dataset. Under USA administrative 1 folder, a group of csv files keep a one-day timestamp of all extracted and processed environmental values for each state, defined as the daily report dataset. The summary report only keeps the latest updated files divided by factors to record the time-series value of each state.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Data Computing and Storage on AWS Cloud Platform</head><p>Cloud computing is becoming the standard approach to handling large scale and remotely sensed (RS) imagery dataset processing, storage, access, and management. There are many cloud platform providers that provide users a "pay as you go" service to support customized computing needs. For example, Amazon web services (AWS), Microsoft Azure, and Google's Compute Engine provide IaaS (Infrastructure as a Service), PaaS (Platform as a Service), or SaaS (Software as a Service). In this study, AWS was adopted as the cloud to support elastic storage and processing tasks for processing Nighttime Light Radiance, Temperatures, Humidity, Pollutants, and Precipitations dataset. With automatic data scraping from multiple RS data portals, those data were stored in a virtual storage optimized instance and were published to AWS S3 distributed storage. By exploiting computing capacity with over one-hundred computing cores and two-hundred gigabytes of memory, a multi-tasked python-based processing was deployed to mine those datasets and produced covid-19 related results from the perspective of RS observation. Ongoing distributed computing approach will be developed to accommodate global scale multi-sourced RS data processing in a single run with a reasonable processing time. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Quality Control</head><p>To provide reliable environmental data sources to the geospatial and covid-19 community, populated data are evaluated in three dimensions including data integrity, consistency, and validity to ensure high quality data publishing.</p><p>Raw data selection, cleaning and qualification: The first and crucial step to create a high-quality COVID-19 related environmental data collection is to select proper input raw data. To guarantee this, we firstly review as many literatures on COVID-19 related research as we can to decide on what environmental factors should be included in the final collection and what data sources are researchers usually dealing with. Then we sift among the potential data sources and choose the one that is most frequent-adopted, stable and authoritative for each environmental factor. In the data processing step, we filter all the invalid and unreasonable values as well as variables that are not related to COVID-19.</p><p>Data integrity: This means that populated data should be comprehensive. A thorough check is applied to time-series data, making sure the data contain all historical data stored in data sources. In addition, since daily grid environmental data are mapped to an administrative level shapefile to provide regional environmental data, integrity check ensures the generated data are provided at each unit (e.g., counties in US) at a certain administrative level if the data is available in the source files.</p><p>Data consistency: This requires that data in our repository are consistent with other sources. On one hand, extracted data should be consistent with values from data sources; on the other hand, regional derived values (e.g., country-level monthly mean temperature) should be consistent with global and temporal distributions. For example, mean temperature in a location in winter is lower than mean temperature there in summer. Precipitation value are relatively larger and frequent in the Inter Tropical Convergence Zone (ITCZ) and South Pacific Convergence Zone (SPCZ).</p><p>Data validity: This dimension estimates the data reliability. Data sources should be provided with the populated data, thereby making data sources available to data consumers to ensure data consumers can investigate the data sources for validity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions</head><p>Our proposed data collection encompasses COVID-19 related environmental datasets that serves as a data basis and reference for users in broader communities (e.g., governmental and urban planning departments, meteorological and climatological scientists, medical and disease control researchers). This is an alternative to other data collection efforts that are virus-case-only platforms. The proposed collection is associated with the COVID-19 gateway of GMU's NSF Spatiotemporal Center and is stored on a stable and highly available AWS server to provide multiple-scale spatiotemporal data at high acquisition speed <ref type="bibr">[33]</ref>. The collection includes various data types and features including temperature, humidity, air quality, nighttime light and precipitation.</p><p>The raw datasets are automatically downloaded from the data sources using Python programs, and the derived values are produced as soon as the newest raw data are released. The timeliness is guaranteed by this procedure.</p><p>The proposed framework is a growing data collection with content extended according to the needs and requirement of users and the evolution of the pandemic. For example, the team is working to automatically correspond OMI NO 2 data with administration shapefiles and to provide country and county level NO 2 information to the communities. It is proposed that these NO 2 data contribute to the Earth data aspects for big spatiotemporal data analytics in fighting against covid-19 pandemic <ref type="bibr">[33,</ref><ref type="bibr">34]</ref>.</p></div></body>
		</text>
</TEI>
