<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>MT-HCCAR: Multi-task Deep Learning withHierarchical Classification andAttention-Based Regression forCloud Property Retrieval</title></titleStmt>
			<publicationStmt>
				<publisher>Springer Nature Switzerland</publisher>
				<date>09/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10577782</idno>
					<idno type="doi">10.1007/978-3-031-70381-2_1</idno>
					
					<author>Xingyan Li</author><author>Andrew M Sayer</author><author>Ian T Carroll</author><author>Xin Huang</author><author>Jianwu Wang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Not Available]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Clouds are integral components of the Earth system, wielding substantial influence over the planet's energy dynamics, climate regulation, and the hydrological cycle <ref type="bibr">[24]</ref>. Satellites have long been an indispensable tool to help us understand our environment. A prominent category of these instruments, commonly referred to as imagers, passively collect measurements of the Earth across various combinations of ultraviolet (UV), visible (VIS), near-infrared (NIR), shortwave infrared (SWIR), and thermal infrared (TIR) wavelength ranges. These measurements of reflected solar and/or emitted thermal radiation in different spectral bands undergo routine processing by algorithms to convert them into geophysical parameters of interest (atmospheric and surface characteristics) in a process known as "retrieval". This is typically done pixel-by-pixel for each satellite image. For clouds, several key properties are targeted, including a cloud mask (which distinguishes cloud-covered pixels from cloud-free ones), thermodynamic phase (indicating whether the cloud comprises liquid water droplets or ice crystals), and cloud optical thickness (COT), which is a measure of both the amount of light scattered by a cloud and the quantity of liquid or ice within it. The routine retrieval of geophysical parameters is essential for advancing our understanding of Earth's climate <ref type="bibr">[13]</ref>.</p><p>One practical motivation for our work is NASA's Plankton, Aerosol, Cloud, ocean Ecosystem (PACE) mission <ref type="bibr">[28]</ref>, which launched in February 2024. Existing NASA cloud masking algorithms are not directly applicable to PACE's main sensor, called OCI <ref type="bibr">[5]</ref>, due to different spectral bands compared to algorithms developed for existing sensors. OCI has similarities with some previous spaceborne imagers such as MODIS <ref type="bibr">[3]</ref>, VIIRS <ref type="bibr">[6]</ref>, and ABI <ref type="bibr">[1]</ref> sensor types. However, two key differences are that 1) OCI has continuous hyperspectral coverage in the UV-NIR, plus discrete SWIR bands, while the others are only multi-spectral (up to a dozen or so discrete relevant bands); and 2) OCI lacks TIR bands. Some of the most commonly used cloud masking tests for those sensors are therefore not applicable, and adapting a subset of those tests would miss out on OCI's unique abilities, so a new approach is warranted.</p><p>Numerous studies have explored machine learning techniques to extract cloud properties from satellite sensor data. These include both retrieval of a single cloud property <ref type="bibr">[17,</ref><ref type="bibr">25,</ref><ref type="bibr">12]</ref>, and simultaneous retrievals of multiple cloud properties <ref type="bibr">[26,</ref><ref type="bibr">14]</ref>. However, challenges persist in this area. To begin with, it's unclear how incorporating atmospheric domain knowledge, such as physical relationships between cloud properties, alongside advanced machine learning techniques, enhances retrieval accuracy. Furthermore, despite the deployment of numerous satellite sensors for similar cloud retrieval tasks (as discussed in Section 3), the generalizability, especially from an Earth science perspective, of employing a unified machine learning architecture across different sensors remains uncertain.</p><p>To address the above challenges we introduce MT-HCCAR, an end-to-end Multi-Task Learning (MTL) model for cloud masking, cloud phase prediction, and COT regression. Our contributions are fourfold. First, we incorporate hierarchical classification to capture the hierarchical relationship between cloud masking and cloud phase classification, enhancing prediction performance as demonstrated in the comparison experiment with baseline methods and ablation study. Second, we employ a cross-attention module to improve COT regression accuracy by leveraging similarities across the classification and regression networks. Third, we conduct quantitative experiments to demonstrate the advantages of our MT-HCCAR model over state-of-the-art baselines and its generalizability across three different satellite sensors. Finally, we perform quantitative and qualitative evaluation of performance variations across sensors from an Earth science perspective. Our implementation is publicly available at GitHub <ref type="bibr">[2]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>In recent years, various machine learning techniques, such as Random Forest (RF), Multi-Layer Perceptron (MLP), and Convolutional Neural Network (CNN), have been employed for retrieving different cloud properties. Among them, most of the work targets cloud detection <ref type="bibr">[17,</ref><ref type="bibr">25]</ref>, cloud phase <ref type="bibr">[11]</ref>, and cloud thickness <ref type="bibr">[21]</ref>. While these machine learning approaches for cloud property retrieval predominantly leverage spectral features, the main limitations lie in two aspects: 1) many studies do not consider background knowledge, such as the task order (e.g., cloud mask prediction, cloud phase prediction, and COT prediction), and 2) several studies conduct different tasks independently, lacking knowledge sharing between classification and regression tasks. These limitations underscore the need for more integrated and informed methodologies in cloud property retrieval studies.</p><p>Multi-task learning (MTL) is proving valuable in Earth science and remote sensing by jointly enhancing performance across diverse remote-sensing tasks through shared features. The studies have shown that classification and regression tasks can be implemented together to improve model performance <ref type="bibr">[15,</ref><ref type="bibr">7]</ref>. There are also studies conducting ML learning specifically for cloud property retrieval. Yang et al. <ref type="bibr">[29]</ref> developed an MLP-based method to retrieve cloud macrophysical parameters (cloud mask, cloud top temperature, and cloud top height) using Himawari-8 satellite data. Wang et al. <ref type="bibr">[26]</ref> proposed TIR-CNN based on the U-Net model to retrieve cloud properties including cloud mask, COT, effective particle radius (CER), and cloud top height (CTH) from thermal infrared radiometry. The architecture consists of encoding and decoding layers, convolutional blocks, batch normalization layers, and leaky Rectified Linear Units (ReLU). The results of applying the model to thermal infrared spectral data from MODIS are used to compare model performance for daytime and nighttime data.</p><p>While these above studies offer valuable insights that are beneficial to our design, our work cannot be compared with most of them directly via experiments. This is because the OCI data we use does not have the same spectral bands, such as TIR bands, with those methods used. Consequently, we will not compare our model with architectures employing convolutional layers, such as CNN.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Problem Statement and Data Simulation</head><p>Cloud property determination requires two operations: simulation and retrieval. Simulation is a forward process mapping from the Earth's state (cloud and surface properties) to satellite direct observations (the reflected light signal the satellites see). Retrieval is an inverse process, determining the geophysical quantities from the satellite observations. Figure <ref type="figure">1</ref> illustrates these two processes with colors showing differences in the wavelength band of the three sensors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Radiative Transfer Simulation</head><p>The simulation process provides training data for our cloud property retrieval model. This study was conducted before the launch of PACE, and the spacecraft has not entered routine operations. Therefore, simulated datasets for OCI, VIIRS, and ABI were utilized to evaluate the sensors and architectures against a standardized source, enabling direct comparisons. This approach ensures consistency as the satellites are tested on identical simulated scenes with the same viewing geometries, which would not be feasible using real data.</p><p>Radiative transfer (RT) models describe the scattering and absorption processes that affect the propagation of light through Earth's atmosphere and surface. RT is the forward model to map from surface and atmospheric properties to spectral intensities seen by the space-borne sensors. We account for realistic variations in the properties of clouds (phase, microphysical, optical, and vertical structure), the atmospheric profile (aerosol, gas, Rayleigh scattering, temperature, and pressure), surface reflectance, and solar/observation geometry. Further details are as in Sayer et al. <ref type="bibr">[20]</ref>, with two modifications: 1) simulating cloudfree conditions as well as single-layer clouds, and 2) utilizing 20 different surface reflectance classes included in the libRadtran RT model <ref type="bibr">[10]</ref> (including various land surface spectra, water, and snow/sea ice). The simulation does not include multi-layer or mixed-phase cloud systems, aligning with most satellite retrieval processing algorithms. However, future work may extend the network to include such systems. We generate a dataset of 250,000 simulations for model training and evaluation. We simulate from the UV to SWIR and convolve these simulations with the solar spectrum from <ref type="bibr">[9]</ref> and sensor relative spectral response functions for OCI [5], ABI <ref type="bibr">[1]</ref>, and VIIRS <ref type="bibr">[6]</ref> in order to generate the spectral top of atmosphere reflectance signal that the instruments would observe. This provides our simulated observations and reference truth (cloud classification and COT), along with band centers for each sensor. For OCI, there are 233 bands in total, including 226 hyperspectral bands ranging from 350 to 890 nm with 2.5 nm spacing, and seven discrete NIR/SWIR bands centered near 940, 1040, 1250, 1378, 1620, 2130, and 2260 nm. VIIRS has 10 bands centered near 412, 445, 488, 555, 672, 865, 1240, 1380, 1610, and 2250 nm. ABI has 6 bands centered near 471, 640, 860, 1370, 1600, and 2200 nm, respectively.</p><p>Each ABI band has a close analog in VIIRS, and similarly, each VIIRS band has a counterpart in OCI, thus forming a trio representing increasing complexity. Another commonly-used sensor, MODIS, has similar spectral coverage to VIIRS so we do not include it. Both ABI and VIIRS have TIR bands, but these are omitted from consideration as our primary focus is on developing a new methodology for OCI, which lacks TIR bands. Application to ABI and VIIRS demonstrates the broader applicability of our new architectures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Cloud Property Retrieval</head><p>Our primary objective is to accurately model the base-10 logarithm of COT (throughout the remaining text, COT means this logarithm unless explicitly described as the "original" COT) for pixels labeled as cloudy. We work in log space for COT as this has a more linear relationship with the brightness seen by satellite than the original COT. Simultaneously, the model should be able to accurately classify the cloud phase [cloudy, cloud-free, cloudy-liquid, cloudy-ice] for each pixel in the dataset to aid COT prediction. Therefore, our problem consists of two tasks: 1) a classification task to predict cloud mask and phase, and 2) a regression task to predict COT values. Assuming that the labels are C = 'cloud-free', C = 'cloudy', CL = 'cloudy-liquid', CI ='cloudy-ice', and bold notations indicate arrays, the details of our study are outlined below.</p><p>Model input. 1) Input features are represented as</p><p>where M is the dimension of the available features. The exact input feature variables include: i) surface pressure in millibar (mbar), ii) total column water vapor content in millimeters (mm), iii) total column ozone content in Dobson units (DU), iv) types of Earth surface with 4 categories (land, snow, desert, and ocean/water), v) top of atmosphere reflectance at different wavelengths collected by each spaceborne sensor, vi) viewing zenith angle, solar zenith angle, and the relative azimuth angle, in degree. 2) Cloud mask/phase labels l cls are used to train the classification task of the model. The set of possible label values is l = {C, C, CL, CI}. There is a hierarchical relationship between the labels, as pixels covered by liquid cloud or ice cloud are both cloudy pixels. Also, there is no coexistence between liquid cloud and ice cloud in our data. Thus, CL and CI can be two subclasses of label C ='cloudy'. 3) True COT value y cot are used to train the regression task of the model. COT values are not available for pixels with no cloud cover. That is, if l cls i = 0, then y cot i = N/A. Model output. 1) Predicted cloud mask/phase class lcls with probabilities u of each pixel belonging to each of the four classes. 2) Predicted COT value &#375;cot for each pixel. 3) Model architecture M with parameters &#946;. The predicted values are generated by the model, which is [&#375; cot , lcls ] = M(X|&#946;).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">MT-HCCAR Model</head><p>Our objective is to train a deep learning model to accomplish two tasks: 1) the classification of cloud mask and phase for each pixel based on its reflectance values, and 2) the subsequent prediction of COT values for pixels classified as cloudy. Toward this objective, our proposed MT-HCCAR model is illustrated in Figure <ref type="figure">2</ref> and the subsections below will each of the modules.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Encoder-Decoder Sub-Network</head><p>An encoder-decoder sub-network, containing an encoder module and a decoder module, is employed in our model to learn latent feature parameters that can be shared for other learning tasks. To integrate the COT regression task and cloud mask/phase prediction task, we adopt the soft-parameter sharing approach of MTL where the shared parameters are derived by encoder-decoder. Previous studies <ref type="bibr">[22,</ref><ref type="bibr">16]</ref> have used similar encoder-decoder techniques to transform original features into more relevant features for better downstream prediction accuracy. In our model, the decoder reconstructs the input feature X into X, and a loss is between X and X is minimized during training to minimize the distortion of input feature throughout feature extraction layers.</p><p>The shared encoder is formed by the first three dense layers, each of which is followed by a ReLU activation function. This feature extractor is a wide-tonarrow structure where the number of kernels at each layer decreases gradually. The decoder and reconstruction branch is narrow-to-wide and further improves the performance of the encoder to learn features for different tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Hierarchical Classification (HC) Sub-Network</head><p>Utilizing the shared parameters acquired by the encoder-decoder sub-network, MT-HCCAR incorporates a hierarchical classification sub-network. This subnetwork comprises a cloud mask classification module and a cloud phase classification module, facilitating cloud mask and phase predictions. To enhance the physical interpretability of the model, the architecture is crafted to mirror human cognitive processes in understanding cloud labels. Fundamentally, to predict cloud phase and COT values from an Earth science perspective, the model must discern between cloudy and non-cloudy pixels before further classifying liquid-phase and ice-phase pixels and predicting COT values for cloudy pixels.</p><p>Hierarchical classification has the capability to categorize instances according to label levels, forming a tree-like structure where each label functions as a node on the tree <ref type="bibr">[18]</ref>. The HC network used in MT-HCCAR consists of two classifiers C M ask and C P hase , the first of which distinguishes cloudy pixels from the cloudfree pixels, and the output is uncertainties for the two labels </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Classification Assisted Regression Sub-Network based on Cross Attention Mechanism (CAR)</head><p>Taking inputs from the encoded feature set and cloud mask classification results as input, our regression sub-network CAR is used for further downstream regression prediction of a cloudy pixel's COT value. Two novel efforts were taken to improve COT prediction accuracy. First, instead of a direct regression module, an auxiliary coarse classification module is added to predict which sub-range of COT each pixel falls into. Second, a residual-based cross-attention mechanism inspired by <ref type="bibr">[23]</ref> is developed to enable the regression module and the auxiliary classification module to share relevant correlations and insights.</p><p>As illustrated in Figure <ref type="figure">2</ref>, there is a connection between internal features from the auxiliary classifier and those from the regression network, facilitated by a residual-based cross-attention module. The close alignment in tasks between the auxiliary classifier and the regression network enables the cross-attention mechanism to selectively attend to features relevant to one task while simultaneously executing the other. This contrasts with relationships involving the regression task and other tasks, such as cloud phase classification, where associations are forged through joint optimization of their respective losses during training iterations, rather than through the shared utilization of internal features.</p><p>Our auxiliary coarse classification involves discretizing continuous values into coarse groups, serving as a preprocessing step before regression to align pixels with similar characteristics in the feature space. Moreover, the auxiliary classifier assigns greater importance to the regression task during model training to get accurate COT predictions. In MT-HCCAR, an auxiliary classifier is employed to categorize continuous COT values into three distinct levels: thin cloud, moderate cloud, and thick cloud. Specifically, we define thin cloud pixels as those with logarithmic COT values within the range of [-1.5, 0], moderate cloud in [0, 1], and thick cloud in [1.0, 2.5]. Given the analogous nature of the tasks in COT coarse classification and COT regression, the features extracted by the auxiliary classifier are integrated with those obtained from the regression network.</p><p>To learn the joint features more effectively, we utilize cross-attention layers to enhance the integration of features from both regression layers and coarse classification layers, facilitating deeper feature learning and more cohesive feature amalgamation. Based on the cross-attention mechanism, adding a residual connection guarantees that the module has a stable output and related research such as Wang et al. <ref type="bibr">[27]</ref> propose a non-local operation network using a residual connection to insert blocks to the network. By using the residual connection in our module A, the output is A(&#977; 1 , &#977; 2 ) = W z y A (&#977; 1 , &#977; 2 ) + &#977; 1 , where as shown in Figure <ref type="figure">2</ref> &#977; 1 is the feature from the regression network, &#977; 2 is the feature from the auxiliary classification module, W z is the weight matrix to map y A to &#977; 2 , and y A denotes the computation within cross-attention mechanism. The calculation of y A involves three variables: a query (Q), a key (K) and a value (V ) <ref type="bibr">[23]</ref>. They are calculated based on corresponding weights matrices W</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Model Training of MT-HCCAR</head><p>The loss function L M T -HCCAR for training MT-HCCAR model is formulated as the weighted sum of four components: a hierarchical classification loss L HC , a regression loss L CAR , a reconstruction loss L Rec , and a Lasso regularization loss L Lasso . That is: L = L HC + L CAR + L Rec + L Lasso . The four loss components in the loss function are calculated using different rules.</p><p>Among the four components, The reconstruction loss L Rec and the Lasso regularization loss component L L&#8867;&#8747; &#8747; &#8768; can be directly calculated. The reconstruction loss L Rec describes the difference between the input features X and the reconstruction of the input generated through the encoder and the decoder. The loss function is a mean square error (MSE) between D(X) = X and X, which means L Rec = n i=1 (X i -Xi ) 2 . The Lasso regularization loss L Lasso is an additional penalty to regularize the training process by increasing the sparsity of the model with the Lasso regularization loss. L Lasso = &#955; </p><p>The CAR loss L CAR is the summation of Cross Entropy loss from the auxiliary classifier C Aux and l1 loss from regression. That is, L CAR = L R +L C Aux . We have predicted COT values and &#375;cot and sigmoid output of thickness group uncertainties [&#251; thin , &#251;mod , &#251;thick ] from the auxiliary classifier, then the two components are</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiments</head><p>To evaluate our model's performance, we compare MT-HCCAR with baseline methods and conduct an ablation study across OCI, VIIRS, and ABI datasets. Given that our dataset comprises independent pixels, we choose two baseline methods from prior research with comparable task objectives and data formats for our experiments: 1) Chen et al. (2018) <ref type="bibr">[8]</ref>: An MLP based method for cloud property retrieval. We use 1 hidden layer with 10 nodes, which are the same as the authors used. 2) Liu et al. (2022) <ref type="bibr">[17]</ref>: An RF based method for cloud detection. We apply one RF to cloud masking and the same RF to cloud phase classification, respectively. The parameters involved in the method are ntrees = 100 and mdepth = 10.</p><p>Besides the above two baseline methods, four ablation study models are also used to evaluate the effectiveness of different modules in our model. The comparison of the four models and our MT-HCCAR is shown in Table <ref type="table">1</ref>. The detailed differences can be found in the supplementary material.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Experiment Setup</head><p>Our model was implemented using the Python deep-learning library PyTorch. All baseline models and proposed models were tested on a single GPU on Kaggle. The dataset comprises satellite data simulation of instruments OCI, VIIRS, and ABI. The three datasets encompassing N = 250, 000 instances. We split data into 62.5%, 22.5%, and 10.0% for training, validation, and test sets, respectively. All models are trained with the same hyperparameters including learning rate = 1e -5 , batch size = 64, and training epochs = 500. All experiment results are mean values of metrics from a 10-fold cross-validation.</p><p>Simulations of OCI, VIIRS, and ABI are utilized instead of observations as our labeled training dataset principally because there is no suitable comprehensive reference truth dataset. While the standard retrieval products from these sensors could be used to train networks, for example, these products have known limitations so there would be a risk of the model training to the artifacts in these products. In a similar vein, comparing the results of applying the trained NN to real satellite observations of these standard products will be instructive to get a general sense of reasonableness but not to gauge their absolute performance. It is expected that, by using a more realistic training set of simulations than was used to develop the physically-based retrievals (which include many simplifying assumptions out of the computational necessities from decades ago when these approaches were developed), the NN model should outperform them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Evaluation Metrics</head><p>We use three types of metrics to evaluate our work, including metrics for the classification task, metrics for the regression task, and metrics from an Earth science perspective.</p><p>Metrics for classification performance. We use Accuracy and Average precision to evaluate classification performance. Cloud masking accuracy (ACC bi ) is the fraction of correct predictions of two big categories [cloudy, cloudfree] by our model. The area under the precision-recall curve (AU (P RC) c ) is calculated for each label. The weighted area under the precision-recall curve (AU (P RC) w ) is the weighted mean of precisions across all labels at each threshold h for class i, with the weight as subtracted between recall at threshold h and the recall at threshold h -1, where the number of thresholds is close to infinity. The weighted precision and recall are:</p><p>Metrics for regression performance. We use MSE and R 2 to evaluate regression performance. Mean squared error (MSE) is the average squared difference between the true and predicted COT values.</p><p>The coefficient of determination (R 2 ) measures how close the predicted value is to the true value.</p><p>, where y cot is the mean value of all true COT values [y reg 1 , y reg 2 , . . . , y reg n ]. Earth science metrics. Besides the above metrics, we also evaluate how a machine learning model could be used for actual satellite missions. The fraction of pixels meeting PACE goals (FMG) is an evaluation metric defined by the PACE Validation Plan [4] based on scientific requirements and expectations. For each pixel, the relative error</p><p>Then FMG represents the percentage of pixels whose relative error is less than 0.25 (for liquid clouds) or 0.35 (for ice clouds). The Validation Plan defines a predictive model with satisfactory performance as one where this goal is met for 65% of cloudy pixels. Note that this mission goal only applies to cases where the true original COT value is 5 or more (log10 COT &gt; 0.7).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Comparison with Baseline Models</head><p>Tables <ref type="table">2</ref> and <ref type="table">3</ref> present a comprehensive evaluation of our proposed model MT-HCCAR against two baseline methods, demonstrating superior performance across both classification and regression metrics. Specifically, in the assessment of classification tasks, ACC bi pertains to cloud masking performance, while AU (P RC) w and AU (P RC) c quantify the performance of both cloud masking and cloud phase classification.</p><p>Upon close examination of Table <ref type="table">2</ref>, the RF classifier by <ref type="bibr">[17]</ref> attains superior ACC bi , AU (P RC) w , and AU (P RC) c in comparison to the SEQ model and the simplest MTL based model, MT-CR, within the ablation study. However, MTL-based models featuring the HC module including MT-HCR, MT-HCCR, and MT-HCCAR, surpass the performance of the RF classifier in these metrics. Shifting the focus to the results of the COT regression task in Table <ref type="table">3</ref>, the MLPbased baseline method introduced by <ref type="bibr">[8]</ref> produces results on par with the SEQ model but significantly inferior to MTL-based models. This finding emphasizes the effectiveness of MTL-based models for COT regression.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Ablation Study</head><p>The ablation study with five models from SEQ to MT-HCCAR for OCI, as depicted in the lower part of Tables <ref type="table">2</ref> and <ref type="table">3</ref>, highlights the effectiveness of our model in improving both classification and regression performance. The comparative analysis demonstrates significant improvements across all metrics from SEQ to MT-CR, which shows the usefulness of the MTL structure. For MTL methods, MT-HCR outperforms MT-CR in binary classification (cloudy, cloud-free) and refines liquid and ice cloud phase classification, leading to enhancements in regression metrics. This proves the effectiveness of the HC module. Moreover, despite focusing on regression, the introduction of CAR in MT-HCCAR maintains the performance of classification tasks. The efficacy of the CAR module is further validated by comparing MT-HCCAR to MT-HCR, indicating improvements in all metrics. Additionally, the adoption of the cross-attention module is substantiated by superior performance in MT-HCCR compared to MT-HCCAR, emphasizing its role in facilitating information exchange between hierarchical classification and regression.</p><p>To confirm the generalizability of our proposed model, we did a further ablation study with VIIRS and ABI. As shown in Table <ref type="table">4</ref>, the application of MTL-based models reveals a similar trend of performance enhancement when integrating HC and CAR modules. Notably, MT-HCCAR achieves the highest performance across both classification and regression tasks among the MTL models across OCI, VIIRS, and ABI datasets, underscoring the generalization capabilities of the introduced model MT-HCCAR. The detailed results of crossvalidation and model selection are in the supplementary material.</p><p>To visually illustrate the improvements facilitated by the HC and CAR modules, Figure <ref type="figure">3</ref> presents a scatter plot showing the distribution of true COT values against predicted COT values for all instances in the test set. Integration of HC and CAR modules results in more accurate predictions compared to MT-CR and MT-HCR. Notably, the predicted Probability Density Function (PDF) aligns closely with the true distribution, indicating improved fidelity to actual COT values. Additionally, scatter points cluster more tightly along the diagonal line, confirming the model's enhanced precision when HC and CAR modules are incorporated into the model. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Earth Science Evaluation</head><p>Quality of performance. Satellite cloud masks are generally evaluated either through comparison against ground-based, airborne, or spaceborne observations from radar/lidar sensors that have much greater sensitivity for cloud detection than imagers in question. Accuracies tend to depend on the surface type and illumination conditions; for daytime scenes such as those simulated here, reported accuracies for NASA's widely-used MODIS cloud mask are 0.850, 0.778, and 0.894. Chen et al. <ref type="bibr">[8]</ref> developed an NN cloud mask for MODIS based on RT simulations and found an accuracy of 0.985 on training data but lower accuracies in the range of 0.739-0.885 on real data depending on region and time of year. Although real-world conditions are more complex than a simulation, the results suggest our models are competitive with current approaches. Note, that the cloud phase is less commonly validated in this way. Obtaining a validation-grade measurement of COT presents significant challenges, with reference datasets lacking comprehensive coverage <ref type="bibr">[19]</ref>. Consequently, existing comparisons primarily focus on evaluating agreement among different remotely sensed datasets. Our network's performance suggests its potential to fulfill the objectives outlined in the Validation Plan <ref type="bibr">[4]</ref>. Notably, PACE's allowance for larger uncertainty in ice cloud measurements stems from the anticipated difficulties arising from variations in ice crystal properties in real-world scenarios. However, our results show that the uncertainty in ice COT can be smaller than that for liquid COT. Traditional methods often assume specific ice crystal properties, resulting in increased uncertainty in retrieved ice COT. Our findings indicate that satellite measurements inherently contain information about these properties, which neural networks are proficient at learning.</p><p>Performance of different satellite sensors. We also compared how the models perform for different satellite sensors based on the results in Tables 2, 3 and 4. The classification tasks exhibit high accuracy across all sensors, with virtually indistinguishable performance from a scientific perspective. The fundamental nature of cloud masking in Earth science prompts many sensors to incorporate a common set of bands proven to be effective for this purpose, with additional bands often designed for diverse applications. For instance, OCI's hyperspectral bands support the measurement of different ocean plankton species <ref type="bibr">[28]</ref>, revealing subtle spectral differences not readily discernible in multispectral data. In the regression task, OCI and VIIRS outperform ABI, indicating the utility of additional bands in predicting COT. The unexpectedly slightly superior performance of VIIRS over OCI may suggest that OCI's substantially larger feature space makes finding an optimal solution during training more challenging or could be attributed to stochastic variation. Adding training epochs for OCI may lead to better results, considering the number of bands.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions</head><p>In this study, we present MT-HCCAR, an end-to-end multi-task learning model tailored for cloud property retrieval on a simulated OCI satellite dataset in the PACE project, tackling tasks including cloud masking, cloud phase classification, and COT prediction. The model is implemented on three sensors' simulated datasets (OCI, VIIRS, and ABI), respectively, to examine its generalization. Comparative analyses against two baseline methods and ablation studies underscore the effectiveness of the HC module and the CAR module, enhancing performance in both classification and regression tasks. The ablation study establishes MT-HCCAR's superior performance across different datasets and multiple evaluation metrics. The positive results affirm our model's capability to address real-world challenges in cloud property retrieval and other multi-task applications. Future research endeavors will involve applying the model to spatial or temporal OCI images, co-located in space and time with VIIRS and ABI post-PACE launch, to assess its performance, and consistency, and enable detailed comparisons with deep learning models and non-machine-learning approaches.</p><p>from the National Science Foundation (NSF) and grant 80NSSC21M0027 from the National Aeronautics and Space Administration (NASA).</p></div></body>
		</text>
</TEI>
