<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>CollabLearn: An Uncertainty-Aware Crowd-AI Collaboration System for Cultural Heritage Damage Assessment</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>09/09/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10309195</idno>
					<idno type="doi">10.1109/TCSS.2021.3109143</idno>
					<title level='j'>IEEE Transactions on Computational Social Systems</title>
<idno>2373-7476</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yang Zhang</author><author>Ruohan Zong</author><author>Ziyi Kou</author><author>Lanyu Shang</author><author>Dong Wang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Cultural heritage sites are precious and fragile resources that hold significant historical, aesthetic, and social values in our society. However, the increasing frequency and severity of natural and man-made disasters constantly strike the cultural heritage sites with significant damages. In this paper, we focus on a cultural heritage damage assessment (CHDA) problem where the goal is to accurately locate the damaged area of a cultural heritage site using the imagery data posted on social media during a disaster event by exploring the collective strengths of both AI and human intelligence from crowdsourcing systems. Unlike other infrastructure-based solutions, social media platforms provide a more pervasive and scalable solution to acquire timely cultural heritage damage information during disaster events. Our work is motivated by the limitation of current AI solutions that fail to accurately model the complex cultural heritage damage due to the lack of essential human cultural knowledge to differentiate various damage types and identify the actual causes of the damage. Two critical technical challenges exist in solving our problem: i) it is challenging to effectively detect the problematic cultural heritage damage estimation of AI in the absence of ground truth labels; ii) it is non-trivial to acquire accurate cultural background knowledge from the potentially unreliable crowd workers to effectively address the failure cases of AI. To address the above challenges, we develop CollabLearn, an uncertainty-aware crowd-AI collaborative assessment system that explicitly explores the human intelligence from crowdsourcing systems to identify and fix AI failure cases and boost the damage assessment accuracy in CHDA applications. The evaluation results on real-world datasets show that CollabLearn consistently outperforms both the state-of-theart AI-only and crowd-AI hybrid baselines in accurately assessing the damage of several world-renowned cultural heritage sites in recent disaster events.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>constantly strike the cultural heritage sites with significant damages <ref type="bibr">[2]</ref>. For example, thousands of heritage places in Syria have suffered significant damage from conflict, looting, and the cessation of official protection since 2011. This paper focuses on an emerging application, cultural heritage damage assessment (CHDA), that aims to protect and conserve cultural heritage sites. The objective of CHDA applications is to accurately locate the damaged areas of a cultural heritage site by exploring the imagery data posted on social media during a disaster event. Unlike other infrastructure-based solutions (e.g., using surveillance cameras, drones, and satellites), the social media platforms provide an infrastructure-free solution that is more pervasive and scalable to acquire timely damage information of the cultural heritage sites during disaster events <ref type="bibr">[3]</ref>- <ref type="bibr">[5]</ref>. The assessment information can then be leveraged by the government agencies and organizations to provide conservation and recovery actions to the sites and save them from further damages.</p><p>Recent progress in AI and image processing has been made towards addressing the disaster damage assessment (DDA) problem <ref type="bibr">[3]</ref>, <ref type="bibr">[6]</ref>- <ref type="bibr">[11]</ref>. In particular, the deep learning based DDA solutions significantly reduce the labeling costs while providing a reasonable assessment accuracy compared to the traditional domain experts based solutions <ref type="bibr">[7]</ref>. Compared to the DDA problem that primarily focuses on identifying disaster-related damages from social media images, the CHDA problem is more challenging due to the high complexity of cultural heritage damage and the lack of cultural background knowledge of AI-based DDA solutions <ref type="bibr">[12]</ref>. Fig. <ref type="figure">I</ref> shows a few examples of failure scenarios when current AI-based solutions are applied to assess the damage areas of the cultural heritage sites. For example, the damage areas detected by AI algorithms for the tower in (A), the stone lions in (B), and the castle in (C) are actually part of the cultural and artistic design that are often observed at the cultural heritage sites <ref type="bibr">[13]</ref>. Meanwhile, the damage areas detected by AI algorithms for the stair flight in (D), the stone wall in (E), and the tiles in (F) are caused by long-term aging effects, which are often confused with the damages caused by recent disasters for AI algorithms. In contrast, humans are often observed to perform better at identifying the damages of cultural heritage sites where AI solutions fail. The reason is intuitive: humans normally have certain cultural background knowledge and a reasonable understanding of the complex scenes in cultural heritage sites, which together help them make a better judgment in CHDA applications. However, the solutions that fully depend on human efforts are expensive in terms of both time and cost and not scalable to address our problem with a large amount of social media data inputs during disaster events <ref type="bibr">[6]</ref>.</p><p>The areas in red color indicate the damage areas detected by the deep learning based damage assessment scheme for CHDA In this paper, we develop an integrated crowd-AI collaboration system to solve the cultural heritage damage assessment problem by exploring the collective strength of both AI and human intelligence. In particular, our goal is to achieve a win-win objective between AI and human intelligence by effectively leveraging the high detection efficiency of AI solutions to automatically process the vast amount of cultural heritage site images and explicitly exploring the human intelligence to identify and fix the failure cases of AI in CHDA applications. To obtain timely and scalable human intelligence, we leverage the widely-adopted open crowdsourcing platforms (e.g., Amazon Mechanical Turk (AMT)), which offer a large amount of 24/7 available crowd workers with reasonable costs <ref type="bibr">[14]</ref>. We refer to the human intelligence acquired from the crowdsourcing platform as crowd intelligence (CI). The design of such a crowd-AI collaboration system is a non-trivial task due to two critical technical challenges that are elaborated below.</p><p>Identification of AI Failure Cases. The first challenge lies in how to accurately identify the failure cases of AI damage assessment solutions without knowing the ground truth labels of images a priori. One straightforward solution to address this problem is to directly ask the crowd workers to examine every output of AI solutions to identify and fix the failure cases as shown in Fig. <ref type="figure">I</ref>. However, such an approach is impractical due to the heavy labor costs and low efficiency, especially in the context of the massive social media data inputs. Some initial efforts were made to address this issue by only selecting the imagery data with complicated image property (e.g., images with complex contents and color distribution) for crowd labeling under the assumption that the AI solutions are more likely to fail when the image is complex <ref type="bibr">[15]</ref>. However, such an assumption does not always hold for the cultural heritage damages as AI may also fail when the input image is relatively simple (e.g., the color distributions in Fig. I (D) are quite simple). Recent work on uncertainty-aware AI solutions (e.g., query-by-committee, dropout) could also potentially be applied to detect the failure cases of AI <ref type="bibr">[11]</ref>, <ref type="bibr">[16]</ref>. Those approaches often leverage a committee of different AI models or different instances of the same model to identify the problematic cases based on the consensus from the outputs of the committee members. However, those approaches will fail when all members in the committee happen to make similar mistakes on the same input <ref type="bibr">[17]</ref>. Therefore, it remains to be a challenging question on how to effectively detect the failure cases of AI in the absence of ground truth labels in CHDA applications.</p><p>Imperfect Crowd Intelligence. The second challenge lies in how to acquire accurate crowd intelligence from the potentially unreliable crowd workers to fix the failure cases of AI. Unlike the labels annotated by domain experts, the labels from the crowd workers can be uncertain and inconsistent <ref type="bibr">[18]</ref>. Such inconsistency is especially salient in CHDA applications due to the intricate nature of the cultural heritage site damage. For example, in Fig. <ref type="figure">2</ref>, we observe that the damage areas identified by different crowd workers are not always consistent. In particular, worker 1 and 2 in (B) and (C) believe the stone pillar is damaged while worker 3 in (D) thinks the stone pillar is intact during a disaster event. Such uncertain and inconsistent crowd labels present a critical challenge to the current active learning based AI systems that rely on the accurate human labels to troubleshoot and retrain the AI models to optimize the model performance <ref type="bibr">[19]</ref>, <ref type="bibr">[20]</ref>. In particular, the imperfect crowd intelligence could potentially collapse the AI model during the model retraining process <ref type="bibr">[21]</ref>. Several recent efforts have been made on training the AI models with imperfect labels <ref type="bibr">[22]</ref>, <ref type="bibr">[23]</ref>. However, those models are designed for structural image processing tasks (e.g., segmenting structural magnetic resonance imaging data) with limited errors in the training labels annotated by domain experts, which cannot be directly applied to handle the complex social media images with the uncertain and inconsistent crowd labels. Therefore, it remains to be a non-trivial question on how to leverage the imperfect crowd intelligence to effectively address the failure cases of AI model in CHDA applications.</p><p>To address the above challenges, we develop CollabLearn, an uncertainty-aware crowd-AI collaboration system that explicitly explores the imperfect crowd intelligence to identify and fix the AI failure cases in CHDA applications. In particular, our CollabLearn jointly models the uncertainty from both AI and crowd intelligence under a unified framework to solve the CHDA problem. To address the first challenge, we develop an uncertainty-aware deep damage assessment model to quantify the uncertainty of the estimated damage areas and detect the failure cases of AI. To address the second challenge, we design a novel crowd-AI fusion model that integrates the uncertainty of both AI models and crowd responses into a holistic estimation framework that addresses the failure cases of AI and improves the overall damage assessment accuracy in CHDA. To the best of our knowledge, our CollabLearn is the first integrated crowd-AI collaboration system that explicitly explores the collective power of uncer-  tain AI models and imperfect crowd intelligence under the same analytical framework to address the CHDA problem.</p><p>We evaluate the CollabLearn using a set of real-world CHDA datasets from seven world-renowned cultural heritage sites that were recently damaged. The evaluation results show that CollabLearn consistently outperforms both state-of-the-art AIapproaches and crowd-AI baselines in correctly identifying the cultural heritage damages under diversified types of cultural heritage sites and evaluation scenarios. We summarize our main contributions as follows:</p><p>&#8226; We study an important cultural heritage damage assessment (CHDA) problem that aims to protect and conserve cultural heritage sites by exploring the collective power of uncertain AI models and imperfect crowd intelligence.</p><p>&#8226; We develop CollabLearn, the first uncertainty-aware crowd-AI collaboration system in CHDA applications to address two important technical challenges, i.e., identification of AI failure cases and imperfect crowd intelligence, under a unified analytical framework. &#8226; We perform extensive experiments to evaluate Col-labLearn through real-world case studies from seven world-renowned and recently damaged cultural heritage sites and the results demonstrate clear performance gains of our CollabLearn scheme compared to state-of-the-art baselines.</p><p>The rest of this paper is organized as follows: we first review the related work in Section II. In Section III, we formally define our crowd-AI cultural heritage damage assessment problem. The proposed CollabLearn framework is elaborated in Section IV. Experiments and evaluation results are presented in Section V. Finally, we conclude the paper in Section VI.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Crowdsourcing</head><p>Crowdsourcing has emerged as a new application paradigm where individual workers work collaboratively to address some challenging problems <ref type="bibr">[24]</ref>, <ref type="bibr">[25]</ref>. Examples of crowdsourcing applications include enhancing driver situation awareness using participatory sensing <ref type="bibr">[26]</ref>, monitoring infectious disease outbreaks using real-time mobile crowdsensing <ref type="bibr">[27]</ref>, detecting ongoing cyber-attacks using social media feeds <ref type="bibr">[28]</ref>, and obtaining situational awareness in disaster response using social sensing <ref type="bibr">[29]</ref>. A comprehensive summary of crowdsourcing applications can be found in <ref type="bibr">[30]</ref>. Several key challenges exist in current crowdsourcing applications including data reliability, incentive design, data scarcity, human-computer interaction, and privacy protection <ref type="bibr">[31]</ref>- <ref type="bibr">[34]</ref>. However, leveraging the imperfect crowd intelligence to identify and fix the AI failure cases in CHDA applications remains to be a challenging problem in crowdsourcing applications. In this paper, we developed the CollabLearn scheme to address this problem by designing a novel crowd-AI collaboration system to boost the CHDA performance. Our work is also related to the recent efforts in obtaining reliable information from unreliable crowdsourced data <ref type="bibr">[35]</ref>, <ref type="bibr">[36]</ref>. However, those solutions primarily focus on fusing the labels from different crowd workers and do not explore the collaboration between crowd intelligence and AI, which often leads to suboptimal system performance <ref type="bibr">[37]</ref>. In contrast, our CollabLearn develops a unified analytical framework to explicitly model the uncertainty of both AI models and crowd responses to address the failure cases of AI and optimize the performance of CHDA applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Social Media based Damage Assessment</head><p>Previous efforts have been made to address the damage assessment problem using social media data <ref type="bibr">[3]</ref>, <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[38]</ref>- <ref type="bibr">[42]</ref>. For example, Li et al. proposed a deep convolutional network approach to classify the severity levels of the damage based on social media images during natural disasters <ref type="bibr">[7]</ref>. Mouzannar et al. developed a deep learning framework that utilizes heterogeneous social media data to obtain situation awareness in disaster response via multimodal convolutional neural networks <ref type="bibr">[38]</ref>. Kumar et al. designed an end-to-end social media image processing and analytical model to identify the disaster damage images on social media using deep neural networks <ref type="bibr">[3]</ref>. However, those approaches cannot be directly applied to solve our CHDA problem due to the complex nature of the cultural heritage damages and the lack of cultural background knowledge of those AI-based solutions. There also exist a couple of initial efforts that leverage the human intelligence to identify and address the failure cases of AI in disaster damage assessment <ref type="bibr">[19]</ref>, <ref type="bibr">[20]</ref>. However, those humanassisted AI systems often rely on the accurate human labels to troubleshoot and retrain the AI models to optimize the model performance. The inconsistent and uncertain crowd labels could cause a potential model collapse in the retraining process of those models. In contrast, this paper explores the uncertainty of both AI models and crowd responses and integrates them into a holistic uncertainty-aware estimation framework to address the failure cases of AI in CHDA applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Crowd-AI Hybrid Systems</head><p>Our work belongs to the growing trend of designing the crowd-AI hybrid systems to solve the complex real-world problems <ref type="bibr">[11]</ref>, <ref type="bibr">[15]</ref>, <ref type="bibr">[43]</ref>- <ref type="bibr">[46]</ref>. For example, Jarrett et al. developed an elastic crowd-AI learning framework that introduces a task complexity index to optimize the integration of AI and crowd intelligence to improve the overall task performance in a mobile face recognition application <ref type="bibr">[15]</ref>. Sener et al. proposed a deep core-set selection approach that collects crowd labels from a subset of representative images to retrain the AI models to improve the overall accuracy in natural scene image classification tasks <ref type="bibr">[43]</ref>. Zhang et al. designed a crowd-AI hybrid system that leverages crowd intelligence to retrain the AI models and combine crowd labels with AI outputs to troubleshoot and tune the performance of AI algorithms in disaster damage assessment applications <ref type="bibr">[11]</ref>. Guo et al. designed a crowd-AI hybrid question-answering system in smart home applications by analyzing the content from camera steam captured by smart IoT devices <ref type="bibr">[44]</ref>. Yang et al. proposed an interactive framework to leverage crowdsourcing platforms and a deep probabilistic model to denoise the data in movie reviews and news articles <ref type="bibr">[45]</ref>. Current crowd-AI solutions often rely on a committee of different AI models to identify the problematic cases when those models do not agree with each other. However, those approaches would likely to fail when all members of the committee happen to make similar mistakes on the same input due to the lack of cultural background knowledge <ref type="bibr">[12]</ref>. More importantly, we observe that current crowd-AI approaches often retrain the AI models with additional labels from the crowd workers to improve their performance. However, we find that such a retraining mechanism does not work well with the imperfect labels on cultural heritage damage images obtained from the crowd due to the complex natural of cultural heritage damage, which can be easily confused with specific cultural and artistic designs and long-term aging effects that are often observed at the cultural heritage sites. In contrast, CollabLearn is the first crowd-AI collaboration system that explicitly explores the collective power of uncertain AI models and imperfect crowd intelligence to boost the assessment accuracy in CHDA applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Deep Learning-based Image Processing and Analytics</head><p>Our work also bears resemblance to the deep learning technique to automate the intelligent image processing and analytics in many real-world applications <ref type="bibr">[47]</ref>. For example, Wang et al. proposed a semantic re-ranking framework that leverages the deep features extracted by convolutional neural networks to improve the sketch-based image retrieval performance <ref type="bibr">[8]</ref>. Ronneberger et al. designed a skip-connected convolutional neural network that utilizes both contracting and expanding paths to enable cross-layer information transmission for biomedical image segmentation <ref type="bibr">[48]</ref>. Xie et al. proposed a image classification framework to classify the building damage status during a nature disaster from satellite radar images via ensemble models and deep learning networks <ref type="bibr">[9]</ref>. Zhu et al. developed a multimodal hypergraph learning approach that leverages vertices and hyperedges in hypergraphs to capture the complex similarities between different landmarks in content-based landmark image searching <ref type="bibr">[10]</ref>. Li et al. proposed a deep feature aggregation framework that aggregates discriminative features from different sub-networks to achieve a fast model convergence for semantic image segmentation <ref type="bibr">[49]</ref>. While the above solutions focused on developing deep learning models to optimize the performance of specific application, they are not designed to accurately detect the failure cases of the deep learning models in the absence of the ground truth labels. In contrast, the CollabLearn designs an uncertainty-aware deep damage assessment network to accurately quantify the uncertainty of the estimated results to detect the failure cases of the deep learning models in CHDA applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Social Computing</head><p>Our work is also related to the recent advances in social computing, which have been successfully applied in many application domains such as human-robot interaction, Internetof-Things (IoT), public health, and information diffusion <ref type="bibr">[50]</ref>- <ref type="bibr">[53]</ref>. For example, Erol et al. proposed an affection-based perception system that enables social robots to recognize human emotion states to improve the personalization in human-robot interaction <ref type="bibr">[50]</ref>. Liu et al. introduced an edge-cloud collaborative computing system to improve energy efficiency and reduce system latency in face detection and recognition using fieldprogrammable gate array-based CNN accelerators <ref type="bibr">[51]</ref>. Zhu et al. developed an attentive deep recurrent framework for daily mental-state monitoring of depression patients by examining the dynamics of human blood vascular systems using Photoplethysmography <ref type="bibr">[52]</ref>. Dong et al. designed a social media information flow model to track the information spread during disaster events and study the influence of different social media user groups on disaster information dissemination <ref type="bibr">[53]</ref>. To the best of our knowledge, our CollabLearn is the first crowd-AI collaboration system that explicitly explores the uncertainty of both AI and crowd responses in a unified analytical framework to address a real-world issue that has an important social impact -cultural heritage protection and conservation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PROBLEM DESCRIPTION</head><p>In this subsection, we formally define our crowd-AI cultural heritage damage assessment problem. We first define a few key terms used in the problem formulation.</p><p>Definition 1: Cultural Heritage Damage Images (X): We define X to be the set of cultural heritage damage images posted on social media (e.g., Twitter), where each image captures a specific scene of a damaged cultural heritage site as shown in Fig. <ref type="figure">3</ref>. In particular, we collect the cultural heritage damage images from social media sites using a cultural heritage imagery data crawler tool <ref type="bibr">[3]</ref>, where each collected image contains the damage of a cultural heritage site from a recent damaging event (e.g., disaster, war). In addition, we define X = {X 1 , X 2 , ..., X A } as a set of collected cultural heritage damage images where A represents the number of collected images.</p><p>Definition 2: Actual Damage Area (D): We define D as the actual damage areas in cultural heritage site images (e.g., the red color areas as shown in Fig. <ref type="figure">3</ref>). In particular, we define D = {D 1 , D 2 , ..., D A } to represent the actual damage areas in all collected images, where D a represents the actual damage area of the a th image. Definition 3: Estimated Damage Area by AI ( D AI ): We define D AI as the damage areas estimated by the AI module of the crowd-AI collaboration system for the cultural heritage site images. In particular, we define D AI a as the estimated damage area of the a th image. Definition 4: Marked Damage Area by Crowd ( D CI ): We define D CI as the damage area annotated by the crowd workers from the crowdsourcing platforms (e.g., Amazon Mechanical Turk). In particular, we define D CI a to represent the marked damage area by a crowd worker for the cultural heritage image X a .</p><p>Definition 5: Crowd Query (Q): We define a crowd query to be a crowdsourcing task where our crowd-AI collaboration system decides to send a set of cultural heritage damage images to the crowdsourcing platform, where each image in the crowd query is marked by a set of N crowd workers on the damage area in the image as follows:</p><p>where D CI a (n) indicates the damage area marked by the n th crowd worker for the image X a . We note that the damage areas marked by different crowd workers in each crowd query could be uncertain and inconsistent due to the complex nature of the cultural heritage damage and the uncertainty of the crowd workers as shown in Fig. <ref type="figure">2</ref>.</p><p>Definition 6: Crowd Query Ratio (&#952;): We define &#952; to be an application-specific parameter that specifies the percentage of cultural heritage damage images that is sent in a crowd query, which is often decided by the performance and budget tradeoff of a CHDA application. In other words, a total of &#952; &#8226; A images will be sent for crowd to mark in a crowd query.</p><p>Definition 7: Identified Damage Area by Crowd-AI Collaboration System ( D): We define D to be the final identified damage area from our crowd-AI collaboration system by leveraging both the estimated damage area generated by AI module D AI and marked damage areas returned by crowd query D CI . In particular, we define D a to represent the final identified damage area for the collected image X a .</p><p>The goal of our paper is to accurately assess the damage of the cultural heritage sites by identifying the damage areas of images of the sites through the collective intelligence from both AI and crowd. Given the above definitions, we formally define our problem as follows:</p><p>where &#915;(&#8226;) represents the quantitative metrics (e.g., IoU and DSC <ref type="bibr">[54]</ref>) to measure the similarity between the identified and actual damage area ( D a and D a ) of an image. This problem is challenging due to the difficulty of effectively detecting the failure cases of AI in the absence of the ground truth labels and the imperfect knowledge obtained from crowdsourcing platform. In this paper, we develop a CollabLearn framework to address these challenges, which is elaborated in the next section.</p><p>IV. SOLUTION</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Overview of CollabLearn Framework</head><p>CollabLearn is an uncertain-aware crowd-AI collaboration system to address the cultural heritage damage assessment problem formulated above. The overview of the CollabLearn is shown in Fig. <ref type="figure">4</ref>. In particular, it consists of two modules:</p><p>&#8226; Uncertainty-aware Deep Damage Assessment (UDDA):</p><p>First, the UDDA module designs a novel deep damage assessment model to accurately quantify the uncertainty of the estimated damage areas and detect the failure cases of AI in CHDA applications. In particular, the UDDA module designs a duo-branch deep estimation network that contains two parallel output branches to simultaneously generate the damage area estimations together with the quantification of the estimation uncertainty under a unified network architecture. More importantly, to ensure the accuracy of the uncertainty quantification, we design an uncertainty-aware loss function to model the error of the estimated damage area and accurately quantify the uncertainty of the estimation within the deep network optimization process. &#8226; Imperfect Crowd Knowledge Fusion (ICKF): Second, the ICKF develops a confidence-aware estimation framework to explicitly model the uncertainty of both AI models and crowd responses to address the failure cases of AI and optimize the performance of CollabLearn. In particular, the ICKF module first designs a novel crowd annotation portal on Amazon Mechanical Turk by allowing the crowd workers to document their confidence on their marked damaged areas, which is essential to obtain accurate crowd intelligence given the complex cultural heritage damage and the unvetted nature of the crowd workers. More importantly, we further design a novel confidence-aware maximum likelihood estimation model to leverage the inconsistent crowd responses with different confidence levels to derive the accurate damage areas of cultural heritage sites to fix the failure cases of AI. B. Uncertainty-Aware Deep Damage Assessment (UDDA)</p><p>In this subsection, we present the uncertainty-aware deep damage assessment network architecture in CollabLearn to estimate the damage area in each cultural heritage image and quantify the uncertainty of the estimation results. In particular, our uncertainty-aware deep damage assessment network architecture design consists of two network components: an encoder network (EN) and an assessment network (AN). In particular, the EN is first used to extract both high-level (e.g., objects and patterns) and low-level (e.g., colors and textures) damagerelated visual features from the cultural heritage images. The AN is then used to explicitly identify the damage areas and quantify the uncertainty of the estimation results using the multi-level visual features extracted by EN. To the best of our knowledge, the UDDA is the first end-to-end AI-based damage assessment approach that designs a multi-branch uncertainty estimation network architecture to detect the failure cases of AI in CHDA applications in the absence of ground truth labels.</p><p>We first define a key concept for our UDDA module as follows:</p><p>Definition 8: Damage Estimation Uncertainty Matrix (M): We first consider the error between the actual and estimated damage area by AI as follows:</p><p>where L CE represents the cross-entropy loss <ref type="bibr">[55]</ref> that measures the error between the actual and estimated damage area of a cultural heritage image. D a is the actual damage area for an image X a (Definition 2) and D AI a is the estimated damage area for X a (defined in Definition 3). We observe that such an error often follows a Gaussian distribution <ref type="bibr">[56]</ref>:</p><p>where M a represents the estimation uncertainty matrix that indicates the standard deviation of the cross-entropy loss at all pixels for the cultural heritage damage image X a . Specifically, we define M = {M 1 , M 2 , ..., M A } to be a set of the damage estimation uncertainty matrices for all cultural heritage images in a CHDA application.</p><p>Given the above definition, let us first formally define the encoder network EN and the assessment network AN in our UDDA module as follows:</p><p>Definition 9: Encoder Network (EN): We define EN as a mapping network to extract multi-level damage-related visual features from the cultural heritage imagery data as follows:</p><p>where V X is used to represent the extracted damage-related visual features. We show an example of EN in the (A) of Fig. <ref type="figure">5</ref>. It contains a stack of ImageNet pre-trained convolutional layers for damage-related visual feature extraction.This is done to ensure the mapping network is capable of accurately identifying the complex visual features for an input cultural heritage image. In addition, we enable the skip connection in the encoder network (i.e., the dotted lines in Fig. <ref type="figure">5</ref>), which is used to forward different levels of damage-related visual features extracted by EN to AN . The different levels of visual features can then be utilized by AN to effectively identify the damage area for each cultural heritage image. Definition 10: Assessment Network (AN): we define AN as a generation network that estimates the damage area for each cultural heritage image and infers the estimation uncertainty matrix using the damage-related visual feature V X extracted by EN :</p><p>where D AI is the estimated damage area generated by the assessment network and M is the set of damage estimation uncertainty matrices defined above. We show an example of AN in (B) of Fig. <ref type="figure">5</ref>. In particular, the assessment network consists of a set of deconvolutional layers that explicitly identify the damage area of an image by gradually examining the damage-related visual features. In addition, AN also includes a set of convolutional layers that fuse different levels of visual features extracted by EN through skip-connections. This is done to ensure both high-level (e.g., objects and patterns) and low-level (e.g., colors and textures) damage-related visual features are successfully forwarded from EN to AN for accurate damage area estimation. The key novelty of the AN lies in the parallel output branch design where each branch contains a convolutional layer and a sigmoid layer as shown in Fig. <ref type="figure">5</ref>. This design provides the estimation of the damage area together with the quantification of the estimation uncertainty under an end-to-end network architecture. Given the two network architectures above, our next question is how to define a loss function for our network to generate the damage assessment results and the estimation uncertainty matrix to quantify the accuracy of the results. To that end, we define two sets of loss functions in our model. In particular, we first consider the assessment loss for the EN and AN as follows:</p><p>where L Assess EN,AN represents the assessment loss function for EN and AN . L CE represents the cross-entropy loss that measures the difference between the actual and estimated damage area of cultural heritage images. The goal of this loss function is to check if AN can accurately estimate the damage area of images using the visual features captured by EN.</p><p>Next, recall that the difference between the actual and estimated damage area (i.e., L CE (AN (EN (X)), D)) follows the Gaussian distribution (i.e., N (0, M 2 )) in Definition 8. We can derive the log-likelihood function for L CE (AN (EN (X)), D) as follows:</p><p>Therefore, we define our uncertainty loss function as the negation of the log-likelihood function as follows:</p><p>By minimizing the loss function L Uncertain EN,AN in our deep damage assessment network, we can obtain the uncertainty matrices M that maximize the likelihood function log L(0, M ; L CE (AN (EN (X)), D)) defined above.</p><p>We then combine the above two sets of loss functions to derive the final loss function L F inal EN,AN to generate the damage estimation results and the uncertainty matrix of the estimation for the UDDA module as follows:</p><p>where L F inal EN,AN is a summation of L Access EN,AN and L U ncertain EN,AN . For the L Access EN,AN , we follow the standard cross-entropy loss design that translates the matrix to a score by calculating the mean value of all the elements in the matrix. For L U ncertain EN,AN , we translate the matrix to a score by calculating the L2 norm of the matrix.</p><p>Using the above loss function, we can learn the optimal instances (i.e., EN * , AN * ) using the RMSprop optimizer <ref type="bibr">[57]</ref>. Finally, we use EN * and AN * to estimate the damage areas and the estimation uncertainty matrices for all input cultural heritage damage images X as follows:</p><p>Given the estimated damage area and the associated uncertainty matrix learned by our UDDA module, our next step is to use them to determine the failure cases of the AI model and send the identified failure cases to the crowdsourcing platforms to obtain crowd intelligence. In particular, a higher uncertainty value in the uncertainty matrix indicates the AI model is more uncertain about estimation results, where the estimation results on damage area are more likely to be inaccurate. Therefore, we define an uncertainty score &#934; to determine which cultural heritage images should be added to the crowd query Q as follows:</p><p>Definition 11: Uncertainty Score &#934;: We define &#934; a to represent the uncertainty score of a cultural heritage image X a as follows:</p><p>where mean(&#8226;) indicates the mean value of all elements in a matrix. M a is the uncertainty matrix of the image X a . Finally, we sort the uncertainty scores of all cultural heritage images and select the top &#952; &#8226; A ranked images into the crowd query Q (&#952; refers to the crowd query ratio in Definition 6 and A is the number of studied images). For the images that are not added to the crowd query Q, we use the damage area estimated by our AI module D AI as the output D for those images.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Imperfect Crowd Knowledge Fusion (ICKF)</head><p>In the previous subsection, we present the UDDA module that identifies the failure cases of AI. Our next question is how to acquire accurate crowd intelligence from the potentially unreliable crowd workers to fix the failure cases of AI. We note many current active learning based crowd-AI approaches often retrain the AI models with additional labels from the annotators to improve their performance. However, we found that such a retraining mechanism does not work well with the imperfect labels on cultural heritage damage images obtained from the crowd. In particular, we compare the performance of three representative deep learning based damage assessment baselines (i.e., UNet <ref type="bibr">[48]</ref>, FCN <ref type="bibr">[58]</ref>, and DFANet <ref type="bibr">[49]</ref>) together with our UDDA module with and without retraining using the imperfect crowd labels. The results are shown in Fig. <ref type="figure">6</ref>. We observe that the performance of all schemes decreases after they are retrained with the imperfect crowd labels. The reason for the decreased performance of AI models retrained by imperfect crowd labels is that the imperfect crowd labels could enforce the AI models to learn the incorrect visual characteristics about the damage areas (e.g., mistakenly learn the visual characteristics of intact areas as the evidence for damage areas). The results validate our hypothesis that simply retraining AI models with imperfect labels from the crowd may lead to suboptimal performance of the models.</p><p>To address the above challenge, we design a crowd-AI fusion module that integrates the uncertainty estimation of AI module and the imperfect crowd responses into a holistic estimation framework to improve the overall damage assessment accuracy. In particular, we first design a crowd annotation portal on Amazon Mechanical Turk by allowing the crowd workers to document their confidence on their marked damage Fig. <ref type="figure">6</ref>. Impact of Imperfect Crowd Intelligence on AI Models areas as shown in Fig. <ref type="figure">7</ref>. Such confidence-aware design is important due to the complex cultural heritage damage and imperfect nature of individual crowd workers. We note that different crowd workers could mark different areas as the damage area and the same worker could express different levels of confidence for the marked areas. Our next question is how to obtain reliable crowd intelligence by leveraging the inconsistent and uncertain responses from the individually unreliable crowd workers. To that end, we first define a key term as follows:</p><p>Definition 12: Inferred Damage Area (D CI ): we define D CI to represent the damage area inferred by our ICKF module using the responses from the crowd query Q. In particular, we define D CI a to be the inferred damage area for cultural heritage image X a in the crowd query Q. Pr D CI a (1), ..., D CI a (N ) , CFa(1), ..., CFa(N ) |D CI a (13) where D CI a (n) represents the damage area marked by a crowd worker in a crowd query (Definition 5) for cultural heritage image X a . CF a (n) indicates the associated confidence-level for the D CI a (n) as shown in Fig. 7. Our goal here is to estimate the likelihood of each pixel of an image being a part of the damage area in X a given the crowd responses and associated confidence levels, which collectively help infer D CI a . In particular, we first define a likelihood function L(&#8486;; O, Z) as follows: L(&#8486;; O, Z) = L(&#8486;; (( D CI a (1), ..., D CI a (N )), (CFa(1), ..., CFa(N ))), D CI a ) = P p=1 N n=1 C c=1 &#945; Un,p&amp;&amp;V c n,p n,c &#215; 1 -C c=1 &#945;n,c (1-Un,p) &#215; d &#215; zp + N n=1 C c=1 &#946; Un,p&amp;&amp;V c n,p n,c &#215; 1 -C c=1 &#946;n,c</p><p>(1-Un,p)</p><p>The above likelihood function represents the likelihood of the observed data O (i.e., damage areas marked by different workers with different confidence levels and the values of hidden variables Z (i.e., the damage area of an image) given the estimated parameter &#8486;. The detailed explanations of the above parameters of the likelihood function are summarized in Table <ref type="table">I</ref>. The objective of our problem is to infer the accurate damaged area D CI a by deriving the values of the hidden variable z p that indicates whether a specific pixel p of an image belongs to a part of the damage area. In particular, the formulated problem can be solved using expectation maximization (EM). However, one key issue for the EM algorithm is that the algorithm is often sensitive to the initialization of the model parameters, which may lead the algorithm to a suboptimal solution.</p><p>To address this problem, we leverage the uncertainty estimation generated by our UDDA module to help the EM algorithm with a better parameter initialization that maximizes the chance of the algorithm to reach an optimized solution. In particular, we first define a key term as follows:</p><p>Definition 13: Reliable AI Estimation Area (&#948; a ): we define &#948; a to represent the sub-area in a cultural heritage image X a with top k percent lowest uncertainty values in the estimation uncertainty matrix M a , which often indicates that the AI module is certain about the estimation results in &#948; a The value of k is often set to be small (e.g., 10 in our experiments) in practice to ensure the estimation results from the AI module are reliable. Leveraging &#948; a , we then set the value of z p for each pixel p within &#948; a to be the same as the assessment result from the UDDA model (i.e., set z p to be 1 if the pixel is estimated by AI as a part of the damage area and 0 otherwise) in the initialization and iterative process of the EM algorithm. We can then infer the damage area of an image D CI a in the crowd query from the learned z p for each pixel. In particular, we exam the z p for all pixels in an image and set all pixels p with z p &gt; 0.5 as the inferred damage area. Finally, we use the inferred damage area D CI to replace the estimated damage area D AI generated by the UDDA module for all images in the Notations Definitions/Explanations P number of pixels in a cultural heritage image N number of crowd workers for each crowd query C number of confidence levels in our crowdsourcing portal &#945;n,c conditional probability that a crowd worker n marks a pixel to be a part of the damage area with a confidence level of c given the pixel is a part of the damage area &#946;n,c conditional probability that a crowd worker n marks a pixel to be a part of the damage area with a confidence level of c given the pixel is not a part of the damage area Un,p indicator variable that is set to be 1 when a crowd worker n marks a pixel p to be a part of the damage area and is set to be 0 otherwise</p><p>indicator variable that is set to be 1 when a crowd worker reports a pixel p to be a part of the damage area with a confidence level of c and is set to be 0 otherwise. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&amp;&amp;</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Summary of CollabLearn Framework</head><p>Finally, we summarize the CollabLearn framework in Algorithm 1. In particular, CollabLearn includes three main phases in performing the crowd-AI based CHDA as follows:</p><p>&#8226; Model training phase: The objective of this phase is to train an optimized uncertainty-aware deep damage assessment (UDDA) network (i.e., EN * and AN * ) that will be used in the later phases to detect the failure cases of AI and infer accurate labels on damage area from the crowd responses. In particular, our framework leverages labeled data to train the UDDA network by optimizing the final loss function (Equation ( <ref type="formula">10</ref>)) using the RMSprop optimizer <ref type="bibr">[57]</ref>. &#8226; AI troubleshooting phase: Given the learned optimized EN * and AN * , our objective in this phase is to identify the failure cases of AI by selecting the images with high uncertainty score &#934; and adding those images to the crowd query Q. Note that our CollabLearn does not involve any network training during the AI troubleshooting phase. Instead, it utilizes the learned network instances (EN * and AN * ) obtained from the model training phase to identify the damaged area of cultural heritage sites and generate the uncertainty estimation of the inferred damage area. In addition, for the images that are not added to Q, we take the damage area estimated by our AI module D AI as the output D of our CollabLearn framework. &#8226; Crowd knowledge fusion phase: For the images in the crowd query Q, we first obtain the crowd responses D CI from the crowdsourcing platform. Our objective of this phase is to integrate the uncertainty matrices M from UDDA module and imperfect crowd response D CI to infer the accurate damage area D CI . The D CI will be used as the output D of our CollabLearn framework for all cultural heritage images in the crowd query Q.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 CollabLearn Framework Summary</head><p>Model Training Phase 1: initialize EN (Definition 9) 2: initialize AN (Definition 10) 3: for each epoch do 4: for each batch do 5: optimize EN and AN (Equation (10)) 6: end for 7: end for 8: obtain EN * and AN * AI Troubleshooting Phase 9: obtain D AI and M using EN * and AN * (Equation (11)) 10: for a in [1,2,...,A] do 11: if &#934;a in top &#952; &#8226; A then 12: add Xa to Q 13: else 14: set D AI a as Da 15: add Da to D 16: end if 17: end for Crowd Knowledge Fusion Phase 18: for each Xa in Q do 19: obtain O from crowdsourcing platform 20: drive D CI a by solving Equation (14) using EM 21: set D CI a as Da 22: add Da to D 23: end for 24: output D</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. EVALUATION</head><p>In this section, we evaluate the performance of the Col-labLearn framework using the real-world datasets on cultural heritage damages collected from seven different recently damaged cultural heritage sites. The results show that Col-labLearn consistently outperforms the state-of-the-art AI-only and crowd-AI hybrid baselines in terms of cultural heritage damage assessment accuracy under various application scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Dataset</head><p>Cultural Heritage Damage Dataset: In our evaluation, we use a real-world dataset on cultural heritage damage collected by <ref type="bibr">[3]</ref> <ref type="foot">foot_0</ref> . In particular, the dataset consists of social media images collected from seven different recently damaged cultural heritage sites as shown in Fig. <ref type="figure">9</ref>.</p><p>These cultural heritage damages have a diversified set of damage characteristics (e.g., damage types, affected areas, and building characteristics), which create a challenging evaluation scenario to study the cultural heritage assessment problem. The ground-truth damage area in each cultural heritage image is manually annotated by domain experts using the image polygonal annotation tool Labelme 2 . In particular, for the seven recently damaged cultural heritage sites studied in our experiment, we first collect the validation images of intact cultural heritage sites using the online data sources (e.g., Google, Wikipedia). We then compare the damage images of the cultural heritage sites with the validation images to determine the ground-truth damage area in each image. In addition, we show a few examples of such a ground-truth damage area annotation process in Fig. <ref type="figure">8</ref>. Please note that the above ground-truth dataset is used for the purpose of evaluation only and is often not available to the crowd-AI system due to the heavy labor costs and low efficiency of domain experts. In addition, we randomly sample the training and testing data from the dataset by setting the ratio of training to testing data as 1:1. Such a ratio is set to ensure that all compared schemes can be evaluated with a sufficient amount of testing data. A large testing set also makes it more challenging for all Crowd-AI schemes (including CollabLearn) to identify the AI failure cases <ref type="bibr">[59]</ref>.</p><p>The training dataset is used to train all compared AI models for cultural heritage damage assessment. In our experiments, we also study the robustness of the CollabLearn scheme and the baselines by varying the ratio between the training and testing data.</p><p>Amazon Mechanical Turk Platform: To obtain the crowd intelligence, we utilize the Amazon Mechanical Turk (AMT) 3 . In our experiment, each image in a crowd query is marked by three independent crowd workers. To ensure the crowd label quality, we select the crowd workers who have an overall task approval rate greater than 95% and have completed at least 1000 approved tasks to participate in our crowd query task. We pay $0.20 to each worker per image in our experiment. In each crowd query task, we ask the crowd workers to mark the damage area for each image in the crowd query together with their confidence on their marked damage areas as shown in Fig. <ref type="figure">7</ref> in Section IV-C. In our experiments, we vary the crowd query ratio (Definition 6) from 10% to 25% in our experiments. We also set the number of crowd workers who respond to each queried image to be 3 for all compared schemes, which achieves a reasonable balance between the number of crowd labels and the query cost. We have followed the corresponding IRB protocol of this research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Baselines and Experiment Settings</head><p>We compare CollabLearn with a set of representative AI and crowd-AI baselines that are widely used in the literature for the damage assessment using social media data images.</p><p>&#8226; AI Baselines: Fig. 9. Cultural Heritage Damage Sites in Our Dataset 1) UNet [48]: a widely-used deep neural network approach that utilizes both contracting and expanding paths to enable cross-layer information transmission for desirable cultural heritage damage assessment accuracy. 2) FCN [58]: a deep learning model that utilizes the fully convolutional neural networks to map the cultural heritage imagery data into a latent deep feature space to infer the damage area. 3) Attention UNet [60]: a recent deep convolutional model that integrates U-Net model with attention gate mechanism to improve the model sensitivity in detecting the cultural heritage damages. 4) DFANet [49]: a deep segmentation network that aggregates discriminative features from different sub-networks to achieve a fast model convergence speed for the cultural heritage damage assessment task.</p><p>&#8226; Crowd-AI Hybrid Baselines: 1) Hybrid Para <ref type="bibr">[15]</ref>: an elastic crowd-AI learning architecture that allocates the imagery data with</p><p>Table II PERFORMANCE COMPARISONS ON DAMAGE ASSESSMENT ACCURACY &#952; = 10% &#952; = 15% &#952; = 20% &#952; = 25% Category Algorithm IoU DSC IoU DSC IoU DSC IoU DSC Random Random 0.2050 0.3277 0.1989 0.3198 0.2154 0.3294 0.2023 0.3251 UNet 0.4115 0.5540 0.4500 0.5915 0.4244 0.5562 0.4928 0.6311 AI-Only FCN 0.4069 0.5507 0.4384 0.5797 0.4662 0.6058 0.4879 0.6269 AttentionUNet 0.3329 0.4535 0.3633 0.4835 0.3684 0.4911 0.3962 0.5179 DFANet 0.3839 0.5539 0.3692 0.5382 0.3677 0.5371 0.3810 0.5510 Hybrid Para 0.4963 0.6278 0.5027 0.6337 0.5162 0.6457 0.5205 0.6489 Crowd-AI Deep Active 0.3856 0.5263 0.3970 0.5389 0.4273 0.5706 0.4580 0.5970 CrowdLearn 0.4894 0.6248 0.4987 0.6322 0.5074 0.6406 0.5171 0.6504 Our Model CollabLearn 0.5298 0.6580 0.5490 0.6763 0.5581 0.6838 0.5686 0.6924</p><p>complex image property (e.g., the images with large size and complex color distributions) for the crowd to label in order to improve the overall assessment accuracy of CHDA application. 2) Deep Active <ref type="bibr">[43]</ref>: a deep active learning-based crowd-AI system that utilizes the deep features extracted from each cultural heritage image to identify the representative ones for crowd labeling to retrain the AI models for performance optimization. 3) CrowdLearn <ref type="bibr">[11]</ref>: a recent crowd-AI framework that explores the crowd intelligence and AI by directly combining crowd labels with AI outputs to improve the accuracy of the estimated labels on cultural heritage damage. To ensure a fairness comparison, the inputs to all compared schemes are set to be the same, which include 1) the collected social media images, 2) the ground-truth labels of images in the training dataset, and 3) the labeled images from crowd workers. In particular, we retrain the AI baselines using the labels returned by the crowd for a fair comparison. In addition, we also consider the random baseline, which estimates the damage area for each image by randomly determining whether each pixel in the image is a part of the damaged area or not.</p><p>In our experiments, we implement our model using PyTorch 1.1.0 libraries 4 and train our model using the NVIDIA Quadro RTX 6000 GPUs. In our experiments, all hyper-parameters are optimized using the RMSprop optimizer <ref type="bibr">[57]</ref>. In particular, we set the learning rate to be 10 -4 . We also set the batch size to be 10 and the model is trained over 300 epochs.</p><p>To evaluate the performance of all compared schemes, we adopt two representative metrics that are widely used to study the performance of the object detection in image processing and computer vision community <ref type="bibr">[54]</ref>. In particular, the two metrics measure the overlap between the estimated and actual damage area as: 1) Intersection over Union (IoU) = Intersection U nion and 2) Dice Similarity Coefficient (DSC) = 4 <ref type="url">https://pytorch.org/</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2*Intersection</head><p>Intersection+U nion , where Intersection and Union represent the intersection and union between the inferred damage area and the actual damage area, respectively. Intuitively, a higher IOU or DSC value indicates a better performance in identifying the damage area of a cultural heritage image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Evaluation Results</head><p>1) Performance Comparisons on Cultural Heritage Damage Assessment: In the first set of experiments, we evaluate the accuracy of all compared schemes in estimating the damage area of a cultural heritage image. In particular, we study the performance of all compared schemes by varying the crowd query ratio &#952; (the percentage of images that are sent to the crowdsourcing platform for labeling defined in Definition 6) from 10% to 25%, which achieves a reasonable tradeoff between the number of crowd responses and the query cost. In particular, we set the lower bound of the crowd query ratio in our experiment to be 10% to ensure that the CollabLearn can acquire sufficient crowd labels to fix the failure cases of AI. In addition, we set the upper bound of the crowd query ratio to be 25% because the performance of CollabLearn plateaus when the crowd query ratio reaches 25% and further increasing the crowd query ratio and query cost will not further improve the performance of CollabLearn. The evaluation results are presented in Table <ref type="table">II</ref>. We observe that the CollabLearn scheme consistently outperforms all compared baselines. For example, the performance gain of CollabLearn compared to the bestperforming baseline (i.e., Hybrid Para) when the crowd query ratio &#952; = 25% on IoU and DSC are 4.81% and 4.35%, respectively. Such performance gains mainly come from the fact that our CollabLearn scheme carefully explores the uncertainty of both deep learning model and crowd intelligence under a holistic estimation framework and collectively improve the overall damage assessment accuracy. We also observe that the performance of our CollabLearn scheme improves when the crowd query ratio increases. This is because with a larger crowd query ratio, more problematic AI cases will be identified by the UDDA module and fixed by the crowd via the ICKF module of our solution. The above results demonstrate the effervesces of our CollabLearn in leveraging the imperfect crowd intelligence to carefully address the failure cases of AI to boost the accuracy of CHDA applications. (a) Crowd Query Ratio = 10% with DCI (DSC) (b) Crowd Query Ratio = 15% with DCI (DSC) (c) Crowd Query Ratio = 20% with DCI (DSC) (d) Crowd Query Ratio = 25% with DCI (DSC) Fig. 11. Effectiveness of Uncertainty-aware Damage Assessment (DCI (DSC))</p><p>2) Effectiveness of AI Failure Detection: In the second set of experiment, we evaluate the effectiveness of our CollabLearn in identifying the failure cases of the AI model. In particular, we first introduce a new metric -Detection Efficiency Index (DCI). In particular, we have DCI(&#8710;) = &#8710;(nonselected) &#8710;(selected) . The selected indicates the set of images that the scheme estimated as the failure cases of AI, which are selected for the crowd query. The nonselected indicates the remaining images that are not selected for the crowd query. &#8710; indicates the mean value of the IoU or DSC of the images in the set. Intuitively, a higher DCI value indicates that the crowd-AI schemes are more effectively in identifying the AI failure cases (i.e., the images selected for the crowd query have a much lower IoU/DSC values compared to the ones that are not selected). The results are shown in Fig. <ref type="figure">10</ref> and Fig. <ref type="figure">11</ref>. Note that we only compare our CollabLearn with the crowd-AI Hybrid baselines because the current AI-only solutions often do not have a mechanism for detecting their failure cases. We observe that the CollabLearn clearly outperforms all compared schemes by achieving the highest DCI in all evaluation scenarios. The above results further validate the effectiveness of the UDDA module in our CollabLearn framework.</p><p>3) Convergence Study of Imperfect Crowd Knowledge Fusion: In the third set of experiments, we evaluate the convergence of ICKF module of our CollabLearn framework in inferring the accurate damage area of images from the imperfect crowd responses. In particular, we show the convergence of our ICKF module in learning the inferred damage areas over different iterations. The results are shown in Fig. <ref type="figure">12</ref>. We observe that our CollabLearn can quickly boost the assessment performance and remain stable afterward. The results are similar across different metrics and crowd query ratios. Such results illustrate the effectiveness of the ICKF module in CollabLearn in leveraging the uncertainty estimation from both AI and crowd responses to derive accurate labels on damage areas of the queried images to improve the overall performance of CHDA applications. We compare the performance of the CollabLearn with the best-performing baselines from both AI-only and Crowd-AI Hybrid categories (i.e., UNet from the AI-only and Hybrid Para from the Crowd-AI). The results are shown in Fig. <ref type="figure">13</ref> and Fig. <ref type="figure">14</ref>  <ref type="foot">5</ref> . We observe that the performance of our CollabLearn scheme is relatively stable as training to testing data ratio changes under different crowd query settings. The results demonstrate the robustness of our scheme over various evaluation settings. We also observe that CollabLearn consistently outperforms the best-performing baselines on different evaluation metrics, which further demonstrate the effectiveness of CollabLearn in optimizing the performance of CHDA applications under a unified crowd-AI analytical framework. VI. CONCLUSION This paper presents a CollabLearn framework to address a cultural heritage damage assessment (CHDA) problem by exploring the collective intelligence from both AI and crowd under a unified analytical framework. CollabLearn addresses two key challenges: the identification of AI failures cases without ground truth labels and the imperfect crowd intelligence fusion. We develop an uncertainty-aware crowd-AI collaboration system to explicitly model the uncertainty of both AI models and crowd responses in a principled estimation framework and explore their complementary strengths to improve the overall performance of the CHDA applications. The results on the real-world cultural heritage damage assessment applications show that CollabLearn consistently outperforms both AI-only and crowd-AI hybrid baselines in terms of the cultural heritage damage assessment accuracy. We believe CollabLearn will provide useful insights to explore the integrated power of uncertain AI models and imperfect crowd intelligence to boost the performance of a diversified set of complex intelligent computing systems (e.g., intelligent transportation, smart health, and social AI) where AI and crowd intelligence are meld into a collaborative and mutual beneficial paradigm.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://crisisnlp.qcri.org/heritage</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1"><p>Note that we start from &#951; = 0.5 to ensure that both CollabLearn and the compared baselines can be evaluated with a sufficient amount of testing data.</p></note>
		</body>
		</text>
</TEI>
