<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Driver Drowsiness Behavior Detection and Analysis Using Vision-Based Multimodal Features for Driving Safety</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>04/14/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10341176</idno>
					<idno type="doi">10.4271/2020-01-1211</idno>
					<title level='j'>SAE Technical Paper Series</title>
<idno>0148-7191</idno>
<biblScope unit="volume">1</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Rui Li</author><author>Howard Brand</author><author>Aditya Gopinath</author><author>Srivatsav Kamarajugadda</author><author>Liang Yang</author><author>Weitian Wang</author><author>Bing Li</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The concrete aging problem has gained more attention in recent years as more bridges and tunnels in the United States lack proper maintenance. Though the Federal Highway Administration requires these public concrete structures to be inspected regularly, on-site manual inspection by human operators is time-consuming and labor-intensive. Conventional inspection approaches for concrete inspection, using RGB imagebased thresholding methods, are not able to determine metric information as well as accurate location information for assessed defects for conditions. To address this challenge, we propose a deep neural network (DNN) based concrete inspection system using a quadrotor flying robot (referred to as CityFlyer) mounted with an RGB-D camera. The inspection system introduces several novel modules. Firstly, a visual-inertial fusion approach is introduced to perform camera and robot positioning and structure 3D metric reconstruction. The reconstructed map is used to retrieve the location and metric information of the defects. Secondly, we introduce a DNN model, namely AdaNet, to detect concrete spalling and cracking, with the capability of maintaining robustness under various distances between the camera and concrete surface. In order to train the model, we craft a new dataset, i.e., the concrete structure spalling and cracking (CSSC) dataset, which is released publicly to the research community. Finally, we introduce a 3D semantic mapping method using the annotated framework to reconstruct the concrete structure for visualization. We performed comparative studies and demonstrated that our AdaNet can achieve 8.41% higher detection accuracy than ResNets and VGGs. Moreover, we conducted five field tests, of which three are manual hand-held tests and two are drone-based field tests. These results indicate that our system is capable of performing metric field inspection, and can serve as an effective tool for civil engineers.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. Introduction</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>S</head><p>TRUCTURAL health monitoring (SHM) plays a significant role in performance evaluation and condition assessments for the nation's highway transportation assets. SHM can augment the operational safety and longevity of highway transportation assets based on data-driven analysis and decision-making. The Federal Highway Administration (FHWA) of the U.S. Department of Transportation (DOT) has launched a Long-Term Bridge Performance (LTBP) program in 2015 to facilitate the SHM by collecting critical performance data <ref type="bibr">[1]</ref>. According to the FHWA's latest bridge element inspection manual <ref type="bibr">[2]</ref>, New York Bridge inspection manual <ref type="bibr">[3]</ref>, and tunnel operations, maintenance, inspection, and evaluation (TOMIE) manual <ref type="bibr">[4]</ref>, it is crucial to identify, measure, and evaluate condition state during a routine inspection on bridges and tunnels. Such condition states include concrete spall (delamination, patched area), exposed rebar, cracking, abrasion (wear), and other damages.</p><p>There are several robotic inspection systems that have been developed for automated concrete inspection. Lim et al. <ref type="bibr">[5]</ref> proposed a visual pavement crack inspection and mapping system using a mobile robot platform. The robot used a camera to perform visual inspection using an edge detection algorithm with a machine learning method. Lidar was used for location tagging and mapping. Under the support of the FHWA LTBP program, Prasanna et al. <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref> proposed an autonomous bridge deck inspection mobile robotic system using a mono-visual camera, ground penetrating radar, and acoustic sensors. The robot was developed to perform pavement crack detection which are relatively planar surfaces. Unmanned aerial vehicles (UAVs) have also been deployed for bridge visual inspection <ref type="bibr">[8]</ref>. UAVs are able to perform remote inspection for areas that are not accessible to human operators. However, none of these robotic inspection systems were able to retrieve metric information of the defects such as width, length, and area information. Also, though these robotic systems used GPS to obtain location information, they were not accurate enough to build a 3D map for visualization nor were they applicable in GPS-denied areas.</p><p>To facilitate automatic inspection, acoustic sensors <ref type="bibr">[9]</ref>, <ref type="bibr">[10]</ref>, ground penetrating radar <ref type="bibr">[11]</ref>, and visual cameras <ref type="bibr">[12]</ref>- <ref type="bibr">[14]</ref> are the three most commonly used sensors in the civil engineering community over the past decade. For visual camera-based inspection, previous researches were mainly focused on using entropy or intensity thresholding methods by highlighting high contrast distinct visual areas. These methods include edge detection, fast Fourier transform (FFT), and fast Haar transform (FHT). Besides using pure thresholding methods, researchers also introduced new detection algorithms by combining image segmentation, image thresholding (such as OSTU's method <ref type="bibr">[15]</ref>), and morphology operations <ref type="bibr">[15]</ref> to produce high-quality detection results. Histogram analysis and automatic peaks detection approaches were also used for visual inspection <ref type="bibr">[16]</ref>. The crackdefragmentation approach for fragment grouping and fragment connection was proposed in <ref type="bibr">[17]</ref>, and an artificial neural network (ANN) was introduced for crack detection classification. However, these methods only work well on a simple clear surface and are not able to indicate defect categories.</p><p>In this paper, we propose an automatic robotic system for concrete structure visual inspection, using an RGB-D camera with a deep neural network and RGB-D reconstruction method to build a 3D map with defects highlighted. This is illustrated in Fig. <ref type="figure">1</ref>. Unlike the previous research which only performed crack or spalling detection using pure RGB images, we introduce an RGB-D visual simultaneous localization and mapping (SLAM) method for structure reconstruction and combine a deep neural network to recognize and highlight defects. The defects are registered and labeled in the 3D map to reveal the physical location in the 3D structure model, facilitating condition assessment. Furthermore, we introduce a depth adaptive windows size predictor based on depth-inpainting to effectively predict the optimized sliding window size. Then, a sliding window based multi-resolution detection model is used to detect the defect area. Finally, to visualize the defects, we introduced a conditional random field (CRF) method to perform 2D to 3D registration and fusion.</p><p>Extending our preliminary work <ref type="bibr">[18]</ref>, <ref type="bibr">[19]</ref>, instead of using the VGGs with fixed sliding windows size to solve the detection problem, we proposed a depth adaptive model to optimize the detection. To summarize, our main contributions are: 522 298 10 000</p><p>1) A high-quality labeled dataset for crack and spalling detection, which is the first publicly available dataset for visual inspection of concrete structures. It has (labeled) crack images and spalling images, and over fieldcollected images from the concrete structure. 100 2) A robotic inspection system with visual-inertial fusion to obtain pose estimation using an RGB-D camera and an IMU. The visual-inertial system has a Hz pose estimation rate to enable online navigation and 3D mapping.</p><p>3) A depth in-painting model that allows depth hole inpainting in an end-to-end approach with real-time performance.</p><p>4) A multi-resolution model that adapts to image resolution changes and allows accurate defect detection in the field. We propose a novel robot inspection system using the CityFlyer <ref type="bibr">[20]</ref> which consists of a control and mission module (CMM), a visual-inertial positioning module, and a deep inspection and 3D registration module as illustrated in Fig. <ref type="figure">2</ref>. The CMM implements autonomous navigation which is developed under the Robot Operating System (ROS) platform. The CMM receives visual inertial odometry (VIO) as feedback to navigate the CityFlyer. The VIO has a Hz frame rate that meets state control requirements and also decreases the frame-to-frame pose estimation error (within ). Meanwhile, the concrete defects prediction output is registered to 3D space using the depth information and the target defect's 3D location and surface normal <ref type="bibr">[21]</ref> are used to navigate the CityFlyer to the best viewing angle. By navigating the CityFlyer to the front view perspective of the target defect area, our system can achieve better inspection data acquisition. The visual-inertial positioning module fuses the output of visual odometry and IMU propagation to achieve real-time pose estimation of the CityFlyer. We use ASUS Xtion Pro RGB-D camera as the visual perception unit to perform pose estimation and 3D perception. Its data sheet is listed in Table <ref type="table">I</ref>. The IMU sensor is Phidgets Spatial sensor. For VIO fusion, it follows the following steps. First, RGB and depth images are used to estimate the pose of the UAV, using feature matching and optimization approaches <ref type="bibr">[22]</ref>. Second, we implement a multi-state extended Kalman filter (MS-EKF <ref type="bibr">[23]</ref>) to fuse IMU state propagation and the visual odometry observation, allowing real-time positioning and control at a Hz. It should be noted that we perform an off-line calibration to obtain the transformation, , between the camera and CityFlyer body.</p><p>The adaptive defect detection and 3D registration module is proposed to solve the significant problem of providing metric information during inspection, allowing civil engineers to perform condition evaluation <ref type="bibr">[4]</ref> and have a context on the spatial characteristics and location of the defects. An AdaNet with depth in-painting and multi-resolution approach is proposed to augment defect detection accuracy. We first introduced a depth-varying sliding window size optimizer. Then, the detection result is registered and fused in a 3D map for visualization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. Concrete Inspection Method</head><p>This section discusses the DNN model-based concrete inspection method, which is able to tell the defects' 2D region information by taking RGB images as inputs. Inspired by feature pyramids <ref type="bibr">[24]</ref>, we propose a Multi-resolution DetectionNet taking multi-resolution RGB image inputs to detect the concrete defects. Moreover, we introduce a depth adaptive sliding-window size selection method, with the capability to adjust bounding box size based on the distance to the surface. In the rest of this section, we provide comprehensive theoretical analysis of the model, and we also compare the detection performance between our AdaNet and ResNets <ref type="bibr">[25]</ref>, VGGs <ref type="bibr">[26]</ref>, and AlexNet <ref type="bibr">[27]</ref>.</p><p>For visual inspection, we treat the concrete defects detection task as a multi-class classification problem. For all input images , denotes the number of the images falling in three categories, e.g., crack, spalling, and background. Each image is associated with a ground truth label</p><p>, where is natural number starting from . The detection goal is to find a mapping function that minimizes a pre-defined loss . For the label , we encode the label of each image as an integer from , denotes the number of classes. In this paper, we define the crack images' label as 1, the spalling images' 0 label as 2, and the background images' with label .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Data Preparation and Augmentation</head><p>There is no publicly accessible concrete defect dataset available to train our model, let alone an RGB-D dataset with depth information. In order to train the inspection model for defects detection, we developed a new concrete structure spalling and cracking (CSSC) dataset for training. We met with and organized discussions with civil engineers to catalog the terminology used in concrete defect assessment applications. This provided key terms for image-based search engines and allowed us to mine images from image search results. The following terms used for web-based datamining are listed below:</p><p>1) Concrete spalling/Rebar: Concrete spalling, concrete rebar, concrete delamination, concrete bridge spalling, concrete column spalling, concrete spalling from fire, concrete spalling repair, and concrete wall.</p><p>2) Concrete Crack: Concrete crack, crack repair, concrete scaling, concrete crazing, and concrete crazing texture. We searched the image data through Google, Yahoo, Bing, and Flickr. Then, we collected a total of concrete crack images and concrete spalling images. For spalling images, we further added images collected from the field, obtaining a total of spalling images for training and validation purposes.</p><p>After assembling the crack and spalling images, we annotated them using Photoshop. An illustration of some of the annotated images are shown in Figs. <ref type="figure">3</ref> and<ref type="figure">4</ref>. For spalling images, we annotated the exposed rebars and annotated the regions (contours) of spalling damage. These are two regions of interest for civil engineering diagnosis with areas of exposed rebar being areas of more serious degradation. Examples of exposed rebar and spalling contour annotation are shown in Fig. <ref type="figure">4</ref>. For concrete cracks, the annotators were  asked to carefully annotate the entire crack areas in order to develop a binary mask as a ground truth (shown in Fig. <ref type="figure">3</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>TABLE I ASUS Xtion Pro Data Sheet</head><p>100 &#215; 100 130 &#215; 130</p><p>Since our AdaNet is a sliding-window detector, we randomly crop the images around the regions of interests (ROIs) using two size settings: and . This is illustrated in Fig. <ref type="figure">5</ref>. For each cropped image output, we determine whether it is a defect or background image via the rule defined in <ref type="bibr">(1)</ref>. We first count the number pixels, , located inside of the defect region in its corresponding label image (with a total pixels). Then, if the defect pixel number is greater than or equal to a pre-defined threshold condition, (where represents an empirical percentage threshold value), we claim the cropped image as a defect image and label them with 1 (as crack) or 2 (as spalling). If there are no defected pixels, i.e., , we classify the cropped image as background and label with . It should be noted that a cropped image will be discarded if the number of defected pixels is between zero and the threshold, i.e., .</p><p>where denotes the category of the image and denotes the image is not used for training. In this paper, we set for crack if cropped size is , and for crack if cropped size if . For concrete spalls, we set to obtain spalls sub-images. These values are selected to consider the constraints of the dataset size and the data quality for better detection accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Depth In-Painting Model</head><p>Commercial RGB-D cameras normally output incomplete depth images if there is no reflected ray from certain viewing angles. The regions with missing depths in the image are referred to as "holes" <ref type="bibr">[28]</ref>. Holes degrade the quality of the 3D reconstruction of a structure and the 3D metric measurements. Fig. <ref type="figure">6</ref> illustrates some examples of the occurence of empty regions in depth image data within a sliding window. Inspired by <ref type="bibr">[29]</ref>, we introduce a depth inpainting model (named InpaintNet) which is illustrated in Fig. <ref type="figure">7</ref>. InpaintNet is developed based on U-Net <ref type="bibr">[30]</ref> which has an auto-encoder framework work with five groups of down-convolutions for the encoder and five groups of upconvolutions for the decoder. Each group has two convolutional layers and each layer has the same number of channels as U-Net. InpaintNet is composed two U-Net frameworks connected in parallel, one of which learns a surface normal embedding from RGB images and the other one performs depth inpainting from depth and surface normal embeddings. The depth inpainting framework inputs depth images to an encoder to forms depth embedding. The depth embeddings are then concatenated with surface normal embeddings. The decoder portion of the depth inpainting networks decodes complete depth images from the combined depth and surface normal embeddings.</p><p>In this paper, we do not have the ground truth depth nor the  x y surface normal data for training. We therefore use classical, computationally expensive approaches to develop estimates of the complete depth and surface normal images. These approaches, though they can not be implemented in real-time, are able to generate accurate estimates of the complete depth and surface normal images to use as a ground truth for the neural network. The neural network then has the benefit of being able estimate complete depth image and surface normal images in real-time. Inspired by <ref type="bibr">[31]</ref>, we introduce a bilateral filter with color guiding to complete the depth images. The bilateral filtering approach is times slower compared to using a neural network model. For the surface normal, we first introduce a Sobel filter to estimate the gradient of the estimate depth images in the and directions.</p><p>N I = &#8710;(x) &#215; &#8710;(y) then the surface normal of each pixel is .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Multi-Resolution Detection</head><p>For robotic on-the-fly defect inspection in the field test, especially using a drone, it is quite challenging to keep a consistent distance between the camera and the surface image aquisition. Since we can obtain the depth aligned with each RGB frame, we can easily using this information to adjust the sliding window size based on the depth measurement. Thus, the detection model should be robust to images taken at any distance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>70.05%</head><p>It has been discussed in our previous research <ref type="bibr">[18]</ref> that a fine-tuned VGG model is not able to perform well in field tests, achieving an average of detection accuracy due I to the spatial resolution of a region depending on the inspection distance. To tackle this problem, this research further introduces a Multi-resolution Detection model, inspired by <ref type="bibr">[24]</ref>, by implementing a multi-resolution input image feature pyramid. Given a sliding window cropped input , we resize to 1/2 and 1/4 and perform feature extraction in a parallel framework, that is</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CNN(wI</head><p>where denotes the CNN feature encoder, and are symbolic representations of the convolution kernel and bias respectively.</p><p>denotes the input image where the superscript denotes the corresponding scale. represents the corresponding output to . Because all levels of the pyramid use the same network architecture, the output also differs with the size. We further up-sample the size of the coarse output feature for 1/2 and 1/4 sized images with a factor of 2 and 4, respectively. In this paper, we take the raw sliding window input and resize to , that is, .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">&#215; 256</head><p>To reduce the channel dimension after concatenation, we introduce a convolutional layer with kernel size to reduce the channel dimension to , then apply an average pooling operation</p><p>f i where is channel and is the output after average pooling. A three-layered fully connected convolution is used to regress and predict whether the current region is a defected area or not.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Loss Design and Training perpixel</head><p>AdaNet is designed to perform concrete defect detection and classification with distance adaptable capability, and is trained through a joint approach for finding the optimal weight to regress to an expected prediction. For InpaintNet, it is a per pixel value prediction model where we evaluate the model performance using a photometric loss as</p><p>where is the predicted normal, is the normal ground truth, is the predicted depth, is the depth ground truth. We jointly optimize over both loss terms.</p><p>For the multi-resolution detection model, aims at determining the existence of defects in an image frame. predicts the probability of the class given as an input. Thus, the loss can be simplified to a cross-entropy style.</p><p>where is the input patch, denotes the convolutional kernel, and is the label of each class. In this paper, we use the above cross entropy loss to perform detection regression for our AdaNet model. Training: The training dataset for multi-resolution DetectionNet contains three classes, as discussed in Section III-A, and we annotate the labels as , respectively. Besides initializing the model with pre-trained model parameters, we also augment the dataset using 1) random rotation with several pre-defined angles; 2) gamma correction ranging from to . For training, we split all the images into three sub-datasets: the training dataset ( of all the images), validation dataset ( of all the images), and testing dataset ( of all the images). For InpaintNet, we used all field collected data. The ground truth depth and normal are obtained through the method proposed in Section III-B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. Pose Estimation And 3D Semantic Registration</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3D</head><p>Our final goal is to reveal where the defects are in a map by registering concrete defects to the 3D map. In this section, we discuss using an RGB-D camera to perform 3D positioning and semantic 3D reconstruction based on the conditional random field (CRF) method to highlight the concrete defects in the map.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Visual Positioning and Association</head><p>The 3D model of the concrete structure is widely used for structure analysis in civil engineering. Moreover, metric defects could be registered to the 3D model with color coded overlays. This provides further assistance to civil engineers to perform concrete structure condition comprehensive assessments <ref type="bibr">[32]</ref>. In this paper, we propose a 3D mapping system taking advantage of visual-inertial (VI) SLAM and deep defect detection.</p><p>As discussed in Section II, the CityFlyer requires high localization accuracy and update frequency to enable stable navigation. In this paper, we introduce an MS-EKF <ref type="bibr">[33]</ref> to fuse high-frequency IMU propagation and low-frequency visual odometry (VO) towards real-time pose estimation. For VI fusion, the IMU measurement is used to predict the state transition and VO observations were used to update the state. The difference in measurement frequency allows us to accommodate the fusion of multiple sensors. For the IMU, its evolving state vector is is the unit quaternion that represents the rotation from the world frame to the IMU frame . and are the IMU linear velocity and linear acceleration with respect to the world coordinate system. and denote the biases affecting the accelerometer and gyroscope measurements. The system derivative form can be partially represented as following in an east-north-up (ENU) coordinate system (partly referred in <ref type="bibr">[33]</ref>, <ref type="bibr">[34]</ref>):</p><p>where is the translation from the IMU frame to the world frame, is the acceleration measurement, is the angular velocity measurement, denotes the time interval, and denotes the gravity. The acceleration, , is subject to rotation and translation in the IMU frame.</p><p>denotes the angular velocity and is the matrix product referred to in <ref type="bibr">[33]</ref>.</p><p>Meanwhile, the VO performs pose estimation using the RGB-D measurement, and outputs the pose (where denotes the rotation, denotes the translation). Once VO finished pose estimation for each frame, we can update the state based on the measurement model.</p><p>V H denotes the measurement noise. denotes the measurement matrix which represents the mapping between IMU state and the VO pose. Then, the prediction from IMU propagation can be corrected by updating using the EKF filter, achieving a 100 Hz pose estimation rate.</p><p>The state estimation error of the VIO will continue to drift as there is no loop-closure to correct the pose if there exists an overlap between views (observations). To further correct the pose, we record the key-frames (i.e., vertex) based on a motion threshold, where and denote the key-frame image and the key-frame pose of a frame , VIO propagation and update allow us to obtain the transformation between two consecutive frames , and the relative transformation can also be derived at the same time. In order to reduce the drift of the visual odometry, this paper introduces graph-optimization to correct the pose drift based on <ref type="bibr">[35]</ref>. To perform graph optimization, the following procedures have to be followed: 1) record the keyframes, , based on motion threshold method; 2) use image features to facilitate loop-closure detection to find the edges (correlation) between any pair of key-frames; 3) perform graph optimization to update all poses simultaneously.</p><p>where denotes the information matrix that describes the correlation between parameters and denotes the optimized poses. Equation ( <ref type="formula">10</ref>) is able to update all frame's new poses at the same time. Here we just use as an example. Once the graph optimization is done, we take the pose error of the last key frame, , to correct the current the VIO propagation. Then, we correct the current VIO output using the correction, , where is the VIO output. of each image frame . Each depth frame , has accuracy. In this paper, we aim to perform a metric reconstruction and superimpose the defect class on the 3D map for better visualization. For each RGB-D frame, we can perform a backward-projection to register the current view measurement to the 3D world.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Spalling and Cracking Fusion Using CRF</head><p>[X,</p><p>where denotes the pixel coordinate in the image, is the corresponding 3D position in world coordinate system, is the inverse camera intrinsic parameter, and denotes the transformation from the camera coordinate system to world coordinate system. The output of the multi-resolution DetectionNet is defect region information, allowing the defected regions in the 3D model to be labeled with specific colors. In this paper, the defects detection in an image is performed using a sliding window approach. Each sliding window defines a region bounding box where the network can output the corresponding class probability distribution on classes. One very important hypothesis we claim is that we assume each pixel in a defect region should have the same probabilistic distribution , i.e., for each pixel .</p><p>In order to fuse a sequence of inspection results, we introduce conditional random fields (CRF) to perform spatial fusion based on our previous work <ref type="bibr">[36]</ref>. For each image frame , the prediction on region is performed via AdaNet, where is the image coordinate. The fusion involves the following procedures: 1) we build a voxel map and each voxel is initialize with equal label probability, i.e.,</p><p>; 2) each new RGB-D frame will have a new probabilistic image using the detection model, and we perform fusion using CRF <ref type="bibr">[37]</ref>, <ref type="bibr">[38]</ref> to fuse the label probabilistic distribution.</p><p>(u</p><p>For each pixel , we first perform a warping operation to find the association between the voxel map and current pixel, and check whether the corresponding voxel is initialized or not. If not, we first initialize it with an equal distribution, . Then, with the next frame overlapping the region, we perform a warping via deployment of a general homogeneous transformation to get the voxel index in the voxel map.</p><p>is the depth measurement of pixel , is the corresponding voxel in the world, is the transformation from world coordinate frame to the current view, and is the warping operator that maps the current view to the world coordinate system. With the AdaNet output the class probability prediction of pixel , we have the conditional probability distribution, . Then, we can update the global probabilistic distribution of each voxel following a recursive Bayesian update procedure <ref type="bibr">[38]</ref>:</p><p>where denotes the probabilistic prediction of voxel at time using AdaNet , and then update its probabilistic distribution. denotes the probabilistic distribution at time . Because the prediction between each frame is independent, the update becomes a simple dot operation between each class. The posterior update is performed over all visible voxels, and is finally normalized to obtain .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. Experiments</head><p>In this section, we discuss the AdaNet training details and compare the experimental performance of the depth inpainting model and defect detection model. To verify the effectiveness of our system, we perform several field tests in a manual holding mode for the RGB-D camera and autonomous inspection mode using the CityFlyer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Depth In-Painting Analysis</head><p>We first perform an ablation study on the depth in-painting performance from an accuracy and time performance perspective. Table <ref type="table">II</ref> shows the results of InpaintNet compared to the raw output and in-painted result from a bilateral filter <ref type="bibr">[31]</ref>. We performed four tests with each dataset containing RGB-D frames from planar concrete surfaces. The ground truth was manually obtained by measuring the distance of the camera to the surface plane. In Table <ref type="table">II</ref>, the depth images of Cracks 1 and 3 have large holes which are not removable through a bilateral filter. InpaintNet, however, is able to achieve a more accurate and complete depth in-1 <ref type="url">https://github.com/ccny-ros-pkg/pytorch_Concrete_Inspection</ref>  painting for Cracks 1 and 3. For the depth images of Cracks 2 and 4, have small holes and can be easily filtered through a bilateral filter. A graphic comparison is given in Fig. <ref type="figure">8</ref>, where we can see that InpaintNet is able to fill the big holes, even though it may not able to give precise prediction. Also, compared with bilateral filter, InpaintNet could resolve a smoother normal estimation. The time performance between the two algorithms were compared revealing InpaintNet to be times faster compared with the bilateral approach (as illustrated in Fig. <ref type="figure">9</ref>). The runtime of InpaintNet was seconds on average with a GTX 1080 GPU for each depth frame.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Detection Model Comparative Analysis 0</head><p>As discussed in Section III-A, we cropped images to obtain training patches, and we made the cropped dataset 1 publicly available for the research community. The dataset has a total of 26 870 concrete crack image patches, 15 950 concrete spalling image patches, and 46 429 back_ground image patches. We label back ground as , concrete crack as 1, and concrete spalling as 2. Representative cropped images are presented in Fig. <ref type="figure">5</ref>. All of the network training and testing are carried out on a GPU server with GTX 1080 GPU and implementated using Pytorch.</p><p>1) Does Multi-Resolution Help? We conducted various comparative experiments between our multi-resolution detection model and other models, especially F-VGG employed in <ref type="bibr">[19]</ref>. Besides VGGs, we also made comparisons to current state-of-art models including ResNets <ref type="bibr">[25]</ref> and AlexNet <ref type="bibr">[27]</ref>. From the comparative results presented in Table <ref type="table">III</ref>, we can conclude that our multi-resolution model does not achieve the highest learning accuracy, but does obtain the highest testing accuracy. We also conducted a comparative study to the model used in <ref type="bibr">[19]</ref> and listed results in Table <ref type="table">IV</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>8.405%</head><p>Inspection of the results in Table IV reveal that our multiresolution model is able to achieve higher detection accuracy, with an average higher detection accuracy. This is also illustrated in Fig. <ref type="figure">10</ref> where it shows that the multiresolution model outputs better coverage predictions than that of F-ResNet-34.</p><p>2) Does Deeper Model Has Better Performance? Research has shown that increasing the depth of a neural network can improve the classification accuracy to a certain extent <ref type="bibr">[26]</ref>. However, the model degradation problem occurs if the model is deeper than a suitable limit. Then, authors in <ref type="bibr">[25]</ref> introduced a deep residual network to overcome the degradation problem, allowing the performance of networks to increase to a higher degree with deeper layer architectures. In this section, we focus on using a well-constructed model with a suitable depth and perform fine-tuning. We do not discuss the degradation problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>96.88%</head><p>1.0 our task is to classify three classes, the texture difference between crack and spalling are quite distinct. However, some possible challenges are the illumination variations and an insufficient dataset. We perform comparative testing on our multi-resolution model, F-ResNets <ref type="bibr">[25]</ref>, F-VGGs <ref type="bibr">[26]</ref>, and AlexNet <ref type="bibr">[27]</ref>. For the comparison, we set the batch size, epoch, learning rate, and loss as the same for a fair comparison. The result is illustrated in Table <ref type="table">III</ref>. From the table it is clear that the deeper a model, the higher the accuracy it can achieve. Table <ref type="table">III</ref> shows that the highest accuracy was achieved by F-ResNet-101. Another interesting finding is that F-ResNets have an average of % higher accuracy in performance compared to F-VGGs. However, deeper models cannot achieve the best detection performance if the best input cropping practice is not used.</p><p>3) Batch Normalization: In this paper, we also discuss the effect of batch normalization for neural network models. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>0.65%</head><p>Batch normalization is proposed to solve the internal covariate shift issue and can work on each neuron to allow scale normalization during training. This enables the model to converge even given larger learning rates and also removes the need for dropout. In this paper, we compare the performance of F-VGGs between given batch normalization and no batch normalization. The results are illustrated in Table <ref type="table">V</ref>. The results in Table V reveals that batch normalization can improve the accuracy by on average. This also proves that batch normalization can improve the model performance even with less diversity in the data. A quantitative comparison of detection accuracy illustrated in Tables III and IV, shows that VGG-Nets are not able to achieve comparable detection performance compared to our multi-resolution detection model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Field Tests and Comparisons</head><p>We conducted field tests at 155 St Broadway, Upper Manhattan, on a concrete bridge. We performed the inspection under the bridge using an RGB-D camera mounted CityFlyer. The CityFlyer was also mounted with a MasterMind computer to perform on-board computation and image streaming to the ground station (a GPU computer for defect detection). Besides the field tests via the CityFlyer, we also manually scanned the concrete surface with the RGB-D camera.</p><p>1) Field Tests (Manual Field Test): In the first stage, we manually carried the RGB-D camera to scan the concrete surface and collect the RGB-D frames for inspection. It should be noted that we have to launch the VIO system to track the motion of the camera to perform a reconstruction of the target concrete surface.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="80">&#215; 200 &#215; 200</head><p>We collected three sets of data for three different scenarios, which each RGB-D frame having a location tag. Then, we performed defect inspection using our deep inspector over each image. The results are illustrated in Fig. <ref type="figure">11</ref>, where green rectangles denote spalling and cyan rectangles denote cracks. To perform detection, we deployed a sliding window to scan through the whole image with varying region sizes from to . We can see in the left-most image and the center image of Fig. <ref type="figure">11</ref> that our model is able to recognize the spalling region and crack region. Further demonstration of the performance of the model is shown in the center image where the spalling region is distinguished from the crack region. These results show how our model can cover the whole defect area in consecutive frames and how this method is able to help civil   2) Autonomous Field Test Using CityFlyer: We also performed two sets of field tests using our CityFlyer, and the results are illustrated in Fig. <ref type="figure">10</ref>. In Fig. <ref type="figure">10,</ref><ref type="figure"/> is carried out at the entrance of the area under a bridge, and is carried out at the middle of the area under the bridge where the illumination is low.</p><p>For Test 1, the trajectory of the drone is illustrated in the left-most image, this illustrates how the CityFlyer was maneuvering to capture the target area. The defect inspection result is illustrated in the second to the left image, where cyan and green rectangles denote crack and spalling, respectively. The right-most image is the front view of the 3D map (point cloud) with color overlayed on the defects, and the second to the right image shows the same point cloud but with a different view from the back. We can see that the spalling and cracks are well highlighted.</p><p>The second test was carried under the bridge, which suffers from low illumination for inspection and localization. The trajectory of the drone is given in the left-most image of Test 2, and the inspection result at this location is given in the second to the left image and the second to the right image. The second to the left image indicates that our model is able to perform correct spalling detection even in a low-illumination environment. However, the second to the right image indicates that it missed detection of a crack region (indicated with a red dashed rectangle) due to low illumination.</p><p>3) Semantic 3D Fusion and Visualization: The semantic 3D highlighted results are illustrated in Fig. <ref type="figure">10</ref>, where we performed back-projection using the predicted output and the corresponding depth image to the 3D world coordinate frame and the 3D spatial data is fused using consecutive frames. We use a voxel map to represent 3D structure information, where each voxel has to be updated through a back-projection manner. Since we deploy an image-based fusion approach, a global probabilistic map searching is not required, enabling non-GPU computation. The reconstructed 3D map with semantic highlighted areas is illustrated in the right-most images of Fig. <ref type="figure">10</ref>. It can be seen in the figure that the regions of defect are well highlighted using green and cyan color. This helps civil engineers identify the defect categories as well as their location.  VI. Conclusion</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>8.41%</head><p>In this paper, we introduced a new automatic concrete structure inspection system using the CityFlyer robot mounted with an RGB-D camera toward visual inspection. For visual concrete inspection, we introduced an AdaNet to perform a detection of defects within a sliding window approach. The AdaNet consists of two sub-models, which are, a depth inpainting model (InpaintNet) to fill holes in a depth image and multi-resolution defect detection model for concrete inspection. The depth adaptive multi-resolution detection model considers both distance and resolution effects, aiming to provide a robust concrete crack and spalling detection task in the field. Meanwhile, we pioneeringly propose using visual SLAM and deep neural network inspection to perform a 3D semantic reconstruction to highlight the defects in a 3D model. It can achieve an average higher detection accuracy compared to F-VGG and F-ResNets. Furthermore, we introduce an RGB-D visual-inertial fusion with filtering and global bundle adjustment to perform pose estimation for the CityFlyer state control. The pose information is used to provide location tags defects predicted in images. Comparative experiments and field tests indicate that the system is able to perform high-quality detection and reconstruction. For future work, we will try optimal tuning of super parameters of the proposed models via intelligent optimization methods <ref type="bibr">[39]</ref> and also work on pixel-level detection toward metric reconstruction.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 7, NO. 4, JULY 2020</p></note>
		</body>
		</text>
</TEI>
