<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>RawSeg: Grid Spatial and Spectral Attended Semantic Segmentation Based on Raw Bayer Images</title></titleStmt>
			<publicationStmt>
				<publisher>British Machine Vision Conference</publisher>
				<date>11/28/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10542666</idno>
					<idno type="doi"></idno>
					
					<author>Guoyu Lu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Major semantic segmentation approaches are designed for RGB color images, which is interpolated from raw Bayer images. The use of RGB images on the one hand provides abundant scene color information. On the other hand, RGB images are easily observable for human users to understand the scene. The RGB color continuity also facilitates researchers to design segmentation algorithms, which becomes unnecessary in end-to-end learning. More importantly, the use of 3 channels adds extra storage and computation burden for neural networks. In contrast, the raw Bayer images can reserve the primitive color information in the largest extent with just a single channel. The compact design of Bayer pattern not only potentially increases a higher segmentation accuracy because of avoiding interpolation, but also significantly decreases the storage requirement and computation time in comparison with standard R, G, B images. In this paper, we propose BayerSeg-Net to segment single channel raw Bayer image directly. Different from RGB color images that already contain neighboring context information during ISP color interpolation, each pixel in raw Bayer images does not contain any context clues. Based on Bayer pattern properties, BayerSeg-Net assigns dynamic attention on Bayer images' spectral frequency and spatial locations to mitigate classification confusion, and proposes a re-sampling strategy to capture both global and local contextual information. We demonstrate the usability of raw Bayer images in segmentation tasks and the efficiency of BayerSeg-Net on multiple datasets.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Figure <ref type="figure">1</ref>: RawSeg-Net is able to achieve accurate scene segmentation from 8-bit (single channel) Bayer pattern image (left). The middle image is the result of the left image overlaid with color Bayer pattern for better observation. Our method precisely segments moving objects (pedestrians, cars) and object boundaries (buildings, roads) (right).</p><p>Scene segmentation is a fundamental and challenging topic in computer vision with a wide range of applications, such as in autonomous driving, augmented reality, medical imaging, etc <ref type="bibr">[1]</ref> [26] <ref type="bibr">[11]</ref>. The vast majority of current semantic segmentation algorithms take the 3-channel color images after image signal processor (ISP) pipelines as inputs. To output RGB color images, the ISP pipeline will consume extra time from raw Bayer images and may damage or lose primitive pixel information captured by the raw camera sensor due to the operations like demosaicing, exposure adjustment, and many other middle processes in ISP <ref type="bibr">[10]</ref>. Raw Bayer images contain all the necessary color and intensity gradient information in a single channel, making them an efficient source for RGB images. They save up to 67% of image storage Figure <ref type="figure">2</ref>: Overview of our proposed RawSeg-Net. Multiple 8-bit raw Bayer images with different grid sizes are input to the backbone to extract low-level features. We further deploy ASPP module to extract the contextual information, introduce spatial coordinate attention to focus on each grid coordinate, and utilize spectral frequency attention to focus on each split spectrum. By concatenating different weighted feature maps and fusing the class maps from different grid sizes, the raw Bayer image can be accurately segmented.</p><p>space and can potentially increase the image processing pipeline by eliminating the ISP process, which is a significant time-consuming step. By processing single-channel images, the computation burden and neural network complexity can also be reduced. Therefore, raw Bayer images have several advantages, including completeness and accuracy of color information, efficient storage, fast processing speed, and reduced network complexity. The widely used Bayer pattern, arranged in a repeated 2 &#215; 2 matrix grid containing one red component, one blue component, and two green components, is typically used to generate raw Bayer images.Despite the numerous benefits of raw Bayer images over RGB images, there is currently a shortage of segmentation algorithms specifically designed for Bayer patterns.</p><p>In this work, we demonstrate the usability of raw Bayer images on scene segmentation tasks and propose a semantic segmentation network designated for raw Bayer image RawSeg-Net in order to accurately segment raw Bayer images, as Fig. <ref type="figure">1</ref>. Unlike RGB color images that maintain neighboring contextual information during ISP color interpolation, raw Bayer images' pixels miss the context clues from neighboring locations from spectral and spatial perspectives. Therefore, to effectively utilize Bayer pattern, we explore a spatial coordinate attention mechanism to accurately allocate attention weights to each specific pixel by aggregating diverse feature maps and spectral frequency attention to capture different light wavelengths and high frequency details contained in the raw Bayer image. As scene images are commonly composed of objects of various sizes (e.g., building as large structures and traffic signs as fine structures), we compose the grids into different sizes (e.g., one composed grid maintains 4 small grids with the same color) to segment the image and fuse segmentation outputs with different grid sizes to precisely segment the images with objects in various scales, benefiting from the Bayer grid pattern. With convolution kennels dedicated to Bayer patterns, RawSeg-Net can capture spatial and spectral features at various grid sizes to realize precise segmentation based on raw images. Our method is detailed in Fig. <ref type="figure">2</ref>.</p><p>To sum up, this paper makes several significant contributions.</p><p>1) We demonstrate that single-channel raw Bayer images are highly suitable for image segmentation tasks, offering advantages such as reduced storage requirements, faster image processing, and less complex neural networks. 2) We propose novel spatial coordinate attention and spectral frequency attention mechanisms designed specifically for Bayer images, allowing for highly accurate semantic segmentation. 3) We introduce a fusion strategy that leverages different grid sizes of the Bayer pattern to effectively segment objects of varying scales.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related work</head><p>Semantic Segmentation: Benefiting from the successful usages of deep Convolutional Neural Networks (CNNs) <ref type="bibr">[21]</ref> [25] <ref type="bibr">[15]</ref> [3] <ref type="bibr">[16]</ref>, semantic segmentation has achieved significant improvement towards understanding a complex scene. Fully Convolutional Networks (FCN) <ref type="bibr">[21]</ref> first applied a fully convolutional network in semantic segmentation tasks. Following FCN, extensive research was proposed based on the FCN architecture, such as UNet <ref type="bibr">[25]</ref>, SegNet <ref type="bibr">[2]</ref>, PSPNet <ref type="bibr">[15]</ref> and DeepLab-based <ref type="bibr">[3] [4]</ref> works. Recently, PSANet <ref type="bibr">[37]</ref> proposed a point-wise attention network to learn attention for each feature map position for scene parsing. HRNet <ref type="bibr">[32]</ref> started from a high resolution convolution stem and gradually added high-to-low resolution blocks. In addition to CNN features based on color and texture information for segmentation, depth information is also applied to support segmentation tasks <ref type="bibr">[22,</ref><ref type="bibr">23]</ref>. Large models, like SAM <ref type="bibr">[13]</ref>, are also proposed for segmentation tasks. Existing semantics segmentation schemes are mainly designed for RGB color images without focusing on raw Bayer images, which are the source of RGB images. Context Attention: Contextual information is critical in various vision-based tasks such as semantic segmentation. An increasing number of works have explored contextual dependencies and context-weighted information, especially attention mechanisms. Different strategies are proposed to explore long-term attention dependencies <ref type="bibr">[31]</ref> [28] <ref type="bibr">[33] [6]</ref>. Wang et al. <ref type="bibr">[33]</ref> presented a self-attention module with non-local operations to capture long-range dependencies in spatial-temporal dimensions to process videos and images. DANet <ref type="bibr">[6]</ref> applied a dual-attention strategy to combine information from the input images and the final feature maps. Different from attention mechanisms commonly applied to RGB color images, this paper focuses on the affluent contextual relationships contained in the Bayer patterns to better capture the shape and spectral information explicitly existing in raw Bayer images.</p><p>Bayer Pattern: Most of the works using Bayer Color Filter Array (CFA) are designed for image demosaicing, which is to interpolate the vacant red, green and blue values in the raw Bayer pattern images to restore 3-channel RGB color images <ref type="bibr">[17]</ref>  <ref type="bibr">[35]</ref> [24] <ref type="bibr">[20]</ref>. Various clues have been investigated to interpolate RGB color information, such as color difference <ref type="bibr">[5]</ref>, edge direction <ref type="bibr">[14]</ref> and image reconstruction <ref type="bibr">[27]</ref>. Deep learning approaches have also been applied in image demosaicing <ref type="bibr">[30]</ref> [29] <ref type="bibr">[19]</ref>. In particular, Liu et al. <ref type="bibr">[19]</ref> proposed a self-guidance network to use an initially estimated green channel as guidance to recover all missing values in the input image. Another typical application for Bayer images is image restoration. Bayer images have also bee applied to object detection tasks <ref type="bibr">[7]</ref>. Zhou et al. <ref type="bibr">[39]</ref> proposed to restore images from the raw Bayer domain. However, Bayer images have rarely been applied to image segmentation tasks, mainly because Bayer images are not convenient for human eyes to observe.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Raw Bayer Image Segmentation Framework</head><p>RawSeg-Net is specially designed for raw Bayer images to map to pixel-level class annotations. The introduction of dynamic attention mechanisms on the Bayer pattern helps coordinate and split spectral wavelengths under multiple Bayer grid sizes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Raw Bayer Pattern</head><p>Most commercial digital cameras have a single CCD/CMOS sensor that captures the intensity of light, but not its color wavelength. To produce color information, the sensor is overlaid with a Bayer "color filter array" (CFA), which filters the captured pixels and produces different spectral channels. This results in a raw Bayer image I bayer , which is an image mosaic. To recover the full RGB color S from the separate spectral channels S R , S G and S B , where S = S R S G S B , S B and S R each occupy a quarter of all pixels, and S G occupies half of all image pixels arranged in a quincunx lattice. Fig. <ref type="figure">3</ref> shows the zoom-in details of a 20 &#215; 20 region in the captured raw Bayer image by a single CCD sensor equipped with a Bayer pattern filter, as well as a rendering illustration where each sample point is plotted with Bayer color. Demosaicing methods are typically used to interpolate missing color information and recover the full RGB color image from a raw Bayer image. However, in scenes with high contrast and constantly changing colors or objects, demosaicing may result in the loss of details and introduce color artifacts like bleeding and zippering. Furthermore, post-processing stages such as demosaicing can be computationally expensive, which makes raw Bayer images a more cost-effective option for end-to-end semantic segmentation. In contrast to RGB images, raw Bayer images preserve the most primitive color information, making them ideal for semantic segmentation. The Bayer CFA used in typical post-processing steps is illustrated in Fig. <ref type="figure">4</ref>.  Effective utilization of color information is crucial for various computer vision tasks, including segmentation and detection, as it provides a wider spectral perception field with multiple color channels. To strengthen the features that encode spectral information and reduce the impact of ineffective features, it is essential to recalibrate them. This is particularly relevant for Bayer images that contain only a single color channel per pixel, as all three RGB channels have already been interpolated from neighboring pixels during the ISP process, which encodes contextual information in the image. However, this contextual information is not encoded in the raw Bayer images. To learn spectral at-tention from the context information, we propose to decompose the image into frequency spectra using Discrete Cosine Transform (DCT), which has been largely used in image and video compression applications. The DCT representation expresses an image as a sum of sinusoids at varying magnitudes and frequencies. Given an input image x of size M &#215; N, the 2D DCT spectrum B &#8712; R M&#215;N is obtained as:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Spectral Frequency Attention Block</head><p>where a 0,0 is 1 &#8730; MN , which corresponds to the lowest frequency component in the left top regions of Fig. <ref type="figure">5</ref>. a p,q is 2 &#8730; MN for all other frequency components of the 2D DCT. B M-1,N-1 corresponds to DCT coefficients of the highest frequency in bottom right regions of Fig. <ref type="figure">5</ref>.</p><p>Given the input feature map X &#8712; R C&#215;H&#215;W , DCT coefficients A &#8712; R F&#215;C&#215;H&#215;W are computed for the selected F frequency components. Reshaping X to 1 &#215;C &#215; H&#215; W , conducting element-wise multiplication with A, and summarizing the output across the spatial coordinate, the embedded frequency matrix will be D &#8712; R C&#215;J , J = H &#215; W. The embedding is then forwarded to choose the maximum frequency response per channel via max pooling. The final weighted feature map output Y is generated by a fully connected layer and sigmoid activation, as shown in Fig. <ref type="figure">6</ref>.  To extract smooth and continuous segmentation boundaries, a larger spatial perception field covering locations with salient and continuous color information is necessary for raw Bayer images where neighboring pixels do not have continuous color changes. To address this, we introduce a spatial attention module, as shown in Fig. <ref type="figure">7</ref>. The raw Bayer image is composed of grids of light-sensitive cells, and the spatial attention block enhances a wide range of contextual information into the local Bayer point. To explore spatial attention in raw Bayer images, we process the input feature X &#8242; &#8712; R C&#215;H&#215;W separately with global average pooling (GAP) and global max pooling (GMP) along the feature channels, and aggregate the results for concatenation. This process is expressed as:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Spatial Attention Module</head><p>where the output of the Pool_block (X &#8242; ) is in a tensor of shape 2 &#215; H &#215;W . The output is then followed by a 5 &#215; 5 convolutional layer and a batch normalization layer. The output is then passed through a sigmoid activation layer (&#8226;) to generate a 1 &#215; H &#215; W attention map. The final weighted attention output is element-wisely multiplied with the original input feature map as:</p><p>where the output feature map Y &#8242; shares the same dimension as the input feature map X &#8242; as C &#215; H &#215; W . With the introduced spectral frequency attention (SFA) and spatial coordinate attention (SCA) blocks, the extracted features from the backbone are able to adaptively emphasize on both Bayer spectrum and coordinate. The SFA and SCA are concatenated together to several convolution layers to generate the final pixel-level estimation map, as shown in Fig. <ref type="figure">8</ref>. The overall objective function for training the RawSeg-Net is a combination of the normal crossentropy loss (between the estimated segmentation output y and the ground truth label y ) and the RMI loss <ref type="bibr">[38]</ref> (between the estimated probability of segmentation labels and the probability of the ground truth labels) as:</p><p>where L ce is the per-pixel cross-entropy loss, and I l (Y, &#7928;) denotes the lower bound of the mutual information of estimated and the ground truth variables. L seg (y, &#7929;) is designed to simultaneously minimize the dissimilarity and maximize the lower bound of the mutual information to enable the estimated segmentation map to achieve high-order consistency with the ground truth segmentation map. Considering the raw Bayer image is composed of pixel level lightsensitive cells, a more focused strategy for processing multi-level grids is essential for the final segmentation output. In contrast with <ref type="bibr">[36]</ref> [8]</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Multi-grid Re-sampling Strategy</head><p>[9] that use different sampling operations or pyramid pooling to obtain a multiscale representation, we instead opt for re-sampling the original grid size to a larger size Bayer grid. More specifically, we group pixels in each original 2 &#215; 2 Bayer pattern together and assign the same color to the newly composed larger Bayer grid so that the entire Bayer image is composed of Bayer patterns with larger size. Therefore, the individual pixel-wise segmentation estimations are combined with dynamically learned scale-aware weights followed by a pixel-wise summation for generating the final refined segmentation. The re-sampling strategy of the raw Bayer image and the dynamic weight guided output refinement steps are depicted in Fig. <ref type="figure">9</ref>. With this strategy, we observe that the final refined output performs better than the estimations with original Bayer input in the large structures such as on building boundaries and pedestrian road, and performs better than the estimations with the re-sampling Bayer input in the fine structures such as lamp poles and traffic signs. With the re-sampling and dynamic weighting strategy, the refined output can benefit from both global and local contextual information, as demonstrated in Fig. <ref type="figure">10</ref>.</p><p>The final loss will be a combination of L seg from the raw Bayer image and the re-sampling Bayer image. Assuming the dynamic attention mask and the pixel-wise segmentation for the input Bayer image are M and Y seg 1 , the corresponding attention mask and the segmentation output for the re-sampled Bayer image are 1 -M and Y seg 2 , the refined segmentation output can be formulated as:</p><p>Therefore, the final refined objective function L final is based on Eq. 4 that can be updated as: We evaluate our proposed framework on three datasets: Cityscapes, Mapillary, and a dataset we collected using a NIKON-D3500 digital camera. Cityscapes is a high-resolution dataset consisting of around 5,000 images with pixel-level segmentation annotations for 19 classes, including road, sidewalk, building, person, car, and more. We used reverse-engineered RGB color images published by the dataset and converted them to 8-bit Bayer images for training and evaluation. The training, validation, and testing sets are split into 2,975, 500, and 1,525 images. We used the most common RGGB Bayer pattern to extract one channel for each 3-channel pixel in the order of the Bayer pattern. We also collected a real raw Bayer image dataset using a NIKON-D3500 camera. The dataset has the same class categories as Cityscapes, and its partition details are shown in Table <ref type="table">1</ref>. Fig. <ref type="figure">11</ref> displays some samples of collected raw Bayer images and their corresponding pixel-wise label annotations. Finally, we evaluated our method on Mapillary and our collected dataset with and without re-training. We use a ResNet-50 based network (configured with a stride of 2 and convolution kernel of 2&#215;2 for adapting Bayer pattern) and ASPP module as the backbone for extracting features in the network. The learning rate with warm-up steps of 10 and the poly learning rate policy <ref type="bibr">[36]</ref> for decaying the initial learning rate by multiplying 1 -iter total iter 0.9 are adopted to help the training stage converge efficiently. The optimizer of stochastic gradient descent (SGD) with a batch size of 4 is utilized and the initial learning rate is set to be 5e -3. Limited by the GPU memory, we resize images to 1024 &#215; 512 for all experiments. Data augmentation with random horizontal flip, color transforms in brightness, contrast, hue, and saturation is applied. Additionally, we re-train the state-of-the-arts <ref type="bibr">[18]</ref> [4] <ref type="bibr">[12]</ref> [6] <ref type="bibr">[37] [32] [34]</ref> on the same Bayer image datasets for fair comparisons.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Visual and Quantitative Analysis</head><p>We report the mean Intersection over Union (IoU) of each specific category on the simulated Cityscape dataset in Table <ref type="table">2</ref> to compare the proposed method with recent state-of-the-art approaches. Our method outperforms <ref type="bibr">[6]</ref> [37] significantly on large objects such as road and sky while improving the accuracy on relatively tiny objects like pole and bike by a large margin compared to <ref type="bibr">[32]</ref> and <ref type="bibr">[34]</ref>. This performance enhancement is mainly attributed to the design of spectral frequency attention, spatial coordinate attention, and multi-grid resampling strategy that consider both global context information and local pattern shapes. Furthermore, our method achieves high accuracy while significantly increasing the time performance, with a speed of 8.7 fps compared to 1.8-5.6 fps for other methods.</p><p>Method Backbone mIoU(%) Mapillary Dataset (without train) Deeplab-V3[4]. ResNet-50 55.8 HRNetV2-W48[32]. ResNet-50 56.9 DNLNet[34]. ResNet-50 58.2 Ours ResNet-50 59.3 Ours ResNet-101 60.1 Our Collected Dataset (without train) Deeplab-V3[4]. ResNet-50 29.7 HRNetV2-W48[32]. ResNet-50 33.6 DNLNet[34]. ResNet-50 31.2 Ours ResNet-50 42.7 Ours ResNet-101 44.9</p><p>Table 3: Results on Mapillary and our collected datasets.</p><p>We also present a qualitative evaluation of the segmentation results in Fig. <ref type="figure">12</ref>, which shows that our method outperforms other state-of-the-art methods in terms of maintaining accurate segmentation boundaries and preserving object shapes, particularly in comparison with <ref type="bibr">[34]</ref> on road and persons. Furthermore, even without re-training on the real collected dataset, our method produces significantly better segmentation results than other methods, demonstrating the suitability and generalizability of the proposed framework in real-world applications.</p><p>We widely validate the performance of the proposed network and test its generalization on different scenes by evaluating it on the Mapillary dataset in Table <ref type="table">3</ref>. Our method achieves about 3.5% mIoU improvement compared to <ref type="bibr">[4]</ref> using the same ResNet-50 backbone structure and 4.3% improvement using the ResNet-101 structure. The improvement is even more significant on the collected dataset, with our method achieving the top performance of 42.7% in mIoU, which is 43.7% higher than <ref type="bibr">[4]</ref> and 36.9% higher than <ref type="bibr">[34]</ref>.</p><p>Figure <ref type="figure">12</ref>: Qualitative results on Cityscapes (top) and our collected dataset (bottom). For each dataset, from left to right: input raw image; input image with Bayer pattern overlaid for better observation; our result; result from HRNetV2-W48 <ref type="bibr">[32]</ref> and DNLNet <ref type="bibr">[34]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Ablation Study</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Method</head><p>Backbone mIoU (%) &#948; (%) <ref type="bibr">[12]</ref> ResNet-50 76. Ablation study on network structures: we investigate the impact of different backbone settings on segmentation performance. As shown in Table <ref type="table">4</ref>, we first evaluate a naive structure without any proposed components, achieving mIoU scores of 75.2% and 77.0% with ResNet-50 and ResNet-101 backbones, respectively. After incorporating the introduced components, our method achieves a significant 5.1% and 4.6% improvement over the naive structure. While changing from ResNet-50 to ResNet-101 backbone only brings a 1.5% improvement for the compared method <ref type="bibr">[12]</ref>, the proposed method gains a 1.3% increase, indicating the improvement mainly comes from components specifically designed for Bayer images rather than deeper network structures.</p><p>Ablation study on network inputs: Table <ref type="table">5</ref> compares the accuracy and computation cost of our proposed method with different types of inputs. Our method achieves higher mIoU performance on raw Bayer images compared to grayscale images, despite both having 8-bit channels. This indicates that the proposed network effectively learns spatial, spectral, and shape information from the Bayer pattern.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Method</head><p>Type mIoU (%) &#948; (%) Params (M)</p><p>RefineNet <ref type="bibr">[18]</ref> Gray Table 6: Ablation study on the effect of each component and loss.  The effects of different modules are also illustrated in Fig. <ref type="figure">13</ref>, which demonstrates that with the introduced attention modules and re-sampling strategy, some misclassified categories such as trucks, poles, and traffic signs can be corrected, and object boundaries and details such as cars and persons are clearer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>We proposed RawSeg-Net, an end-to-end semantic segmentation network designed to segment raw Bayer images, enabling the elimination of the ISP process in image generation. Our approach uses 8-bit raw Bayer images, leading to large storage reductions and computational efficiency improvements. By introducing Bayer spectral frequency and spatial coordinate attention, as well as a multi-grid re-sampling strategy, we improved segmentation accuracy by combining local and global context information, offering a promising solution for efficient and accurate semantic segmentation of raw Bayer images.</p><p>Ack: This paper is supported by NSF Awards No. 2334624, 2334690, and 2334246.</p></div></body>
		</text>
</TEI>
