<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>SWIN-SFTNET : SPATIAL FEATURE EXPANSION AND AGGREGATION USING SWIN TRANSFORMER FOR WHOLE BREAST MICRO-MASS SEGMENTATION</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2023 April</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10398748</idno>
					<idno type="doi"></idno>
					<title level='j'>IEEE International Symposium on Biomedical Imaging</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>S.A Kamran</author><author>K.F Hossain</author><author>A. Tavakkoli</author><author>G. Bebis</author><author>S. Baker</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Incorporating various mass shapes and sizes in training deep learning architectures has made breast mass segmentation challenging. Moreover, manual segmentation of masses of irregular shapes is time-consuming and errorprone. Though Deep Neural Network has shown outstanding performance in breast mass segmentation, it fails in segmenting micro-masses. In this paper, we propose a novel U-net-shaped transformer-based architecture, called Swin-SFTNet, that outperforms state-of-the-art architectures in breast mammography-based micro-mass segmentation. Firstly to capture the global context, we designed a novel Spatial Feature Expansion and Aggregation Block(SFEA) that transforms sequential linear patches into a structured spatial feature. Next, we combine it with the local linear features extracted by the swin transformer block to improve overall accuracy. We also incorporate a novel embedding loss that calculates similarities between linear feature embeddings of the encoder and decoder blocks. With this approach, we achieve higher segmentation dice over the state-of-the-art by 3.10% on CBIS-DDSM, 3.81% on InBreast, and 3.13% on CBIS pre-trained model on the InBreast test data set.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Breast cancer is one of the most dominant cancer types in the world, and Mammography has been acknowledged as a vital tool for the early detection of breast cancer. However, asymmetrical shapes, microcalcifcations, and small masses complicate automated breast mass segmentation. Additionally, most computer-aided diagnosis (CAD) systems rely on traditional image-processing-based approaches, which are quite error-prone and require manual intervention. Recently, machine learning and deep learning approaches have outperformed these conventional methods <ref type="bibr">[1]</ref> and have become a popular technique for such tasks. Nonetheless, most CAD tools are still plagued by manually extracting suspicious regions or segments from low-resolution images, which fail &#8902; Equal Contribution to segment micro masses with accurate contour and high probability.</p><p>Deep Neural Network has shown excellent performance in medical image segmentation. Popular networks like U-Net <ref type="bibr">[2]</ref>, FCN <ref type="bibr">[3]</ref>, AUNet <ref type="bibr">[4]</ref>, ARF-Net <ref type="bibr">[5]</ref> demonstrated outstanding outcomes for breast mass segmentation from both mammography images. These networks implemented diverse methods like generating multi-scale feature maps, attentionguided dense upsampling, and additive channel attention to learn robust feature maps to segment tumors of different sizes with more than 85%+ dice scores. However, the dice score of these systems falls to 5-15% when applied to images with micro-masses.</p><p>One reason for the failure of CNN-based approaches on micro-masses is they overtly focus on global semantic information. And to eliminate similar problems, Visiontransformer (ViT) <ref type="bibr">[6]</ref> was proposed to prioritize local patchlevel information. Taking 2D image patches with positional embeddings as input, Vision Transformers has outperformed most medical imaging downstream tasks <ref type="bibr">[7]</ref><ref type="bibr">[8]</ref><ref type="bibr">[9]</ref>. Recently, Swin-UNet has achieved phenomenal results in organ segmentation like Gallbladder, Spleen, Liver, etc. Although Swin-UNet can capture local information correctly for precise boundary segmentation of organs, the organ is unique in shape and does not contain similar-looking artifacts. One of the primary problems of segmenting micro-masses in the breast is that surrounding fatty tissues can throw off the segmentation boundary of the model and might raise the false-positive rate as well. To address the above issues, we propose a novel transformer network named Swin Spatial Feature Transformer Network (Swin-SFTNet) and a novel embedding similarity loss to achieve a segmentation dice improvement over the state-of-the-art by 3.10%, 3.81%, and 3.13% on CBIS-DDSM <ref type="bibr">[10]</ref>, InBreast <ref type="bibr">[11]</ref>, and CBIS pretrained on InBreast dataset respectively. Our main contributions are: (1) Employing a Swin-Transformer as a basic building block to create Swin-SFTNet to incorporate spatial global and sequential local context information in a multiscale feature fusion confguration. (2) Designing a novel Spatial Feature Expansion and Aggregation Block to convert sequential linear patches into structured spatial features for capturing global context information for better micro-mass segmentation and (3) Utilizing a novel embedding loss that calculates similarities between features of the encoder and decoder blocks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">METHODOLOGY</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Overall Architecture</head><p>The overall architecture of our proposed Swin-SFTNet is illustrated in Fig. <ref type="figure">1</ref>. Swin-SFTNet incorporates an encoder, a decoder, three skip connections between the encoder and decoder, and three parallel SFEA blocks followed by patch extract and patch embedding layer before concatenating with the output feature map. Our architecture is an enhanced version of Swin-UNet <ref type="bibr">[7]</ref>, a UNet-like auto-encoder that replaces Swin-Transformer blocks <ref type="bibr">[12]</ref> with regular convolution layers. We frst transform the breast mammography grayscale images into RGB, providing the model with learning essential features. We utilize a patch-embedding layer to transform the input non-overlapping patches of size 4 &#215; 4. So for three RGB channels, we get to 4 &#215; 4 &#215; 3 = 48 depth dimension. Next, we utilize a dense layer to project feature dimension into C arbitrary dimension. Following this layer, we have our encoder blocks, each consisting of two successive swintransformer blocks and a patch-merging layer. We explain the swin-transformer block in Subsection 2.2. We repeat the encoder blocks three times to downsample the feature dimen- H &#215; W &#215; 8C successively.To conclude the encoder, we uti-</p><p>lize two swin-transformer blocks after the last patch-merging layer.</p><p>Similar to the encoder, we design a symmetric decoder composed of multiple Swin Transformer blocks and patch expanding layer. Each decoder black is concatenated with the skip-connection features from the encoder with the same spatial dimension. As a result, we avoid any loss of spatial information due to successive downsampling. In contrast to the patch merging layer, the patch expanding layer reshapes the feature maps with 2&#215; up-sampling of spatial dimension. Additionally, it utilizes convolution to halve the depth dimension. We repeat the decoder blocks three times to upsample the feature dimensions from  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Swin-Transformer Block</head><p>Traditional window-based multi-head self-attention (W-MSA) proposed in Vision Transformer (ViT) <ref type="bibr">[6]</ref> utilizes a single low-resolution window for building feature-map and has quadratic computation complexity. In contrast, the Swin Transformer block incorporates shifted windows multi-head self-attention (SW-MSA), which builds hierarchical local feature maps and has linear computation time. Swin transformer block can be described in the following Eq. 1 and Eq. 2.</p><p>In Eq. 1, we illustrate the frst sub-block of swin transformer consisting of LayerNorm (&#981;) layer, multi-head self attention module (W-MSA), residual connection (+) and 2-layer MLP with GELU non-linearity (&#948;). In similar way Eq. 2 illustrates the second sub-block of swin transformer consisting of LayerNorm (&#981;) layer, shifted window multi-head self attention module (SW-MSA), residual skip-connection (+) and MLP with GELU activation (&#948;). Additionally, l notifes layer number and x is the feature-map. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Spatial Feature Expansion and Aggregation</head><p>Although the multi-head self-attention module can capture local contextual information to understand inherent feature representations, consecutive patch merging and expanding layers can degrade the overall global context of the task. The existing skip connection concatenation cannot solve this problem as they apply dense layers on sequential patches to create linear projections. To create spatial projections of learnable features, we propose the Spatial Feature Expansion and Aggregation block illustrated in Fig. <ref type="figure">2</ref>. We start with the top-most skip connection that comes out of the frst encoder layer and has a feature output of F &#8712; R D&#215;C , where D = 4096 and C = 128. We apply a patch expanding layer with patch-size</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Objective Function and Embedding Similarity Loss</head><p>For binary output of background and masses we use binary cross-entropy loss given in Eq. 3. We also use Dicecoeffcient loss given in Eq. 4 for better segmentation output. For dice-coeffcient we use &#949; = 1.0 in numerator and denominator for addressing the division by zero. Here, E symbolizes expected values given, p (prediction) and y (ground-truth).</p><p>Finally, the embedding feature loss is calculated by obtaining positional and patch features from the transformer encoder layers E and decoder layers D by inserting the image, as shown in Eq. 5. Here, Q stand for the number of features extracted from the embedding layers of the transformerencoder.</p><p>We combine Eq. 3, 4, and 5 to confgure our ultimate cost function as provided in Eq. 6. Here, &#955; is the weight for each loss.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>P P</head><p>Here, spatial dimension P = 256 and N = 1, so the resultant spatial dimension becomes 256 &#215; 256 &#215; 32. In a similar manner, from the 2nd and 3rd skip connections with We evaluated our model with three publicly available datasets. followed by element-wise addition of features from E P E to</p><p>We used CBIS-DDSM <ref type="bibr">[10]</ref> and InBreast <ref type="bibr">[11]</ref>, two whole</p><p>In a similar manner, we mammography segmentation datasets. All images are re-</p><p>apply another same 2D Convolution Block on E C1 to get fea-sized to 256 &#215; 256 dimension using bilinear interpolation, ture output and add element-wise features from E P E to get and the masks are resized to the same size using the nearestfnal output</p><p>4 . These two convolution oper-neighbor technique. Both dataset contains craniocaudal (CC) ation helps with extracting global spatial context information that we further combine with our decoder's local patch-level information. Following this operation, we utilize the 4 &#215; 4 Patch-extraction operation to convert it the feature into 2D sequence feature output, E T &#8712; R D&#215;4C . After that, we use patch-embedding layer to make the feature dimension same as the decoder's paired output, so the output feature map becomes E K &#8712; R D&#215;C . Next, we concatenate the feature from the decoder's patch expanding layer with the E K . We do this for all of our skip connections, so the output feature map becomes O &#8712; R D&#215;2C . Here, we use three different values for C = [128, 256, 512], for the three skip connections. and mediolateral oblique (MLO) views of breasts. From the CBIS-DDSM dataset, we separate 849 training and 69 test images based on the subtlety of 4 and 5. The masses on the test images are less than 200 pixels in size, which is 0.3% of the whole image. The subtlety defnes the visual challenge to annotate the masses for the clinician, with 1-5 grading where 1= ungradable and 5=most gradable. We use OpenCV's contour-based technique to remove artifacts, and for enhancement, we use CLAHE. The Inbreast dataset contains 107 images, which we split into 90 training and 17 test images. The test images are separated based on any mass being less than 100 pixels or smaller. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Hyper-parameter</head><p>We chose &#955; bce = 0.4, &#955; dice = 0.6 and &#955; emb = 0.01 (Eq. 6). For optimizer, we used Adam with a learning rate of &#945; = 0.0001, &#946; 1 = 0.9 and &#946; 2 = 0.999. We used Tensorfow 2.8 to train the model in mini-batches with the batch size, b = 8 in 100 epochs which took around 1 hour to train on NVIDIA A30 GPU. The inference time is 41 millisecond per image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Quantitative Evaluation</head><p>For micro-mass segmentation tasks, we compare our model with three state-of-the-art architectures, AUNet <ref type="bibr">[4]</ref>, ARF-Net <ref type="bibr">[5]</ref>, and SWIN-UNet <ref type="bibr">[7]</ref> for CBIS-DDSM and InBreast, as given in Table . 1. AUNet utilizes attention-guided dense upsampling to retain important spatial features lost due to bilinear up-sampling. In contrast, ARF-Net uses a Selective Receptive Filed Module (SRFM) module to fuse multi-scale and multi-receptive feld information and the current state-of-theart for both breast segmentation datasets. Finally, Swin-UNet combines swin transformer blocks presented in <ref type="bibr">[12]</ref> with a UNet-like structure to reach high precision in multiple organ segmentation. ARFNet and AUNet show high-performance gains against previous approaches. However, the prediction is skewed because the test set contains images with more large masses and few micro-masses. We designed the experiment to emphasize micro-mass segmentation, so we sorted the images based on the tumor size. For InBreast, we chose the last portion for testing (small than 100 px), which has the smallest breast masses. And for CBIS-DDSM, we discarded the leftover large mass test images (larger than 200 px), as we had separate training images. . We can see from Breast (Given in Red). Moreover, in qualitative comparison in Fig. <ref type="figure">3</ref>, our model can segment harder and smaller masses than other architectures. We also did ablation study for the embedding loss for two datasets, which are provided in Table . 2. With the novel loss function we have 3.14%, 6.33%, and 0.86% gain for CBIS-DDSM, InBreast, and CBIS pre-trained model consecutively. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">CONCLUSION</head><p>In this paper, we proposed Swin-SFTNet, with a novel Spatial Feature Expansion and Aggregation Block (SFEA) block, which captures the global context of the images and fuses it with the local patch-wise features. Moreover, we also integrate a novel embedding loss that computes the similarities between the encoder and decoder block's patch-level features. Our model outperforms other architectures in micromass segmentation tasks in two popular datasets.</p></div></body>
		</text>
</TEI>
