<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>SegImgNet: Segmentation-Guided Dual-Branch Network for Retinal Disease Diagnoses</title></titleStmt>
			<publicationStmt>
				<publisher>AAAI</publisher>
				<date>05/28/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10652106</idno>
					<idno type="doi">10.1609/aaaiss.v5i1.35547</idno>
					<title level='j'>Proceedings of the AAAI Symposium Series</title>
<idno>2994-4317</idno>
<biblScope unit="volume">5</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Xinwei Luo</author><author>Songlin Zhao</author><author>Yun Zong</author><author>Yong Chen</author><author>Gui-Shuang Ying</author><author>Lifang He</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Retinal image plays a crucial role in diagnosing various diseases, as retinal structures provide essential diagnostic information. However, effectively capturing structural features while integrating them with contextual information from retinal images remains a challenge. In this work, we propose segmentation-guided dual-branch network for retinal disease diagnosis using retinal images and their segmentation maps, named SegImgNet. SegImgNet incorporates a segmentation module to generate multi-scale retinal structural feature maps from retinal images. The classification module employs two encoders to independently extract features from segmented images and retinal images for disease classification. To further enhance feature extraction, we introduce the Segmentation-Guided Attention (SGA) block, which leverages feature maps from the segmentation module to refine the classification process. We evaluate SegImgNet on the public AIROGS dataset and the private e-ROP dataset. Experimental results demonstrate that SegImgNet consistently outperforms existing methods, underscoring its effectiveness in retinal disease diagnosis.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Retinal imaging, particularly fundus photography, is a noninvasive technique widely used in ophthalmology to capture detailed visualizations of retinal structures. By analyzing these images, clinicians can diagnose not only ocular diseases but also systemic conditions such as hypertension and diabetes <ref type="bibr">(Li et al. 2023;</ref><ref type="bibr">Tan et al. 2024)</ref>. However, manual interpretation by ophthalmologists is costly, timeconsuming, and subject to variability, potentially leading to delays in patient care and inconsistent diagnoses. Therefore, there is an urgent need for automated tools to improve disease detection efficiency through retinal image analysis.</p><p>Deep learning has emerged as a promising tool for automating disease detection using retinal images <ref type="bibr">(Zhou et al. 2023;</ref><ref type="bibr">Huang et al. 2023;</ref><ref type="bibr">Zhao et al. 2023</ref>). These methods typically leverage established computer vision architectures and employ transfer learning to adapt them for various medical applications, as illustrated in Figure <ref type="figure">1</ref>(a). For example, RETFound <ref type="bibr">(Zhou et al. 2023)</ref>, built on the Vision Transformer (ViT) architecture, is pretrained on large-scale datasets and later fine-tuned on retinal image datasets for disease detection. However, despite their effectiveness, these approaches focus primarily on modeling the overall data distribution of retinal images rather than on highlighting structural features of the retina. Critical diagnostic features are often embedded in the fine-grained structural details of the retina elements that may not significantly impact the overall data distribution but are essential for accurate disease diagnosis. Consequently, compared to natural image classification tasks, retinal disease diagnosis requires models with a stronger ability to capture and interpret key structural features. To address this challenge, a common strategy is to segment key retinal structures from retinal images <ref type="bibr">(Li and Liu 2022;</ref><ref type="bibr">Almeida et al. 2024;</ref><ref type="bibr">Wang et al. 2021a;</ref><ref type="bibr">Sivapriya et al. 2024)</ref>. By isolating diagnostically significant structures, the model can focus on extracting relevant features, as shown in Figure <ref type="figure">1</ref>(b). For example, <ref type="bibr">(Almeida et al. 2024</ref>) utilizes a customized image processing technique to segment retinal blood vessels and feed them into DenseNet121 for disease classification, while <ref type="bibr">(Sivapriya et al. 2024)</ref> employs ResEAD2Net for blood vessel segmentation and sub-AP AP 3&#215;3 Conv Sigmoid SGA Segmentation Module Feature Map SGA Segmentation Guided Feature Map Segmentation-Guided Attention Block C Concatenation Element-wise Multiplication AP Average Pooling Raw Image Feature Map Segmented Image Encoder Segmentation Module Raw Image Encoder &#8230; C MLP MLP Multilayer Perceptron SGA SGA SGA SGA sequently applies multiple machine learning algorithms to the segmented data for disease prediction. Although these methods improve attention to segmented regions, they ignore valuable information from complementary image areas, potentially limiting overall diagnostic performance.</p><p>To extract more comprehensive features, recent studies have integrated both segmentation results and retinal images for disease diagnoses <ref type="bibr">(Alam et al. 2023;</ref><ref type="bibr">Joshi, Sharma, and Dutta 2024;</ref><ref type="bibr">Xiong et al. 2025)</ref>. Specifically, some approaches fuse segmented and raw images into a single input and then feed it into an encoder for classification, as shown in Figure <ref type="figure">1</ref>(c), while others process segmented and raw images through separate encoders to extract features for disease classification, as shown in Figure <ref type="figure">1(d)</ref>. For example, <ref type="bibr">(Alam et al. 2023</ref>) stacks segmentation maps and retinal images into a single input for GoogleNet, whereas VisionDeep-AI <ref type="bibr">(Joshi, Sharma, and Dutta 2024;</ref><ref type="bibr">Xiong et al. 2025)</ref> concatenates features extracted from segmented images and retinal images using separate EfficientNet or ResNet50 models. However, these methods lack explicit interactions between segmentation and classification feature spaces. As a result, retinal anatomical features are not fully leveraged to enhance the learned representations in the classification model, limiting the model's ability to incorporate prior structural information for improved disease diagnosis.</p><p>In this paper, we propose SegImgNet, a deep learning framework for retinal disease classification that integrates both retinal images and segmentation maps. By leveraging multi-scale structural feature maps obtained from segmentation along with original retinal images, SegImgNet enhances classification performance. The framework consists of two main components: a segmentation module and a classification module. The segmentation module, based on the U-Net <ref type="bibr">(Ronneberger, Fischer, and Brox 2015)</ref> architecture, generates retinal structure feature maps. The classification module includes a segmented image encoder, a raw image encoder, a classifier, and Segmentation-Guided Attention (SGA) blocks. The segmented image encoder extracts disease-related local features, while the raw image encoder captures broader global contextual information. Both encoders are built on the ConvNeXt architecture, and the classifier combines their outputs into a unified representation for disease classification. Additionally, the SGA block enhances feature extraction by generating attention maps from structural segmentation, allowing the model to focus on critical retinal details. Extensive experiments on public AIROGS and private e-ROP datasets demonstrate that SegImgNet consistently outperforms existing state-of-theart methods for retinal disease diagnosis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Our Approach</head><p>Figure <ref type="figure">2</ref> illustrates the architecture of SegImgNet, which consists of two main components: a segmentation module and a classification module. The details of these two modules are introduced below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Segmentation Module</head><p>The segmentation module f seg (&#8226;) employs a U-Net architecture to generate retinal structure feature maps. U-Net utilizes a symmetric encoder-decoder architecture with skip connections, enabling it to capture both low-level spatial details and high-level abstract features. This structure ensures the precise localization of the retinal structures while preserving fine-grained anatomical details.</p><p>The U-Net's encoder consists of multiple convolutional layers followed by downsampling operations, progressively reducing spatial resolution while enhancing feature abstraction. This hierarchical representation enables the model to capture retinal structures across multiple scales, which is essential for detecting both fine-grained details and broader pathological patterns. The decoder, on the other hand, reconstructs the segmented image by gradually upsampling the encoded features, restoring spatial details lost during downsampling. Skip connections bridge the corresponding encoder and decoder layers, allowing high-resolution features from the encoder to be directly merged with upsampled features in the decoder. These connections help preserve fine-grained structural information, which is crucial for accurately delineating retinal regions.</p><p>Specifically, given a retinal image x &#8712; R H&#215;W &#215;Craw , where H, W , and C raw denote the height, width, and channel size of the raw image, respectively, the corresponding segmented image x seg &#8712; R H&#215;W &#215;Cseg and multi-scale retinal structural feature maps {h</p><p>2 i &#215;Ci are obtained as follows:</p><p>where C seg represents the channel size of the segmented image, and L represents the number of feature scales, which is empirically set to 4 in this study <ref type="bibr">(Li et al. 2024)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Classification Module</head><p>The classification module extracts structural features from segmentation maps and contextual representations from raw retinal images for disease diagnoses. It consists of a segmented image encoder, a raw image encoder, a classifier, and SGA blocks. Each component is detailed below. Segmented Image Encoder: The segmented image encoder extracts fine-grained structural representations from the output of the segmentation module while incorporating segmentation priors at multiple stages. Here we use ConvNeXt (Liu et al. 2022) as a feature extractor or backbone for this encoder. Each stage of the feature extractor is equipped with a Segmentation-Guided Attention (SGA) block, which enhances attention to retinal structural features. By selectively emphasizing relevant features and filtering out less informative regions, the SGA block ensures that the extracted representations retain critical anatomical details essential for accurate disease classification and improved diagnostic reliability. The final segmentation map feature representations are obtained from the last stage of the feature extractor, where segmentation-guided information is further enriched with anatomical details. Specifically, the SGA block builds on the approach in (Li et al. 2024), utilizing convolution operations and a sigmoid activation function to refine feature extraction. It enhances the intermediate feature maps of the segmented image encoder by integrating segmentation-derived structural information. Given the output feature map h (i) local from the i-th stage of the feature extractor and the corresponding retinal structural feature map h (i)</p><p>seg , the SGA block produces an enhanced representation, formulated as:</p><p>where Conv 3&#215;3 (&#8226;) represents a convolutional layer with a kernel size of 3 &#215; 3 used for adjusting h</p><p>seg spatial dimensions to match h Raw Image Encoder: The raw image encoder is designed to extract global contextual representations from retinal images, complementing the structural features extracted by the segmented image encoder. Similarly to the segmented image encoder, it employs ConvNeXt as the backbone. However, unlike the segmented image encoder, which processes segmented images with segmentation-derived feature map enhancement, the raw image encoder focuses on capturing broader disease-relevant patterns within the retinal image. In particular, the raw image encoder is not equipped with SGA blocks, ensuring that it does not emphasize the same structural features as the segmented image encoder. This design preserves feature complementarity by allowing the segmented image encoder to prioritize segmentation-guided structural information. The final global feature representations are obtained from the deepest stage of ConvNeXt, where high-level disease-relevant information is encoded while retaining spatial context. Classifier: After obtaining the segmented image feature embedding h local and the raw image feature embedding h global from the encoders, the classifier concatenates them to form a comprehensive feature embedding h cls for disease classification. It then applies a Multilayer Perceptron (MLP) followed by a sof tmax activation function to classify diseases based on the feature embedding h cls . Specifically, the probability of the k-th disease, &#375;k , is computed as follows:</p><p>where K denotes the total number of classes, and f k (h cls )) represents the MLP output for class k.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Overall Loss Function</head><p>To address the class imbalance commonly found in medical datasets, we employ a Weighted Cross-Entropy (WCE) loss function to train SegImgNet. This loss function assigns higher penalties to misclassified minority-class samples, mitigating the dominance of majority classes and improving the model's ability to detect rare disease cases. The WCE loss is defined as:</p><p>) where N is the number of input samples, w k denotes the weight assigned to class k. y </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiment Experimental Setup</head><p>Datasets: We evaluated SegImgNet on two datasets: the public AIROGS dataset and the private e-ROP dataset.  &#8226; The AIROGS dataset <ref type="bibr">(Steen et al. 2023</ref>) is an improved glaucoma dataset consisting of a balanced subset of standardized retinal images. It is derived from the Rotterdam EyePACS AIROGS set, which contains 113,893 color retinal images from 60,357 subjects across approximately 500 different sites with heterogeneous ethnicities. These retinal images were labeled as glaucomatous or healthy based on clinical evaluations performed by glaucoma specialists. For this study, we used 4,950 publicly available retinal images, including 2,475 glaucomatous images and 2,475 healthy images.</p><p>&#8226; The e-ROP dataset originates from the Telemedicine Methods for Evaluating Acute Retinopathy of Prematu-rity (e-ROP) study <ref type="bibr">(Quinn et al. 2014)</ref>, which collected retinal images from 1,257 infants admitted to neonatal intensive care units across 13 centers in North America. These images are captured using wide-angle retinal cameras during scheduled diagnostic examinations. Each retinal image was labeled as either preandplus or normal by experienced ophthalmologists. In this study, we used 7,811 center-view retinal images, including 990 preandplus images and 6,821 normal images. Evaluation Metrics: We evaluated model performance us-ing six standard metrics: Area Under the Receiver Operating Characteristic Curve (AUC) to assess discriminative ability, sensitivity (true positive rate) to quantify disease detection capability, specificity (true negative rate) to measure the ability to identify healthy cases, precision (positive predictive value) to evaluate diagnostic confidence, F1-score to balance precision and recall, and accuracy to reflect overall classification performance. Implementation Details: To ensure a fair comparison, we conducted five-fold cross-validation on each dataset, partitioning the labeled images into 80% training data and 20% test data. The training data was further divided into a training set and a validation set in a ratio 3 : 1, maintaining the original class distribution for hyperparameter tuning. To mitigate class imbalance in the training set, we employed the Random OverSampling Examples (ROSE) <ref type="bibr">(Hayaty, Muthmainah, and Ghufran 2020)</ref> technique to balance the number of images in each class. Additionally, we applied dataaugmentation techniques, including image flipping, cropping, and scaling, to the training set to improve the model's generalization ability. For consistency, all retinal images were resized &#215; 256 pixels.</p><p>All compared models are implemented using the Py-Torch framework. The segmentation components were pretrained on 933 samples from six public retinal vessel segmentation datasets: FIVES <ref type="bibr">(Jin et al. 2022)</ref>, DRIVE <ref type="bibr">(Staal et al. 2004</ref>), STARE <ref type="bibr">(Hoover, Kouznetsova, and Goldbaum 2000)</ref>, CHASEDB1 <ref type="bibr">(Budai et al. 2013a)</ref>, HRF <ref type="bibr">(Budai et al. 2013b)</ref>, and Retinal Blood Vessel Segmentation <ref type="bibr">(Wang et al. 2021b)</ref>. The classification components were pre-trained on the ImageNet dataset, except for RETFound, which was trained on its custom dataset.</p><p>All experiments were accelerated using NVIDIA RTX A5000 GPUs. Model optimization was performed using the Adam optimizer. To enhance performance, we conducted a grid search to fine-tune key hyperparameters, including the learning rate, batch size, and disease class weight in the weighted cross-entropy loss function. The learning rate was explored within the range 5 &#215; 10 -5 to 1 &#215; 10 -3 , batch sizes were selected from {16, 32, 64, 128}, and class weight were varied from 0.5 to 0.9 with a step size of 0.1. We set the maximum number of training epochs to 200, with early stopping applied if validation performance did not improve within 20 epochs. The best-performing model checkpoint on the validation set was selected for testing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experimental Results</head><p>Comparisons with Baselines: Table <ref type="table">1</ref> presents the disease classification performance of all compared models across two datasets. Specifically, we have the following observations: SegImgNet consistently outperforms all baselines across key metrics on both datasets, demonstrating its superior capability to distinguish between disease and normal cases. While SegImgNet achieves slightly lower specificity (0.843 &#177; 0.042) and accuracy (0.857 &#177; 0.013) compared to VisionDeep-AI (0.885 &#177; 0.010 and 0.865 &#177; 0.010, respectively) on the e-ROP dataset, it remains highly competitive on these two metrics. More importantly, while VisionDeep-AI exhibits higher specificity and accuracy, it falls short in other critical metrics, particularly sensitivity (0.731 &#177; 0.028 for VisionDeep-AI vs. 0.831 &#177; 0.027 for SegImgNet). This lower sensitivity increases the risk of missed diagnoses, which can lead to delayed treatment. In medical applications, sensitivity is crucial, as missing a disease diagnosis can have far more severe consequences than misclassifying a healthy individual. Notably, SegImgNet achieves the highest sensitivity among all baselines on both data sets, confirming its effectiveness in clinical decision-making. Figure <ref type="figure">3</ref> shows the visualization of intermediate feature maps from the segmented image encoder of the top three models (SegImgNet, VisionDeep-AI and Multi-GlaucNet) across two datasets. We selected the feature maps produced by each model's second downsampling layer and visualized four representative channels, chosen based on their mean and variance. The visualization results demonstrate that our approach achieves higher structural clarity and consistency compared to the other two approaches. Specifically, SegImgNet more distinctly delineates prominent edges and anatomical structures, thereby enhancing its capability to preserve and highlight morphological features for accurate retinal analysis. Ablation Study: Here we investigated the contribution of each key component in SegImgNet, including the segmented image encoder, raw image encoder, and SGA block. Table <ref type="table">2</ref> presents the performance of different model variants: "w/o segmented image encoder" excludes the segmented image encoder, "w/o raw image encoder" removes the raw image encoder, and "w/o SGA" omits the SGA block. The results demonstrate that each component is essential for optimal performance. Removing the segmented image encoder significantly reduces the model's ability to capture retinal structural features, while eliminating the raw image encoder weakens its capacity to extract global contextual information. Furthermore, the absence of the SGA block degrades classification performance, highlighting the importance of multi-scale retinal structural feature maps in enhancing representation learning. The complete SegImgNet model, incorporating all components, achieves the highest performance, emphasizing the importance of integrating local and global feature extraction with attention-based enhancement. These findings confirm that each module plays a critical role in maximizing disease classification accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>In this study, we introduce SegImgNet, a deep learning model that integrates local retinal structural features from segmented images with global contextual information from raw images for disease classification. Extensive experiments on public and private datasets show that SegImgNet outperforms existing methods, demonstrating the effectiveness of segmentation-guided attention for feature enhancement. Our findings highlight the potential of incorporating retinal structural priors into deep learning frameworks to improve the robustness of AI-driven medical imaging. Future work will focus on optimizing feature fusion, expanding the model to broader ophthalmic applications, and improving generalization across diverse clinical datasets.</p></div></body>
		</text>
</TEI>
