<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Semi-Supervised Multimodal Multi-Instance Learning for Aortic Stenosis Diagnosis</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>04/14/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10591820</idno>
					<idno type="doi">10.1109/ISBI60581.2025.10981205</idno>
					
					<author>Zhe Huang</author><author>Xiaowei Yu</author><author>Benjamin S Wessler</author><author>Michael C Hughes</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Automated interpretation of ultrasound imaging of the heart (echocardiograms) could improve the detection and treatment of aortic stenosis (AS), a deadly heart disease. However, existing deep learning pipelines for assessing AS from echocardiograms have two key limitations. First, most methods rely on limited 2D cineloops, thereby ignoring widely available Spectral Doppler imaging that contains important complementary information about pressure gradients and blood flow abnormalities associated with AS. Second, obtaining labeled data is difficult. There are often far more unlabeled echocardiogram recordings available, but these remain underutilized by existing methods. To overcome these limitations, we introduce Semi-supervised Multimodal Multiple-Instance Learning (SMMIL), a new deep learning framework for automatic interpretation for structural heart diseases like AS. During training, SMMIL can combine a smaller labeled set and an abundant unlabeled set of both 2D and Doppler modalities to improve its classifier. When deployed, SMMIL can combine information from all available images to produce an accurate study-level diagnosis of this life-threatening condition. Experiments demonstrate that SMMIL outperforms recent alternatives, including two medical foundation models.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Aortic stenosis (AS) is a degenerative heart valve condition that leads to obstructed blood flow, affecting over 12.6 million adults. With timely diagnosis and appropriate surgical valve replacement, AS could be a treatable condition with very low mortality <ref type="bibr">[1]</ref>. Unfortunately, timely diagnosis of AS remains a challenge <ref type="bibr">[2]</ref>. Currently, up to 2/3 of symptomatic AS patients may never get referred for care <ref type="bibr">[3]</ref>. Improved detection could help reduce the estimated 102,700 deaths caused by AS annually.</p><p>Ultrasound (US) images of the heart, known as echocardiograms, are considered the gold standard for AS diagnosis. Automated study-level analysis of Echocardiograms represents an opportunity for improved detection of AS. However, developing automated AS detection algorithms faces several major challenges that our present work addresses:</p><p>&#8226; Mimicking expert synthesis of multiple images. Routine US scans produce many images of the heart from diverse viewpoints. Clinicians review all available images, using the most relevant to assess the health of the aortic valve. Developing models to emulate this intricate, multi-image review process is non-trivial. An entire study may receive a positive AS diagnosis even though some component images do not individually show signs of AS. We employ Multiple-Instance Learning (MIL) to form one comprehensive study-level prediction from multiple images, mimicking the expert process. &#8226; Integrating across modalities. Effective AS diagnosis in clinical settings requires the integration of information from two modalities: spectral Doppler and 2D cine series <ref type="bibr">[4]</ref> (see Fig. <ref type="figure">1</ref>). However, the fusion of these two modalities has been underexplored for AS diagnosis. We bridge this gap by designing a multimodal attention pooling mechanism that leverages the full spectrum of diagnostic information available to clinicians. Other modalities (3D, M-mode) are sometimes available, but less relevant to AS and outside the scope of our study.</p><p>&#8226; Overcoming data limitations. Deep learning classifiers rely heavily on access to large labeled datasets, which are often scarce in medical imaging applications like our AS task. We adapt our Multimodal Multi-Instance learning framework with semi-supervised learning (SSL) to jointly learn from a labeled set and an additional unlabeled set of 5,386 scans, improving our MMIL classifier beyond what is possible with only the limited labeled set.</p><p>We address all three issues in a holistic manner. For a version of this article with expanded supplement, see our preprint <ref type="bibr">[5]</ref>. Related Works. Automated detection of Aortic Stenosis using machine learning has drawn a lot of interests in recent years. Holste et al. <ref type="bibr">[6]</ref> and Dai et al. <ref type="bibr">[7]</ref> focus on detecting AS using only pre-selected PLAX cineloops from each study, while Ginsberg et al. <ref type="bibr">[8]</ref> use both PLAX and PSAX. Huang et al. <ref type="bibr">[9]</ref> and Wessler et al. <ref type="bibr">[10]</ref> avoid the need of prefiltering views by training separate classifiers for view type and AS as encoders: Swin Transformer-T <ref type="bibr">[17]</ref> as f and Video Swin Transformer-T <ref type="bibr">[18]</ref> as f .</p><p>Pooling layer for Dopper Branch. Given all per-instance vectors hk from all K Dopplers in a study, we wish to map to one overall Doppler-specific representation vector z &#8712; R M . We form z via attention pooling inspired by ABMIL <ref type="bibr">[16]</ref> </p><p>Here, w and &#360; are trainable parameters and the attention weight vector {&#227; k } K k=1 sums to one by construction.</p><p>Pooling layer for 2D Branch. Given all per-instance vectors h k from all 2D cineloops in a study, we use the supervised attention pooling proposed in SAMIL <ref type="bibr">[13]</ref> to obtain an overall</p><p>Here, separate attention modules produce two distinct vectors of normalized attention weights</p><p>The parameters of the first module, U a , w a , are trained in supervised fashion as in <ref type="bibr">[13]</ref>, to favor PLAX and PSAX views (the view types that clinicians use to diagnose AS in practice). Parameters of the second module, U b , w b , are free to learn optimal attention allocation for the overall AS diagnostic task. Constructing the ultimate attention weight c k via c k &#8733; a k b k ensures that irrelevant views receive low weights (due to low a k values), while relevant views may receive varying attention (due to the flexibility of b k ). Supervision for A comes from a pretrained view-type classifier <ref type="bibr">[13]</ref>, which produces view relevance scores r(x k ) for the likelihood of each 2D video being PLAX or PSAX. Temperature scaling and renormalizing gives vector R = {r 1 , . . . r K }. We minimize KL-divergence from R to A:</p><p>Supervision for the 2D branch is possible because view-type classifiers for 2D cineloop are readily available. For spectral Doppler, off-the-shelf view classifiers or even datasets with view labels are not available. If ready in the future, our Doppler pooling branch could easily benefit from such supervision (e.g. steering toward aortic valve Dopplers).</p><p>Multimodal Fusion. Modeling complex relationships between modalities via Multimodal Fusion is a popular strategy, with different approaches typically categorized into Early, Intermediate, and Late Fusion. Multimodal fusion specifically for AS diagnosis has been understudied. We adopt an Intermediate Fusion strategy, building patient-study embedding s &#8712; R M via an attention-weighted average of 2D embedding z and Doppler embedding z:</p><p>Parameters w s , U s learn the optimal relative weight to be put on Doppler or 2D representations in a study-specific fashion.</p><p>Output layer. A linear-softmax layer maps each studyembedding s &#8712; R M to a probability vector &#961; &#8712; &#8710; 3 indicating the chance of 3 AS severity levels (none, early, significant).</p><p>Given N bag-label pairs X i , y i , we train SMMIL parameters &#952; (including weights for all encoders, attention pooling, and output layers) to minimize cross-entropy plus SA loss from Eq. ( <ref type="formula">3</ref>) with hyperparameter &#955; &gt; 0:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Semi-supervised Curriculum for Multimodal MIL</head><p>Labeled TTEs are not available for many patients (each TMED-2 training split has only 360 studies). Unlabeled TTEs are more abundant and in practice easier to obtain.</p><p>To leverage the unlabeled data, we turn to Semi-Supervised Learning (SSL) <ref type="bibr">[19]</ref>. Common SSL approaches include: Pseudo-labeling (PL) <ref type="bibr">[20]</ref>, Consistency regularization <ref type="bibr">[21]</ref> and Hybrid <ref type="bibr">[22]</ref>. Integrating SSL with our MMIL presents unique challenges, such as formulating appropriate objectives for the unlabeled set and managing substantial GPU memory requirements. To overcome these challenges, we build upon a pseudo-label-based method called curriculum labeling <ref type="bibr">[23]</ref>. Let D L denote the labeled set and D U denote the unlabeled set. Fig 1 (top) illustrates our SSL workflow. Training proceeds in several rounds. Each round trains an MMIL architecture to converge on an available set of (pseudo-)labeled data. In the first round, only the actual labeled set D L is used. In subsequent rounds, we train on the union of the labeled set D L and a subset of the unlabeled data D U pseudo-labeled by the model from the previous round. Pseudolabels are computed for each unlabeled bag by taking the class with maximum predicted probability; the maximum probability value itself is retained as the associated confidence. Selection keeps only the unlabeled bags with the highest confidence, stepping linearly to the top 20% for round 2, the top 40% for round 3, and so on until 100% of unlabeled bags are selected in final round. Following <ref type="bibr">[23]</ref>, parameter vector &#952; is freshly initialized at random to begin each round to avoid confirmation bias.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">EXPERIMENTS, RESULTS, AND ANALYSIS</head><p>Implementation. For spectral Doppler, we pad the image to center around the zero-velocity line with image height spanning from -450 to 450 cm/second, then resize the image to 160 x 200. For 2D cineloop, we use the released 112x112 video Table <ref type="table">1</ref>. Balanced accuracy evaluation of 3-level AS severity classification on TMED-2, across test sets of the 3 released splits. Left: using all 2D instances. Right: using only 2D instances with view labels. Results marked "-" not reported in original work.</p><p>Bal. Acc. on Test, All 2D Method 1 2 3 avg (std) Holste et al. <ref type="bibr">[6]</ref> 62.1 65.1 70.3 65.9 (3.4) ABMIL <ref type="bibr">[16]</ref> 58.5 60.4 61.6 60.2 (1.3) Set Transf. <ref type="bibr">[24]</ref> 61.0 62.6 62.6 62.1 (0.8) DSMIL <ref type="bibr">[25]</ref> 60.1 67.6 73.1 66.9 (5.3) SAMIL <ref type="bibr">[13]</ref> 72 without additional processing, taking only the first 8 frames of each cineloop as the video. We implement our framework in PyTorch. We train the model end-to-end on one 80 GB NVIDIA A100 GPU. For all experiments, we use SGD optimizer with momentum 0.9, set the batch size to 1 bag, and &#955; to 10. We search several hyperparameters (learning rate in 5e-4 and 5e-5; weight decay in 1e-4 and 1e-5; temperature &#964; in 0.5 and 0.05) based on validation set performance. Once trained, inference takes around 0.3 seconds for each patient-study.</p><p>Evaluation of 3-level AS on TMED-2. We compare with various strong alternatives, including general MIL models and dedicated AS diagnosis models <ref type="bibr">[6]</ref> [10] <ref type="bibr">[12]</ref>  <ref type="bibr">[11]</ref>. Comparisons to past work on TMED-2 are complicated by variations in how 2D instances were treated. Some works use all available 2D images (All 2D); others examine only the 2D images/videos with associated view labels (ViewLOnly 2D, &#8776; 48% of all 2D). We trained and evaluated SMMIL on both versions to ensure fair comparison. Second, past TMED-2 data releases have only provided one still frame image from each 2D instance, rather than our present focus on video. We thus report both SMMIL-ID (Image 2D instances + Dopplers) and SMMIL-VD (Video 2D instances + Doppler).</p><p>Table <ref type="table">1</ref> reports the balanced accuracy across the 3 splits of TMED-2. Our SMMIL-ID achieves significant improvements, which we suggest primarily come from the Doppler modality. Using video (suffix "-VD") adds further modest gains. On the harder All 2D version, SMMIL-ID averages 82% balanced accuracy compared to 72.6% for the best alternative (SAMIL). On the ViewLOnly version, SMMIL-ID beats recent work <ref type="bibr">[12]</ref> by 2.7 points and [10] by 10 points.</p><p>Ablation. Two key components of SMMIL are the use of unlabeled data with SSL and the incorporation of spectral Dopplers. We assess the impact of each component in Tab. 1 (left). Spectral Doppler adds 9 percentage points to overall balanced accuracy, while SSL adds almost 4. Each modality alone (2D or Doppler) is worse than modeling both jointly.</p><p>External validation. Tab. 2 compares our SMMIL-VD to alternatives at binary detection tasks on the external validation set. SMMIL-VD outperforms alternatives, often by a wide margin (e.g. +5 point gain on AUROC in no vs. some AS).</p><p>Medical Foundation Models. Recently, generalist medical foundation models (MFMs) <ref type="bibr">[26,</ref><ref type="bibr">27]</ref> have shown some promise, but readiness for echocardiogram interpretation remains an open question. We evaluate two SOTA MFMs, Med-Flamingo <ref type="bibr">[26]</ref> and Rad-FM <ref type="bibr">[27]</ref>, on our AS task with zeroshot and Chain-of-Thought prompting <ref type="bibr">[28]</ref>.</p><p>In Table <ref type="table">1</ref> (right), we find that all MFMs get below 57% balanced accuracy, barely better than random chance, despite substantial effort in prompt engineering. Particularly, Med-Flamingo is not able to output reasonable results based on varying inputs. While both models do well at naming the body part (heart) or the imaging type (ultrasound), unsurprisingly, MFMs are still nascent <ref type="bibr">[27]</ref> and will require further effort to succeed off-the-shelf at echo-related AS diagnosis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">CONCLUSION</head><p>We proposed SMMIL, a new deep learning framework for automated interpretation of echocardiograms. We demonstrated how combining modalities (spectral Doppler and 2D cineloop) and SSL on abundant unlabeled data can improve AS detection. SMMIL may be broadly applied to other medical tasks with multiple instances of several data types.</p><p>Compliance with Ethical Standards. This study was performed in line with the principles of the Declaration of Helsinki. Approval for this retrospective study of deidentified images collected during routine care was granted by the Tufts Health Sciences IRB (MODCR-14-12678).</p></div></body>
		</text>
</TEI>
