<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Uncertainty-Aware Bayesian Deep Learning withNoisy Training Labels forEpileptic Seizure Detection</title></titleStmt>
			<publicationStmt>
				<publisher>Springer Nature Switzerland</publisher>
				<date>10/03/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10580691</idno>
					<idno type="doi">10.1007/978-3-031-73158-7_1</idno>
					
					<author>Deeksha M Shama</author><author>Archana Venkataraman</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Supervised learning has become the dominant paradigm in computer-aided diagnosis. Generally, these methods assume that the training labels represent "ground truth" information about the target phenomena. In actuality, the labels, often derived from human annotations, are noisy/unreliable. This aleoteric uncertainty poses significant challenges for modalities such as electroencephalography (EEG), in which "ground truth" is difficult to ascertain without invasive experiments. In this paper, we propose a novel Bayesian framework to mitigate the effects of aleoteric label uncertainty in the context of supervised deep learning. Our target application is EEG-based epileptic seizure detection. Our framework, called BUNDL, leverages domain knowledge to design a posterior distribution for the (unknown) "clean labels" that automatically adjusts based on the data uncertainty. Crucially, BUNDL can be wrapped around any existing detection model and trained using a novel KL divergence-based loss function. We validate BUNDL on both a simulated EEG dataset and the Temple University Hospital (TUH) corpus using three state-of-the-art deep networks. In all cases, BUNDL improves seizure detection performance over existing noise mitigation strategies.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Deep learning is becoming ubiquitous in computer-aided diagnosis with medical imaging data. The most common paradigm is supervised learning, which relies on labeled training data to learn complex and generalizable patterns. The underlying assumption of supervised learning is that the training labels provide accurate"ground truth" information about the phenomena of interest. In actuality, these labels can be unreliable due to noise, human error, and subjective biases. This unreliability can be viewed as aleoteric uncertainty arising from the complex imaging data and the manual label annotations. Aleoteric uncertainty is further exacerbated in modalities that are not easily readable through visual inspection, such as genomics data, fMRI connectomics, and electroencephalography (EEG). Particularly, epileptic seizure detection using EEG has seen extensive adoption of supervised AI models <ref type="bibr">[27,</ref><ref type="bibr">2,</ref><ref type="bibr">5,</ref><ref type="bibr">4,</ref><ref type="bibr">19,</ref><ref type="bibr">9,</ref><ref type="bibr">14]</ref> but also has high aleoteric uncertainty due to environmental noise and data artifacts <ref type="bibr">[6,</ref><ref type="bibr">3]</ref>. As a consequence, the manually annotated seizure interval and onset location are ambiguous, and AI models trained on such noisy labels inadvertently learn misleading features, resulting in poor generalization performance.</p><p>Considerable progress has been made within the AI community to quantify uncertainty <ref type="bibr">[1,</ref><ref type="bibr">12,</ref><ref type="bibr">13]</ref> and to adaptively de-noise the data while still trusting the given labels <ref type="bibr">[20]</ref>. More recently, there has been work that tackles the problem of learning with uncertain labels <ref type="bibr">[23]</ref>. These methods can be broadly grouped into three categories. The first category is data pruning, in which a pretraining step is used to re-weight or remove noisy samples prior to training <ref type="bibr">[26,</ref><ref type="bibr">16,</ref><ref type="bibr">17]</ref>. These methods can lead to degeneracies in complex data modalities by removing critical data points. The second category proposes "robust architectures" that use an auxiliary deep network to estimate the transition from clean label to noisy label via a joint optimization <ref type="bibr">[7]</ref>, EM strategies <ref type="bibr">[15]</ref>, or sequentially trained multi-networks <ref type="bibr">[25]</ref>. These methods suffer from increased time complexity, scalability issues, and over-fitting on smaller datasets. The third category devises "robust loss functions" which use prior information for self-supervised label adjustment <ref type="bibr">[10]</ref>, model regularization to prevent overfitting <ref type="bibr">[28]</ref>, and loss correction via bootstrapping <ref type="bibr">[18,</ref><ref type="bibr">21]</ref>. However, these methods do not model instance-dependent label transitions, which is often the case for complex data modalities.</p><p>We propose a novel statistical framework called Bayesian UNcertainty-aware Deep Learning (BUNDL) to mitigate the impact of label inaccuracies when training deep networks. BUNDL assumes a conditional Bernoulli distribution for the unknown "clean labels" that alters the predicted posterior probability when uncertainty is higher in the data. Our target application is epileptic seizure detection from scalp EEG data. Accordingly, we leverage domain knowledge that the seizure is generally over-segmented by clinicians <ref type="bibr">[3,</ref><ref type="bibr">8]</ref> and infer the clean label posterior from the observed EEG by minimizing a novel KL divergence loss function. BUNDL is developed in a network-agnostic manner and is validated on both simulated EEG data and the publicly available Temple University Hospital (TUH) corpus <ref type="bibr">[22]</ref> using three state-of-the-art deep networks for seizure detection. Our results show that BUNDL achieves better seizure detection performance than existing noise mitigation strategies while being network-independent and easily adapted to new applications. We also conduct a secondary analysis of seizure onset zone (SOZ) localization and demonstrate that using BUNDL to "correct" the seizure detection task also improves localization.</p><p>2 Unified Framework to Mitigate Aleoteric Label Noise Fig. <ref type="figure">1</ref>(left) illustrates the graphical model that underlies BUNDL. For simplicity, we assume label corruption depends only on the current input, not the entire training dataset. Mathematically, we let x &#8712; R N d be a random variable for the input data of dimension N d , &#375; &#8712; {0, 1} be a Bernoulli random variable corresponding to the given noisy label, and y &#8712; {0, 1} be the unobserved clean label. In our case, x is a single frame of multichannel EEG, &#375; is the corresponding clinician annotated seizure versus baseline classification, and y is the unobserved ground truth seizure information. Our generative process(Fig. <ref type="figure">1</ref>(left)) assumes that the clean labels y are used to generate the input data x and that both x and y inform the noisy labels &#375;. Finally, both x and &#375; are observed during training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Learning the Clean Label Posterior via Distributional Loss</head><p>Our goal during modified supervised training is to learn a function f &#952; (&#8226;) parameterized by &#952; that infers the clean label posterior distribution given the input data, i.e., p(y|x). This function can then be applied to new test data samples.</p><p>We train f &#952; (&#8226;) using the labeled pairs (x, &#375;) by first defining the noisy label posterior distribution as p(&#375;|x) = B(&#375;; p g ), where B(&#8226;) denotes a Bernoulli distribution, and the parameter p g is the clinician provided (binary) label of baseline versus seizure activity. We note that if the labels were instead provided as a continuous measure of clinician confidence, then these could be used as p g .</p><p>We account for label uncertainty by defining the posterior distribution of the clean label y given the observed data and noisy label pair (x, &#375;) as follows:</p><p>where z &#375; x is instance-dependent aleoteric uncertainty and f &#952; (x) is the average of the model prediction f &#952; (&#8226;) across multiple stochastic forward passes to the model using dropout. Intuitively, when z &#375;</p><p>x is low, the model trusts the given labels p g since accuracy of manual annotations of clean samples is reliable. When z &#375;</p><p>x is high, the model hinges towards its self-supervision f &#952; (x).</p><p>By marginalizing &#375; in Eq. ( <ref type="formula">1</ref>), we obtain the clean label posterior p(y|x) = B(y; p yc ), where the Bernoulli parameter p yc is computed as follows:</p><p>We implement the function f &#952; (&#8226;) using a deep neural network. The optimal parameter &#952; * are learned by minimizing the KL divergence between the Bernoulli distribution parameterized by Eq. ( <ref type="formula">2</ref>) and the deep network prediction Q(y|x; &#952;) = B(y; f &#952; (x)) for all samples in the training dataset:</p><p>Computing Uncertainty from Monte-Carlo Dropout (MCD): As shown in Fig. <ref type="figure">1</ref>(right), BUNDL estimates the posterior distribution of the clean labels through multiple forward passes of the input EEG to the deep network. These passes generate "posterior estimates" with the randomness provided by dropout <ref type="bibr">[24]</ref>. We estimate the data uncertainty z &#375;</p><p>x as the empirical entropy (rescaled between 0 and 1) of these posterior samples <ref type="bibr">[12]</ref>. From a practical standpoint, the network output is random in the initial stages of training and cannot provide a good uncertainty estimate. Therefore, we pretrain the network with a subset of data away from the seizure onset and set z &#375; x = 0. This way we avoid learning from highly uncertain EEG data around change points in early stages of training. We then update the network parameters using the z &#375;</p><p>x continuously predicted by the model for another 30 epochs. We also use a small tolerance of 0.001 for p g to ensure numerical stability. The dropout rate is set at 0.2 and 20 posterior samples are used in every epoch. The learning rate is chosen through grid search in the range [10 -6 , 10 -2 ] to choose the best performing rate across nested 10-fold cross validation.See supplement for the algorithms. Domain Knowledge for Seizure Detection: The seizure interval tends to be "over-segmented" by clinicians due to the comparatively higher cost of missing a seizure than introducing false positives <ref type="bibr">[3,</ref><ref type="bibr">8]</ref>. Consequently, we opt to trust the control class labels and set z 0</p><p>x =0 during training only. The overestimated seizure class is unreliable, hence, we quantify z 1</p><p>x using the entropy of posterior samples. Hence, p yc = p g (z 1</p><p>x f &#952; (x) + p g (1z 1 x ))<ref type="foot">foot_0</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Evaluation setup</head><p>We use the optimal parameters &#952; * learned for f &#952; (&#8226;) during training to directly estimate the clean label parameter p yc from new testing data. We report metrics at the window and seizure levels for detailed comparison. At window level, we compare area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPRC) aggregated across individual time points. During cross validation, we select the learning rate and detection threshold for each method to allow no more than 3 minutes of false positives per hour in the validation dataset. At the seizure level, we report the false positive rate (FPR) (min/hour), sensitivity, and the latency (seconds) of seizure prediction. We perform repeated 10-fold cross validation with patient-level splits Baseline Algorithms: We compare BUNDL with two state-of-the-art approaches to handle noisy labels: (1) Self Adaptive learning (SelfAdapt) <ref type="bibr">[10]</ref>, which alters the clean label posterior using self-supervision independent of data uncertainty, and (2) Noisy Adaptation Layer (NAL) <ref type="bibr">[7]</ref>, which builds a robust model to jointly estimate clean label and noisy label posterior using EM like algorithm. We use their "complex model" as it is better suited to the task. Finally, our baseline is the traditional cross-entropy loss (CEL), which does not assume any label noise. We use the state-of-art DeepSOZ model <ref type="bibr">[14]</ref> for all comparisons.</p><p>Network Comparison: Since BUNDL is model-agnostic, we use three recently published networks for seizure detection to compare the performance of BUNDL against traditional CEL-based training. These networks are as following:</p><p>1. DeeSOZ <ref type="bibr">[14]</ref> consists of a transformer that employs self-attention to extract features across EEG channels in a time window. The global features from the transformer is then processed by an LSTM to predict seizure, whereas channel-wise features are passed through pooling layer to localize SOZ. 2. TGCN <ref type="bibr">[4]</ref> employs eight layers of spatio-temporal convolutions (STC) consisting of 1-D convolutions to process EEG within each channel's neighborhood. This is followed by three linear layers to predict seizure 3. CNN <ref type="bibr">[5]</ref> also consists of 1-D convolutions and batch normalization that process an EEG time window altogether unlike TGCN that does it within a neighborhood. This is followed by LSTM layer to predict seizure activity.</p><p>Extension to Seizure Onset Zone (SOZ) Localization:. SOZ localization is crucial for real-world epilepsy management. Unlike seizure detection, the goal of SOZ localization is to identify the EEG channels from which the seizure activity originates, as shown in Fig. <ref type="figure">2</ref>. We hypothesize that accounting for noise in the seizure detection labels will help us stabilize and/or improve SOZ localization. We compare the performance of DeepSOZ for SOZ localization when the seizure detection branch is trained with BUNDL to when it is trained with CEL. We report accuracy averaged at seizure-level and patient-level as in <ref type="bibr">[14]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Epilepsy Datasets</head><p>Simulated Data: <ref type="foot">4</ref> We use the SEREEGA framework <ref type="bibr">[11]</ref> in MATLAB to simulate 10 min EEG recordings for 120 subjects with 1-10 seizure events per subject. We select 10-20 montage channels from the 32-channel New York Head (ICBM-NY) leadfield matrix. Background activity is generated via four sources in each brain quadrant with continuous ERSP data in alpha, beta, and occasional slow waves, and one randomly located source for spike artifacts (6-14 Hz). Background sources include multiple randomly chosen white noise intervals. Seizure source locations vary for each subject, with continuous intervals of spikes and sharp waves (2.5-4 Hz) having uniformly varying onset times and duration. We generate noisy labels by oversegmenting seizures by 30% for training and using the clean labels for evaluation. We choose to oversegment seizures, as this matches our clinical hypotheses, and we defer other types of label noise to future.</p><p>TUH Corpus:<ref type="foot">foot_2</ref> TUH <ref type="bibr">[22]</ref> is a public dataset with 642 seizure recordings from 120 patients (Table <ref type="table">1</ref>. Each recording contains 10-20 montage EEG data and clinician-annotated intervals of the temporal seizure interval and the seizure onset channels. For preprocessing, we resample the EEG signals to 200 Hz to ensure uniformity. We filter the signals (1.6-30 Hz) and apply a clipping mechanism to remove high-intensity artifacts, setting the threshold at two standard deviations from the mean. All signals are normalized to have zero mean and unit variance within a recording. We crop recordings to a duration of 10 minutes centered around the seizure interval, while ensuring a uniform distribution of onset times and segment the input EEG to one-second non-overlapping windows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Performance Comparisons with BUNDL</head><p>Seizure Detection Accuracy: Table <ref type="table">2</ref> compares BUNDL using DeepSOZ with three state-of-the-art noise mitigation strategies on the simulated data. The performance metrics are computed against the ground-truth clean labels for simulated data. We observe that BUNDL achieves the best overall performance with a mean AUROC of 0.97 and a mean AUPRC of 0.95 against the clean labels at the one-second window level. While NAL achieves similar window-level performance, BUNDL shows considerable gains at the seizure level. Specifically, BUNDL reduces both latency and false-positive rate (FPR) by 50% over NAL. The SelfAdapt and CEL methods have similar overall performance, which indicates that SelfAdapt is failing to capture and/or mitigate label noise. Table <ref type="table">3</ref> reports the seizure detection accuracy of all four methods (BUNDL and baselines) on the TUH dataset. In this case, the performance metrics are computed against the (presumed noisy) given labels, as we do not have access to "clean labels" for real-world data. This mismatch between the model outputs (i.e., estimated clean labels) and the given labels results in a uniform decrease in window-level AUROC and AUPRC as compared to the simulated data. At the seizure level, BUNDL has the lowest FPR (2.3 min/hour) and onset latency (7.6 sec), which is a significant improvement over the baseline methods. We note that the baseline methods achieve a slightly higher sensitivity than BUNDL. This can be partially attributed to the mismatched evaluation against the noisy given labels. In addition, the increased FPR for the baselines suggest that they are over-segmenting the seizure, which can also lead to higher sensitivity. Fig. <ref type="figure">3</ref> illustrates the predicted "clean" seizure probability p yc across time for EEG from a patient in TUH. BUNDL correctly predicts the seizure whereas other methods have increased false positives (red) and latency.</p><p>Applying BUNDL to Different DL Models: Table <ref type="table">4</ref> compares BUNDL with CEL (i.e., no noise mitigation) using three state-of-the-art deep networks. In simulated data, BUNDL is uniformly better than CEL in all metrics. In TUH dataset, BUNDL uniformly improves FPR and latency for all models. Once  again, we notice a slightly lower sensitivity for BUNDL, which can partially be attributed to computing the TUH evaluation metrics against the noisy clinician labels. We note that the sensitivity of BUNDL was highest in the simulated experiment, for which we had access to the "ground truth" clean labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Label Transition in TUH:</head><p>We analyze the joint distribution p(&#375;, y|x) by multiplying predicted p(y|&#375;, x) with p(&#375;|x) given by the test labels , to understand the behavior of noisy and predicted "clean" labels. Around onset region (i.e., 30-seconds before the annotated onset), the clinician labels are determined to be false positives by BUNDL with 0.21 probability (lower left quadrant), which reflects the ambiguous seizure start. During seizure, this probability rises to 0.383, due to increased uncertainty in EEG during seizure time points seen in Fig. <ref type="figure">4</ref>(a) indicating that the clinician maybe mislabelling seizure onset. Effectively, BUNDL hedges towards baseline for highly uncertain intervals.</p><p>SOZ Localization: At seizure level, DeepSOZ trained using CEL has a localization accuracy of 0.601&#177;0.176, which improves to 0.633&#177;0.149 using BUNDL.</p><p>Similarly at patient level, accounting for noisy labels using BUNDL increases the accuracy from 0.591&#177;0.144 (CEL) to 0.620&#177;0.111. This suggests that reducing ambiguities in detection helps to improve localization, as shown for one patient in Fig. <ref type="figure">5</ref>. More examples are plotted in supplementary material. Fig. 5: Temporal seizure detection and seizure onset localization of two patients predicted by (a) DeepSOZ-BUNDL (b) DeepSOZ-CEL. Predicted SOZ is given by red heatmap. True SOZ is marked in black and mentioned underneath.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>We have introduced, BUNDL, a novel statistical framework to mitigate the impact of noisy training labels. Specifically, BUNDL designs the transition between clean and noisy labels using a novel probabilistic model and facilitates robust training with a distributional loss. Using the backdrop of epileptic seizure detection, we demonstrated that training with BUNDL achieves improved performance as compared to state-of-art techniques in both simulated and real EEG data. Moreover, BUNDL can be used to train any existing deep learning model with minimal overhead. In an exploratory evaluation, we also demonstrate that addressing label noise in seizure detection enhances seizure onset zone localization, which is a critical step in epilepsy management. BUNDL is developed in a model-agnostic way and easily adaptable. Future work will explore its use in other medical imaging applications with various label noise types.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0"><p>(1pg)(pg(1z 0 x ) + f &#952; (x)z 0 x ) =0 when pg &#8712; {0, 1}</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1"><p>Simulated data and code repository</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2"><p>TUH dataset.</p></note>
		</body>
		</text>
</TEI>
