<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Multi-module Recurrent Convolutional Neural Network with Transformer Encoder for ECG Arrhythmia Classification</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>07/27/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10321625</idno>
					<idno type="doi">10.1109/bhi50953.2021.9508527</idno>
					<title level='j'>2021 IEEE International Conference on Biomedical and Health Informatics (BHI)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Minh Duc Le</author><author>Vidhiwar Singh Rathour</author><author>Quang Sang Truong</author><author>Quan Mai</author><author>Patel Brijesh</author><author>Ngan Le</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The automatic classification of electrocardiogram(ECG) signals has played an important role in cardiovasculardiseases diagnosis and prediction. Deep neural networks (DNNs),particularly Convolutional Neural Networks (CNNs), have excelledin a variety of intelligent tasks including biomedical andhealth informatics. Most the existing approaches either partitionthe ECG time series into a set of segments and apply 1D-CNNs ordivide the ECG signal into a set of spectrogram images and apply2D-CNNs. These studies, however, suffer from the limitation thattemporal dependencies between 1D segments or 2D spectrogramsare not considered during network construction. Furthermore,meta-data including gender and age has not been well studiedin these researches. To address those limitations, we propose amulti-module Recurrent Convolutional Neural Networks (RCNNs)consisting of both CNNs to learn spatial representationand Recurrent Neural Networks (RNNs) to model the temporalrelationship. Our multi-module RCNNs architecture is designedas an end-to-end deep framework with four modules: (i) timeseriesmodule by 1D RCNNs which extracts spatio-temporalinformation of ECG time series; (ii) spectrogram module by2D RCNNs which learns visual-temporal representation of ECGspectrogram ; (iii) metadata module which vectorizes age andgender information; (iv) fusion module which semantically fusesthe information from three above modules by a transformerencoder. Ten-fold cross validation was used to evaluate the approachon the MIT-BIH arrhythmia database (MIT-BIH) underdifferent network configurations. The experimental results haveproved that our proposed multi-module RCNNs with transformerencoder achieves the state-of-the-art with 99.14% F1 score and98.29% accuracy.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Cardiovascular diseases are the leading cause of death in the USA <ref type="bibr">[7]</ref>. An electrocardiogram (ECG) records the electrical activity of the heart, thereby providing the summative evaluation of the cardiac electrical activity. It has been estimated that up to 300 million ECGs are recorded annually in Europe alone <ref type="bibr">[23]</ref>, these enormous amounts of ECG data highlights the importance of computer-aid interpretation. A high-accuracy computer-aid interpretation can save expert clinicians considerable time and efforts, as well as reducing the number of misdiagnoses.</p><p>Deep neural networks (DNN) <ref type="bibr">[37]</ref>, inspired by information processing and distributed communication nodes in biological systems, has been receiving massive interest in both academia and industry for a decade. They are computational models comprising of multiples layers, in which output of a layer is the input of the successive layer. The hierarchy of layers enables the network to learn the increasingly abstract, higher-level representations of the input data. DNNs have been showing their dominating performances in various intelligent tasks including biomedical <ref type="bibr">[8]</ref> and health informatics <ref type="bibr">[3]</ref>  <ref type="bibr">[17]</ref>. In the last decade, various DNNs-based methods have been employed in ECG-based automatic arrhythmia classification. Convolutional Neural Networks (CNNs) is the most favorable method <ref type="bibr">[4]</ref> and could be categorized into 2 main groups: 1D CNNs in time series and 2D CNNs on time-frequency spectrograms. The former uses raw ECGs as the input <ref type="bibr">[2]</ref> [25] <ref type="bibr">[15]</ref>  <ref type="bibr">[16]</ref>, split each ECG signal into multiple smaller segments which are then classified into labels in prediction step. The second approach focuses on frequency characteristic of the ECG signal, using its time-frequency spectrogram as the input of a 2D CNN for classification <ref type="bibr">[12]</ref> [14] <ref type="bibr">[36] [35]</ref>. Although the CNNsbased approaches have proven to be effective for arrhythmia classification, they suffer following limitations Lack of temporal relationship: Either 1D CNN on time series or 2D CNNs on spectrogram first partitions an ECG signal into a set of 1D segments or 2D spectrograms at different time. Then, a CNN-based network is applied into each 1D segments or 2D spectrograms. There is no mechanism to model the temporal relations between these segments or spectrograms within the same sample coming from one patient. Meta-data is not taken into consideration: ECG signal is presented in a high dimensional space while meta-data is given in a binary number (i.e. gender) or scale (age). Combining a high dimensional space of ECG signal (either times series or spectrogram) and very low dimensional space of meta-data is challenging. Most existing works do not take meta-data into account.</p><p>Single module: Most of the existing works is single module, i.e. they target at either time series with 1D CNNs frameworks or spectrogram with 2D CNNs frameworks. None of the previous works explores how to fuse multiple modules to inherit the merits from both time series and spectrogram.</p><p>To address the aforementioned limitations, we proposed a multi-module Recurrent Convolution Neural Networks (RC-NNs) with transformer encoder. Our network makes use of LSTM <ref type="bibr">[11]</ref> as a RNNs and contains four modules as follows. (i) time series module by a 1D RCNNs: In this module, 1D CNNs is first utilized to extract spatio-information from time series segment and LSTM then is used to model the temporal relations between 1D segments. (ii) spectrogram module by a 2D RCNNs: Given an ECG signal, spectrograms at different times are extracted by Short Time Fourier Transform (STFT). A 2D CNNs is used to learn visual representation in spatial domain and a LSTM network is applied to model the tempoinformation between spectrograms within an ECG signal; (iii) meta-data module: An autoencoder to featurize/vectorize the metadata to learn semantic information from both sex and gender; (iv) fusion module: the information from three modules is then fused under a transformer encoder. The entire network is illustration in Fig. <ref type="figure">1</ref>, each module is presented in one colored block.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>Recently, researchers have made major efforts in using DLbased techniques to outperform specialist cardiologist in ECG interpretation. Various ideas have been proposed and Convolutional Neural Network (CNN) has been widely implemented in automatic arrhythmia diagnosis. Yildirim in <ref type="bibr">[39]</ref> proposed a novel approach to classify 10-second ECG signal fragments involving 17 classes. Hannun <ref type="bibr">[2]</ref> also proposed an end-to-end DL approach to classify 12 rhythm classes using single-lead ECG recordings. Although the work achieved good results, it raised a question if DNN would be useful in a realistic clinical setting, where 12-lead ECGs are the clinical standard. Ribeiro <ref type="bibr">[25]</ref> partially addressed the question by presenting a DNN model using 12-lead ECG recordings to classify 6 types of abnormalities. Recurrent neural network (RNN) is also widely applied for arrhythmia diagnosis due to their highly dynamic behavior. Wang <ref type="bibr">[33]</ref> proposed a global and updatable classification scheme named Global Recurrent Neural Network (GRNN). Zhang <ref type="bibr">[40]</ref> introduced a patient-specific ECG classification using RNN to learn time correlation among ECG signal points. Long short-term memory (LSTM) and its improved version, gated recurrent unit (GRU) are among best DNN candidates in ECG classification <ref type="bibr">[6]</ref>, <ref type="bibr">[26]</ref> and <ref type="bibr">[31]</ref>.</p><p>The aforementioned studies show that an end-to-end DNN can successfully learn complex representative features of ECG signals with less or without excessive dependencies on manual feature extraction. Although the end-to-end approach extracts the "deep features" automatically along the network layers, it neglects one important feature of ECG, the frequency response. The importance of ECG frequency content was recognized from the beginning of the 20 th century <ref type="bibr">[5]</ref> [34], and has been studied in various medical research nowadays, such as <ref type="bibr">[28]</ref> and <ref type="bibr">[29]</ref>. There are various time-frequency transformation methods used for ECG feature extraction, Shorttime Fourier Transform (STFT) is extensively used to achieve ECG's spectral content. To exploit frequency characteristic of ECGs, several efforts have been made. Huang <ref type="bibr">[12]</ref> used STFT-based spectrogram and 2D CNN for ECG arrhythmia classification. Each ECG signal is transformed into 2D-image of spectrogram to be subsequently fed into 2D CNN for image classification. Xia <ref type="bibr">[36]</ref>  <ref type="bibr">[35]</ref> proposed using STFT and stationary wavelet transform (SWT) transformations to obtain two-dimensional (2-D) matrix input suitable for deep CNNs. Yildirim <ref type="bibr">[38]</ref> proposed a novel wavelet sequence based on deep bidirectional LSTM network model.</p><p>The work mentioned above merely focused on ECG signal characteristics. Other important characteristics such as patients' physical state (e.g. age, gender) are not considered <ref type="bibr">[4]</ref>. Macfarlane <ref type="bibr">[20]</ref> showed that ECG interval measurements, including QRS duration, heart rate, QT dispersion, and selected Q-wave durations are highly influenced by patients' gender, age and race. Therefore, age and gender differences in the ECG should be incorporated into a variety of criteria for ECG interpretation <ref type="bibr">[21]</ref>. In this paper, we propose an ECG arrhythmia classification method using multimodality -ECG signal, its frequency response and demographic factors (age and gender).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PROPOSED MULTI-MODULE RCNNS WITH TRANSFORMER ENCODER</head><p>Our proposed network consisting of four modules, i.e. Time Series Module, Spectrogram Module, Metadata Module and Fusion Module is detailed as follows:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Time Series Module: 1D RCNNs</head><p>This module aims to extract spatio-temporal information given an ECG time series signal. Let X be a recording ECG signal, X is partitioned into n segments i.e. X = {x i } i=n i=1 . Each segment length is set as l. There are two steps in this module. At the first step, the spatio-feature of each segment is extracted by 1D CNNs. We use function F to represent 1D CNNs, which transforms input segment {x i } i=n i=1 into spatial representation vector:</p><p>In the second step, a bidirectional LSTM (BLSTM) <ref type="bibr">[9]</ref> is applied to model the temporal relations between 1D segments.  </p><p>w h e r e f i &#8712; R L .I no u re x p e r im e n t s , w es e t l=3 6 0, n=   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B .S p e c t r o g r am M o d u l e :2DR C N N s T h i s m o d u l eu s e s2D t im e -f r e q u e n c y r e s p o n s e so fE CGa s a ni n p u t .S h o r t T im eF o u r i e rT r a n s f o rm( S T F T )i su t i l i z e d t oe x t r a c tt im e -f r e q u e n c yr e s p o n s e so f E CG . T h e m e t h o d i n v o l v e ss l i d i n gasm a l l w i n d o wo v e rt h es i g n a la n dt h e n p e r f o rm i n gd i s c r e t eF o u r i e r t r a n s f o rm f o re a c hc o r r e s p o n d i n g w i n d o w .T h ee q u a t i o n f o rS T F T i ss h o w n i nE q : 3w h e r eSi s S T F T f u n c t i o n a n dg ( n -m) i s t h ew i n d o w f u n c t i o n .U s u a l l y a H a n no raG a u s s i a nw i n d o w i su s e da n d t h ew i d t ho fw i n d o w i ss p e c i fi e db y m. { s</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C . M e t a d a t a M o d u l e :A u t o e n c o d e r I na d d i t i o n t oE CGs i g n a l , m e t a d a t a i ss t u d i e d i no u rn e tw o r k .D i f f e r e n t f r omE CG s i g n a lw h i c h i sp r e s e n t e d i na l o n g t im e s e r i e s ,m e t a d a t a i sp r e s e n t e db y tw o s c a l e sc o r r e s p o n d i n g t og e n d e r ( g )a n da g e a . I no r d e r t o f e a t u r i z em e t a d a t a ,w efi r s t u t i l i z e w o r d 2 v e c t e c h n i q u e t oc o n v e r t m e t a d a t a i n t ov e c t o r s . W eu s e f u n c t i o n W t op r e s e n tw o r d 2 v e cw h i c h t r a n s f o rm sa n i n p u t xi n t oav e c t o r</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>] i sem p l o y e d t or e -w e i g h t t h e s e f e a t u r e sb yap r o p e r r a t i o .T h i sh e l p s t h eo v e r a l l m o d e l t ok n o w w h i c hi n f o rm a t i o ns h o u l db e m o r eem p h a s i z e dt o</head><p>b e t t e rs em a n t i c a l l yf u s e t h e s ef e a t u r e s .T h efi n a lf e a t u r efi s g e n e r a l l yc om p u t e da s f o l l o w s</p><p>w h e r ew 0 , { w i }a n d{ w i }a r e l e a r n tb yT r a n s f o rm e rE n c o d e r [ <ref type="bibr">3 2</ref> ] . F i n a l l y ,w eem p l o ya f u l l yc o n n e c t e d l a y e rw i t hs o f tm a x t o c o n v e r t t h eo u t p u t sf&#8712;R L i n t o i n t oc a t e g o r i c a lp r o b a b i l i t i e s , where f = &#952;(f ) and f &#8712; R K (6)</p><p>IV. EXPERIMENTAL RESULTS Datasets MIT-BIH Arrhythmia dataset consists of 48 thirty minutes long two-lead ECG recordings of 47 subjects. The recordings are digitized using a sampling frequency of 360Hz. The database consists of a total 20 labels. In our experiments, we follow similar experiment setup in <ref type="bibr">[12]</ref>, i.e. we choose five most common labels i.e. Normal beat (N), Left bundle branch block beat (L), Right bundle branch block beat (R), Premature ventricular contraction (V), Atrial premature beat (A) and Others ( ) as all the other beats. This dataset has 2 leads and lead V5 is used. The split of training:validation is 90:10 and label that occurs most was used as the sample class. Metrics F 1 -score is computed as the harmonic mean of the precision and recall:</p><p>Accuracy is the measure of how well the model could perform classification. It is the fraction of correct predictions among the total number of predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Performance Comparison</head><p>In this section, we first examine the effectiveness of RCNNs compared to CNNs as show in Table <ref type="table">I</ref>  CNNs, 2D CNNs, RNNs. In this experiment, different methods are conducted on different number of classes while our approach is conducted on the most common classes i.e. five most common classes and 1 class other for all the other 15 labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CONCLUSION</head><p>In this paper, we proposed an ECG arrhythmia classification method based multi-module Recurrent Convolutional Neural Networks (RCNNs). The experiment has been conducted on six classes (five most common classes and the other classes) from MIT-BIH arrhythmia database. Our network takes all time-series, spectrogram and metadata into consideration. The proposed multi-module RCNNs is able to model both spatial information through CNNs and temporal information through LSTM. Our experiments have shown that metadata plays an important role to improve the classification performance. Our multi-module network outperforms most SOTA approach on the same dataset, with F1-score = 99.14%, and accuracy = 98.29%.</p></div></body>
		</text>
</TEI>
