<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Audio-Visual Event Localization in the Wild</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10105230</idno>
					<idno type="doi"></idno>
					<title level='j'>IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops</title>
<idno>2160-7508</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yapeng Tian</author><author>Jing Shi</author><author>Bochen Li</author><author>Zhiyao Duan</author><author>Chenliang Xu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In this paper, we study a family of audio-visual event temporal localization tasks as a proxy to the broader audiovisual scene understanding problem for unconstrained videos. We pose and seek to answer the following questions: (Q1) Does inference jointly over auditory and visual modalities outperform inference over them independently? (Q2) How does the result vary under noisy training conditions? (Q3) How does knowing one modality help model the other modality? (Q4) How do we best fuse information over both modalities? (Q5) Can we locate the content in one modality given its observation in the other modality? Notice that the individual questions might be studied in the literature, but we are not aware of any work that conducts a systematic study to answer these collective questions as a whole.</p><p>In particular, we define an audio-visual event as an event that is both visible and audible in a video segment, and we establish three tasks to explore aforementioned research questions: 1) supervised audio-visual event localization, 2) weakly-supervised audio-visual event localization, and 3) event-agnostic cross-modality localization. The first two tasks aim to predict which temporal segment of an input video has an audio-visual event and what category the event belongs to. The weakly-supervised setting assumes that we have no access to the temporal event boundary but an event tag at video-level for training. Q1-Q4 will be explored within these two tasks. In the third task, we aim to locate the corresponding visual sound source temporally within a video from a given sound segment and vice versa, which will answer Q5.</p><p>We propose both baselines and novel algorithms to solve the above three tasks. For the first two tasks, we start with a baseline model treating them as a sequence labeling problem. We utilize CNN to encode audio and visual inputs, adapt LSTM <ref type="bibr">[6]</ref> to capture temporal dependencies, and apply Fully Connected (FC) network to make the final predictions. Upon this baseline model, we introduce an audioguided visual attention mechanism to verify whether audio can help attend visual features; it also implies spatial locations for sounding objects as a side output. Furthermore, we investigate several audio-visual feature fusion methods and propose a novel dual multimodal residual fusion network that achieves the best fusion results. For weaklysupervised learning, we formulate it as a Multiple Instance Learning (MIL) <ref type="bibr">[10]</ref> task, and modify our network struc-ture via adding a MIL pooling layer to handle the problem. To address the harder cross-modality localization task, we propose an audio-visual distance learning network that measures the relativeness of any given pair of audio and visual content. It projects audio and visual features into subspaces with the same dimension. Contrastive loss <ref type="bibr">[5]</ref> is introduced to learn the network.</p><p>Observing that there is no publicly available dataset directly suitable for our tasks, we collect a large video dataset that consists of 4143 10-second videos with both audio and video tracks for 28 audio-visual events and annotate their temporal boundaries. Videos in our dataset are originated from YouTube, thus they are unconstrained. Our extensive experiments support the following findings: modeling jointly over auditory and visual modalities outperforms modeling independently over them, audio-visual event localization in a noisy condition can still achieve promising results, the audio-guided visual attention can well capture semantic regions covering sounding objects and can even distinguish audio-visual unrelated videos, temporal alignment is important for audio-visual fusion, the proposed dual multimodal residual network is effective in addressing the fusion task, and strong correlations between the two modalities enable cross-modality localization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Dataset and Problems</head><p>Audio-Visual Event Dataset To the best of our knowledge, there is no publicly available dataset directly suitable for our purpose. Therefore, we introduce the Audio-Visual Event (AVE) dataset , a subset of AudioSet <ref type="bibr">[4]</ref>, that contains 4143 videos covering 28 event categories and videos in AVE are temporally labeled with audio-visual event boundaries. Each video contains at least one 2s long audio-visual event.</p><p>The dataset covers a wide range of audio-visual events (e.g., man speaking, woman speaking, dog barking, playing guitar, and frying food etc.) from different domains, e.g., human activities, animal activities, music performances, and vehicle sounds.We provide examples from different categories and show the statistics in Fig. <ref type="figure">1</ref>. Each event category contains a minimum of 60 videos and a maximum of 188 videos, and 66.4% videos in the AVE contain audio-visual events that span over the full 10 seconds. Fully and Weakly-Supervised Event Localization The goal of event localization is to predict the event label for each video segment, which contains both audio and visual tracks, for an input video sequence. Concretely, for a video sequence, we split it into T non-overlapping segments {V t , A t } T t=1 , where each segment is 1s long (since our event boundary is labeled at second-level), and V t and A t denote the visual content and its corresponding audio counterpart in a video segment, respectively. Let</p><p>be the event label for that video segment. Here, C is the total number of AVE events plus one background label. Different than the supervised setting, in the weakly-supervised manner we have only access to a video-level event tag, and we still aim to predict segment-level labels during testing. The weaklysupervised task allows us to alleviate the reliance on wellannotated data for modelings of audio, visual and audiovisual. Cross-Modality Localization In the cross-modality localization task, given a segment of one modality (auditory/visual), we would like to find the position of its synchronized content in the other modality (visual/auditory). Concretely, for visual localization from audio (A2V), given a l-second audio segment &#194; from {A t } T t=1 , where l &lt; T , we want to find its synchronized l-second visual segment within {V t } T t=1 . Similarly, for audio localization from visual content (V2A), given a l-second video segment V from {V t } T t=1 , we would like to find its l-second audio segment within {A t } T t=1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Overview of Proposed Methods</head><p>Audio-Visual Event Localization Network: Our network mainly consists of five modules: feature extraction, audioguided visual attention, temporal modeling, multimodal fusion and temporal labeling (see Fig. <ref type="figure">2(a)</ref>). The feature extraction module utilizes pre-trained CNNs to extract visual features v t = [v 1 t , ..., v k t ] &#8712; R dv&#215;k and audio features a t &#8712; R da from each V t and A t , respectively. Here, d v denotes the number of CNN visual feature maps, k is the vectorized spatial dimension of each feature map, and d a denotes the dimension of audio features. We use an audioguided visual attention model to generate a context vector v att t &#8712; R dv . Two separate LSTMs take v att t and a t as inputs to model temporal dependencies in the two modalities re-spectively. For an input feature vector F t at time step t, the LSTM updates a hidden state vector h t and a memory cell state vector c t , where F t refers to v att t or a t in our model. For evaluating the performance of the proposed attention mechanism, we compare to models that do not use attention; we directly feed global average pooling visual features and audio features into LSTMs as baselines. To better incorporate the two modalities, we introduce a multimodal fusion network. The audio-visual representation h * t is learned by a multimodal fusion network with audio and visual hidden state output vectors h v t and h a t as inputs. This joint audiovisual representation is used to output event category for each video segment. For this, we use a shared FC layer with the Softmax activation function to predict probability distribution over C event categories for the input segment and the whole network can be trained with a multi-class crossentropy loss. Audio-Guided Visual Attention: Given that attention mechanism has shown superior performance in many applications such as neural machine translation <ref type="bibr">[3]</ref> and image captioning <ref type="bibr">[12,</ref><ref type="bibr">9]</ref>, we use it to implement our audio-guided visual attention. The attention network will adaptively learn which visual regions in each segment of a video to look for the corresponding sounding object or activity. Concretely, we define the attention function f att and it can be adaptively learned from the visual feature map v t and audio feature vector a t . At each time step t, the visual context vector v att t is computed by:</p><p>where w t is an attention weight vector corresponding to the probability distribution over k visual regions that are attended by its audio counterpart. The attention weights can be computed based on MLP with a Softmax activation function. The attention map visualization results show that the audio-guided attention mechanism can adaptively capture the location information of sound source, and it can also improve temporal localization accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Audio-Visual Feature Fusion:</head><p>To combine features coming from visual and audio modalities, we introduce a Dual Multimodal Residual Network (DMRN). Given audio and visual features h a t and h v t from LSTMs, the DMRN will compute the updated audio and visual features:</p><p>where f (&#8226;) is an additive fusion function, and the average of h a &#8242; t and h v &#8242; t is used as the joint representation h * t for labeling the video segment. Here, the update strategy in DMRN can both preserve useful information in the original modality and add complimentary information from the other modality.</p><p>Weakly-Supervised Event Localization: To address the weakly-supervised event localization, we formulate it as a MIL problem and extend our framework to handle noisy training condition. Since only video-level labels are available, we infer label of each audio-visual segment pair in the training phase, and aggregate these individual predictions into a video-level prediction by MIL pooling as in <ref type="bibr">[11]</ref>:</p><p>where m 1 , ..., m T are predictions from the last FC layer of our audio-visual event localization network, and g(&#8226;) averages over all predictions. The probability distribution of event category for the video sequence can be computed using m over the Softmax. During testing, we can predict the event category for each segment according to computed m t .</p><p>Cross-Modality Localization: To address the crossmodality localization problem, we propose an audio-visual distance learning network (AVDLN) as illustrated in Fig. <ref type="figure">2</ref>(b). Our network can measure the distance D &#952; (V i , A i ) for a given pair of V i and A i . At test time, for visual localization from audio (A2V), we use a sliding window method. Let {V i , A i } N i=1 be N training samples and {y i } N i=1 be their labels, where V i and A i are a pair of 1s visual and audio segments, y i &#8712; {0, 1}. Here, y i = 1 means that V i and A i are synchronized. The AVDLN will learn to measure distances between these pairs. In practice, we use the Euclidean distance as the metric and contrastive loss <ref type="bibr">[5]</ref> to optimize the AVDLN.   Table <ref type="table">1</ref>. Event localization prediction accuracy (%) on AVE dataset. A, V, V-att, A+V, A+V-att denote that these models use audio, visual, attended visual, audio-visual and attended audiovisual features, respectively. W-models are trained in a weaklysupervised manner. Note that audio-visual models all fuse features by concatenating the outputs of LSTMs. Audio and Visual: From Tab. 1, we observe that A outperforms V and W-A is also better than W-V. It demonstrates that audio features are more powerful to address audiovisual event localization task on the AVE dataset. However, when we look at each individual event, using audio is not always better than using visual. We observe that V is better than A for some events (e.g. car, motocycle, train, bus). Actually, most of these events are outdoor. Audios in these videos can be very noisy: several different sounds may be mixed together (e.g. people cheers with a racing car), and may have very low intensity (e.g. horse sound from far distance). For these conditions, visual information will give us more discriminative and accurate information to understand events in videos. A is much better than V for some events (e.g. dog, man and woman speaking, baby crying). Sounds will provide clear cues for us to recognize these events. For example, if we hear barking sound, we know that there may be a dog. We also observe that A+V is better than both A and V, and W-A+V is better than W-A and W-V. From the above results and analysis, we can conclude that auditory and visual modalities will provide complementary information for us to understand events in videos.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Results</head><p>Audio-Guided Visual Attention: The quantitative results (see Tab. 1) show that V-att is much better than V (a 3.3% absolute improvement) and A+V-att outperforms A+V by 1.3%. We show qualitative results of our attention method in Fig. <ref type="figure">3</ref>. We observe that a range of semantic regions in many different categories and examples can be attended by sound, which validates that our attention network can learn which visual regions to look for sounding objects. An interesting observation is that the audio-guided visual attention tends to focus on sounding regions, such as man's mouth, head of crying boy etc, rather than whole objects in some examples. Audio-Visual Fusion: We compare our fusion method: DMRN with several network-based multimodal fusion methods: Additive, Maxpooling (MP), Gated, Multimodal Bilinear (MB), and Gated Multimodal Bilinear (GMB) in <ref type="bibr">[7]</ref>, Gated Multimodal Unit (GMU) in <ref type="bibr">[2]</ref>, Concatenation (Concat), and MRN <ref type="bibr">[8]</ref>. Three different fusion strategies: early, late and decision fusions are explored. Here, early fusion methods directly fuse audio features from pre-trained CNNs and attended visual features; late fusion methods fuse audio and visual features from outputs of two LSTMs; and decision fusion methods fuse the two modalities before Softmax layer. Table <ref type="table">2</ref> shows audio-visual event localization prediction accuracy of different multimodal feature fusion methods on AVE dataset. Our DMRN model in the late fusion setting can achieve better performance than all compared method. We also observe that late fusion is better than early fusion and decision fusion. The superiority of late fusion over early fusion demonstrates that temporal modeling before audio-visual fusion is useful. We know that auditory and visual modalities are not completely aligned, and temporal modeling can implicitly learn certain alignments between the two modalities, which is helpful for the audiovisual feature fusion task. The decision fusion can be regard as a type of late fusion but using lower dimension (same as the category number) features. The late fusion outperforms the decision fusion, which validates that processing multiple features separately and then learning joint representa-tion using a middle layer rather than the bottom layer is an efficient fusion way.</p><p>Full and Weak Supervision: Obviously, supervised models are better than weakly supervised ones, but quantitative comparisons show that weakly-supervised approaches achieve promising event localization performance, which demonstrates the effectiveness of the MIL frameworks, and validates that the audio-visual event localization task can be addressed even in a noisy condition.</p><p>Cross-Modality Localization: We compare our method: AVDLN with DCCA <ref type="bibr">[1]</ref> on cross-modality localization tasks: A2V (visual localization from audio segment query) and V2A (audio localization from visual segment query).</p><p>The prediction accuracy of AVDLN and DCCA on the A2V and V2A tasks are 44.8/36.6 and 34.8/34.1, respectively. Our AVDL outperforms DCCA over a large margin both on A2V and V2A tasks. Even using the strict evaluation metric (which counts only the exact matches), our models on both subtasks: A2V and V2A, show promising results, which further demonstrates that there are strong correlations between audio and visual modalities.</p></div></body>
		</text>
</TEI>
