<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>01/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10521034</idno>
					<idno type="doi">10.1109/WACV56688.2023.00231</idno>
					
					<author>Dennis Fedorishin</author><author>Deen Dayal_Mohan</author><author>Bhavin Jawade</author><author>Srirangaraj Setlur</author><author>Venu Govindaraju</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Learning to localize the sound source in videos without explicit annotations is a novel area of audio-visual research. Existing work in this area focuses on creating attention maps to capture the correlation between the two modalities to localize the source of the sound. In a video, oftentimes, the objects exhibiting movement are the ones generating the sound. In this work, we capture this characteristic by modeling the optical flow in a video as a prior to better aid in localizing the sound source. We further demonstrate that the addition of flow-based attention substantially improves visual sound source localization. Finally, we benchmark our method on standard sound source localization datasets and achieve state-of-the-]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In recent years, the field of audio-visual understanding has become a very active area of research. This can be attributed to the large amount of video data being produced as part of user-generated content on social media and other platforms. Recent methods in audio-visual understanding have leveraged popular deep learning techniques to solve challenging problems such as action recognition <ref type="bibr">[13]</ref>, deepfake detection <ref type="bibr">[34]</ref>, and other tasks. Given a video, one such task in audio-visual understanding is to locate the object in the visual space that is generating the prominent audio content. When observing a natural scene, it is often trivial for a human to localize the region/object from which the sound originates. One of the main reasons for this is the binaural nature of the human hearing sense. However, the majority of audio-visual data in digital media is monaural, which complicates audio localization tasks. Furthermore, naturally occurring videos do not have explicit annotations of the location of the source of the audio in the image. This * Equal contribution authors in alphabetic order Figure <ref type="figure">1</ref>. Given a video with audio, the goal of sound source localization is to localize the object/region producing the sound in a video frame. Our method introduces optical flow as an informative prior to improve visual sound source localization performance. makes the task of training deep neural networks to understand audio-visual associations for localization a challenging task.</p><p>Owing to the success of self-supervised learning (SSL) in vision <ref type="bibr">[8,</ref><ref type="bibr">16]</ref>, language <ref type="bibr">[9,</ref><ref type="bibr">26]</ref> and other multi-modal applications <ref type="bibr">[2,</ref><ref type="bibr">22]</ref>, recent methods in sound source localization <ref type="bibr">[6,</ref><ref type="bibr">30]</ref> have adopted SSL based methods to overcome the need for annotations. One such method <ref type="bibr">[6]</ref>, finds the cosine similarity between the audio and visual representations extracted convolutionally at different spatial locations in the images. They rely on self-supervised training by creating positive and negative associations from these predicted similarity matrices. This bootstrapping approach has been shown to improve sound source localization.</p><p>Following this research finding, the majority of recent approaches in visual sound source localization have focused on creating robust optimization objectives for better audiovisual associations. However, one interesting aspect of the problem that has received relatively little attention is the creation of informative priors to improve the association of the audio to the correct "sounding object" (or the object producing the sound). Priors can be viewed as potential regions in the image from where the sound may originate. We can draw parallels to work in two-stage object detection methods, in which region proposal networks are used to identify regions in the image space that could potentially be objects. However, generating potential candidate regions for sound source localization is more challenging because the generated priors should be relevant from a multi-modal perspective. In order to generate these informative priors for where sounds possibly originate from, we leverage optical flow.</p><p>The intuition behind using optical flow to create an enhanced prior is the fact that optical flow can model patterns of apparent motion of objects. This is important as most often, an object moving in a video tends to be the sound source. Enforcing a constraint to prioritize the objects that tend to be in relative motion might lend itself to creating better sound source localizations. This paper proposes an optical flow-based localization network that can create informative priors for performing superior sound source localization. The contributions in this paper are as follows:</p><p>1. We explore the need for creating informative priors for visual sound source localization, which is a complementary research direction to prior methods. 2. We propose the use of optical flow as an additional source of information to create informative priors. 3. We design an optical flow-based localization network that uses cross-attention to form stronger audio-visual associations for visual sound source localization. 4. We run extensive experiments on two benchmark datasets: VGG Sound and Flickr SoundNet and demonstrate the effectiveness of our method. Our method consistently achieves superior results over the state-of-the-art. We perform rigorous ablation studies and provide quantitative and qualitative results, showing the superiority of our novel localization network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Generating robust multi-modal representations through joint audio-visual learning is an active area of research that has found application in multiple audio-visual tasks. Initial works in the area of joint audio-visual learning focus on probabilistic approaches. In <ref type="bibr">[17]</ref>, the audio-visual signals were modeled as samples from a multivariate Gaussian process, and audio-visual synchrony was defined as the mutual information between the modalities. <ref type="bibr">[12]</ref> focused on first learning a lower-dimensional subspace that maximized mutual information between the two modalities. Furthermore, they explored the relationship between these audiovisual signals using non-parametric density estimators. <ref type="bibr">[20]</ref> proposed a spatio-temporal segmentation mechanism that relies on the velocity and acceleration of moving objects as visual features and used canonical correlation analysis to associate the audio with relevant visual features. In recent years, deep learning-based methods have been used to explore the creation of better bimodal representations. They mostly employ two-stream networks to encode each modality individually and employ a contrastive loss-based supervision to align the two representations <ref type="bibr">[19]</ref>. Methods like <ref type="bibr">[1,</ref><ref type="bibr">32]</ref> used source separation to localize audio via motion trajectory-based fusion and synchronization. Furthermore, <ref type="bibr">[25]</ref> addressed the problem of separating multiple sound sources from unconstrained videos by creating coarse to fine-grained alignment of audio-visual representations. Additionally, methods like <ref type="bibr">[24,</ref><ref type="bibr">25]</ref> use class-specific saliency maps. <ref type="bibr">[33]</ref> uses class attention maps to help generate saliency maps that are used for better sound source localization. More recently, methods have focused on creating objective functions specific to sound localization. <ref type="bibr">[6]</ref> introduced the concept of tri-map, which incorporates background mining techniques into the self-supervised learning setting. The tri-map contains an area of positive correlation, no correlation (background), and an ignore zone to avoid uncertain areas in the visual space. <ref type="bibr">[30]</ref> introduced a negative-free method for sound-localization by mining explicit positives. Further, this method uses a predictive coding technique to create better a feature alignment between the audio and visual modalities. These recent methods mainly focus on creating stronger optimization objectives for visual sound source localization. A complementary direction in the research landscape is to explore creating more informative priors for audio-visual association. In this paper, we explore one such idea, which leverages optical flow. The authors of <ref type="bibr">[3]</ref> have explored the use of optical flow in the context of certain audio-visual tasks, like retrieval. In this work, we explore the use of optical flow as an informative prior for visual sound source localization.</p><p>Optical flow provides a means to estimate pixel-wise motion between consecutive frames. Early works <ref type="bibr">[5,</ref><ref type="bibr">18,</ref><ref type="bibr">31]</ref> presented optical flow prediction as an energy minimization problem with several objective terms utilizing continuous optimization. Optical flow maps can be broadly divided into two types: Sparse and Dense. Sparse optical flow represents the motion of salient features in a frame, whereas Dense optical flow represents the motion flow vectors for the whole frame. Earlier methods for sparse optical flow estimation include the Lucas-Kanade algorithm <ref type="bibr">[21]</ref>, which utilizes brightness constancy equations to optimize a least squares approximation under the assumption that flow remains locally smooth and the relative displacement of neighboring pixels is constant. Farneback <ref type="bibr">[10]</ref> proposed a dense optical flow estimation technique where quadratic polynomials were utilized to approximate pixel neighborhood for two frames, and these polynomials were then used to compute global displacement. FlowNet <ref type="bibr">[11]</ref> proposed the first CNNbased approach towards estimating optical flow maps where they computed static cross-correlation between intermediate convolutional feature maps for two consecutive frames and up-scale them to extract optical flow maps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method</head><p>In this section, we will first present the formulation of the sound source localization problem under a supervised setting. Following this, we will describe the current selfsupervised approach, motivate the need for better localization proposals for sound source localization, and subsequently elaborate on the design and implementation of our novel optical flow-based sound source localization network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Problem Statement</head><p>Given a video consisting of both the audio and visual modality, visual sound source localization aims to find the spatial region in the visual modality that generated the audio. Consider a video consisting of N frames. Let the image corresponding to a video frame be I, where I &#8712; R WixHix3 and A be the spectrogram representation generated out of the audio around the frame, where A &#8712; R WaxHax1 . The problem of audio localization can be thought of as finding the region in I that has a high association/correlation with A. More formally, this can be written as:</p><p>where &#934;(I; &#952; i ) and &#936;(A; &#952; j ) correspond to convolution neural network-based feature extractors associated with visual and audio modalities, and f v &#8712; R mxnxc and f a &#8712; R mxnxc are the corresponding lower dimensional feature maps, respectively. &#969; is the function that finds the association between the two modalities, and P (I, A) is the region in the original image space with the source that generated the audio. It is important to note that extrapolating the association in the feature space to the corresponding region of the original image space (i.e P (I, A) ) is trivial. Given the above-mentioned feature maps, one way of finding an association between the feature representations is:</p><p>where GAP (f a ) is the global-average-pooled representation of the audio feature map. S represents the cosine similarity of this audio representation to each spatial location in the visual feature map. Here m and n are the width and height of the feature map. If a binary mask M &#8712; R mxnx1 generated from a ground truth indicating positive and negative regions of audio-visual correspondence is available, we can formulate the learning objective in a supervised setting: For a given sample k (with an image frame I k and audio A k ) in the dataset, the positive and negative response can be defined as</p><p>Here S k&#8594;k refers to the cosine similarity S from Eq 2 when using I k and A k . Similarly, S k&#8594;j is the cosine similarity when the image and audio are not from the same video. &#10216;&#8226;, &#8226;&#10217; denotes the inner product. The final learning objective has a similar formulation to <ref type="bibr">[23]</ref>:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Self-Supervised Localization</head><p>In most real-world scenarios, the ground truth necessary to generate the binary mask M would be missing. Hence there is a need for a training objective that does not rely on explicit ground truth annotations. One way to achieve this objective is to replace the ground truth mask with a generated pseudo mask as proposed in <ref type="bibr">[6]</ref>. The pseudo mask can be generated by binarizing the similarity matrix S based on a threshold. More specifically, given S k&#8594;k from Eq 2, the pseudo mask can be written as:</p><p>where &#1013; is a scalar threshold. &#963; denotes the sigmoid function that maps a similarity value in S k&#8594;k , that is below the threshold to value 0 and above the threshold to 1. &#964; is the temperature controlling the sharpness. Additionally, <ref type="bibr">[6]</ref> further refines the pseudo mask by eliminating potentially noisy associations. This is done by considering separate positive and negative thresholds above and below the similarity value that is considered reliable. If a value is between these thresholds, it's considered a noisy association and is subsequently ignored. More formally:</p><p>Figure <ref type="figure">2</ref>. Overview of our optical flow-based sound localization method. Given a chosen frame of a video and the audio surrounding that frame, we extract features from both modalities, which are then used to attend towards sounding objects in the frame. We further compute the dense optical flow field from the chosen and subsequent frame and use flow features to attend towards moving objects in the frame.</p><p>Here &#1013; p and &#1013; n are the positive and negative thresholds, respectively. Once the positive and negative responses are computed, the overall training objective is similar to Eq 4.</p><p>In the above approach, it is logical to bootstrap the prediction and perform self-supervised training if the pseudo masks in Eq 5 generated at the initial training iterations resemble that of the ground truth. However, this is not guaranteed since the feature extractors associated with the individual modalities (in Eq 1) are randomly initialized. Therefore, a high or low value in the similarity matrix S k&#8594;k during the initial iterations of the self-supervised training may not correspond to informative positive or negative regions since the feature extractors are not trained. If a feature extractor is initialized with pretrained weights from a classification task, for example, the visual extractor on ImageNet, the network will often activate towards objects in the image. Considering this characteristic as an object-centric prior, it may be useful for self-supervised sound localization as the most salient objects in a frame are often the ones emitting the sound. However, situations may arise where the source of the audio is not the most salient object in the frame. This would produce sub-optimal associations S k&#8594;k in the initial iterations, which, when used for self-supervised training as mentioned in Eq 6 would lead to sub-optimal performance. As a result, there is a need to construct more meaningful priors when computing S k&#8594;k to improve audio-visual associations, subsequently improving self-supervised learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Optical-Flow Based Localization Network</head><p>Having motivated the need for some meaningful priors that enable better audio-visual associations, we approach the problem from an object detection viewpoint. In earlier object detection methods such as R-CNN <ref type="bibr">[15]</ref> and Fast R-CNN <ref type="bibr">[14]</ref>, a selective search was used as a method to generate region proposals. Selective search provided a set of probable locations where an object of interest may be present. An alternative to selective search-based approaches is a two-stage approach like in <ref type="bibr">[27]</ref>, using region proposal networks. Most of these region proposal networks have auxiliary training objectives in order to produce regions containing potential objects. Using these objectives to generate potential regions of interest in a self-supervised setting becomes challenging. Furthermore, generating candidate regions using selective search or regular region proposal networks, only based on visual modality, might not be well suited for enforcing priors for a cross-modal task such as visual sound source localization.</p><p>As a better alternative, we use optical flow to generate informative localization proposals. Optical flow using frames of a video can efficiently capture the objects that are mov-ing. Most often, these objects are the source of the sound. Capturing optical flow in the pixel space can often be a good prior to improve audio-visual association. Furthermore, since the optical flow tends to focus on the relative motion of objects rather than the salient objects, it can complement the priors of the pre-trained vision model, which tends to focus on the latter. We design a network as shown in Figure <ref type="figure">2</ref>, which takes in optical flow computed between two adjacent video frames and generates regions in the feature map f v that act as priors to create better audio-visual associations. The localization network is comprised of a cross-attention between the feature representation extracted from image and flow modalities. Given the flow feature representation f f and visual feature representation f v , we project these feature representations using separate projection layers to create two tensors K v and Q f . &#946; is computed as an outer product of the tensors K v and Q f along the channel dimensions. That is, if K v and Q f &#8712; R mxnxd , then the resulting &#946; &#8712; R mxnxdxd is computed as below:</p><p>The softmax function is applied to the final dimension to normalize the attention matrix. The goal is to compute the attention to be applied to each spacial location, thus yielding a cross attention matrix of size dxd for each spatial location. We compute another tensor from the visual modality V v &#8712; R mxnxd . For each spatial location in V v , we have a d dimensional representation which we multiply to the corresponding dxd attention matrix in &#946;. That is :</p><p>Finally E is projected back to produce the final crossattended proposal prior E p &#8712; R mxnxc . In order to impose this prior for performing the audio-visual association, we add E p to the visual feature map f v as shown in Figure <ref type="figure">2</ref>. The enhanced audio-visual association can be written as:</p><p>where &#8853; denotes element-wise addition. Once the enhanced audio-visual association is obtained, we use Eq 6 to compute the positive and negative responses. We train the entire network (feature extractors and localization network) endto-end using the optimization objective mentioned in Eq 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Datasets</head><p>For training and evaluating our proposed model, we follow prior work in this area and use two large-scale audiovisual datasets:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1">Flickr SoundNet</head><p>Flickr SoundNet <ref type="bibr">[4]</ref> is a collection of over 2 million unconstrained videos collected from the Flickr platform. To directly compare against prior works, we construct two subsets of 10k and 144k videos that are preprocessed into extracted image-audio pairs, described further in Section 4.3. The Flickr SoundNet evaluation dataset consists of 250 image-audio pairs with labeled bounding boxes localized on the sound source in the image, manually annotated by <ref type="bibr">[28]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2">VGG Sound</head><p>VGG Sound <ref type="bibr">[7]</ref> is a dataset of 200k video clips spread across 309 sound categories. Similar to Flickr SoundNet, we construct subsets of 10k and 144k image-audio pairs to train our proposed model. For evaluation, we utilize the VGG Sound Source <ref type="bibr">[6]</ref> dataset, which contains 5000 annotated image-audio pairs that span 220 sound categories. Compared to the Flickr SoundNet test set, which has about 50 sound categories, VGG Sound Source has significantly more sounding categories, making it a more challenging scenario for sound localization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Evaluation Metrics</head><p>For proper comparisons against prior works, we use two metrics to quantify audio localization performance: Consensus Intersection Over Union (cIoU) and Area Under Curve of cIoU scores (AUC) <ref type="bibr">[28]</ref>. cIoU quantifies localization performance by measuring the intersection over union of a ground-truth annotation and a localization map, where the ground-truth is an aggregation of multiple annotations, providing a single consensus. AUC is calculated by the area under the curve of cIoU created from thresholds varying from 0 to 1. In our experiments, we show results for cIoU at a threshold of 0.5, denoted by cIoU 0.5 , and AUC scores, denoted by AU C cIoU .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Implementation Details</head><p>In this paper, sound source localization is defined as localizing an excerpt of audio to its origin location in an image frame, both extracted from its respective video clip. For both Flickr SoundNet and VGG Sound, we extract the middle frame of a video along with 3 seconds of audio centered around the middle frame and a calculated dense optical flow field to construct an image-flow-audio pair. For the image frames, we resize images to 224 &#215; 224 and perform random cropping and horizontal flipping data augmentations. To calculate an optical flow field corresponding to the middle frame, we take the middle frame and subsequent frame of a video V , denoted by V t and V t+1 respectively, and use the Gunnar Farneback <ref type="bibr">[10]</ref> algorithm to generate a 2-channel flow field corresponding to horizontal and vertical flow vectors denoting movement magnitude. We sim-</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Method</head><p>Training Set cIoU 0.5 AUC cIoU Attention <ref type="bibr">[28]</ref> Flickr 10k 0.436 0.449 CoarseToFine <ref type="bibr">[25]</ref> 0.522 0.496 AVObject <ref type="bibr">[1]</ref> 0.546 0.504 LVS * <ref type="bibr">[6]</ref> 0.730 0.578 SSPL <ref type="bibr">[30]</ref> 0.743 0.587 HTF (Ours) 0.860 0.634 Attention <ref type="bibr">[28]</ref> Flickr 144k 0.660 0.558 DMC <ref type="bibr">[19]</ref> 0.671 0.568 LVS * <ref type="bibr">[6]</ref> 0.702 0.588 LVS &#8224; <ref type="bibr">[6]</ref> 0.697 0.560 HardPos <ref type="bibr">[29]</ref> 0.762 0.597 SSPL <ref type="bibr">[30]</ref> 0.759 0.610 HTF (Ours) 0.865 0.639 LVS * <ref type="bibr">[6]</ref> VGGSound 144k 0.719 0.587 HardPos <ref type="bibr">[29]</ref> 0.768 0.592 SSPL <ref type="bibr">[30]</ref> 0.767 0.605 HTF (Ours) 0.848 0.640</p><p>Table 1. Quantitative results on the Flickr SoundNet testing dataset where models are trained on the two training subsets of Flickr SoundNet and VGG Sound 144k. "*" Denotes our faithful reproduction of the method, and " &#8224;" denotes our evaluation reproduction using officially provided model weights.</p><p>ilarly perform random cropping and horizontal flipping of the flow fields, which are performed consistently with image augmentations. For audio, we sample 3 seconds of the video at 16kHz and construct a log-scaled spectrogram using a bin size of 256, FFT window of 512 samples, and stride of 274 samples, resulting in a shape of 257 &#215; 300.</p><p>Following <ref type="bibr">[6]</ref>, we use ResNet18 backbones as the visual and audio feature extractors. Similarly, we use ResNet18 as the optical flow feature extractor. We pretrain the visual and flow feature extractors on ImageNet and leave the audio network randomly initialized. During training, we keep the visual feature extractor parameters frozen. For all experiments, we train the model using the Adam optimizer with a learning rate of 10 -3 and a batch size of 128. We train the model for 100 epochs for both the 10k and 144k sample subsets on Flickr SoundNet and VGGSound. We set &#1013; p = 0.65, &#1013; n = 0.4, and &#964; = 0.03, as described in Eq 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Quantitative Evaluation</head><p>In this section, we compare our method against prior works <ref type="bibr">[1,</ref><ref type="bibr">6,</ref><ref type="bibr">19,</ref><ref type="bibr">25,</ref><ref type="bibr">28,</ref><ref type="bibr">29,</ref><ref type="bibr">30]</ref> on standardized experiments for self-supervised visual sound source localization. Results of various training configurations are reported in Tables <ref type="table">1</ref> and <ref type="table">2</ref> for the Flickr SoundNet and VGG Sound Source testing datasets, respectively.</p><p>As shown in Table <ref type="table">1</ref> and 2, our method, HTF, significantly outperforms all prior methods, creating the new stateof-the-art on self-supervised sound source localization. On</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Method</head><p>Training Set cIoU 0.5 AUC cIoU Attention <ref type="bibr">[28]</ref> VGGSound 10k 0.160 0.283 LVS * <ref type="bibr">[6]</ref> 0.297 0.358 SSPL <ref type="bibr">[30]</ref> 0.314 0.369 HTF (Ours) 0.393 0.398 Attention <ref type="bibr">[28]</ref> VGGSound 144k 0.185 0.302 AVObject <ref type="bibr">[1]</ref> 0.297 0.357 LVS * <ref type="bibr">[6]</ref> 0.301 0.361 LVS &#8224; <ref type="bibr">[6]</ref> 0.288 0.359 HardPos <ref type="bibr">[29]</ref> 0.346 0.380 SSPL <ref type="bibr">[30]</ref> 0.339 0.380 HTF (Ours) 0.394 0.400 Table 3. Quantitative results on the VGG Sound Source testing dataset on heard and unheard class subsets. Each model is trained on 50k samples belonging to 110 (heard) classes.</p><p>the Flickr testing set, we achieved an improved performance of 11.7% cIoU and 4.7% AUC when trained on 10k Flickr samples and 10.6% cIoU and 2.9% AUC when trained with 144k Flickr samples. Similarly, on the VGG Sound Source testing set, we improve by 7.9% cIoU and 2.9% AUC when trained on 10k VGG Sound samples and 5.5% cIoU and 2.0% AUC when trained on 144k samples. Further, we investigate the robustness of our method by evaluating it across the VGG Sound and Flickr SoundNet datasets. Specifically, we train our model with 144k VGG Sound samples and test on the Flickr SoundNet test set. Comparing against <ref type="bibr">[6,</ref><ref type="bibr">29,</ref><ref type="bibr">30]</ref>, we significantly outperform all methods, as shown in Table <ref type="table">1</ref>, which shows our model is capable of generalizing well across datasets. We further investigate our method's robustness by testing on sound categories that are disjoint from what is seen during training. Following <ref type="bibr">[6]</ref>, we sample 110 sound categories from VGG Sound for training and test on the same 110 categories (heard) used during training and 110 other disjoint (unheard) sound categories. As shown in Table <ref type="table">3</ref>, we outperform <ref type="bibr">[6]</ref> on both heard and unheard testing subsets. In addition, we highlight that the performance of the unheard subset slightly outperforms the heard subset, showing our model performs well against unheard sound categories.</p><p>Utilizing a self-supervised loss formulation similar to <ref type="bibr">[6]</ref>, we see that our method significantly outperforms it on  both testing datasets across all training setups and experiments. We highlight that these improvements are obtained from incorporating a more informative prior, based on optical flow, into the sound localization objective. In section 4.6, we further investigate the direct influence of incorporating optical flow along with our other design choices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">Qualitative Evaluation</head><p>In Figure <ref type="figure">3</ref>, we visualize and compare sound localizations of LVS <ref type="bibr">[6]</ref> and our method on the Flickr SoundNet and VGG Sound Source test sets. As shown, our method can accurately localize various types of sound sources. Comparing against LVS <ref type="bibr">[6]</ref>, we examine localization improve-ments across multiple samples, specifically where sounding objects exhibit a high flow magnitude through movement. For example, in the first column, LVS <ref type="bibr">[6]</ref> localizes only a small portion of the sounding vehicle, while our method entirely localizes the vehicle, where a significant magnitude of flow is exhibited. In the fifth column, our method more accurately localizes to the two crowds in the stadium, both of which are sound sources exhibiting movement.</p><p>However, it is also important to investigate samples where little optical flow is present. It is possible that a frame in a video exhibits little movement, for example, a stationary car or person emitting noise. In these cases, there is no meaningful optical flow to localize towards. In Figure <ref type="figure">4</ref>, we see that even in the absence of significant optical flow, our method still localizes on par or better compared to LVS <ref type="bibr">[6]</ref>. This reinforces that optical flow is used as an optional prior, where areas of high movement when present, may be used to localize better, but is not required. In the following section, we further investigate the exact effects of introducing priors like optical flow into the self-supervised framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6.">Ablation Studies</head><p>In this section, we explore the implications of our design choices with multiple ablation studies. As explained in section 3.2, we explore the need for informative priors to train a self-supervised audio localization network. We introduce optical flow as one of these priors, in addition to pretraining the vision network on ImageNet to provide an objectcentric prior. In Table <ref type="table">4</ref>, we study the individual effects of each of these design choices, namely adding the flow attention mechanism, ImageNet weights for the vision encoder, and freezing the vision encoder during training.</p><p>When training the model without any priors (model 4.a), we see that performance suffers as there is little meaning- ful information for the self-supervised objective. However, when simply adding the optical flow attention previously described (model 4.d), we see a large performance improvement as the network can now use optical flow to better localize, as moving objects are often the ones exhibiting sound. Similarly, when using ImageNet pretrained weights (Table <ref type="table">4</ref>.b), we see a significant performance improvement as the model now has an object-centric prior, where salient objects in an image are often exhibiting sound. When combining both priors (model 4.e), we see even further performance improvements, which show the importance of incorporating multiple informative priors for the self-supervised sound localization objective. We further explore the effects of freezing the vision encoder during training. As previously mentioned, a network pretrained on a classification task such as ImageNet will often have high activations around salient objects (an objectcentric prior). When training in a self-supervised setting, the network may divert from its original weights and instead have a less object-centric focus, which may be suboptimal for sound source localization. When freezing the network in the non-flow setting (model 4.c), we see performance slightly decrease compared to the unfrozen counterpart (model 4.b). However, when freezing the network in the optical flow setting (model 4.f), we see a slight improvement over the flow setting with an unfrozen vision encoder (model 4.e). We infer that enforcing the vision encoder to keep its object-centric characteristics while the flow encoder can reason and attend towards other parts of the image produces a more informative representation, improving localization performance.</p><p>Finally, we explore variations of the optical flow encoder to better understand how the optical flow information is being used. We replace the learnable ResNet18 encoder with a single max pooling layer to see if the simple presence of movement is still informative for localizing sounds. As shown in Table <ref type="table">5</ref>, when using a simple max pooling layer (model 5.a), we still notice a significant performance improvement over the network without optical flow (models 4.a-c). However, we see a further improvement over the max pooling layer when using a learnable encoder, like a ResNet18 network. While the max pooling layer only captures the presence of movement at a particular location, a learnable encoder allows deeper reasoning of the flow information. For example, the eighth column in Figure <ref type="figure">3</ref> shows an optical flow field where the sounding object (tractor) is not moving but rather the environment around it is. In this case, with the max pooling encoder, the network is biased away from the sounding object, whereas a learnable encoder can better reason about the flow in the given frame, improving overall localization performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>In this work, we introduce a novel self-supervised sound source localization method that uses optical flow to aid in the localization of a sounding object in a frame of a video. In a video, moving objects are often the ones making the sound. We take advantage of this observation by using optical flow as a prior for the self-supervised learning setting. We formulate the self-supervised objective and describe the cross-attention mechanism of optical flow over the corresponding video frame. We evaluate our approach on standardized datasets and compare against prior works and show state-of-the-art results across all experiments and evaluations. Further, we conduct extensive ablation studies to show the necessity and effect of including informative priors like optical flow, into the self-supervised sound localization objective to improve performance.</p><p>While we explore optical flow in this work, there are other priors that may be explored to improve sound source localization further. For example, pretraining the audio encoder can likely provide a better understanding of the class of the sound being emitted, which can then be used to help localize toward that specific object. Further, improving the optical flow generation, for example, using flow estimation methods or aggregating flow across multiple frames, can potentially improve the optical flow signal to ultimately improve overall localization performance. We leave exploration of these hypotheses for future work.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: University at Buffalo Libraries. Downloaded on July 03,2024 at 14:37:06 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
