<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Scene-Aware Audio Rendering via Deep Acoustic Analysis</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>05/01/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10181766</idno>
					<idno type="doi">10.1109/tvcg.2020.2973058</idno>
					<title level='j'>IEEE Transactions on Visualization and Computer Graphics</title>
<idno>1077-2626</idno>
<biblScope unit="volume">26</biblScope>
<biblScope unit="issue">5</biblScope>					

					<author>Zhenyu Tang</author><author>Nicholas J. Bryan</author><author>Dingzeyu Li</author><author>Timothy R. Langlois</author><author>Dinesh Manocha</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Fig. 1: Given a natural sound in a real-world room that is recorded using a cellphone microphone (left), we estimate the acoustic material properties and the frequency equalization of the room using a novel deep learning approach (middle). We use the estimated acoustic material properties for generating plausible sound effects in the virtual model of the room (right). Our approach is general and robust, and works well with commodity devices.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Auditory perception of recorded sound is strongly affected by the acoustic environment it is captured in. Concert halls are carefully designed to enhance the sound on stage, even accounting for the effects an audience of human bodies will have on the propagation of sound <ref type="bibr">[2]</ref>. Anechoic chambers are designed to remove acoustic reflections and propagation effects as much as possible. Home theaters are designed with acoustic absorption and diffusion panels, as well as with careful speaker and seating arrangements <ref type="bibr">[47]</ref>.</p><p>The same acoustic effects are important when creating immersive effects for virtual reality (VR) and augmented reality (AR) applications. It is well known that realistic sounds can improve a user's sense of presence and immersion <ref type="bibr">[33]</ref>. There is considerable work on interactive sound propagation in virtual environments based on geometric and wave-based methods <ref type="bibr">[7,</ref><ref type="bibr">43,</ref><ref type="bibr">53,</ref><ref type="bibr">72]</ref>. Furthermore, these techniques are increasingly used to generate plausible sound effects in VR systems and games, including Microsoft Project Acoustics 1 , Oculus Spatializer 2 , and Steam Audio 3 . However, these methods are limited to synthetic scenes where an exact geometric representation of the scene and acoustic material properties are known apriori.</p><p>In this paper, we address the problem of rendering realistic sounds that are similar to recordings of real acoustic scenes. These capabil-Fig. <ref type="figure">2</ref>: Our pipeline: Starting with an audio-video recording (left), we estimate the 3D geometric representation of the environment using standard computer vision methods. We use the reconstructed 3D model to simulate new audio effects in that scene. To ensure our simulation results perceptually match recorded audio in the scene, we automatically estimate two acoustic properties from the audio recordings: frequency-dependent reverberation time or T 60 of the environment, and a frequency-dependent equalization curve. The T 60 is used to optimize the frequency-dependent absorption coefficients of the materials in the scene. The frequency equalization filter is applied to the simulated audio, and accounts for the missing wave effects in geometrical acoustics simulation. We use these parameters for interactive scene-aware audio rendering (right). multi-modal estimation and optimization <ref type="bibr">[52]</ref>, and scene-aware audio in 360&#176;videos <ref type="bibr">[35]</ref>. However, these approaches either require separate recording of an IR, or produce audio results that are perceptually different from recorded scene audio. Important acoustic properties can be extracted from IRs, including the reverberation time (T 60 ), which is defined as the time it takes for a sound to decay 60 decibels <ref type="bibr">[32]</ref>, and the frequency-dependent amplitude level or equalization (EQ) <ref type="bibr">[22]</ref>.</p><p>Main Results: We present novel algorithms to estimate two important environmental acoustic properties from recorded sounds (e.g. speech). Our approach uses commodity microphones and does not need to capture any IRs. The first property is the frequency-dependent T 60 . This is used to optimize absorption coefficients for geometric acoustic (GA) simulators for audio rendering. Next, we estimate a frequency equalization filter to account for wave effects that cannot be modeled accurately using geometric acoustic simulation algorithms. This equalization step is crucial to ensuring that our GA simulator outputs perceptually match existing recorded audio in the scene.</p><p>Estimating the equalization filter without an IR is challenging since it is not only speaker dependent, but also scene dependent, which poses extra difficulties in terms of dataset collection. For a model to predict the equalization filtering behavior accurately, we need a large amount of diverse speech data and IRs. Our key idea is a novel dataset augmentation process that significantly increases room equalization variation. With robust room acoustic estimation as input, we present a novel inverse material optimization algorithm to estimate the acoustic properties. We propose a new objective function for material optimization and show that it models the IR decay behavior better than the technique by Li et al. <ref type="bibr">[35]</ref>. We demonstrate our ability to add new sound sources in regular videos. Similar to visual relighting examples where new objects can be rendered with photorealistic lighting, we enable audio reproduction in any regular video with existing sound with applications for mixed reality experiences. We highlight their performance on many challenging benchmarks.</p><p>We show the importance of matched T 60 and equalization in our perceptual user study &#167;5. In particular, our perceptual evaluation results show that: (1) Our T 60 estimation method is perceptually comparable to all past baseline approaches, even though we do not require an explicit measured IR; (2) Our EQ estimation method improves the performance of our T 60 -only approach by a statistically significant amount (&#8776; 10 rating points on a 100 point scale); and (3) Our combined method (T 60 +EQ) outperforms the average room IR (T 60 = .5 seconds with uniform EQ) by a statistically significant amount (+10 rating points) -the only reasonable comparable baseline we could conceive that does not require an explicit IR estimate. To the best of our knowledge, ours is the first method to predict IR equalization from raw speech data and validate its accuracy. Our main contributions include:</p><p>&#8226; A CNN-based model to estimate frequency-dependent T 60 and equalization filter from real-world speech recordings.</p><p>&#8226; An equalization augmentation scheme for training to improve the prediction robustness.</p><p>&#8226; A derivation for a new optimization objective that better models the IR decay process for inverse materials optimization.</p><p>&#8226; A user study to compare and validate our performance with current state-of-the-art audio rendering algorithms. Our study is used to evaluate the perceptual similarity between the recorded sounds and our rendered audio.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>Cohesive audio in mixed reality environments (when there is a mix of real and virtual content), is more difficult than in fully virtual environments. This stems from the difference between "Plausibility" in VR and "Authenticity" in AR <ref type="bibr">[29]</ref>. Visual cues dominate acoustic cues, so the perceptual difference between how audio sounds and the environment in which it is seen is smaller than the perceived environment of two sounds. Recently, Li et al. introduced scene-aware audio to optimize simulator parameters to match the room acoustics from existing recordings <ref type="bibr">[35]</ref>. By leveraging visual information for acoustic material classification, Schissler et al. demonstrated realistic audio for 3D-reconstructed real-world scenes <ref type="bibr">[52]</ref>. However, both of these methods still require explicit measurement of IRs. In contrast, our proposed pipeline works with any input speech signal and commodity microphones. Sound simulation can be categorized into wave-based methods and geometric acoustics. While wave-based methods generally produce more accurate results, it remains an open challenge to build a real-time universal wave solver. Recent advances such as parallelization via rectangular decomposition <ref type="bibr">[38]</ref>, pre-computation acceleration structures <ref type="bibr">[36]</ref>, and coupling with geometric acoustics <ref type="bibr">[48,</ref><ref type="bibr">73]</ref> are used for interactive applications. It is also possible to precompute low-frequency wave-based propagation effects in large scenes <ref type="bibr">[45]</ref>, and to perceptually compress them to reduce runtime requirements <ref type="bibr">[44]</ref>. Even with the massive speedups presented, and a real-time runtime engine, these methods still require tens of minutes to hours of pre-computation depending Fig. <ref type="figure">3</ref>: The simulated and recorded frequency response in the same room at a sample rate of 44.1kHz is shown. Note that the recorded response has noticeable peaks and notches compared with the relatively flat simulated response. This is mainly caused by room equalization. Missing proper room equalization leads to discrepancies in audio quality and overall room acoustics. on the size of the scene and frequency range chosen, making them impractical for augmented reality scenarios and difficult to include in an optimization loop to estimate material parameters. With interactive applications as our goal, most game engines and VR systems tend to use geometric acoustic simulation methods <ref type="bibr">[7,</ref><ref type="bibr">53,</ref><ref type="bibr">54,</ref><ref type="bibr">72]</ref>. These algorithms are based on fast ray tracing and perform specular and diffuse reflections <ref type="bibr">[50]</ref>. Some techniques have been proposed to approximate low-frequency diffraction effects using ray-tracing <ref type="bibr">[48,</ref><ref type="bibr">66,</ref><ref type="bibr">69]</ref>. Our approach can be combined with any interactive audio simulation method, though our current implementation is based on bidirectional ray tracing <ref type="bibr">[7]</ref>. The sound propagation algorithms can also be used for acoustic material design optimization for synthetic scenes <ref type="bibr">[37]</ref>.</p><p>The efficiency of deep neural networks has been shown in audio/video-related tasks that are challenging for traditional methods <ref type="bibr">[17,</ref><ref type="bibr">21,</ref><ref type="bibr">24,</ref><ref type="bibr">61,</ref><ref type="bibr">71]</ref>. <ref type="bibr">Hershey et al.</ref> showed that it is feasible to use CNNs for large-scale audio classification problems <ref type="bibr">[23]</ref>. Many deep neural networks require a large amount of training data. Salamon et al. used data augmentation to improve environmental sound classification <ref type="bibr">[49]</ref>. Similarly, Bryan estimates the T 60 and the direct-toreverberant ratio (DRR) from a single speech recording via augmented datasets <ref type="bibr">[5]</ref>. Tang et al. trained CRNN models purely based on synthetic spatial IRs that generalize to real-world recordings <ref type="bibr">[63]</ref><ref type="bibr">[64]</ref><ref type="bibr">[65]</ref>. We strategically design an augmentation scheme to address the challenge of equalization's dependence on both IRs and speaker voice profiles, which is fully complimentary to all prior data-driven methods.</p><p>Acoustic simulators require a set of well-defined material properties. The material absorption coefficient is one of the most important parameters <ref type="bibr">[4]</ref>, ranging from 0 (total reflection) to 1 (total absorption). When a reference IR is available, it is straightforward to adjust room materials to match the energy decay of the simulated IR to the reference IR <ref type="bibr">[35]</ref>. Similarly, Ren et al. optimized linear modal analysis parameters to match the given recordings <ref type="bibr">[46]</ref>. A probabilistic damping model for audio-material reconstruction has been presented for VR applications <ref type="bibr">[60]</ref>. Unlike all previous methods which require a clean IR recording for accurate estimation and optimization of boundary materials, we infer typical material parameters including T 60 values and equalization from raw speech signals using a CNN-based model.</p><p>Analytical gradients can significantly accelerate the optimization process. With similar optimization objectives, it was shown that additional gradient information can boost the speed by a factor of over ten times <ref type="bibr">[35,</ref><ref type="bibr">52]</ref>. The speed gain shown by Li et al. <ref type="bibr">[35]</ref> is impressive, and we further improve the accuracy and speed of the formulation. More specifically, the original objective function evaluated energy decay relative to the first ray received (the direct sound if there were no obstacles). However, energy estimates can be noisy due to both the oscillatory nature of audio as well as simulator noise. Instead, we optimize the slope of the best fit line of ray energies to the desired energy decay (defined by the T 60 ), which we found to be more robust.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">DEEP ACOUSTIC ANALYSIS: OUR ALGORITHM</head><p>In this section, we overview our proposed method for scene-aware audio rendering. We begin by providing background information, discuss how we capture room geometry, and then proceed with discussing how we estimate the frequency dependent room reverberation and equalization parameters directly from recorded speech. We follow by discussing how we use the estimated acoustic parameters to perform acoustic materials optimization such that we calibrate our virtual acoustic model with real-world recordings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Background</head><p>To explain the motivation of our approach, we briefly elaborate on the most difficult parts of previous approaches, upon which our method improves. Previous methods require an impulse response of the environment to estimate acoustic properties <ref type="bibr">[35,</ref><ref type="bibr">52]</ref>. Recording an impulse response is a non-trivial task. The most reliable methods involve playing and recording Golay codes <ref type="bibr">[19]</ref> or sine sweeps <ref type="bibr">[18]</ref>, which both play loud and intrusive audio signals. Also required are a fairly highquality speaker and microphone with constant frequency response, small harmonic distortion and little crosstalk. The speaker and microphone should be acoustically separated from surfaces, i.e., they shouldn't be placed directly on tables (else surface vibrations could contaminate the signal). Clock drift between the source and microphone must be accounted for <ref type="bibr">[6]</ref>. Alternatively, balloon pops or hand claps have been proposed for easier IR estimation, but require significant post-processing and still are very obtrusive <ref type="bibr">[1,</ref><ref type="bibr">56]</ref>. In short, correctly recording an IR is not easy, and makes it challenging to add audio in scenarios such as augmented reality, where the environment is not known beforehand and estimation must be done interactively to preserve immersion. Geometric acoustics is a high-frequency approximation to the wave equation. It is a fast method, but assumes that wavelengths are small compared to objects in the scene, while ignoring pressure effects <ref type="bibr">[50]</ref>. It misses several important wave effects such as diffraction and room resonance. Diffraction occurs when sound paths bend around objects that are of similar size to the wavelength. Resonance is a pressure effect that happens when certain wavelengths are either reinforced or diminished by the room geometry: certain wavelengths create peaks or troughs in the frequency spectrum based on the positive or negative interference they create <ref type="bibr">[12]</ref>.</p><p>We model these effects with a linear finite impulse response (FIR) equalization filter <ref type="bibr">[51]</ref>. We compute the discrete Fourier transform on the recorded IR over all frequencies, following <ref type="bibr">[35]</ref>. Instead of filtering directly in the frequency domain, we design a linear phase EQ filter with 32ms delay to compactly represent this filter at 7 octave bin locations. We then blindly estimate this compact representation of the frequency spectrum of the impulse response as discrete frequency gains, without specific knowledge of the input sound or room geometry. This is a challenging estimation task. Since the convolution of two signals (the IR and the input sound) is equivalent to multiplication in the frequency domain, estimating the frequency response of the IR is equivalent to estimating one multiplicative factor of a number without constraining the other. We are relying on this approach to rec-Fig. <ref type="figure">4</ref>: Network architecture for T 60 and EQ prediction. Two models are trained for T 60 and EQ, which have the same components except the output layers have different dimensions customized for the octave bands they use. ognize a compact representation of the frequency response magnitude in different environments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Geometry Reconstruction</head><p>Given the background, we begin by first estimating the room geometry.</p><p>In our experiments, we utilize the ARKit-based iOS app MagicPlan<ref type="foot">foot_0</ref> to acquire the basic room geometry. A sample reconstruction is shown in Figure <ref type="figure">5</ref>. With computer vision research evolving rapidly, we believe constructing geometry proxies from video input will become even more robust and easily accessible <ref type="bibr">[3,</ref><ref type="bibr">74]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Learning Reverberation and Equalization</head><p>We use a convolutional neural network (Figure <ref type="figure">4</ref>) to predict room equalization and reverberation time (T 60 ) directly from a speech recording. Training requires a large number of speech recordings with known T 60 and room equalization. The standard practice is to generate speech recordings from known real-world or synthetic IRs <ref type="bibr">[14,</ref><ref type="bibr">28]</ref>. Unfortunately, large scale IR datasets do not currently exist due to the difficulty of IR measurement; most publicly available IR datasets have fewer than 1000 IR recordings. Synthetic IRs are easy to obtain and can be used, but again lack wave-based effects as well as other simulation deficiencies. Recent work has addressed this issue by combining real-word IR measurements with augmentation to increase the diversity of existing real-world datasets <ref type="bibr">[5]</ref>. This work, however, only addresses T 60 and DRR augmentation, and lacks a method to augment the frequencyequalization of existing IRs. To address this, we propose a method to do this in Section 3.3.2. Beforehand, however, we discuss our neural network estimation method for estimating both T 60 and equalization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1">Octave-Based Prediction</head><p>Most prior work takes the full-frequency range as input for prediction. For example, one closely related work <ref type="bibr">[5]</ref> only predicts one T 60 value for the entire frequency range (full-band). However, sound propagates and interacts with materials differently at different frequencies. To this end, we define our learning targets over several octaves. Specifically, we calculate T 60 at 7 sub-bands centered at {125, 250, 500, 1000, 2000, 4000, 8000}Hz. We found prediction of T 60 at the 62.5Hz band to be unreliable due to low signal-to-noise ratio (SNR). During material optimization, we set the 62.5Hz T 60 value to the 125Hz value. Our frequency equalization estimation is done at 6 octave bands centered at {62.5, 125, 250, 500, 2000, 4000}Hz. As we describe in &#167;3.3.2, we compute equalization relative to the 1kHz band, so we do not estimate it. When applying our equalization filter, we set bands greater than or equal to 8kHz to -50dB. Given our target sampling rate of 16kHz and the limited content of speech in higher octaves, this did not affect our estimation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2">Data Augmentation</head><p>We use the following datasets as the basis for our training and augmentation.</p><p>&#8226; ACE Challenge: 70 IRs and noise audio <ref type="bibr">[15]</ref>;</p><p>&#8226; MIT IR Survey: 271 IRs <ref type="bibr">[68]</ref>;</p><p>&#8226; DAPS dataset: 4.5 hours of 20 speakers' speech (10 males and 10 females) <ref type="bibr">[40]</ref>.</p><p>First, we use the method in <ref type="bibr">[5]</ref> to expand the T 60 and direct-toreverberant ratio (DRR) range of the 70 ACE IRs, resulting in 7000 synthetic IRs with a balanced T 60 distribution between 0.1-1.5 seconds. The ground truth T 60 estimates can be computed directly from IRs can be computed is a variety of ways. We follow the methodology of Karjalainen et al. <ref type="bibr">[27]</ref> when computing the T 60 from real IRs with a measurable noise floor. This method was found to be the most robust estimator when computing the T 60 from real IRs in recent work <ref type="bibr">[15]</ref>. The final composition of our dataset is listed in Table <ref type="table">2</ref>.</p><p>While we know the common range of real-world T 60 values, there is limited literature giving statistics about room equalization. Therefore, we analyzed the equalization range and distribution of the 271 MIT survey IRs as a guidance for data augmentation. The equalization of frequency bands is computed relative to the 1kHz octave. This is a common practice <ref type="bibr">[70]</ref>, unless expensive equipment is used to obtain calibrated acoustic pressure readings.</p><p>For our equalization augmentation procedure, we first fit a normal distribution (mean and standard deviation) to each sub-band amplitude of the MIT IR dataset as shown in Figure <ref type="figure">6</ref>. Given this set of parametric model estimates, we iterate through our training and validation IRs. For each IR, we extract its original EQ. We then randomly sample a target EQ according to our fit models (independently per frequency band), calculate the distance between the source and target EQ, and then design an FIR filter to compensate for the difference. For simplicity, we use the window method for FIR filter design <ref type="bibr">[59]</ref>. Note, we do not require a perfect filter design method. We simply need a procedure to increase the diversity of our data. Also note, we intentionally sample our augmented IRs to have a larger variance than the recorded IRs to further increase the variety of our training data.</p><p>We compute the log Mel-frequency spectrogram for each four second audio clip, which is commonly used for speech-related tasks <ref type="bibr">[9,</ref><ref type="bibr">16]</ref>. We   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.3">Network Architecture and Training</head><p>We propose using a network architecture differing only in the final layer for both T 60 and room equalization estimation. Six 2D convolutional layers are used sequentially to reduce both the time and frequency resolution of features until they have approximately the same dimension. Each conv layer is immediately followed by a rectified linear unit (ReLU) <ref type="bibr">[41]</ref> activation function, 2D max pooling, and batch nor-malization. The output from conv layers is flattened to a 1D vector and connected to a fully connected layer of 64 units, at a dropout rate of 50% to lower the risk of overfitting. The final output layer has 7 fully connected units to predict a vector of length 7 for T 60 or 6 fully connected units to predict a vector of length 6 for frequency equalization. This network architecture is inspired by Bryan <ref type="bibr">[5]</ref>, where it was used to predict full-band T 60 . We updated the output layer to predict the more challenging sub-band T 60 , and also discovered that the same architecture predicts equalization well.</p><p>For training the network, we use the mean square error (MSE) with the ADAM optimizer <ref type="bibr">[30]</ref> in Keras <ref type="bibr">[10]</ref>. The maximum number of epochs is 500 with an early stopping mechanism. We choose the model with the lowest validation error for further evaluation on the test set. Our model architecture is shown in Figure <ref type="figure">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Acoustic Material Optimization</head><p>Our goal is to optimize the material absorption coefficients at the same octave bands as our T 60 estimator in &#167; 3.3.1 of a set of room materials to match the sub-band T 60 of the simulated sound with the target predicted in &#167; 3.3.</p><p>Ray Energy. We borrow notation from <ref type="bibr">[35]</ref>. Briefly, a geometric acoustic simulator generates a set of sound paths, each of which carries an amount of sound energy. Each material m i in a scene is described by a frequency dependent absorption coefficient, &#961; i . A path leaving the source is reflected by a set of materials before it reaches the listener. The energy fraction that is received by the listener along path j is</p><p>where m k is the material the path intersects on the k th bounce, N j is the number of surface reflections for path j, and &#946; j accounts for air absorption (dependent on the total length of the path). Our goal is to optimize the set of absorption coefficients &#961; i to match the energy  distribution of the paths e j to that of the environment's IR. Again similar to <ref type="bibr">[35]</ref>, we assume the energy decrease of the IR follows an exponential curve, which is a linear decay in dB space. The slope of this decay line in dB space is m = -60/T 60 .</p><p>Objective Function. We propose the following objective function:</p><p>where m is the best fit line of the ray energies on a decibel scale:</p><p>with y i = 10log 10 (e i ), which we found to be more robust than previous methods. Specifically, in comparison with Equation (3) in <ref type="bibr">[35]</ref>, we see that Li et al. tried to match the slope of the energies relative to e 0 , forcing e 0 to be at the origin on a dB scale. However, we only care about the energy decrease, and not the absolute scale of the values from the simulator. We found that allowing the absolute scale to move and only optimizing the slope of the best fit line produces a better match to the target T 60 . We minimize J using the L-BFGS-B algorithm <ref type="bibr">[75]</ref>. The gradient of J is given by</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">ANALYSIS AND APPLICATIONS 4.1 Analysis</head><p>Speed. We implement our system on an Intel Xeon(R) CPU @3.60GHz and an NVIDIA GTX 1080 Ti GPU. Our neural network inference runs at 222 frames per second (FPS) on 4-second sliding windows of audio due to the compact design (only 18K trainable parameters). Optimization runs twice as fast with our improved objective function. The sound rendering is based on the real-time geometric bi-directional sound path tracing from Cao et al. <ref type="bibr">[7]</ref>.</p><p>Sub-band T 60 prediction. We first evaluate our T 60 blind estimation model and achieve a mean absolute error (MAE) of 0.23s on the test set (MIT IRs). While the 271 IRs in the test set have a mean T 60 of 0.49s with a standard deviation (STD) of 0.85s at the 125Hz sub-band, the highest sub-band 8000Hz only has a mean T 60 of 0.33s with a STD of 0.24s, which reflects a narrow subset within our T 60 augmentation range. We also notice that the validation MAE on ACE IRs is 0.12s, which indicates our validation set and the test set still come from different distributions. Another error source is the inaccurate labeling of low-frequency sub-band T 60 as shown in Figure <ref type="figure">7</ref>, but we do not filter any outliers in the test set. In addition, our data is intended to cover frequency ranges up to 8000Hz, but human speech has less energy in high-frequency range <ref type="bibr">[67]</ref>, which results in low signal energy for these sub-bands, making it more difficult for learning.  Material Optimization. When we optimize the room material absorption coefficients according to the predicted T 60 of a room, our optimizer efficiently modifies the simulated energy curve to a desired energy decay rate (T 60 ) as shown in Figure <ref type="figure">8</ref>. We also try fixing the room configuration and set the target T 60 to values uniformly distributed between 0.2s and 2.5s, and evaluate the T 60 of the simulated IRs. The relationship between the target and output T 60 is shown in Figure <ref type="figure">9</ref>, in which our simulation closely matches the target, demonstrating that our optimization is able to match a wide range of T 60 values.</p><p>To test the real-world performance of our acoustic matching, we recorded ground truth IRs in 5 benchmark scenes, then use the method in <ref type="bibr">[35]</ref>, which requires a reference IR, and our method, which does not require an IR, for comparison. Benchmark scenes and results are summarized in Table <ref type="table">3</ref>. We apply the EQ filter to the simulated IR as a last step. Overall, we obtain a prediction MAE of 3.42dB on our test set, whereas before augmentation, the MAE was 4.72dB under the same training condition, which confirms the effectiveness of our EQ augmentation. The perceptual impact of the EQ filter step is evaluated in &#167;5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Comparisons</head><p>We compare our work with two related projects, Schissler et al. <ref type="bibr">[52]</ref> and Kim et al. <ref type="bibr">[29]</ref>, where the high-level goal is similar to ours but the specific approach is different.</p><p>Material optimization is a key step in our method and Schissler et al. <ref type="bibr">[52]</ref>. One major difference is that we additionally compensate for wave effects explicitly with an equalization filter. Figure <ref type="figure">10</ref> shows Table <ref type="table">3</ref>: Benchmark results for acoustic matching. These real-world rooms are of different sizes and shapes, and contain a wide variety of acoustic materials such as brick, carpet, glass, metal, wood, plastic, etc., which make the problem acoustically challenging. We compare our method with <ref type="bibr">[35]</ref>. Our method does not require a reference IR and still obtains similar T 60 and EQ errors in most scenes compared with their method. We also achieve faster optimization speed. Note that the input audio to our method is already noisy and reverberant, whereas <ref type="bibr">[35]</ref>   the difference in spectrograms, where the high frequency equalization was not properly accounted for. Our method better replicates the rapid decay in the high frequency range. For audio comparison, please refer to our supplemental video. We also want to highlight the importance of optimizing T 60 . In <ref type="bibr">[29]</ref>, a CNN is used for object-based material classification. Default materials are assigned to a limited set of objects. Without optimizing specifically for the audio objective, the resulting sound might not blend in seamlessly with the existing audio. In Figure <ref type="figure">11</ref>, we show that our method produces audio that matches the decay tail better, whereas <ref type="bibr">[29]</ref> produces a longer reverb tail than the recorded ground truth.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Applications</head><p>Acoustic Matching in Videos Given a recorded video in an acoustic environment, our method can analyze the room acoustic properties from noisy, reverberant recorded audio in the video. The room geometry can be estimated from video <ref type="bibr">[3]</ref>, if the user has no access to the room for measurement. During post-processing, we can simulate sound that is similar to the recorded sound in the room. Moreover, virtual characters or speakers, such as the ones shown in Figure <ref type="figure">1</ref>, can be added to the video, generating sound that is consistent with the real-world environment. Fig. <ref type="figure">11</ref>: We demonstrate the importance on T 60 optimization on the audio amplitude waveform. Our method optimizes the material parameters based on input audio and matches the tail shape and decay amplitude with the recorded sound, whereas the visual-based object materials from Kim et al. <ref type="bibr">[29]</ref> failed to compensate for the audio effects.</p><p>Real-time Immersive Augmented Reality Audio Our method works in a real-time manner and can be integrated into modern AR systems. AR devices are capable of capturing real-world geometry, and can stream audio input to our pipeline. At interactive rates, we can optimize and update the material properties, and update the room EQ filter as well. Our method is not hardware-dependent and can be used with any AR device (which provides geometry and audio) to enable a more immersive listening experience.</p><p>Real-world Computer-Aided Acoustic Design Computer-aided design (CAD) software has been used for designing architecture acous-Fig. <ref type="figure">12</ref>: A screenshot of MUSHRA-like web interface used in our user study. The design is from Cartwright et al. <ref type="bibr">[8]</ref>. tics, usually before construction is done, in a predictive manner <ref type="bibr">[31,</ref><ref type="bibr">42]</ref>. But when given an existing real-world environment, it becomes challenging for traditional CAD software to adapt to current settings because acoustic measurement can be tedious and error-prone. By using our method, room materials and EQ properties can be estimated from simple input, and can be further fed to other acoustic design applications in order to improve the room acoustics such as material replacement, source and listener placement <ref type="bibr">[39]</ref>, and soundproofing setup.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">PERCEPTUAL EVALUATION</head><p>We perceptually evaluated our approach using a critical listening test. For this test, we studied the perceptual similarity of a reference speech recording with speech recordings convolved with simulated impulse responses. We used the same speech content for the reference and all stimuli under testing and evaluated how well we can reconstruct the same identical speech content in a given acoustic scene. This is useful for understanding the absolute performance of our approach compared to the ground truth results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Design and Procedure</head><p>For our test, we adopted the multiple stimulus with hidden reference and anchor (MUSHRA) methodology from the ITU-R BS.1534-3 recommendation <ref type="bibr">[57]</ref>. MUSHRA provides a protocol for the subjective assessment of intermediate quality level of audio systems <ref type="bibr">[57]</ref> and has been adopted for a wide variety of audio processing tasks such as audio coding, source separation, and speech synthesis evaluation <ref type="bibr">[8,</ref><ref type="bibr">55]</ref>.</p><p>In a single MUSHRA trial, participants are presented with a highquality reference signal and asked to compare the quality (or similarity) of three to twelve stimuli on a 0-100 point scale using a set of vertical sliders as shown in Figure <ref type="figure">12</ref>. The stimuli must contain a hidden reference (identical to the explicit reference), two anchor conditionslow-quality and high-quality, and any additional conditions under study (maximum of nine). The hidden reference and anchors are used to help the participants calibrate their ratings relative to one another, as well as to filter out inaccurate assessors in a post-screening process. MUSHRA tests serve a similar purpose to mean opinion (MOS) score tests <ref type="bibr">[58]</ref>, but requires fewer participants to obtain results that are statistically significant.</p><p>We performed our studies using Amazon Mechanical Turk (AMT), resulting in a MUSHRA-like protocol <ref type="bibr">[8]</ref>. In recent years, web-based MUSHRA-like tests have become a standard methodology and have been shown to perform equivalently to full, in-person tests <ref type="bibr">[8,</ref><ref type="bibr">55]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Participants</head><p>We recruited 269 participants on AMT to rate one or more of our five acoustic scenes under testing following the approach proposed by Cartwright et al. <ref type="bibr">[8]</ref>. To increase the quality of the evaluation, we pre-screened the participants for our tests. To do this, we first required that all participants have a minimum number of 1000 approved Human Intelligence Task (HITs) assignments and have had at least 97 percent of all assignments approved. Second, all participants must pass a hearing screening test to verify they are listening over devices with an adequate frequency response. This was performed by asking participants to listen to two separate eight second recordings consisting of a 55Hz tone, a 10kHz tone and zero to six tones of random frequency. If any user failed to count the number of tones correctly after two or more attempts, they were not allowed to proceed. Out of the 269 participants who attempted our test, 261 participants passed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Training</head><p>After having passed our hearing screening test, each user was presented with a one page training test. For this, the participant was provided two sets of recordings. The first set of training recordings consisted of three recordings: a reference, a low-quality anchor, and a high-quality anchor. The second set of training recordings consisted of the full set of recordings used for the given MUSHRA trail, albeit without the vertical sliders present. To proceed to the actual test, participants were required to listen to each recording in full. In total, we estimated the training time to be approximately two minutes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Stimuli</head><p>For our test conditions, we simulated five different acoustic scenes. For each scene, a separate MUSHRA trial was created. In AMT language, each scene was presented as a separate HIT per user. For each MUSHRA trial or HIT, we tested the following stimuli: hidden reference, low-quality anchor, mid-quality anchor, baseline T 60 , Baseline T 60 +EQ, proposed T 60 , and proposed T 60 +EQ.</p><p>As noted by the ITU-R BS.1534-3 specification <ref type="bibr">[57]</ref>, both the reference and anchors have a significant effect on the test results, must resemble the artifacts from the systems, and must be designed carefully. For our work, we set the hidden reference as an identical copy of the explicit reference (required), which consisted of speech convolved with the ground truth IR for each acoustic scene. Then, we set the low-quality anchor to be completely anechoic, non-reverberated speech. We set the mid-quality anchor to be speech convolved with an impulse response with a 0.5 second T 60 (typical conference room) across frequencies, and uniform equalization.</p><p>For our baseline comparison, we included two baseline approaches following previous work <ref type="bibr">[35]</ref>. More specifically, our Baseline T 60 leverages the geometric acoustics method proposed by Cao et al. <ref type="bibr">[7]</ref> as well as the materials analysis calibration method of Li et al. <ref type="bibr">[35]</ref>. Our Baseline T 60 +EQ extends this and includes the additional frequency equalization analysis <ref type="bibr">[35]</ref>. These two baselines directly correspond to the proposed materials optimization (Proposed T 60 ) and equalization prediction subsystems (Proposed T 60 +EQ) in our work. The key difference is that we estimate the parameters necessary for both steps blindly from speech.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">User Study Results</head><p>When we analyzed the results of our listening test, we post-filtered the results following the ITU-R BS.1534-3 specification <ref type="bibr">[57]</ref>. More specifically, we excluded assessors if they &#8226; rated the hidden reference condition for &gt; 15% of the test items lower than a score of 90</p><p>&#8226; or, rated the mid-range (or low-range) anchor for more than 15% of the test items higher than a score of 90.</p><p>Using this post-filtering, we reduce our collected data down to 70 unique participants and 108 unique test trials, spread across our five acoustic scene conditions. Among these participants, 24 are females and 46 are males, with an average age of 36.0 and a standard deviation of 10.2 years. We show the box plots of our results in Figure <ref type="figure">13</ref>. The median ratings for each stimulus include: Baseline T 60 (62.0), Baseline T 60 +EQ (85.0), Low-Anchor (40.5), Mid-Anchor (59.0), Proposed T 60 (61.5), Proposed T 60 +EQ (71.0), and Hidden Reference (99.5). As seen, the Low-Anchor and Hidden Reference outline the range of user scores for our test. In Fig. <ref type="figure">13</ref>: Box plot results for our listening test. Participants were asked to rate how similar each recording was to the explicit reference. All recordings have the same content, but different acoustic conditions. Note our proposed T 60 and T 60 +EQ are both better than the Mid-Anchor by a statistically significant amount (&#8776;10 rating points on a 100 point scale). terms of baseline approaches, the Proposed T 60 +EQ method achieves the highest overall listening test performance. We then see that our proposed T 60 method and T 60 +EQ method outperform the mid-anchor. Our proposed T 60 method is comparable to the baseline T 60 method, and our proposed T 60 +EQ method outperforms our proposed T 60 -only method.</p><p>To understand the statistical significance, we performed a repeated measures analysis of variance (ANOVA) to compare the effect of our stimuli on user ratings. The Hidden Reference and Low-Anchor are for calibration and filtering purposes and are not included in the following statistical tests, leaving 5 groups for comparison. Bartlett's test did not show a violation of homogeneity of variances (&#967; 2 = 4.68, p = 0.32). A one-way repeated measures ANOVA shows significant differences (F(4, 372) = 29.24, p &lt; 0.01) among group mean ratings. To identify the source of differences, we further conduct multiple post-hoc paired t-tests with Bonferroni correction <ref type="bibr">[26]</ref>. We are able to observe following results: a) There is no significant difference between Baseline T 60 and Proposed T 60 (t(186) = -1.72, p = 0.35), suggesting that we cannot reject the null hypothesis of identical average scores between prior work (which uses manually measured IRs) and our work; b) There is a significant difference between Baseline T 60 +EQ and Proposed T 60 +EQ (t(186) = -5.09, p &lt; 0.01), suggesting our EQ method has a statistically different average (lower); c) There is a significant difference between Proposed T 60 and Proposed T 60 +EQ (t(186) = -2.91, p = 0.02), suggesting our EQ method significantly improves performance compared to our proposed T 60 -only subsystem; d) There is a significant difference between Mid-Anchor and Proposed T 60 +EQ (t(186) = -3.78, p &lt; 0.01), suggesting our method is statistically different (higher performing) on average than simply using an average room T 60 and uniform equalization.</p><p>In summary, we see that our proposed T 60 computation method is comparable to prior work, albeit we perform such estimation directly from a short speech recording rather than relying on intrusive IR measurement schemes. Further, our proposed complete system (Proposed T 60 +EQ) outperforms both the mid-anchor and proposed T 60 system alone, demonstrating the value of EQ estimation. Finally, we note our proposed T 60 +EQ method does not perform as well as prior work, largely due to the EQ estimation subsystem. This result, however, is expected as prior work requires manual IR measurements, which result in perfect EQ estimation. This is in contrast to our work, which directly estimates both T 60 and EQ parameters from recorded speech, enabling a drastically improved interaction paradigm for matching acoustics in several applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">CONCLUSION AND FUTURE WORK</head><p>We present a new pipeline to estimate, optimize, and render immersive audio in video and mixed reality applications. We present novel algorithms to estimate two important acoustic environment characteristicsthe frequency-dependent reverberation time and equalization filter of a room. Our multi-band octave-based prediction model works in tandem with our equalization augmentation and provides robust input to our improved materials optimization algorithm. Our user study validates the perceptual importance of our method. To the best of our knowledge, our method is the first method to predict IR equalization from raw speech data and validate its accuracy.</p><p>Limitations and Future Work. To achieve a perfect acoustic match, one would expect the real-world validation error to be zero. In reality, zero error is only a sufficient but not necessary condition. In our evaluation tests, we observe that small validation errors still allow for plausible acoustic matching. While reducing the prediction error is an important direction, it is also useful to investigate the perceptual error threshold for acoustic matching for different tasks or applications. Moreover, temporal prediction coherence is not in our evaluation process. This implies that given a sliding windows of audio recordings, our model might predict temporally incoherent T 60 values. One interesting problem is to utilize this coherence to improve the prediction accuracy as a future direction.</p><p>Modeling real-world characteristics in simulation is a non-trivial task -as in previous work along this line, our simulator does not fully recreate the real world in terms of precise details. For example, we did not consider the speaker or microphone response curve in our simulation. In addition, sound sources are modeled as omnidirectional sources <ref type="bibr">[7]</ref>, where real sources exhibit certain directional patterns. It remains an open research challenge to perfectly replicate and simulate our real world in a simulator.</p><p>Like all data-driven methods, our learned model performs best on the same kind of data on which it was trained. Augmentation is useful because it generalizes the existing dataset so that the learned model can extrapolate to unseen data. However, defining the range of augmentation is not straightforward. We set the MIT IR dataset as the baseline for our augmentation process. In certain cases, this assumption might not generalize well to estimate the extreme room acoustics. We need to design better and more universal augmentation training algorithms. Our method focused on estimation from speech signals, due to their pervasiveness and importance. It would be useful to explore how well the estimation could work on other audio domains, especially when interested in frequency ranges outside typical human speech. This could further increase the usefulness of our method, e.g., if we could estimate acoustic properties from ambient/HVAC noise instead of requiring a speech signal.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0"><p>https://www.magicplan.app/</p></note>
		</body>
		</text>
</TEI>
