<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>ManiWAV: Learning RobotManipulation from In-the-Wild Audio-Visual Data</title></titleStmt>
			<publicationStmt>
				<publisher>Conference of Robot Learning</publisher>
				<date>11/04/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10557731</idno>
					<idno type="doi"></idno>
					
					<author>Z Liu</author><author>C Chi</author><author>E Cousineau</author><author>N Kuppuswamy</author><author>B Burchfiel</author><author>S Song</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Audio signals provide rich information for the robot interaction and object properties through contact. This information can surprisingly ease the learning of contact-rich robot manipulation skills, especially when the visual information alone is ambiguous or incomplete. However, the usage of audio data in robot manipulation has been constrained to teleoperated demonstrations collected by either attaching a microphone to the robot or object, which significantly limits its usage in robot learning pipelines. In this work, we introduce ManiWAV: an 'ear-in-hand' data collection device to collect in-the-wild human demonstrations with synchronous audio and visual feedback, and a corresponding policy interface to learn robot manipulation policy directly from the demonstrations. We demonstrate the capabilities of our system through four contact-rich manipulation tasks that require either passively sensing the contact events and modes, or actively sensing the object surface materials and states. In addition, we show that our system can generalize to unseen in-the-wild environments by learning from diverse in-the-wild human demonstrations.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>b) Contact Modes a) Contact Events c) Surface Materials d) Object States</head><note type="other">furry rough</note></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Selecting and executing good contact is at the core of robot manipulation. However, most vision-based robotic systems nowadays are limited in their capability to sense and utilize contact information. In this work, we propose a robotic system that learns contact through a common yet under-explored modalityaudio. Our first insight is that audio signals provide rich contact information. During a manipulation task, audio feedback can reveal several key information about the interaction and object properties, including:</p><p>&#8226; Contact events and modes: From wiping on a surface to flipping an object with spatula, audio feedback captures salient and distinct signals that can be used for detecting contact events and characterizing contact modes (Fig. <ref type="figure">1 a</ref>, <ref type="figure">b</ref>).</p><p>8th Conference on Robot Learning (CoRL 2024), Munich, Germany. &#8226; Surface materials: Audio signals can be used to characterize the surface material through contact with the object. In contrast, either image sensors or vision-based tactile sensors require high spatial resolution to capture the subtle texture difference (e.g., the 'hook' and 'loop' side of velcro tapes) (Fig. <ref type="figure">1 c</ref>). &#8226; Object states and properties: Without the need for direct contact, audio signals can provide complementary information about the object state and physical properties beyond visual observation (Fig. <ref type="figure">1 d</ref>).</p><p>Second, audio data is scalable for data collection and policy learning. This is because acoustic sensors (i.e., contact microphones) are cheap, robust, and readily available to purchase. Audio signals also have standardized coding formats that can be easily integrated into existing video recording and storage pipelines (e.g., MP4 files). These nice properties make it possible to collect audio in the wild with low-cost data collection devices, such as a hand-held gripper, without the need for a robot. On the other hand, alternative ways to sense contact, such as tactile sensors, are relatively more expensive, fragile, and require expert knowledge to use.</p><p>Given the richness and scalability of audio data, we propose a versatile robot learning system, ManiWAV, that leverages audio feedback for contact-rich robot manipulation tasks. Building upon the portable hand-held data collection device UMI <ref type="bibr">[1]</ref>, we redesign one gripper finger to embed a piezoelectric contact microphone that senses audio vibrations through contact with solid objects. The audio signals can be easily streamed to the GoPro camera through a mic port and stored synchronously with vision data in MP4 files. With this intuitive design, one can demonstrate a wide range of manipulation tasks with synchronous vision and audio feedback at a low time and maintenance cost.</p><p>To learn from the collected demonstrations, one key challenge is to bridge the audio domain gap between in-the-wild data and actual robot deployment due to test-time noises (Fig. <ref type="figure">2 b</ref>). To achieve this goal, we propose a data augmentation strategy that encourages learning of task-relevant audio representation. In addition, we propose an end-to-end sensorimotor learning network to encode and fuse the vision and audio data, with a diffusion policy <ref type="bibr">[2]</ref> head for action prediction.</p><p>We demonstrate the capability of our proposed system on four contact-rich manipulation tasks: wipe shape from whiteboard, flip bagel with spatula, pour objects from cup, and strap wires with velcro tape. We also show that our system can generalize to unseen in-the-wild environments by leveraging in-the-wild data collected from diverse environments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Tactile Sensing for Contact-rich Manipulation. As a functional equivalent to the human touch modality, tactile sensing has long been studied to provide feedback for the robot's physical interactions <ref type="bibr">[3]</ref>. Ranging from six-axis force/torque sensors <ref type="bibr">[4,</ref><ref type="bibr">5]</ref> to camera-based tactile sensors <ref type="bibr">[6,</ref><ref type="bibr">7,</ref><ref type="bibr">8,</ref><ref type="bibr">9,</ref><ref type="bibr">10,</ref><ref type="bibr">11]</ref> and tactile skins <ref type="bibr">[12,</ref><ref type="bibr">13,</ref><ref type="bibr">14,</ref><ref type="bibr">15]</ref>, tactile sensors take various forms and provide information of contact forces <ref type="bibr">[16]</ref>, object geometry <ref type="bibr">[17,</ref><ref type="bibr">18]</ref>, etc. Many recent works have incorporated tactile feedback into robot learning pipelines for contact-rich manipulation tasks by learning better visuo-tactile representations <ref type="bibr">[19,</ref><ref type="bibr">20,</ref><ref type="bibr">21,</ref><ref type="bibr">22,</ref><ref type="bibr">23,</ref><ref type="bibr">24]</ref>. However, most tactile sensors are limited in their reproducibility given the cost and requirement for domain knowledge to use. In this work, we find that acoustic sensors such as piezoelectric contact microphones can provide alternative tactile feedback at a much lower cost and higher availability.</p><p>Acoustic Sensing in Controlled Environments. Sound is an important information carrier of the physical environment. Acoustic sensing can be categorized into active and passive. Active acoustic sensing is done by emitting a vibration waveform with a speaker, which is then received by a microphone. Prior works have embedded active acoustic sensors on objects <ref type="bibr">[25]</ref>, robot arms <ref type="bibr">[26,</ref><ref type="bibr">27]</ref>, parallel gripper <ref type="bibr">[28]</ref> or soft finger <ref type="bibr">[29]</ref> to sense object material, shape, and contact interactions. Passive acoustic sensing captures sounds generated from interactions, and prior works have shown that including audio as input to end-to-end robot learning algorithms improves the performance of manipulation tasks <ref type="bibr">[30,</ref><ref type="bibr">31,</ref><ref type="bibr">32,</ref><ref type="bibr">33]</ref>. However, the collection of audio data in prior works requires a controlled environment, where the data is collected through teleoperation, with sensors attached to either the robot or the object. To address this limitation, we propose an 'ear-in-hand' gripper design with passive acoustic sensing to collect human demonstrations without the need for a robot, making it promising to collect in-the-wild audio-visual data at a lower cost of time.</p><p>Policy Learning from Multisensory Data. Multisensory observations (e.g., vision, audio, tactile) allow robots to better perceive the physical environment and guide action planning <ref type="bibr">[19,</ref><ref type="bibr">34,</ref><ref type="bibr">35,</ref><ref type="bibr">36]</ref>. However, many prior works require pre-defined state abstractions <ref type="bibr">[37,</ref><ref type="bibr">38]</ref> to learn the control policy and are thus limited in their ability to generalize to new tasks. Recently, end-to-end models have been proposed to take in multisensory inputs and output robot actions through a behavior cloning approach <ref type="bibr">[33,</ref><ref type="bibr">31,</ref><ref type="bibr">32,</ref><ref type="bibr">30]</ref>. We extend upon prior works and propose an audio data augmentation strategy to bridge the domain gap between in-the-wild data and robot data during deployment, together with an end-to-end policy learning network to effectively learn task-relevant audio-visual representations from multimodal human demonstrations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Method</head><p>We propose a data collection and policy learning framework to learn contact-rich manipulation tasks from vision and audio. On the data collection front, our goal is to easily collect in-the-wild demonstrations with clean and salient contact signals. To achieve this, we extend the UMI <ref type="bibr">[1]</ref> data collection device by embedding a piezoelectric contact microphone in one gripper finger to stream audio data synchronously with the vision data through the GoPro camera mic port.</p><p>On the algorithm front, one key challenge is to bridge the audio domain gap between the collected demonstrations and feedback received during robot deployment, as illustrated in Fig. <ref type="figure">2 (b</ref>). Another challenge is to learn a robust and task-relevant audio-visual representation that can effectively guide the downstream policy. To address these challenges, we propose a data augmentation strategy to bridge the audio domain gap and a transformer-based model that learns from human demonstrations with vision and audio feedback.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Ear-in-Hand Hardware Design</head><p>Our data collection device is built on the Universal Manipulation Interface (UMI) <ref type="bibr">[1]</ref>. UMI is a portable and low-cost hand-held gripper designed to collect human demonstrations in the wild. The collected data can be used to train a visuomotor policy that is directly deployable on a robot.</p><p>We redesign the 3D-printed parallel jaw gripper on the device to embed a piezoelectric contact microphone under high-friction grip tape wrapped around the finger. The microphone is connected to the 3.5mm external mic port on the GoPro camera media mod. Fig. <ref type="figure">2</ref> (a) shows the hand-held gripper design. Audio is recorded at 48000 Hz and stored with 60Hz image data synchronously as MP4 files. During robot deployment, the same parallel jaw gripper with embedded contact microphone is mounted on a UR5 robot arm, shown in Fig. <ref type="figure">2 (c</ref>). The images and audio are streamed in real-time through an Elgato HD60 X external capture card into a Ubuntu 22.04.3 desktop.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Policy Design</head><p>We propose an end-to-end closed-loop sensorimotor learning model that takes in RGB images and audio, and outputs 10-DoF robot actions (including end effector positions, end effector orientation represented in 6D <ref type="bibr">[39]</ref>, and 1D gripper openness).</p><p>Audio Data Augmentation. One key challenge is that the audio signals received during real-time robot deployment are very different from the data collected by the hand-held gripper, resulting in a large domain gap between the training and test scenarios, as illustrated in Fig. <ref type="figure">2 (b</ref>). This is mostly because of 1) nonlinear robot motor noises during deployment and 2) out-of-distribution sounds generated by the robot interaction. (e.g., accidentally colliding with an object). To address the domain gap, the key is to augment the training data with noises and guide the model to focus on the invariant task-relevant signals and ignore unpredictable noises. In particular, we randomly sample audio as background noises from ESC-50 <ref type="bibr">[40]</ref>. The sounds are normalized to the same scale as the collected sound in the training dataset. We also record 10 samples of robot motor noises under randomly sampled trajectories with the same contact microphone location as deployment time. The background noises and robot noises are overlayed to the original audio signal, each with a probability of 0.5. In our experiments, we show that this simple yet effective approach yields better policy performance by enforcing the inductive bias of the model on task-relevant audio signals. Interestingly, we find that a policy co-trained with audio attends more to the task-relevant regions (shape of drawing or free space inside the pan). In contrast, the vision-only policy often overfits to background structures as a shortcut to estimate contact (e.g., the edge of the whiteboard, table, and room structures).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CLIP ViT Image Encoder AST Encoder</head><p>Vision Encoder. We use a CLIP-pretrained ViT-B/16 model <ref type="bibr">[41]</ref> to encode the RGB images. The images are resized into 224x224 resolution with random crop and color jitter augmentation. Images are sampled at 20 Hz and we take images in the past 2 timesteps. Each image is encoded separately using the classification token feature.</p><p>Audio Encoder. We use the audio spectrogram transformer (AST) <ref type="bibr">[42]</ref> to encode the audio input. AST, similar to a ViT model, leverages the attention mechanism to learn better audio representation from spectrogram patches. The intuition behind using a transformer encoder instead of a CNN-based encoder as seen in prior works <ref type="bibr">[31,</ref><ref type="bibr">33,</ref><ref type="bibr">30]</ref> is that the 'shift invariance' that CNN leverages is less suitable to audio spectrograms, as shifts in either the time and frequency domain can significantly change the audio information. In our experiment, we show that training the transformer encoder from scratch outperforms both pre-trained and from-scratch CNN models.</p><p>The audio signal (from the last 2-3 seconds, depending on the task) is first resampled from 48kHz to 16kHz, which helps filter high-frequency noises and increase the frequency resolution of task-relevant signals on the spectrogram. The waveform is then converted to a log mel spectrogram using FFT size and window length of 400, hop length of 160, and 64 mel filterbanks. The log-mel spectrograms are linearly normalized to range [-1,1]. We use the classification token feature extracted from the last hidden layer of the AST encoder.</p><p>Sensory Fusion. We fuse the vision and audio features using a transformer encoder in a similar fashion as Li et al. <ref type="bibr">[33]</ref> to leverage the attention mechanism to weigh the features adaptively at different stages of the task (e.g., vision is important for moving to the target object, whereas audio is important upon contact). We concatenate the output features and downsample the dimension to 768 with a linear projection layer. Finally, we concatenate the end effector poses (20 Hz) from the past 2 timesteps to the audio-visual feature.</p><p>Policy Learning. To model the multimodality intrinsic to human demonstrations, we choose to use a diffusion model with UNet encoders as proposed by Chi et al. <ref type="bibr">[2]</ref> as the policy head, conditioned on the observation representation mentioned above in each denoising step. The entire model (Fig. <ref type="figure">3</ref>), including the above-mentioned encoders, is end-to-end trained using the noise prediction MSE loss on future robot trajectories of 16 steps.</p><p>Audio Latency Matching. During data collection, the vision and audio data are synchronized when recording through GoPro. During deployment, we calibrated the audio latency to be 0.23; details can be found in the appendix. We adopt an approach similar to Chi et al. <ref type="bibr">[1]</ref> to compensate for this delay.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Evaluation</head><p>We study four contact-rich manipulation tasks to show the different advantages of learning from audio feedback, such as detecting contact events and modes (flipping and wiping), or sensing object state (pouring) and surface material (taping). In each task, we test the policy under different scenarios and compare with alternative approaches to validate the robustness and generalizability of our approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Flipping Task</head><p>The robot is tasked to flip a bagel in a pan from facing down to facing upward using a spatula. To perform this task successfully, the robot needs to sense and switch between different contact modes -precisely insert the spatula between the bagel and the pan, maintain contact while sliding, and start to tilt up the spatula when the bagel is in contact with the edge of the pan.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data collection:</head><p>We collect two types of demonstrations for this task: 283 in-lab demonstrations and an additional 274 in-the-wild demonstrations collected in 7 different environments using different pans.</p><p>Test Scenarios: We run 20 rollouts for each policy. To ensure a fair comparison, we use the same set of robot and object configurations for evaluations between different methods. We achieve the same object configuration by overlaying their position with respect to a captured image in the camera view. The test configurations can be grouped into four categories.</p><p>&#8226; T1: Variations in task configuration: different initial robot and object configurations (14 / 20).</p><p>&#8226; T2: Audio perturbation by playing different types of noises in the background (2 / 20).</p><p>&#8226; T3: Generalization to unseen table height (4 / 20).</p><p>&#8226; T4: Generalization to two unseen in-the-wild environments: a black stovetop and a white countertop, the latter is more challenging due to unstructured background and lack of similar training data.</p><p>Comparisons: In this task, we focus on comparing our results with several ablations of the network design:</p><p>&#8226; Vision only: the original diffusion policy conditioned on image observations. &#8226; MLP policy: using an MLP with three hidden layers (following Li et al. <ref type="bibr">[33]</ref>) instead of action diffusion. The model takes the observation representation and outputs the future action trajectory. [MLP fusion] policy often fails to fully wipe out the drawing and terminates early.</p><p>&#8226; ResNet: uses a ResNet18 encoder to encode the audio log-mel spectrograms, with an additional CoordConv layer <ref type="bibr">[43]</ref> following Li et al. <ref type="bibr">[33]</ref>. &#8226; AVID: Following the approach by Mejia et al. <ref type="bibr">[30]</ref>, use a 9-layer CNN audio encoder pre-trained on AudioSet <ref type="bibr">[44]</ref> using Audio-Visual Instance Discrimination (AVID) <ref type="bibr">[45]</ref>. &#8226; In-the-wild: For in-the-wild evaluation (T4), we compare our model trained with in-the-wild demonstrations (blue bar in Fig. <ref type="figure">5</ref>  In-the-wild data enables generalization to unseen in-the-wild environments. As shown in Fig. <ref type="figure">5</ref>, policy trained on in-the-wild data significantly outperforms in-lab data on two unseen environments, as the scene diversity in in-the-wild data allows the policy to generalize better to new environments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Wiping Task</head><p>In this task, the robot is tasked to wipe a shape (e.g., heart, square) drawn on a whiteboard. The robot can start in any initial configuration above the whiteboard and grasp an eraser in parallel to the board. The main challenge of the task is that the robot needs to exert a reasonable amount of contact force on the whiteboard while moving the eraser along the shape. We collect 119 demonstrations in total for this task.</p><p>Comparisons: In addition to the vision only baseline, we evaluate the following alternatives for processing and learning from audio data:</p><p>&#8226; MLP fusion: uses an MLP with 2 hidden layers to fuse the vision and audio features instead of a transformer encoder. This approach was used by Du et al. <ref type="bibr">[31]</ref>. &#8226; Noise masking: without noise augmentation but instead mask out the audio frequency below 500 Hz, which is the UR5 control frequency. &#8226; No noise aug: without augmenting the audio data with noises during training.</p><p>Test Scenarios: We run 20 rollouts for each policy. In addition to the T1 (5 / 20) and T2 (4 / 20) test cases as described above, we also test generalization to unseen table heights, erasers, and drawing shape T3 <ref type="bibr">(11 / 20)</ref>. A detailed breakdown of the test scenarios can be found in the appendix. Findings: The quantitative result and typical failure cases are visualized in Fig. <ref type="figure">6</ref>. We find that: 1) Contact audio improves robustness and generalizability. Using only images from the wrist-mount camera, it is hard to infer whether the eraser is contacting the board or not, whereas incorporating the contact audio improves the overall success rate from 40% to 85%. The [Vision only] policy fails to generalize to unseen table heights and unseen erasers. To understand the behaviors of the vision only policy and our policy better, we visualize the attention map of the vision encoder in Fig. <ref type="figure">4</ref> and find that the model trained with audio attends better to task-relevant features. 2) Noise augmentation is an effective strategy to bridge the audio domain gap and increase the system's robustness to out-of-distribution sounds. Without noise augmentation, the robot is less robust to noises during test time and does not generalize well to unseen table heights, unseen erasers, and unseen shapes. Another alternative we study is simply masking out the robot noise regions. Surprisingly, this approach yields slightly better results than [No noise aug] by removing the domain gap caused by robot motor noises. However, this alternative does not successfully address other noises during the robot execution, as mentioned in Section 3.2. Lastly, we show 3) the advantage of using a transformer to fuse the vision and audio features compared to using an MLP. A typical failure in [MLP fusion] is that the robot fails to wipe in the area of the shape and stops when the shape is not completely wiped off.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Pouring Task</head><p>The robot is tasked to pick up the white cup and pour dice out to the pink cup if the white cup is not empty. When finished pouring, the robot needs to place the empty cup down to a designated location. The challenge of the task is that the robot cannot observe whether there are dice in the cup or not given the camera viewpoint both before and after the pouring action. We collect 145 demonstrations for this task, with a 'shaking' action to generate vibrations that can be captured by the contact microphone if there are dice inside the cup.</p><p>Comparisons: In addition to the vision only baseline, we study how the length of the audio input affects the policy performance with two ablations: one using 1s history audio and one using 10s history audio.</p><p>Test Scenarios: We run 12 rollouts for each policy. In addition to the T1 and T2 scenarios, we also tested generalization scenarios (T3) to an unseen number of dice (e.g., no dice or &gt; 6 dice) and unseen objects (e.g., screws or beans).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Findings:</head><p>The quantitative result and typical failure cases are visualized in Fig. <ref type="figure">7</ref>. Our key findings in the experiments are: 1) Combining with information-seeking action, audio can provide critical state information beyond visual observations. As illustrated in the figure, the vision only policy fails to execute the pour action as it cannot infer whether there are dice in the cup or not, whereas the policy trained with audio feedback can leverage vibrations to infer the state information and generalize to objects with similar sounds (e.g., screws). 2) Policy performance is sensitive to audio history length. As shown in the result, [1s audio] yields significantly lower performance since the shortened audio window does not contain sufficient information to guide robot actions. [10s audio] shows that even though a longer audio window contains sufficient information, it increases the complexity of the learning process and requires a higher capacity model.</p><p>Task Definition Init Touch every tape Grasp the 'hook' tape Apply tape Final transport, in shorts, wrong path planning All:miss place All: miss touch Vision only: wrong tape Typical Failure Cases </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Taping Task</head><p>The robot is tasked to choose the 'hook' tape from several velcro tapes (either 'hook' or 'loop') and strap wires by attaching the 'hook' tape to a 'loop' tape underneath the wires. We collect 193 demonstrations in total, with a 'sliding' primitive where we use the tip of the gripper finger to slide along the tape. We run 10 rollouts in total, each time the robot is presented 2-4 velcro tapes in random order with at least one 'hook' tape.</p><p>Comparisons: In addition to vision only, we compare the following baselines: &#8226; Env Mic: Instead of using a contact microphone, we mount a Rode VideoMic GO II directional microphone on the GoPro camera for both data collection and deployment to collect the audio signals.</p><p>&#8226; Noise Reduction: Instead of applying training time noise augmentation, we evaluated an alternative method that uses a test-time noise reduction algorithm. The algorithm estimates a noise threshold for each frequency and applies a smoothed mask on the spectrogram <ref type="bibr">[46]</ref>. More details can be found in the appendix. Findings. 1) Contact microphone is sufficiently sensitive to different surface materials. As shown in Fig. <ref type="figure">8</ref>, the [Vision only] policy makes random decisions and yields a 20% success rate. Similarly, the system that uses environment microphone to collect audio data achieves similar results as the [Vision only] method, as it fails to pick up the subtle differences between the surface material. In contrast, by leveraging the contact microphone, our method is able to reliably guide the robot to pick up the correct tape. 2) Training-time noise augmentation is more effective than test-time noise reduction. This is because the noise cancellation algorithm causes the signal to be deprecated, resulting in a domain gap between training and testing. On the other hand, our noise augmentation method preserves the frequency distribution of the original signals. More visualization can be found in the appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Limitations and Future Directions</head><p>On the data front, even though the contact microphone can pick up a wide range of audio signals and is robust against environment noises in the background by design, it may not be useful in scenarios where the interaction does not generate salient signals (e.g., for deformable objects such as cloth or quasi-static tasks), and can easily become imperceptible due to robot motor noises during deployment. On the policy front, the current method does not leverage the fact that audio signals are received at a higher frequency than images and can be used to learn more reactive behaviors. Future work can consider a hierarchical network architecture <ref type="bibr">[47]</ref> that infers higher frequency actions from audio inputs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>Audio signals reveal rich information about the robot interaction and object properties, which can ease the learning of contact-rich robot manipulation policies. We present ManiWAV, an in-the-wild audio-visual data collection and policy learning framework. We design an 'ear-in-hand' gripper to collect human demonstrations that can be directly used to train robot policies through behavior cloning. By learning an effective audio-visual representation as condition to the diffusion policy, our method outperforms several alternative approaches on four contact-rich manipulation tasks and generalizes to unseen in-the-wild environments. (a) Empty Cup (b) Cup with Dice (c) Cup with Screws method is able to reliably guide the policy to pour objects and place the cup, whereas the baselines either never execute the pour action or the place action, resulting in low substep success rate. We also find that the policy can generalize to unseen objects such as screws. In Fig. <ref type="figure">11</ref>, we visualize the audio spectrogram when the robot 'shakes' the cup. We can observe that the spectrogram for an empty cup is distinctive from a non-empty cup, and a cup with screws generates a similar audio pattern as a cup with dice upon shaking. We hypothesize that the audio features for the cup with screws and the cup with dice are also close in the audio feature space, leading to similar policy behavior. Videos of the policy rollouts can be found on the project website. A breakdown of the success rate for each substep in the taping task is shown in Tab. 2. 'Touch' is successful if the robot slides along the tape while maintaining contact, 'Sense' is successful if the robot chooses the correct tape, 'Pick' is successful if the robot successfully grasps the tape. 'Place' is successful if the robot successfully places the tape on top of the wires. By leveraging audio feedback to infer the object surface material (whether the tape is a 'hook' or 'loop'), our method is able to reliably guide the policy to choose the correct tape, whereas the baselines make random decisions, as shown in the 'Sense' step success rate. Videos of the policy rollouts can be found on the project website.</p><p>Noise Reduction Algorithm. We use the non-stationary noise reduction method introduced in <ref type="bibr">[46]</ref>. The algorithm computes a spectrogram from the audio waveform, and then applies an IIR filter forward and backward on each frequency channel to obtain a time-smoothed version of the spectrogram. A mask is then computed based on the spectrogram by estimating a noise threshold for each frequency band of the signal/noise. Finally, a smoothed, inverted version of the mask is applied to the original spectrogram to cancel Using a [ResNet] and [AVID] audio encoder results in the spatula to 'displace' most of the times, likely because the model is not sensitive enough to the sound feedback of spatula touching the bottom of the pan, and as a result the robot keeps moving downward and causes the spatula to displace.</p></div></body>
		</text>
</TEI>
