skip to main content


Title: Pay Attention! - Robustifying a Deep Visuomotor Policy Through Task-Focused Visual Attention
Several recent studies have demonstrated the promise of deep visuomotor policies for robot manipulator control. Despite impressive progress, these systems are known to be vulnerable to physical disturbances, such as accidental or adversarial bumps that make them drop the manipulated object. They also tend to be distracted by visual disturbances such as objects moving in the robot’s field of view, even if the disturbance does not physically prevent the execution of the task. In this paper, we propose an approach for augmenting a deep visuomotor policy trained through demonstrations with Task Focused visual Attention (TFA). The manipulation task is specified with a natural language text such as “move the red bowl to the left”. This allows the visual attention component to concentrate on the current object that the robot needs to manipulate. We show that even in benign environments, the TFA allows the policy to consistently outperform a variant with no attention mechanism. More importantly, the new policy is significantly more robust: it regularly recovers from severe physical disturbances (such as bumps causing it to drop the object) from which the baseline policy, i.e. with no visual attention, almost never recovers. In addition, we show that the proposed policy performs correctly in the presence of a wide class of visual disturbances, exhibiting a behavior reminiscent of human selective visual attention experiments.  more » « less
Award ID(s):
1741431
NSF-PAR ID:
10111643
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Page Range / eLocation ID:
4254-4262
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Feature-based attention is known to enhance visual processing globally across the visual field, even at task-irrelevant locations. Here, we asked whether attention to object categories, in particular faces, shows similar location-independent tuning. Using EEG, we measured the face-selective N170 component of the EEG signal to examine neural responses to faces at task-irrelevant locations while participants attended to faces at another task-relevant location. Across two experiments, we found that visual processing of faces was amplified at task-irrelevant locations when participants attended to faces relative to when participants attended to either buildings or scrambled face parts. The fact that we see this enhancement with the N170 suggests that these attentional effects occur at the earliest stage of face processing. Two additional behavioral experiments showed that it is easier to attend to the same object category across the visual field relative to two distinct categories, consistent with object-based attention spreading globally. Together, these results suggest that attention to high-level object categories shows similar spatially global effects on visual processing as attention to simple, individual, low-level features. 
    more » « less
  2. Background

    In Physical Human–Robot Interaction (pHRI), the need to learn the robot’s motor-control dynamics is associated with increased cognitive load. Eye-tracking metrics can help understand the dynamics of fluctuating mental workload over the course of learning.

    Objective

    The aim of this study was to test eye-tracking measures’ sensitivity and reliability to variations in task difficulty, as well as their performance-prediction capability, in physical human–robot collaboration tasks involving an industrial robot for object comanipulation.

    Methods

    Participants (9M, 9F) learned to coperform a virtual pick-and-place task with a bimanual robot over multiple trials. Joint stiffness of the robot was manipulated to increase motor-coordination demands. The psychometric properties of eye-tracking measures and their ability to predict performance was investigated.

    Results

    Stationary Gaze Entropy and pupil diameter were the most reliable and sensitive measures of workload associated with changes in task difficulty and learning. Increased task difficulty was more likely to result in a robot-monitoring strategy. Eye-tracking measures were able to predict the occurrence of success or failure in each trial with 70% sensitivity and 71% accuracy.

    Conclusion

    The sensitivity and reliability of eye-tracking measures was acceptable, although values were lower than those observed in cognitive domains. Measures of gaze behaviors indicative of visual monitoring strategies were most sensitive to task difficulty manipulations, and should be explored further for the pHRI domain where motor-control and internal-model formation will likely be strong contributors to workload.

    Application

    Future collaborative robots can adapt to human cognitive state and skill-level measured using eye-tracking measures of workload and visual attention.

     
    more » « less
  3. Multiple types of memory guide attention: Both long-term memory (LTM) and working memory (WM) effectively guide visual search. Furthermore, both types of memories can capture attention automatically, even when detrimental to performance. It is less clear, however, how LTM and WM cooperate or compete to guide attention in the same task. In a series of behavioral experiments, we show that LTM and WM reliably cooperate to guide attention: Visual search is faster when both memories cue attention to the same spatial location (relative to when only one memory can guide attention). LTM and WM competed to guide attention in more limited circumstances: Competition only occurred when these memories were in different dimensions – particularly when participants searched for a shape and held an accessory color in mind. Finally, we found no evidence for asymmetry in either cooperation or competition: There was no evidence that WM helped (or hindered) LTM-guided search more than the other way around. This lack of asymmetry was found despite differences in LTM-guided and WM-guided search overall, and differences in how two LTMs and two WMs compete or cooperate with each other to guide attention. This work suggests that, even if only one memory is currently task-relevant, WM and LTM can cooperate to guide attention; they can also compete when distracting features are salient enough. This work elucidates interactions between WM and LTM during attentional guidance, adding to the literature on costs and benefits to attention from multiple active memories. 
    more » « less
  4. Many goal-directed actions that require rapid visuomotor planning and perceptual decision-making are affected in older adults, causing difficulties in execution of many functional activities of daily living. Visuomotor planning and perceptual identification are mediated by the dorsal and ventral visual streams, respectively, but it is unclear how age-induced changes in sensory processing in these streams contribute to declines in visuomotor decision-making performance. Previously, we showed that in young adults, task demands influenced movement strategies during visuomotor decision-making, reflecting differential integration of sensory information between the two streams. Here, we asked the question if older adults would exhibit deficits in interactions between the two streams during demanding motor tasks. Older adults ( n = 15) and young controls ( n = 26) performed reaching or interception movements toward virtual objects. In some blocks of trials, participants also had to select an appropriate movement goal based on the shape of the object. Our results showed that older adults corrected fewer initial decision errors during both reaching and interception movements. During the interception decision task, older adults made more decision- and execution-related errors than young adults, which were related to early initiation of their movements. Together, these results suggest that older adults have a reduced ability to integrate new perceptual information to guide online action, which may reflect impaired ventral-dorsal stream interactions. NEW & NOTEWORTHY Older adults show declines in vision, decision-making, and motor control, which can lead to functional limitations. We used a rapid visuomotor decision task to examine how these deficits may interact to affect task performance. Compared with healthy young adults, older adults made more errors in both decision-making and motor execution, especially when the task required intercepting moving targets. This suggests that age-related declines in integrating perceptual and motor information may contribute to functional deficits. 
    more » « less
  5. An option is a short-term skill consisting of a control policy for a specified region of the state space, and a termination condition recognizing leaving that region. In prior work, we proposed an algorithm called Deep Discovery of Options (DDO) to discover options to accelerate reinforcement learning in Atari games. This paper studies an extension to robot imitation learning, called Discovery of Deep Continuous Options (DDCO), where low-level continuous control skills parametrized by deep neural networks are learned from demonstrations. We extend DDO with: (1) a hybrid categorical–continuous distribution model to parametrize high-level policies that can invoke discrete options as well continuous control actions, and (2) a cross-validation method that relaxes DDO’s requirement that users specify the number of options to be discovered. We evaluate DDCO in simulation of a 3-link robot in the vertical plane pushing a block with friction and gravity, and in two physical experiments on the da Vinci surgical robot, needle insertion where a needle is grasped and inserted into a silicone tissue phantom, and needle bin picking where needles and pins are grasped from a pile and categorized into bins. In the 3-link arm simulation, results suggest that DDCO can take 3x fewer demonstrations to achieve the same reward compared to a baseline imitation learning approach. In the needle insertion task, DDCO was successful 8/10 times compared to the next most accurate imitation learning baseline 6/10. In the surgical bin picking task, the learned policy successfully grasps a single object in 66 out of 99 attempted grasps, and in all but one case successfully recovered from failed grasps by retrying a second time. 
    more » « less