BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation

Shah, Rutav; Yu, Albert; Zhu, Yifeng; Zhu, Yuke; Martín-Martín, Roberto

Citation Details

This content will become publicly available on January 31, 2026

BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation

To operate at a building scale, service robots must perform very long-horizon mobile manipulation tasks by navigating to different rooms, accessing different floors, and interacting with a wide and unseen range of everyday objects. We refer to these tasks as Building-wide Mobile Manipulation. To tackle these inherently long-horizon tasks, we propose BUMBLE, a unified VLM-based framework integrating open-world RGBD perception, a wide spectrum of gross-to-fine motor skills, and dual-layered memory. Our extensive evaluation (90+ hours) indicates that BUMBLE outperforms multiple baselines in long-horizon building-wide tasks that require sequencing up to 12 ground truth skills spanning 15 minutes per trial. BUMBLE achieves 47.1% success rate averaged over 70 trials in different buildings, tasks, and scene layouts from different starting rooms and floors. Our user study demonstrates 22% higher satisfaction with our method than state-of-the-art mobile manipulation methods. Finally, we demonstrate the potential of using increasingly capable foundation models to push performance further. more »

Award ID(s):: 2145283 2318065

PAR ID:: 10570015

Author(s) / Creator(s):: Shah, Rutav; Yu, Albert; Zhu, Yifeng; Zhu, Yuke; Martín-Martín, Roberto

Publisher / Repository:: IEEE

Date Published:: 2025-01-31

Format(s):: Medium: X

Location:: IEEE International Conference on Robotics and Automation

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on January 31, 2026
Conference Paper:
The DOI is not currently available.

More Like this