FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation

Agentic AI
Published: arXiv: 2604.03139v1
Authors

Mingao Tan Yiyang Li Shanze Wang Xinming Zhang Wei Zhang

Abstract

Current vision-language navigation methods face substantial bottlenecks regarding heterogeneous robot compatibility, real-time performance, and navigation safety. Furthermore, they struggle to support open-vocabulary semantic generalization and multimodal task inputs. To address these challenges, this paper proposes FSUNav: a Cerebrum-Cerebellum architecture for fast, safe, and universal zero-shot goal-oriented navigation, which innovatively integrates vision-language models (VLMs) with the proposed architecture. The cerebellum module, a high-frequency end-to-end module, develops a universal local planner based on deep reinforcement learning, enabling unified navigation across heterogeneous platforms (e.g., humanoid, quadruped, wheeled robots) to improve navigation efficiency while significantly reducing collision risk. The cerebrum module constructs a three-layer reasoning model and leverages VLMs to build an end-to-end detection and verification mechanism, enabling zero-shot open-vocabulary goal navigation without predefined IDs and improving task success rates in both simulation and real-world environments. Additionally, the framework supports multimodal inputs (e.g., text, target descriptions, and images), further enhancing generalization, real-time performance, safety, and robustness. Experimental results on MP3D, HM3D, and OVON benchmarks demonstrate that FSUNav achieves state-of-the-art performance on object, instance image, and task navigation, significantly outperforming existing methods. Real-world deployments on diverse robotic platforms further validate its robustness and practical applicability.

Paper Summary

Problem
Current vision-language navigation methods face significant challenges when it comes to navigating in real-world environments. These challenges include: * Inability to work with heterogeneous robots (e.g., wheeled, quadruped, humanoid) * Insufficient real-time performance, making it difficult to meet the low-latency requirements of robotic navigation * Limited attention to safety mechanisms, such as collision avoidance and dynamic obstacle avoidance * Inability to understand open-vocabulary instructions and limited applicability in open-world settings
Key Innovation
The FSUNav framework addresses these challenges by proposing a Cerebrum-Cerebellum architecture that integrates vision-language models (VLMs) with a high-frequency end-to-end module (Cerebellum) and a three-layer reasoning model (Cerebrum). The Cerebellum module develops a universal local planner based on deep reinforcement learning, enabling unified navigation across heterogeneous platforms and improving navigation efficiency while reducing collision risk. The Cerebrum module constructs a unified three-layer reasoning architecture, enabling zero-shot understanding and localization of object categories, instance images, and natural language instructions.
Practical Impact
The FSUNav framework has significant practical implications for the deployment of vision-language navigation on heterogeneous robotic platforms. Its ability to: * Work with diverse robotic platforms (e.g., wheeled, quadruped, humanoid) * Achieve real-time performance and low latency * Prioritize safety mechanisms, such as collision avoidance and dynamic obstacle avoidance * Understand open-vocabulary instructions and adapt to new situations makes it an essential tool for practical deployment in complex dynamic environments.
Analogy / Intuitive Explanation
Imagine a robotic navigator that can understand natural language instructions, such as "Find the red ball in the room." The Cerebrum module is like a librarian that can understand the language and retrieve relevant information about the object (red ball). The Cerebellum module is like a GPS system that can use this information to navigate to the object, avoiding obstacles and adapting to changing environments. The FSUNav framework is like a combination of these two systems, enabling the robotic navigator to understand and execute complex instructions in real-time.
Paper Information
Categories:
cs.RO
Published Date:

arXiv ID:

2604.03139v1

Quick Actions