ROCKET-1

Master the visual-temporal context prompting model for open-world interactions.

CommonProductProgrammingVisual-Language ModelEmbodied Decision-Making
ROCKET-1 is a Visual-Language Model (VLM) specifically designed for embodied decision-making in open-world environments. This model connects VLMs with policy models through a visual-temporal context prompting protocol, guiding policy-environment interactions using object segmentation from past and current observations. By this means, ROCKET-1 unlocks the visual-language reasoning capabilities of VLMs, enabling it to solve complex creative tasks, especially in spatial understanding. Experiments with ROCKET-1 in Minecraft demonstrate that this approach allows agents to accomplish previously unattainable tasks, highlighting the effectiveness of visual-temporal context prompting in embodied decision-making.
Visit

ROCKET-1 Alternatives