ROCKET-1
Master the visual-temporal context prompting model for open-world interactions.
CommonProductProgrammingVisual-Language ModelEmbodied Decision-Making
ROCKET-1 is a Visual-Language Model (VLM) specifically designed for embodied decision-making in open-world environments. This model connects VLMs with policy models through a visual-temporal context prompting protocol, guiding policy-environment interactions using object segmentation from past and current observations. By this means, ROCKET-1 unlocks the visual-language reasoning capabilities of VLMs, enabling it to solve complex creative tasks, especially in spatial understanding. Experiments with ROCKET-1 in Minecraft demonstrate that this approach allows agents to accomplish previously unattainable tasks, highlighting the effectiveness of visual-temporal context prompting in embodied decision-making.
ROCKET-1 Visit Over Time
Monthly Visits
231
Bounce Rate
47.66%
Page per Visit
1.4
Visit Duration
00:00:50