Video Language Planning
Complex, long-term visual planning
CommonProductVideoVisual PlanningMulti-Modal
Video Language Planning (VLP) is an algorithm that, through training visual language models and text-to-video models, achieves complex, long-term visual planning. VLP takes long-term task instructions and current image observations as input and outputs a detailed multi-modal (video and language) plan describing how to complete the final task. VLP can generate long-term video plans in various robotics domains, from multi-object re-arrangement to multi-camera dual-arm dexterous manipulation. The generated video plans can be converted into real robot actions through goal-conditioned policy. Experiments demonstrate that VLP significantly improves the success rate of long-term tasks compared to previous methods.
Video Language Planning Visit Over Time
Monthly Visits
672
Bounce Rate
52.77%
Page per Visit
1.5
Visit Duration
00:01:26