Video Language Planning

Complex, long-term visual planning

CommonProductVideoVisual PlanningMulti-Modal

Video Language Planning (VLP) is an algorithm that, through training visual language models and text-to-video models, achieves complex, long-term visual planning. VLP takes long-term task instructions and current image observations as input and outputs a detailed multi-modal (video and language) plan describing how to complete the final task. VLP can generate long-term video plans in various robotics domains, from multi-object re-arrangement to multi-camera dual-arm dexterous manipulation. The generated video plans can be converted into real robot actions through goal-conditioned policy. Experiments demonstrate that VLP significantly improves the success rate of long-term tasks compared to previous methods.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

Video Language Planning

Video Language Planning Visit Over Time

Video Language Planning Visit Trend

Video Language Planning Visit Geography

Video Language Planning Traffic Sources

Video Language Planning Alternatives

Video Language Planning — Complex, long-term visual planning

OpenCompass Multi-modal Leaderboard — Real-time updated leaderboard of multi-modal model performance

4M — Multi-modal and Multi-task Model Training Framework

Griffon — High-resolution multi-modal perception LVLM

Mobile-Agent — Autonomous Multi-Modal Mobile Device Agent

Fuyu-8B — A small multi-modal model that supports image and text generation

DevMind AI — Multi-Modal AI Development Assistant

Unified-IO 2 — A unified multi-modal generation model

UniVG — Unified Multi-Modal Video Generation System

Kimi-VL — A highly efficient open-source expert-mixed visual language model with multi-modal reasoning capabilities.

Reka Core — Powerful multi-modal LLM, commercial solution.

Kosmos-2 — A world-facing multi-modal large language model

Magma-8B — Magma-8B is a multi-modal AI model developed by Microsoft that processes image and text inputs to generate text outputs.

Media2Face — Multi-modal Guided Co-speech Facial Animation Generation

Any GPT — A multi-modal large-scale language model

VCoder — VCoder is a visual perception model that can improve the performance of multi-modal large language models on object-level visual tasks.

SEED-Story — Multi-modal Long-form Story Generation Model

MagicAvatar — Multi-modal Avatar Generation and Animation

Silo — Multi-modal conversation, text-to-image

MNN-LLM Android App — A lightweight multi-modal language model Android application.

Mini-Gemini — A multi-modal AI model with both image understanding and generation capabilities.

Video-MME — The first comprehensive benchmark for evaluating the performance of Multi-Modal Large Language Models (MLLMs) in video analysis.

Google Gemini.co — Google's largest and most powerful multi-modal AI model

Runway gen2 — A multi-modal artificial intelligence system that can generate new videos based on text, images, or video clips.

Janus-Pro-1B — Janus-Pro-1B is an autoregressive framework for unified multi-modal understanding and generation.

Multi-modal Large Language Models — Provides a comprehensive evaluation of MLLMs

EgoLife — EgoLife is a long-term, multi-modal, multi-view daily life AI assistant project aimed at advancing research in long-term context understanding.

Migician — Migician is a multi-modal large language model focusing on multi-image localization, capable of achieving free-form, precise multi-image localization.

HPT — HPT is an innovative multi-modal LLM framework launched by HyperGAI, designed to understand and process various input modalities including text, images, and videos.

SmolVLM — An efficient open-source visual language model

GEO Services