Pandora
General world model, supports natural language action and video state
CommonProductVideoNatural Language ProcessingVideo Generation
Pandora is a step towards a general world model, capable of simulating world states through video generation and allowing control of video content at any time using natural language. Unlike previous text-to-video models, Pandora allows for free-form text action input at any point during video generation, enabling real-time control. This real-time control capability fulfills the promise of world models supporting interactive content generation and enhanced robust reasoning and planning. Pandora can generate videos across multiple domains, such as indoor/outdoor, natural/urban, human/robot, 2D/3D environments. Additionally, Pandora allows for instruction fine-tuning through high-quality data, enabling the model to learn actions in one domain and apply them in another unseen domain. Pandora's autoregressive model also generates longer videos, with output lengths exceeding the length of training videos. Despite its limitations as a preliminary step towards a general world model, such as potential failures in generating consistent videos, simulating complex scenarios, understanding common sense and physical laws, and following instructions/actions, Pandora demonstrates immense potential in video generation and natural language control.