Meta has recently launched Emu Video, which adopts a decomposed generation method, first generating an image and then creating a video from that image and text. Emu Video utilizes a pretrained text-to-image model with fixed parameters and trains the text-to-video task using temporal parameters. The model employs a multi-stage training strategy to enhance the quality of the generated videos and ensure consistency with the text. In human evaluations, Meta's Emu Video showed that the quality and semantic consistency of the generated 4-second videos surpassed that of Gen-2.