DiTCtrl

Explore attention control in multimodal diffusion transformers for un-tuned, multi-prompt long video generation.

CommonProductVideoVideo GenerationMultimodal
DiTCtrl is a video generation model based on the Multimodal Diffusion Transformer (MM-DiT) architecture, focusing on generating coherent scene videos with multiple continuous prompts without additional training. By analyzing the attention mechanism of MM-DiT, this model achieves precise semantic control and attention sharing between different prompts, producing videos with smooth transitions and cohesive object movement. The main advantages of DiTCtrl include no training requirement, capability to handle multi-prompt video generation tasks, and showcasing cinematic transition effects. Additionally, DiTCtrl introduces a new benchmark called MPVBench specifically designed for evaluating the performance of multi-prompt video generation.
Visit

DiTCtrl Alternatives