Are you tired of AI-generated short videos that, while realistic, always seem to lack a certain "je ne sais quoi"? A groundbreaking new technology called Long Context Tuning (LCT) is changing the game, enabling AI video generation models to direct multi-shot narrative videos, much like movies and TV shows, seamlessly switching between different shots to create more coherent and engaging storylines.

image.png

Previously, top-tier AI video generation models like SoRA, Kling, and Gen3 could create realistic single-shot videos up to a minute long. However, this fell far short of the needs of real-world narrative videos composed of multiple shots (like a scene in a movie). A film scene typically consists of a series of single shots capturing different aspects of the same continuous event.

For example, the iconic scene in Titanic where Jack and Rose meet on the deck comprises four main shots: a close-up of Jack turning around, a medium shot of Rose speaking, a wide shot of Rose walking towards Jack, and a close-up of Jack embracing Rose from behind. Generating such a scene requires ensuring high consistency in visual aspects (e.g., character features, background, lighting, and tone) and temporal dynamics (e.g., the rhythm of character movements and the smoothness of camera motion) across different shots to maintain narrative fluidity.

Researchers have proposed various methods to bridge the gap between single-shot generation and multi-shot narratives, but most have limitations. Some methods rely on inputting key visual elements (such as character identities and backgrounds) to enforce visual consistency across shots, but struggle to control more abstract elements like lighting and tone. Others first generate a set of coherent keyframes and then use an image-to-video (I2V) model to independently synthesize each shot, which makes it difficult to guarantee temporal consistency between shots, and sparse keyframes also limit the effectiveness of conditioning.

LCT addresses these challenges. It's like giving a pre-trained single-shot video diffusion model a "super brain," enabling it to process longer context information and directly learn inter-shot coherence from scene-level video data. LCT's core innovations include:

Extended Full Attention Mechanism: LCT extends the full attention mechanism, originally applied to individual shots, to encompass all shots within a scene. This means the model can simultaneously "attend" to all visual and textual information across the entire scene while generating video, better understanding and maintaining cross-shot dependencies.

Interleaved 3D Positional Embeddings: To distinguish tokens (basic units of text and video) in different shots, LCT introduces interleaved 3D Rotated Positional Embeddings (RoPE). This is like giving each shot and its internal tokens unique "labels," allowing the model to recognize each shot's individuality while understanding its relative position within the entire scene.

Asynchronous Noise Scheduling: LCT innovatively applies independent diffusion timesteps to each shot. This allows the model to learn dynamic dependencies between different shots during training and more effectively utilize cross-shot information. For example, when one shot has a lower noise level, it can naturally serve as a rich source of visual information, guiding the denoising process of other shots with higher noise levels. This strategy also facilitates subsequent visual conditional inputs and joint generation.

Experimental results show that single-shot models tuned with LCT excel at generating coherent multi-shot scenes and exhibit surprising new capabilities. For example, it can generate compositions based on given character identities and environmental images, even without prior training on such tasks. Furthermore, the LCT model supports autoregressive shot extension, enabling both continuous single-shot extension and multi-shot extension with shot transitions. This feature is particularly useful for long-video creation as it decomposes long-video generation into multiple scene segments, facilitating interactive modification by users.

Moreover, researchers found that after LCT, models with bidirectional attention can be further fine-tuned to contextual causal attention. This improved attention mechanism maintains bidirectional attention within each shot, but between shots, information only flows from previous shots to subsequent shots. This unidirectional information flow allows efficient use of KV-cache (a caching mechanism) during autoregressive generation, significantly reducing computational costs.

As shown in Figure 1, LCT technology can be directly applied to short film production, enabling scene-level video generation. Even more exciting, it has also spawned several emerging capabilities, including interactive multi-shot directing, single-shot extension, and zero-shot compositional generation, even though the model was never trained for these specific tasks. Figure 2 shows an example of scene-level video data, which includes a global prompt (describing characters, environment, and story outline) and specific event descriptions for each shot.

In summary, Long Context Tuning (LCT) opens new avenues for more practical visual content creation by expanding the context window of single-shot video diffusion models, enabling them to directly learn scene-level coherence from data. This technology not only enhances the narrative ability and coherence of AI-generated videos but also provides new ideas for future long-video generation and interactive video editing. We have reason to believe that future video creation will become more intelligent and creative thanks to advancements in technologies like LCT.

Project Address: https://guoyww.github.io/projects/long-context-video/

Paper Address: https://arxiv.org/pdf/2503.10589