Lumina-T2X
A unified text-to-any-modal generation framework
CommonProductImageText-to-ImageText-to-Video
Lumina-T2X is an advanced text-to-any-modal generation framework that can convert text descriptions into vivid images, dynamic videos, detailed multi-view 3D images, and synthetic speech. The framework employs a stream-based large diffusion transformer (Flag-DiT) architecture, supports models up to 700 million parameters, and can extend sequence lengths to 128,000 tokens. Lumina-T2X integrates image, video, 3D object multi-view, and audio spectrum into a unified spatiotemporal latent token space, enabling the generation of outputs of any resolution, aspect ratio, and duration.
Lumina-T2X Visit Over Time
Monthly Visits
494758773
Bounce Rate
37.69%
Page per Visit
5.7
Visit Duration
00:06:29