Lumina-T2X

A unified text-to-any-modal generation framework

CommonProductImageText-to-ImageText-to-Video
Lumina-T2X is an advanced text-to-any-modal generation framework that can convert text descriptions into vivid images, dynamic videos, detailed multi-view 3D images, and synthetic speech. The framework employs a stream-based large diffusion transformer (Flag-DiT) architecture, supports models up to 700 million parameters, and can extend sequence lengths to 128,000 tokens. Lumina-T2X integrates image, video, 3D object multi-view, and audio spectrum into a unified spatiotemporal latent token space, enabling the generation of outputs of any resolution, aspect ratio, and duration.
Visit

Lumina-T2X Visit Over Time

Monthly Visits

488643166

Bounce Rate

37.28%

Page per Visit

5.7

Visit Duration

00:06:37

Lumina-T2X Visit Trend

Lumina-T2X Visit Geography

Lumina-T2X Traffic Sources

Lumina-T2X Alternatives