Unified-IO 2
A unified multi-modal generation model
CommonProductImageMulti-ModalTransformer
Unified-IO 2 is a unified multi-modal generation model that can understand and generate images, text, audio, and actions. It utilizes a single encoder-decoder Transformer model to process inputs and outputs of different modalities (images, text, audio, actions, etc.) as representations within a shared semantic space. This model is trained from scratch on large-scale multi-modal pre-training data, using multi-modal denoising objectives for optimization. To learn a wide range of skills, the model is further fine-tuned on 120 existing datasets, which include prompts and data augmentation. Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark, achieving strong results across 30+ benchmarks, including image generation and understanding, text understanding, video and audio understanding, and robotics manipulation.
Unified-IO 2 Visit Over Time
Monthly Visits
115
Bounce Rate
98.90%
Page per Visit
1.0
Visit Duration
00:00:00